AP Statistics Study Guide Flashcards

AP Stats Study Guide

General Tips & Chapter 1: One Variable Data

  • Introduction to Statistics

    • Statistics is the science and art of collecting, analyzing, and drawing conclusions from data.
    • Data can be:
      • Categorical: Data that falls into categories.
      • Quantitative: Numerical data representing measurements or counts.
    • Key Terms:
      • Individual: An object described in a set of data (also known as cases/observational units).
      • Variable: An aspect that can take different values for different individuals.
      • Distribution: The pattern of variation of a variable, showing what values the variable takes and how often it takes those values.
    • Types of Statistics:
      • Descriptive Statistics: Analyzing data (Units 1-7).
      • Inferential Statistics: Making inferences/drawing conclusions from data (Units 8-12).
  • FRQs (Free Response Questions)

    • Always include context: variable and units.
  • MCQs (Multiple Choice Questions) Strategies

    • Process of elimination.
    • Read actively and carefully.
    • Underline important information.
    • Anticipate answers.
  • Types of Variables

    • Categorical:
      • Nominal: Categories with no inherent order.
      • Ordinal: Categories with a specific order.
        • Numbers can be ordinal if they don’t measure anything (e.g., cell phone digits).
    • Quantitative:
      • Discrete: Fixed set of possible values with gaps between them.
        • Whole numbers or defined intervals.
        • Countable or countably infinite.
      • Continuous: Infinite possibilities.
        • Decimals/fractions.
        • Any value in an interval on the number line.
      • The main difference: whether they measure something.

1.1 Analyzing Categorical Data

  • Tables and Graphs
    • Frequency Table: Shows counts for each category.
    • Relative Frequency Table: Shows proportions/percentages for each category.
    • Bar Graph: Graphical representation of categorical data with bars representing frequencies or relative frequencies.
      • Tips:
        • Equal width bars.
        • Leave gaps between bars.
        • Label and scale axes.
        • Indicate whether frequencies or relative frequencies are used.
    • Pie Chart: Circular chart divided into slices proportional to frequencies or relative frequencies.
      • Good for: comparing categories to the whole.
      • Areas of slices proportional to frequencies/relative frequencies.
      • Must include all possible categories in the whole (add an “other” option if necessary).
      • Include a legend key.
    • Two-Way Tables: Summarizes data on the relationship between two categorical variables for a group of individuals.
      • Marginal Relative Frequency: B/C
      • Joint Relative Frequency: A/C
      • Conditional Relative Frequency: A/B
    • Side-by-Side Bar Graph: Bar graphs showing the distribution of a categorical variable for each value of another categorical variable.
      • Tip: It is acceptable for the bars to touch here within each of the values for the categorical variable, but leave spaces in between the distributions for each value.
    • Segmented Bar Graph: Distribution of a categorical variable as segments of a whole (bars stacked on top of each other and proportional to relative frequencies).
      • Uses relative frequencies.
      • Tip: The bars do not touch here!
    • Mosaic Plot: Similar to segmented, except the width of the bars proportional to number of individuals in that category.
      • Tip: The bars do touch here!

*Note: Tables are not data, they are summaries of data!

  • Avoid Bad Statistics Practices

    • Truncating axes.
    • Using pictograms.
  • Association

    • Knowing the value of one variable allows you to predict the value of the other.
    • Association does NOT equal causation!

1.2 Displaying Quantitative Data with Graphs

  • Types of Graphs
    • Dotplot: Each data value is shown as a dot above its location on a number line.
      • Pros: Can see every individual value, easy to see shape.
      • Cons: Difficult to make with large data sets.
    • Stemplot: Separates each data value into a stem (all but the final digit) and a leaf (the final digit).
      • Tips: Always add a key; Split stems to better see distribution if needed (try to get at least 5 stems); Make sure each stem has an equal number of possible leaf digits.
      • Pros: Can see every individual value, easy to see shape.
      • Cons: Difficult to make with large data sets.
    • Histogram: Groups data into bins (intervals) and shows the frequency (or relative frequency) of values within each bin.
      • Tips: Equal size bins; Try to go with a minimum of 5 bins; The bin size will affect the appearance of the distribution (more bins -> more detail but less clear overall pattern); Edges of bins either inclusive or noninclusive (typically the right edge of a bin is noninclusive); Bars touch; Either frequencies or relative frequencies (but need to indicate which one!).
      • Pros: Easier to make for large data sets, easy to see shape (especially for large data sets–simplifies the overall pattern).
      • Cons: Doesn’t show every individual value.
        *Use relative frequencies if you’re using the histogram to compare distributions with different numbers of observations.
    • Boxplot: Represents the five-number summary (minimum, Q1, median, Q3, maximum) and any outliers.
      • Pros: Easy to make for large data sets, shows five-number summaries, splits data into quartiles.
      • Cons: Doesn’t show individual values, slight skewness.

1.3 Describing Quantitative Data with Numbers

  • Measures of Center

    • Mean: Average (\bar{x} or µ).
    • Median: Middle value.
    • Mode: Most common value.
  • Measures of Variability

    • Range: Maximum – Minimum.
      • Pros: Easy to calculate.
      • Cons: Nonresistant and doesn’t express variability from the center.
    • IQR: Interquartile Range (middle 50% of values).
      • Pros: Resistant.
      • Calculating quartiles: Split each half (leave out median!).
    • Standard Deviation: Typical distance from the mean (s_x or σ).
      • sx = \sqrt{\frac{\sum(xi - \mu_x)^2}{n-1}}
      • To calculate: calculate all the deviations (value – mean), square each, add up, divide by n-1, take square root.
      • Properties: Only use in tandem with the mean; s_x always greater than or equal to σ.
      • Cons: Nonresistant.
  • Describing Quantitative Distributions: SOCV

    • Shape: Skewness & modality, and any clusters/gaps.
    • Outliers: 1.5 x IQR (above Q3 or below Q1).
    • Center: Measures of the typical value: mean/median/mode.
    • Variability: Range/IQR/st dev.
    • Always add context! (variables & units).
  • Comparing Mean & Median

    • Mean: Nonresistant.
    • Median: Resistant (not sensitive to skewness/outliers).
  • Outliers

    • 1.5 x IQR (above Q3 or below Q1).
    • Five Number Summary: minimum, Q1, median, Q3, maximum
    • Also good to know: upper & lower bounds (1.5 x IQR) (might not be actual data values)

*Decide which measures to use based on whether resistancy is a concern for the distribution

  • Additional Vocab
    • Statistic: A value describes a characteristic of a sample.
    • Parameter: A value that describes a characteristic of a population.

Types of MCQs (or FRQs)

  • Compare mean & median (use knowledge of resistancy & skewness to answer).
  • Interpret graphical or tabular representation of data:
    • Using it to answer questions about the variable/distribution of the variable.
  • Interpret summary statistics/a value about the distribution:
    • Put it in context (variable & units!).
    • Especially interpreting standard deviation: “the typical distance from the mean”.
  • Match data with its graphical representation OR match graphical representations of the same data.
  • Determine whether there is an association between two categorical variables given data: graphical/tabular representations OR summary statistics:
    • Calculate distribution of one categorical variable for each value of the other (basically a whole bunch of conditional relative frequencies).
    • And then find whether knowing the value of one variable allows you to predict the value of another (eg. are those distributions you calculated the same or not).
  • Describe a distribution (quantitative):
    • SOCV (shape, outliers, center, variability) (see content overview for more information).
    • Context (variables & units of the distribution).
  • Compare distributions (quantitative):
    • SOCV for both distributions.
    • Use explicitly comparative language that relates the two distributions.
  • Make a claim/argument based on a distribution (quantitative):
    • Refer to specific characteristics (eg. SOCV) of the distribution in your answer.
    • Give specific numbers as much as possible.
    • Context (variables & units of the distribution).
    • Then, explain why those characteristics support your claim/argument.
  • Describe a distribution/compare distributions/make a claim based on a distribution (categorical).
  • Construct a certain type of graph for given data:
    • Follow appropriate guidelines for the type of graph (see content summaries for tips).
    • Always label & scale axes appropriately! with units!
    • Add a title with context & others.
  • What is apparent from the histogram but not from the boxplot?
  • Misrepresenting/manipulating data:
    • Why would it be misleading to only report [insert statistic/parameter here]?
    • What would you want to report in order to [achieve specified goal]?
  • Why does one method for determining outliers give you more outliers than the other?

Language & Wording / General Common Mistakes

  • Language & Wording
    • Always include context: distribution & variables & units.
    • Describing distributions: “appears to be” / “approximately” (bc you cannot be sure).
      • Ex. “approximately normal” & “roughly symmetric” (this is a very important one!).
    • Be VERY careful with relative frequencies vs. frequencies / raw counts! (this is a very important one!).
      • Use relative frequencies with groups of different sizes!
      • & say “a greater percentage” not “more”.
      • Plurality vs majority.
      • Always indicate which one you’re using.
    • For histograms & boxplots: keep in mind that you can’t conclusively determine what the values are.
      • Esp for histograms: need to say that a value is in a certain bin (“between [value] and [value]”).
  • Common Mistakes
    • Range, IQR, and st. dev. are single values! not a range of values.
    • Avoid bad statistics: truncated axes & pictograms.
    • Association is not causation!

Chapter 2: Modeling Distributions of Quantitative Data

2.1 Describing Location in a Distribution

  • Percentile: pth percentile is value with p% observations less than or equal to it.

    • Works well with a frequency table of quantitative data.
  • Cumulative Relative Frequency Graphs / Ogives: Plots points corresponding to the percentile of a value in the distribution & points connected with line segments to create the graph.

    • Another way to describe location in a distribution.
  • Standardized Scores (z-scores): How many standard deviations from the mean a value is (& what direction).

    • (value – mean) / st dev
    • Allows for a standard scale to compare values from different distributions.
    • A way to describe location in a distribution.
  • Transforming Data

    • Adding/Subtracting Constant: Affects measure of center/location (not shape/variability).
    • Multiplying/Dividing Constant: Affects measures of center, location, variability (not shape).
    • Multiple Transformations: Follow order of operations.
    • Transformations Related to Z-scores: In a distribution of z-scores, shape remains the same as original distribution, mean always 0, standard deviation always 1.

2.2 Density Curves & Normal Distributions

  • Density Curve: Simplified model of a distribution of a quantitative variable
    • Always on or above horizontal axis, has an area of exactly 1 underneath it.
    • Always an approximation of data (not an exact model).
    • Models continuous data but often used to approximate discrete distributions as well.
    • Describing Density Curves:
      • Shape: same ways as usual
      • Center: mean (balance point) (µ) & median (divides area of curve in half) (if symmetric, they’re the same)
      • Variability: same measures as usual (σ)
  • Normal Distributions: Bell shaped & symmetric & unimodal distribution
    • Approximated with a normal curve (density curve).
    • Can fully be described by mean (same as median) (µ) and standard deviation (σ).
    • Useful for: real data, chance outcomes, inference methods
  • Empirical Rule: 68% (within 1σ of µ) - 95% (within 2σ of µ) - 99.7% (within 3σ of µ) (for normal distributions)
    • Standard Normal Distribution: distribution of z-scores (mean 0, st dev 1)
  • Finding areas in Normal Distributions
    • Empirical Rule (when applicable).
    • Find z-score & use table a to look up p-value (percent of values to left of z-score): Table a connects z-scores to percentiles in a normal distribution.
    • Technology (see calculator functions)
  • Types of Problems: Area to left, area to right, area between, working backwards (z-score given area)
  • Assessing Normality
    • Plot data (see if it looks normal)
    • Check against empirical rule: Check amount of data within 1, 2, and 3 st dev from mean (w/in 3-5% is pretty good!)
    • Normal Probability / Normal Quantile Plot:
      • Plots actual z-score (x) vs predicted z-score if it was normal (y).
      • Look for a straight-ish line on the normal probability plot
        *ideally use all three methods to check!

Types of MCQs (or FRQs)

  • Interpret percentile or z-score (see content summary for info):
    • Provide specific values for the percentages / mean & st dev.
    • Use cumulative relative frequency graphs to determine percentile.
  • Describe how distribution of data will change with a given type of transformation.
  • Find area in a normal distribution (see content summary for info on how).
  • Use percentiles or z-scores to evaluate claims about data:
    • Find & interpret percentile / z-score
      • Percentile: value with p% of observations less than or equal to it
      • Z-score: abt that many standard deviations above / below the mean
    • Draw conclusion using percentile / z-score
  • Normal Distribution Questions:
    • Picture
      • Draw normal curve
      • Label specific distribution (context & mean / st dev)
      • Label boundary values & shade area of interest
    • Calculate z-score(s)
    • Calculate p-values using table a or calculator

Language & Wording/General Common Mistakes

Overall: Units 1 & 2 & 3

  • If you use calculator, ALWAYS LABEL VALUES: State answer in context
  • Is it extrapolation / is the answer reasonable?
  • Predict value & comment on whether it’s reliable:
    • E: correct prediction, plugged into formula for LSRL to get predicted value, say whether it is reliable or not (extrapolation)
  • ALWAYS ADD CONTEXT esp when describing location in a distribution
  • Percentile: “at” a percentile NOT “in” a percentile (bc it’s a location!)
  • Z-scores: always provide the direction! not just “away” from the mean (and provide context (units & distribution / variables)!
  • Normal Distributions:
    • Be careful with direction & tails (two-sided vs one-sided)
    • Distributions of real world data always “approximately normal” (never perfect)
  • Quantitative Data:
    • Discrete & Continuous
    • One-Var:
      • Tabular: frequency table, relative / cumulative frequency table
      • Graphical: dotplot, stemplot, boxplot, histogram, cumulative relative frequency graph
      • Numerical: 5-number summary, center, variability, percentile / z-score
    • Two-Var:
      • Graphical: scatterplot
      • Numerical: r, r2 , LSRL, s
    • Simplified model of data: density curves
  • Categorical Data:
    • Nominal & Ordinal
    • One-Var:
      • Tabular: frequency table, relative frequency table
      • Graphical: bar chart, pie chart
      • Numerical: proportions, etc
    • Two-Var:
      • Tabular: two-way table (frequency OR relative frequency)
      • Graphical: side-by-side bar chart, segmented bar chart, mosaic plot
      • Numerical: proportions, association, etc

Chapter 3: Exploring Two-Variable Quantitative Data

3.1 Scatterplots & Correlation

  • Scatterplots: explanatory x-axis, response y-axis; label & scale axes (you CAN truncate the axes here)
  • Describing Scatterplots: CDOFS
    • Context: state variables & units
    • Direction: pos / neg / no correlation
    • Outliers / unusual features: outliers & points outside the general pattern / clusters
    • Form: linear / nonlinear
    • Strength: r (correlation coefficient) / r2 value (measures whether LSRL is a good fit)
  • r value: measures strength & direction (ONLY for linear models)
    • Cautions
      • r is nonresistant
      • only for linear
      • correlation is not causation!
      • no units
      • unaffected by changing units / changing explanatory & response variables
    • | r | less than 0.5: weak, | r | between 0.5 and 0.75: moderate, | r | greater than 0.75: strong
  • Extrapolation: using a regression line to make predictions way outside of the interval of x-values used to generate the line (beyond the scope of your data)
    • won’t be accurate bc it might not remain linear at such extreme points

3.2 Linear Regression

  • Regression Line: model of how response variable (y) changes as explanatory variable (x) changes
  • Residuals: actual value – predicted value (based on line)
  • Least-Squares Regression Line: line that minimizes sum of squared residuals
  • Explanatory & Response Variables: not necessarily causation (though it could be), just which helps to explain the other
  • \hat{y} = a + bx (y-hat is predicted y-value for a given x-value); predicting so it’s okay if y-hat is not an integer for a real world situation (think of it as an average)
  • A good linear regression line: minimizes the residuals; sum of residuals on an LSRL is always 0
  • Residual Plots: scatterplot that plots residuals against explanatory variable
    • Determines whether a linear model is appropriate (check for random scatter & no leftover curved pattern)
  • s: standard deviation of residuals
    • How well does the line work? -> how good will predictions be?
    • Measures typical residual (distance between predicted & actual)
    • Calculate in the same way as st dev but divide by n-2
  • r2: coefficient of determination (value between 0 and 1, usually expressed as a percentage)
    • Square of correlation r
    • When finding r from r2, make sure to consider direction of correlation!
    • How well the LSRL fits the data: percent reduction in sum of squared residuals when using LSRL instead of mean to make predictions
    • What percent of the variability in the response variable that can be explained by the linear association
  • Regression to the Mean
    • (to calculate LSRL): slope: b = r (sy / sx ); y-int: a = \bar{y} - b(\bar{x})
    • since LSRL passes through (\bar{x}, \bar{y})
  • Correlation & Regression Wisdom
    • Correlation and LSRLs only describe linear relationships
    • r, s, r2 , and LSRL are non-resistant (see influential points)
  • Influential Points: points that, if removed, substantially change the slope, y-int, r, r2 , or s
    • These are very often influential (but not automatically guaranteed to be)
    • Can do regression calculations with & without the points to see how much influence they have
  • Outliers: doesn’t follow pattern of data and has a large residual
  • High Leverage: much larger / smaller values than other values in data set
  • Removing High-Leverage Points
    • Lower than line & slope negative: slope closer to 0 & y-int lower
    • Lower than line & slope positive: slope steeper & y-int lower
    • Higher than line & slope negative: slope steeper & y-int higher
    • Higher than line & slope positive: slope closer to 0 & y-int higher
    • and sometimes effects on r, r2 , or s (use the point to evaluate this)
  • Removing Outliers
    • Impacts r, r2 , and s values heavily: Usually makes them go up because strength of association & fit of LSRL way higher
    • Doesn’t generally impact LSRL though

3.3 Transforming to Achieve Linearity

  • Applying a function to a quantitative variable (changes the scale of measurement) in order to make the scatterplot more approximately linear (in order to use linear regression methods)
  • Transforming with powers & roots (for power models: y = ax^p)
    • Option 1: Raise values of x to power p (graph (x^p, y) (it will be linear))
    • Option 2: pth root of values of y (graph (x, \sqrt[p]{y}) (it will be linear))
    • When p is known: use the above methods; When p is unknown: guess & check OR use log (more universal & works for unknown power models)
  • Transforming with Logs (for power OR exponential models)
    • Apply log transformation (log10 or ln)
    • For power models (y = ax^p): use a log-log (both variables)
    • For exponential models (y = ab^x): take log of y-var
    • To choose a model: most random scatter (tiebreaker highest r2 value)

Types of MCQs (or FRQs)

  • Slope: for every [increase / decrease] in one [unit of x], there is a predicted [increase / decrease] in [units of y]
  • Y-int: when the [context of x] is 0 [units], the predicted value of [context of y] would be [y-int]
  • r-value / correlation coefficient: the correlation coefficient of [r] indicates that there is a [strong / moderate / weak], [positive / negative] correlation between [context of x] and [context of y]
  • s (standard deviation of residuals): on average, the model mispredicts [context of y] by [s units] using the LSRL
  • r2 value: [r2] percent of the variability in [context of y] can be explained by the linear association with [context of x]
  • Residual plot: the residual plot [is randomly scattered / has a pattern], indicating that a linear model [is / is not] appropriate
  • Effect of outliers / high leverage points on measures of strength or the LSRL
    • What would happen when they are removed?
  • Can you infer causation based on correlation? (no) (might not be worded like this directly though)
  • Which is explanatory & which is response?
  • Interpret a feature of the association / regression line
  • LSRL in general: for every increase in [1 unit of x], there is a predicted increase in [b units of y] above / below a [unit of y] of [a – the y-int]

Tips/Common Errors

  • Be very careful with predicted vs actual values
    • Remember to add a hat on top of any predicted value! And ALWAYS SAY THEY’RE “PREDICTED”
  • When defining LSRL: always define variables & add units for x & y
    • Particularly important when data has been transformed to get the LSRL
  • With transformed data: be mindful of units & always convert back to “regular” units where appropriate / needed!
  • With residuals: pay attention to whether the question is asking for predicted residual (from LSRL / LSRL equation) or actual residual (from residual plot / graph)!
  • Can’t go backwards with LSRL & predict x value given a definite y value (can find x value from y-hat though)
  • How to tell if linear model is a good fit: high r2 value, s is small relative to the data

Chapter 4: Collecting Data

4.1 Sampling & Surveys

  • Sampling: selecting a random group of people out of a whole population (that’s representative of the population)
  • Sampling Frame: the group of members from the population from which we select our sample
  • Sampling Survey: collects data from the individuals in the sample (to learn about the population)
  • Types of Sampling
    • Random Sampling: involves a chance process to determine which individuals are in the sample
      • SRS: every group of n individuals has an equal chance of being selected (label individuals with numbers, random number generator, select individuals that correspond: sample without replacement! don’t include repeated numbers; calculator: math -> prob -> randintnorep(1, N) OR use table d
      • Stratified: SRS selected from each strata; Strata: group w similar characteristics assumed to be associated with the variables being measured: ensures you get some from each strata (more precise & accurate estimates)
      • Clustered: randomly selecting entire clusters; Clusters: diff responses between (hopefully representative of population): no statistical advantage but resource-efficient
      • Systematic: randomly select starting point & select every kth individual after: make sure no patterns coincide with your systematic pattern; allows you not to have to have identifiers (eg. names) for all the individuals in the population (useful w unknown population size)
        *Decide on sampling type based on population & variable & resources available to you
    • Bad Sampling
      • Convenience Sampling: individuals who are easy to reach
      • Voluntary Response Sampling: allows individuals to choose to be in sample: leads to voluntary response bias; individuals who feel strongly / have similar opinions more likely to respond
  • Bias: likely to systematically overestimate or underestimate the value
    • Undercoverage: certain individuals less likely / cannot be chosen in a sample
    • Nonresponse: individual chosen for sample can’t be contacted / doesn’t participate: diff from voluntary response bias bc this is only after sample has been selected
    • Response: systematic pattern of inaccurate answers to a survey question; ex. wording or order of questions
  • Uses resources more efficiently than a census

4.2 Experiments

  • Observational Studies: observes individuals & measures variables of interest (does not influence responses); Retrospective: existing data; Prospective: tracks into future; pros: ethics; cons: confounding variables (cannot determine causality with observational studies)
  • Experiments: imposes a treatment on individuals & measures their responses; pros: can establish causality bc avoids confounding; cons: resources, ethics important vocab:
    • Placebo: no active ingredient; optional: placebo effect
    • Treatment: condition imposed on individuals
    • Experimental Unit: individual to which treatment applied: Subject: human experimental unit
    • Factor: explanatory var that’s manipulated (may cause change in response var); treatments formed using levels of each of the factors (in a multifactor experiment)
    • Levels: diff possible values of a factor
  • Confounding: when variables are associated so that their effects on a response variable can’t be distinguished from one another. Avoids confounding variables
  • Control Group: provides a baseline for comparison; not always required but there does need to be some sort of comparison
  • Random Assignment of Treatments: avoids confounding variables
  • Control: all other variables constant; avoids confounding variables
  • Replication: use enough subjects (diff in effects can be distinguished from chance variation). Avoids confounding variables & reduces variation
  • Double Blind: neither subjects nor the ppl measuring know the treatment; Triple Blind: statistician doesn’t know either
  • Single-Blind: only one of the groups (above) knows
  • Selection Biases: voluntary, convenience, undercoverage; The other two: nonresponse, response
Designing Experiments (required in a good experiment)
  • Completely Randomized Design: experimental units assigned to treatments completely at random
  • Randomized Block Design: random assignment within each block: Form blocks based on: confounding variable / the variable that’s the best predictor of the response variable; It accounts for variability in [response var] created by [block var]
  • Block: group of experimental units known to be similar in some way that could affect their response to the treatments
  • Matched Pairs Design: a type of RBD where blocks are pairs; pairs of similar experimental units, either true pairs & randomize treatment or each individual receives both treatments in a random order: Can establish causality

4.3 Using Studies Wisely

  • Statistical Significance: observed diff is larger than can be attributed to chance alone: Ensured by random assignment of treatments
  • Statistical Inference: generalising results to population: Assuming sample is representative of population (ensured by random sample)
  • Ethical Data Gathering: informed consent, benefit
  • Sampling Variability: diff random samples (same size, same population) produce diff estimates
  • Larger sample sizes produce more accurate estimates (closer to true value)
Types of MCQs (or FRQs)
  • Is it an observational study or an experiment?
  • Describe the type of bias present
    • Describe how members respond differently, then describe how this leads to overestimated / underestimated values
  • Describe a confounding variable
    • Describe that variable’s association with both explanatory AND response vars
  • Describe an appropriate experiment design
    • Describe how to randomly assign treatments
      • Create groups AND define which group gets which treatment
    • Describe why certain experiment designs would be preferable. Draw a diagram to explain the experiment design

Tips/Common Errors

  • Describing how to use a random number generator / table: ALWAYS account for repeated numbers
  • If you flip coins: make sure everyone flips a coin (not just until you have one group & then put the rest in another. this would not be random sampling)
  • Be really careful not to mix up language for experiments & language for observational studies!

Chapter 5: Probability

5.1 Randomness, Probability, and Simulation

  • Random Process: generates outcomes purely by chance; unpredictable in short-term but predictable in the long run
  • Probability: likelihood of an event to happen; proportion of times an outcome would occur in the long run; P(A) = (number of outcomes in event A) / (total number of outcomes in sample space)
  • Law of Large Numbers: more trials means proportion approaches true probability (more accurate)
  • Simulation: imitates random process such that simulated outcomes are consistent with real-world outcomes: Options: random number generator, random number table, flipping coin, draw cards, etc; describing simulation: remember to say that repeated numbers will be ignored!; for random number table: remember that numbers need to be the same length digit-wise

5.2 Probability Rules

  • Probability Model: description of a random process that includes a list of all possible outcomes & the probability for each outcome- Sum of all probabilities is 1 (or 100%), each probability is between 0 & 1 (or 0% & 100%)
  • Sample Space: list of all outcomes: options: chart, table, venn diagram, probability tree
  • Event: any collection of outcomes from a random process- Notation & vocab: P(A) = (number of outcomes in event A) / (total number of outcomes in sample space)
  • Complement: the probability that an event does not occur P(AC) = 1 – P(A)
  • Intersection: P(A and B) = A ∩ B (both A and B must be true) joint probability
  • Union: P(A or B) = A ⋃ B (at least one–either A or B, or both–must be true)
  • Mutually Exclusive Events: cannot occur simultaneously (no outcomes in common) (also known as disjoint) if P(A and B) = 0:
    • Addition Rule: P(A or B) = P(A) + P(B)
  • Non-Mutually Exclusive Events: can occur simultaneously: use a two-way table!
    • General Addition Rule: P(A or B) = P(A) + P(B) – P(A and B)

5.3 Conditional Probability & Independence

  • Conditional Probability: probability that an event happens given that another event is known to have happened: P(A | B) FORMULA: P(A ∩ B) / P(B) Tree diagrams are useful with conditional