AP Statistics Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide

Statistics Basics

  • Statistics: Science of collecting, analyzing, and drawing conclusions from data.
  • Descriptive Statistics: Methods for organizing and summarizing data.
  • Inferential Statistics: Making generalizations from a sample to a population.
  • Population: Entire collection of individuals or objects.
  • Sample: Subset of the population selected for study.
  • Variable: Characteristic whose value changes.
  • Data: Observations on single or multi-variables.

Types of Variables

  • Categorical (Qualitative): Basic characteristics.
  • Numerical (Quantitative): Measurements or observations of numerical data.
    • Discrete: Listable sets (counts).
    • Continuous: Any value over an interval (measurements).
  • Univariate: One variable.
  • Bivariate: Two variables.
  • Multivariate: Many variables.

Distributions

  • Symmetrical: Data with fairly same shape and size on both sides.
  • Uniform: Every class has equal frequency.
  • Skewed: One side (tail) is longer than the other. Skewness is in the direction the tail points.
  • Bimodal: Two or more classes have large frequencies separated by another class.

Describing Numerical Graphs (S.O.C.S.)

  • Shape: Symmetrical, skewed (right/left), uniform, or bimodal.
  • Outliers: Gaps, clusters, etc.
  • Center: Middle of the data (mean, median, mode).
  • Spread: Variability (range, standard deviation, IQR).
  • Context: Everything must be in context to the data and situation.
  • Comparison: When comparing distributions, use comparative language.

Parameters vs. Statistics

  • Parameter: Value of a population (typically unknown).
  • Statistic: Calculated value about a population from a sample(s).

Measures of Center

  • Median: Middle point of the data (50th percentile) in numerical order.
  • Mean: μ for population, \bar{x} for sample.
  • Mode: Occurs most in the data. Can have multiple modes or none.

Measures of Spread (Variability)

  • Range: Max - Min.
  • IQR: Interquartile range (Q3 - Q1).
  • Standard Deviation: σ for population, s for sample. Measures typical deviation from the mean. Sample standard deviation is divided by df = n-1.
  • Sum of deviations from the mean is always zero.
  • Variance: Standard deviation squared.

Resistant vs. Non-Resistant Measures

  • Resistant: Not affected by outliers (Median, IQR).
  • Non-Resistant: Affected by outliers (Mean, Range, Variance, Standard Deviation, Correlation Coefficient (r), Least Squares Regression Line (LSRL), Coefficient of Determination (r^2)).

Comparison of Mean & Median Based on Graph Type

  • Symmetrical: Mean and median are the same.
  • Skewed Right: Mean > Median.
  • Skewed Left: Mean < Median.
  • Mean is pulled in the direction of the skew.
  • Trimmed Mean: Use a % to remove observations from the top and bottom to eliminate outliers.

Linear Transformations of Random Variables

  • μ{a +bx} =a +bμx (Mean is changed by both addition/subtraction & multiplication/division).
  • σ{a +bx} = |b|σx (Standard deviation is changed by multiplication/division ONLY).

Combination of Two (or More) Random Variables

  • μ{x ± y} = μx ± μ_y (Add or subtract the means).
  • σ^2{x ± y} = σ^2x + σ^2_y (Always add the variances - X & Y MUST be independent).

Z-Score

  • Standardized score indicating how many standard deviations an observation is from the mean. Creates a standard normal curve with μ = 0 & σ = 1.
  • z = {x - μ \over σ}

Normal Curve

  • Bell-shaped and symmetrical.
  • As σ increases, the curve flattens.
  • As σ decreases, the curve thins.

Empirical Rule (68-95-99.7)

  • Measures 1σ, 2σ, and 3σ on normal curves from the center μ.
  • 68% of the population is between -1σ and 1σ.
  • 95% of the population is between -2σ and 2σ.
  • 99.7% of the population is between -3σ and 3σ.

Boxplots

  • For medium or large numerical data; doesn't contain original observations.
  • Use modified boxplots with fences at 1.5 * IQR from the ends of the box (Q1 & Q3).
  • Points outside the fence are outliers.
  • Whiskers extend to the smallest & largest observations within the fences.

5-Number Summary

  • Minimum, Q1 (25th Percentile), Median, Q3 (75th Percentile), Maximum

Probability Rules

  • Sample Space: Collection of all outcomes.
  • Event: Any sample of outcomes.
  • Complement: All outcomes not in the event.
  • Union: A or B, all outcomes in both circles. A ∪ B
  • Intersection: A and B, happening in the middle of A and B. A ∩ B
  • Mutually Exclusive (Disjoint): A and B have no intersection; they cannot happen at the same time.
  • Independent: Knowing one event doesn't change the outcome of another.
  • Experimental Probability: Number of successes from an experiment divided by the total amount from the experiment.
  • Law of Large Numbers: As an experiment is repeated, the experimental probability gets closer to the true probability.

Probability Rules (Formulas)

  • All values are 0 < P < 1.
  • Probability of sample space is 1.
  • Complement: P + (1 - P) = 1
  • Addition: P(A or B) = P(A) + P(B) – P(A & B)
  • Multiplication: P(A & B) = P(A) * P(B) if A & B are independent.
  • P (at least 1 or more) = 1 – P (none)
  • Conditional Probability: P(A|B) = {P(A & B) \over P(B)}

Correlation Coefficient

  • (r) - Quantitative assessment of the strength and direction of a linear relationship. (use ρ (rho) for population parameter)
  • Values: [-1, 1] ; 0 - no correlation, (0, ±0.5) - weak, [±0.5, ±0.8) - moderate, [±0.8, ±1] - strong

Least Squares Regression Line (LSRL)

  • Line of best fit. Minimizes deviations (residuals) from the line. Used with bivariate data.
  • \hat{y} = a + bx; x is independent, y is dependent.
  • Residuals (error) - vertical difference of a point from the LSRL. All residuals sum to 0.
  • Residual = y - \hat{y}
  • Residual Plot - scatterplot of (x, residual). No pattern indicates a linear relationship.

Coefficient of Determination

  • (r^2) - Proportion of variation in y explained by the relationship of (x, y). Never use the adjusted r^2.
  • Interpretations: (must be in context!)
    • Slope (b): For unit increase in x, the y variable will increase/decrease by the slope amount.
    • Correlation coefficient (r): There is a (strength, direction, linear) association between x & y.
    • **Coefficient of determination (r^2): Approximately r^2% of the variation in y can be explained by the LSRL of x and y.
  • Extrapolation - LRSL cannot be used to find values outside the range of the original data.
  • Influential Points - if removed, significantly change the LSRL.
  • Outliers - points with large residuals.

Census

  • A complete count of the population. Why not to use a census?
    • Expensive
    • Impossible to do
    • If destructive sampling you get extinction

Sampling Frame

  • Is a list of everyone in the population.

Sampling Design

  • Refers to the method used to choose a sample.

SRS (Simple Random Sample)

  • One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
    • Advantages: Easy and unbiased
    • Disadvantages: Large σ^2 and must know population.

Stratified

  • Divide the population into homogeneous groups called strata, then SRS each strata.
    • Advantages: More precise than an SRS and cost reduced if strata are already available.
    • Disadvantages: Difficult to divide into groups, more complex formulas & must know population.

Systematic

  • Use a systematic approach (every 50th) after choosing randomly where to begin.
    • Advantages: Unbiased, the sample is evenly distributed across the population & don’t need to know population.
    • Disadvantages: A large σ^2 and can be confounded by trends.

Cluster Sample

  • Based on location. Select a random location and sample ALL at that location.
    • Advantages: Cost is reduced, unbiased & don’t need to know the population.
    • Disadvantages: May not be representative of the population and has complex formulas.

Random Digit Table

  • Each entry is equally likely, and each digit is independent of the rest.

Random # Generator

  • Calculator or computer program

Bias

  • Error that favors a certain outcome related to the center of sampling distributions.

Sources of Bias

  • Voluntary Response: People choose themselves to participate.
  • Convenience Sampling: Ask people who are easy or comfortable to ask.
  • Undercoverage: Some group(s) are left out of the selection process.
  • Non-response: Someone cannot or does not want to participate.
  • Response: False answers due to question wording.

Experimental Design

  • Observational Study: Observe outcomes without giving a treatment.
  • Experiment: Actively impose a treatment on the subjects.
  • Experimental Unit: Single individual or object that receives a treatment.
  • Factor: Explanatory variable being tested.
  • Level: A specific value for the factor.
  • Response Variable: What you are measuring with the experiment.
  • Treatment: Experimental condition applied to each unit.
  • Control Group: Group used to compare the factor to for effectiveness (doesn't have to be placebo).
  • Placebo: Treatment with no active ingredients.
  • Blinding: Subjects are unaware of the treatment.
  • Double Blinding: Neither subjects nor evaluators know which treatment is being given.

Principles of Experimental Design

  • Control: Keep all extraneous variables constant.
  • Replication: Use many subjects to quantify the natural variation in the response.
  • Randomization: Use chance to assign subjects to treatments.
  • The only way to show cause and effect is with a well designed, well controlled experiment.

Experimental Designs

  • Completely Randomized: All units are randomly allocated to all treatments.
  • Randomized Block: Units are blocked and then randomly assigned within each block (reduces variation).
  • Matched Pairs: Units are matched and then randomly assigned. OR individuals do both treatments in random order (assignment is dependent).
  • Confounding Variables: Effect of the variable on the response cannot be separated from the factor being tested - happens in observational studies.
  • Randomization reduces bias by spreading extraneous variables to all groups.
  • Blocking helps reduce variability. Another way to reduce variability is to increase sample size.

Random Variable

  • A numerical value that depends on the outcome of an experiment.

Discrete Probability Distributions

  • Gives values & probabilities associated with each possible x.
  • μX = Σxi * p(x_i)
  • σ^2X = Σ(xi - μ)^2 * p(x_i)
  • Fair Game = All pay-ins equal all pay-outs.

Special Discrete Distributions

  • Binomial Distributions:

    • Two mutually exclusive outcomes, fixed number of trials (n), each trial is independent, probability (p) of success is the same for all trials.
    • Random variable - is the number of successes out of a fixed # of trials. Starts at X = 0 and is finite.
    • μ_X = np
    • σ_x = \sqrt{npq}
    • Calculator: binomialpdf (n, p, x) = single outcome P(X= x), binomialcdf (n, p, x) = cumulative outcome P(X < x), 1 - binomialcdf (n, p, (x -1)) = cumulative outcome P(X > x)
  • Geometric Distributions:

    • Two mutually exclusive outcomes, each trial is independent, probability (p) of success is the same for all trials. (NOT a fixed number of trials)
    • Random Variable –when the FIRST success occurs. Starts at 1 and is ∞.
    • Calculator: geometricpdf (p, a) = single outcome P(X = a), geometriccdf (p, a) = cumulative outcomes P(X < a), 1 - geometriccdf (n, p, (a -1)) = cumulative outcome P(X > a)

Continuous Random Variable

  • Numerical values that fall within a range or interval (measurements). Area under the curve always = 1. To find probabilities, find the area under the curve.

Unusual Density Curves

  • Any shape (triangles, etc.)

Uniform Distributions

  • Uniformly (evenly) distributed, shape of a rectangle

Normal Distributions

  • Symmetrical, unimodal, bell shaped curves defined by the parameters μ & σ.
  • Calculator:
    • Normalpdf – used for graphing only
    • Normalcdf(lower bound, upper bound, μ, σ) – finds probability
    • InvNorm(p) – z-score OR InvNorm(p, μ, σ) – gives x-value
  • To assess Normality - Use graphs – dotplots, boxplots, histograms, or normal probability plot.
  • Distribution – is all of the values of a random variable.
  • Sampling Distribution – of a statistic is the distribution of all possible values of all possible samples. Use normalcdf to calculate probabilities.

Sampling Distributions

  • \mu{\bar{X}} = μX
  • σ_{\bar{x}} = {σ \over \sqrt{n}} (standard deviation of the sample means)
  • \mu_{\hat{p}} = p
  • σ_{\hat{p}} = \sqrt{{pq \over n}} (standard deviation of the sample proportions)
  • \mu{X1 - X2} = μ{X1} - μ{X_2}
  • σ{X1 - X2} = \sqrt{{σ1^2 \over n1} + {σ2^2 \over n_2}} (standard deviation of the difference in sample means)
  • \mu{\hat{p1} - \hat{p2}} = p1 - p_2
  • σ{\hat{p1} - \hat{p2}} = \sqrt{{{p1q1} \over n1} + {{p2q2} \over n_2}} (standard deviation of the difference in sample proportions)
  • \mu_b = β
  • s_b (do not need to find, usually given in computer printout (standard error of the slopes of the LSRLs)
  • Standard error – estimate of the standard deviation of the statistic

Central Limit Theorem

  • When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.

Confidence Intervals

  • Point Estimate: Uses a single statistic based on sample data.
  • Confidence Intervals: Used to estimate the unknown population parameter.
  • Margin of Error: The smaller the margin of error, the more precise our estimate.
  • Steps:
    • Assumptions – see table below
    • Calculations – C.I. = statistic ± critical value * (standard deviation of the statistic)
    • Conclusion – Write your statement in context. We are [x]% confident that the true [parameter] of [context] is between [a] and [b].
  • What makes the margin of error smaller:
    • Make critical value smaller (lower confidence level).
    • Get a sample with a smaller s.
    • Make n larger.

T distributions compared to standard normal curve

  • Centered around 0
  • More spread out and shorter
  • More area under the tails.
  • When you increase n, t-curves become more normal.
  • Can be no outliers in the sample data
  • Degrees of Freedom = n – 1
  • Robust – if the assumption of normality is not met, the confidence level or p-value does not change much – this is true of t-distributions because there is more area in the tails

Hypothesis Tests

  • Tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.
  • Null Hypothesis: H_0 is the statement being tested. Null hypothesis should be “no effect”, “no difference”, or “no relationship”
  • Alternate Hypothesis: H_a is the statement suspected of being true.
  • P-Value: Assuming the null is true, the probability of obtaining the observed result or more extreme
  • Level of Significance: α is the amount of evidence necessary before rejecting the null hypothesis.
  • Steps:
    • Assumptions – see table below
    • Hypotheses - don’t forget to define parameter
    • Calculations – find z or t test statistic & p-value
    • Conclusion – Write your statement in context. Since the p-value is < (>) α, I reject (fail to reject) the Ho. There is (is not) sufficient evidence to suggest that [Ha].

Type I and II Errors and Power

  • Type I Error: Reject H0 when H0 is actually true (probability is α).
  • Type II Error: Fail to reject H0, and H0 is actually false (probability is β).
  • α and β are inversely related. Consequences are the results of making a Type I or Type II error.
  • The Power of a Test – is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true. Power = 1 – β
If you increaseType I error αType II error βPower
αIncreasesDecreasesIncreases
nSameDecreasesIncreases
0 – μa)SameDecreasesIncreases

\chi^2 Test

  • Used to test counts of categorical data.
    • Goodness of Fit (univariate)
    • Independence (bivariate)
    • Homogeneity (univariate 2 (or more) samples)
  • \chi^2 Distribution: All curves are skewed right, every df has a different curve, and as the degrees of freedom increase the \chi^2 curve becomes more normal.
  • Goodness of Fit: Univariate categorical data from a single sample. Does the observed count “fit” what we expect? Must use list to perform, df = number of the categories – 1, use χ2cdf (χ2, ∞, df) to calculate p-value
  • Independence: Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate
  • Homogeneity: Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate

For both \chi^2 tests of independence & homogeneity:

  • Expected counts = {(row total)(column total) \over grand total}
  • df = (r – 1)(c – 1)

Regression Model

  • X & Y have a linear relationship where the true LSRL is μ_y = α + βx
  • The responses (y) are normally distributed for a given x-value.
  • The standard deviation of the responses (\sigmay) is the same for all values of x. o S is the estimate for \sigmay

Confidence Interval

  • b ± t^*s_b

Hypothesis Testing

  • {b - β \over s_b}

Assumptions:

Proportions z - proceduresMeans t - proceduresCounts \chi^2 - procedures
One sample:• SRS from population • Can be approximated by normal distribution if n(p) & n(1 – p) > 10 • Population size is at least 10n• SRS from population • Distribution is approximately normal o Given o Large sample size o Graph of data is approximately symmetrical and unimodal with no outliersAll types: • Reasonably random sample(s) • All expected counts > 5 o Must show expected counts
Two samples:• 2 independent SRS’s from populations (or randomly assigned treatments) • Can be approximated by normal distribution if n1(p1), n1(1 – p1), n2p2, & n2(1 – p2) > 10 • Population sizes are at least 10nMatched pairs: • SRS from population • Distribution of differences is approximately normal - Given - Large sample size - Graph of differences is approximately symmetrical and unimodal with no outliers
Two samples: • 2 independent SRS’s from populations (or randomly assigned treatments) • Distributions are approximately normal o Given o Large sample sizes o Graphs of data are approximately symmetrical and unimodal with no outliers
Bivariate Data:t – procedures on slope • SRS from population • There is a linear relationship between x & y. • Residual plot has no pattern. • The standard deviation of the responses is constant for all values of x. • Points are scattered evenly across the LSRL in the scatterplot. • The responses are approximately normally distributed. • Graph of residuals is approximately symmetrical & unimodal with no outliers.