AP Statistics Cumulative AP Exam Study Guide

Statistics

  • Statistics is the science of collecting, analyzing, and drawing conclusions from data.
  • Descriptive statistics involves methods of organizing and summarizing data.
  • Inferential statistics involves making generalizations from a sample to the population.

Basic Definitions

  • Population: An entire collection of individuals or objects.
  • Sample: A subset of the population selected for study.
  • Variable: Any characteristic whose value changes.
  • Data: Observations on single or multi-variables.

Types of Variables

  • Categorical (Qualitative): Basic characteristics.
  • Numerical (Quantitative): Measurements or observations of numerical data.
    • Discrete: Listable sets (counts).
    • Continuous: Any value over an interval of values (measurements).
  • Univariate: One variable.
  • Bivariate: Two variables.
  • Multivariate: Many variables.

Distributions

  • Symmetrical: Data on which both sides are fairly the same shape and size. Often resembles a "Bell Curve".
  • Uniform: Every class has an equal frequency (number). Graphically, this forms "a rectangle".
  • Skewed: One side (tail) is longer than the other side. The skewness is in the direction that the tail points (left or right).
  • Bimodal: Data where two or more classes have large frequencies separated by another class between them, resembling a "double hump camel".

Describing Numerical Graphs (S.O.C.S.)

  • Shape: Overall type (symmetrical, skewed right/left, uniform, or bimodal).
  • Outliers: Includes gaps, clusters, etc.
  • Center: Middle of the data (mean, median, and mode).
  • Spread: Refers to variability (range, standard deviation, and IQR).
  • Context: Everything must be in context to the data and situation of the graph.
  • Comparison: When comparing two distributions, MUST use comparative language!

Parameter vs. Statistic

  • Parameter: Value of a population (typically unknown).
  • Statistic: A calculated value about a population from a sample(s).

Measures of Center

  • Median: The middle point of the data (50th percentile) when the data is in numerical order. If two values are present, then average them together.
  • Mean:
    • μ is for a population (parameter).
    • \bar{x} is for a sample (statistic).
  • Mode: Occurs the most in the data. There can be more than one mode, or no mode at all if all data points occur once.

Variability

  • Variability allows statisticians to distinguish between usual and unusual occurrences.

Measures of Spread (variability)

  • Range: A single value calculated as (Max - Min).
  • IQR: Interquartile range, calculated as (Q3 - Q1).
  • Standard deviation:
    • σ for population (parameter).
    • s for sample (statistic).
    • Measures the typical or average deviation of observations from the mean. Sample standard deviation is divided by df = n-1.
      *Sum of the deviations from the mean is always zero!
  • Variance: Standard deviation squared.

Resistance to Outliers

  • Resistant:
    • Median
    • IQR
  • Non-Resistant:
    • Mean
    • Range
    • Variance
    • Standard Deviation
    • Correlation Coefficient (r)
    • Least Squares Regression Line (LSRL)
    • Coefficient of Determination (r^2)

Comparison of Mean & Median Based on Graph Type

  • Symmetrical: Mean and the median are the same value.
  • Skewed Right: Mean is a larger value than the median.
  • Skewed Left: The mean is smaller than the median.
  • The mean is always pulled in the direction of the skew away from the median.

Trimmed Mean

  • Use a % to take observations away from the top and bottom of the ordered data. This possibly eliminates outliers.

Linear Transformations of Random Variables

  • μ{a +bx} =a +bμx The mean is changed by both addition (subtract) & multiplication (division).
  • σ{a +bx} = |b|σx The standard deviation is changed by multiplication (division) ONLY.

Combination of Two (or More) Random Variables

  • μ{x ± y} = μx ± μ_y Just add or subtract the two (or more) means.
  • σ{x± y}^2 = σx^2 + σ_y^2 Always add the variances – X & Y MUST be independent.

Z-Score

  • Is a standardized score. This tells you how many standard deviations from the mean an observation is. It creates a standard normal curve consisting of z-scores with a μ = 0 & σ = 1.
  • z = \frac{x - μ}{σ}

Normal Curve

  • Is a bell-shaped and symmetrical curve.
    • As σ increases the curve flattens.
    • As σ decreases the curve thins.

Empirical Rule (68-95-99.7)

  • Measures 1σ, 2σ, and 3σ on normal curves from a center of μ.
    • 68% of the population is between -1σ and 1σ.
    • 95% of the population is between -2σ and 2σ.
    • 99.7% of the population is between -3σ and 3σ.

Boxplots

  • Are for medium or large numerical data. It does not contain original observations. Always use modified boxplots where the fences are 1.5 IQRs from the ends of the box (Q1 & Q3). Points outside the fence are considered outliers. Whiskers extend to the smallest & largest observations within the fences.

5-Number Summary

  • Minimum, Q1 (1st Quartile – 25th Percentile), Median, Q3 (3rd Quartile – 75th Percentile), Maximum

Probability Rules

  • Sample Space: Is collection of all outcomes.
  • Event: Any sample of outcomes.
  • Complement: All outcomes not in the event.
  • Union: A or B, all the outcomes in both circles. A ∪ B
  • Intersection: A and B, happening in the middle of A and B. A ∩ B
  • Mutually Exclusive (Disjoint): A and B have no intersection. They cannot happen at the same time.
  • Independent: If knowing one event does not change the outcome of another.
  • Experimental Probability: Is the number of success from an experiment divided by the total amount from the experiment.
  • Law of Large Numbers: As an experiment is repeated the experimental probability gets closer and closer to the true (theoretical) probability. The difference between the two probabilities will approach “0”.

Rules of Probability

  • (1) All values are 0 < P < 1.
  • (2) Probability of sample space is 1.
  • (3) Compliment = P + (1 - P) = 1
  • (4) Addition P(A \text{ or } B) = P(A) + P(B) – P(A \& B)
  • (5) Multiplication P(A \& B) = P(A) · P(B) if A & B are independent
  • (6) P (at least 1 or more) = 1 – P (none)
  • (7) Conditional Probability – takes into account a certain condition.
    • P(A|B) = \frac{P(A \& B)}{P(B)} = \frac{P(\text{both})}{P(\text{given})}

Correlation Coefficient

  • (r) – is a quantitative assessment of the strength and direction of a linear relationship. (use ρ (rho) for population parameter)
  • Values – [-1, 1] 0 – no correlation, (0, ±0.5) – weak, [±0.5, ±0.8) – moderate, [±0.8, ±1] - strong

Least Squares Regression Line (LSRL)

  • Is a line of mathematical best fit. Minimizes the deviations (residuals) from the line. Used with bivariate data.
  • \hat{y} = a + bx
    • x is independent, the explanatory variable & y is dependent, the response variable

Residuals (error)

  • Is vertical difference of a point from the LSRL. All residuals sum up to “0”.
  • Residual = y - \hat{y}

Residual Plot

  • A scatterplot of (x (or \hat{y} ) , residual). No pattern indicates a linear relationship.

Coefficient of Determination

  • (r^2) - gives the proportion of variation in y (response) that is explained by the relationship of (x, y). Never use the adjusted r^2.

Interpretations (must be in context!)

  • Slope (b) – For unit increase in x, then the y variable will increase/decrease slope amount.
  • Correlation coefficient (r) – There is a strength, direction, linear association between x & y.
  • Coefficient of determination (r^2) - Approximately r^2 % of the variation in y can be explained by the LSRL of x and y.

Cautions About Regression

  • Extrapolation – LRSL cannot be used to find values outside of the range of the original data.
  • Influential Points – are points that if removed significantly change the LSRL.
  • Outliers – are points with large residuals.

Census

  • A complete count of the population.
  • Why not to use a census?
    • Expensive
    • Impossible to do
    • If destructive sampling you get extinction

Sampling Frame

  • Is a list of everyone in the population.

Sampling Design

  • Refers to the method used to choose a sample.

SRS (Simple Random Sample)

  • One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
  • Advantages: Easy and unbiased
  • Disadvantages: Large σ^2 and must know population.

Stratified

  • Divide the population into homogeneous groups called strata, then SRS each strata.
  • Advantages: More precise than an SRS and cost reduced if strata already available.
  • Disadvantages: Difficult to divide into groups, more complex formulas & must know population.

Systematic

  • Use a systematic approach (every 50th) after choosing randomly where to begin.
  • Advantages: Unbiased, the sample is evenly distributed across population & don’t need to know population.
  • Disadvantages: A large σ^2 and can be confounded by trends.

Cluster Sample

  • Based on location. Select a random location and sample ALL at that location.
  • Advantages: Cost is reduced, is unbiased & don’t need to know population.
  • Disadvantages: May not be representative of population and has complex formulas.

Random Digit Table

  • Each entry is equally likely and each digit is independent of the rest.

Random # Generator

  • Calculator or computer program

Bias

  • Error – favors a certain outcome, has to do with center of sampling distributions – if centered over true parameter then considered unbiased

Sources of Bias

  • Voluntary Response – people choose themselves to participate.
  • Convenience Sampling – ask people who are easy, friendly, or comfortable asking.
  • Undercoverage – some group(s) are left out of the selection process.
  • Non-response – someone cannot or does not want to be contacted or participate.
  • Response – false answers – can be caused by a variety of things
  • Wording of the Questions – leading questions.

Experimental Design

  • Observational Study – observe outcomes with out giving a treatment.
  • Experiment – actively imposes a treatment on the subjects.
  • Experimental Unit – single individual or object that receives a treatment.
  • Factor – is the explanatory variable, what is being tested
  • Level – a specific value for the factor.
  • Response Variable – what you are measuring with the experiment.
  • Treatment – experimental condition applied to each unit.
  • Control Group – a group used to compare the factor to for effectiveness – does NOT have to be placebo
  • Placebo – a treatment with no active ingredients (provides control).
  • Blinding – a method used so that the subjects are unaware of the treatment (who gets a placebo or the real treatment).
  • Double Blinding – neither the subjects nor the evaluators know which treatment is being given.

Principles of Experimental Design

  • Control – keep all extraneous variables (not being tested) constant
  • Replication – uses many subjects to quantify the natural variation in the response.
  • Randomization – uses chance to assign the subjects to the treatments.
  • The only way to show cause and effect is with a well designed, well controlled experiment.

Experimental Designs

  • Completely Randomized – all units are allocated to all of the treatments randomly
  • Randomized Block – units are blocked and then randomly assigned in each block –reduces variation
  • Matched Pairs – are matched up units by characteristics and then randomly assigned. Once a pair receives a certain treatment, then the other pair automatically receives the second treatment. OR individuals do both treatments in random order (before/after or pretest/post-test). Assignment is dependent
  • Confounding Variables – are where the effect of the variable on the response cannot be separated from the effects of the factor being tested – happens in observational studies – when you use random assignment to treatments you do NOT have confounding variables!
  • Randomization – reduces bias by spreading extraneous variables to all groups in the experiment.
  • Blocking – helps reduce variability. Another way to reduce variability is to increase sample size.

Random Variable

  • A numerical value that depends on the outcome of an experiment.
    • Discrete – a count of a random variable
    • Continuous – a measure of a random variable

Discrete Probability Distributions

  • Gives values & probabilities associated with each possible x.
  • μX = \sum xi p(x_i)
  • σX ^2= \sum (xi - μX)^2 p(xi)
  • Calculator shortcut – 1 VARSTAT L1,L2
  • Fair Game – a fair game is one in which all pay-ins equal all pay-outs.

Special Discrete Distributions

Binomial Distributions

  • Properties:
    • Two mutually exclusive outcomes.
    • Fixed number of trials (n).
    • Each trial is independent.
    • The probability (p) of success is the same for all trials.
  • Random variable -
    • Is the number of successes out of a fixed # of trials.
    • Starts at X = 0 and is finite.
  • μ_X = np
  • σ_x = \sqrt{npq}
  • Calculator:
    • binomialpdf (n, p, x) = single outcome P(X= x)
    • binomialcdf (n, p, x) = cumulative outcome P(X < x)
    • 1 - binomialcdf (n, p, (x -1)) = cumulative outcome P(X > x)

Geometric Distributions

  • Properties
    • Two mutually exclusive outcomes.
    • Each trial is independent.
    • Probability (p) of success is the same for all trials. (NOT a fixed number of trials)
  • Random Variable – when the FIRST success occurs. Starts at 1 and is ∞.
  • Calculator:
    • geometricpdf (p, a) = single outcome P(X = a)
    • geometriccdf (p, a) = cumulative outcomes P(X < a)
    • 1 - geometriccdf (n, p, (a -1)) = cumulative outcome P(X > a)

Continuous Random Variable

  • Numerical values that fall within a range or interval (measurements), use density curves where the area under the curve always = 1. To find probabilities, find area under the curve

  • Unusual Density Curves -any shape (triangles, etc.)

  • Uniform Distributions –uniformly (evenly) distributed, shape of a rectangle

Normal Distributions

  • Symmetrical, unimodal, bell shaped curves defined by the parameters μ & σ
  • Calculator:
    • Normalpdf – used for graphing only
    • Normalcdf(lower bound, upper bound, μ, σ) – finds probability
    • InvNorm(p) – z-score OR InvNorm(p, μ, σ) – gives x-value
  • To assess Normality - Use graphs – dotplots, boxplots, histograms, or normal probability plot.
  • Distribution – is all of the values of a random variable.

Sampling Distribution

  • Of a statistic is the distribution of all possible values of all possible samples. Use normalcdf to calculate probabilities – be sure to use correct SD
  • μ{\bar{X}} = μX
  • σ{\bar{X}} = \frac{σx}{\sqrt{n}}(standard deviation of the sample means)
  • μ_{\hat{p}} = p
  • σ_{\hat{p}} = \sqrt{\frac{pq}{n}}(standard deviation of the sample proportions)
  • μ{X1 - X2} = μ{X1} - μ{X_2}
  • σ{X1 - X2} = \sqrt{\frac{σ1^2}{n1} + \frac{σ2^2}{n_2}}(standard deviation of the difference in sample means)
  • μ{\hat{p1} - \hat{p2}} = p1 - p_2
  • σ{\hat{p1} - \hat{p2}} = \sqrt{\frac{p1q1}{n1} + \frac{p2q2}{n_2}}(standard deviation of the difference in sample proportions)
  • μ_b = β
  • SE_b(do not need to find, usually given in computer printout (standard error of the slopes of the LSRLs)
  • Standard error – estimate of the standard deviation of the statistic

Central Limit Theorem

  • When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.

Confidence Intervals

  • Point Estimate – uses a single statistic based on sample data, this is the simplest approach.
  • Confidence Intervals – used to estimate the unknown population parameter.
  • Margin of Error – the smaller the margin of error, the more precise our estimate

Steps for Creating a Confidence Interval

  • Assumptions – see table below
  • Calculations – C.I. = statistic ± critical value (standard deviation of the statistic)
  • Conclusion – Write your statement in context.
    • We are [x]% confident that the true [parameter] of [context] is between [a] and [b].

Factors That Make the Margin of Error Smaller

  • Make critical value smaller (lower confidence level).
  • Get a sample with a smaller s.
  • Make n larger.

T Distributions Compared to Standard Normal Curve

  • Centered around 0
  • More spread out and shorter
  • More area under the tails.
  • When you increase n, t-curves become more normal.
  • Can be no outliers in the sample data
  • Degrees of Freedom = n – 1
  • Robust – if the assumption of normality is not met, the confidence level or p-value does not change much – this is true of t-distributions because there is more area in the tails

Hypothesis Tests

  • Hypothesis Testing – tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.
  • Null Hypothesis – H0 is the statement being tested. Null hypothesis should be “no effect”, “no difference”, or “no relationship”
  • Alternate Hypothesis – Ha is the statement suspected of being true.
  • P-Value – assuming the null is true, the probability of obtaining the observed result or more extreme
  • Level of Significance – α is the amount of evidence necessary before rejecting the null hypothesis.

Steps for Hypothesis Testing

  • Assumptions – see table below
  • Hypotheses - don’t forget to define parameter
  • Calculations – find z or t test statistic & p-value
  • Conclusion – Write your statement in context.
    • Since the p-value is < (>) α, I reject (fail to reject) the Ho. There is (is not) sufficient evidence to suggest that [Ha].

Type I and II Errors and Power

  • Type I Error – is when one rejects H0 when H0 is actually true. (probability is α)
  • Type II Error – is when you fail to reject H0, and H0 is actually false. (probability is β)
  • α and β are inversely related. Consequences are the results of making a Type I or Type II error. Every decision has the possibility of making an error.
  • The Power of a Test – is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true. Power = 1 – β

Relationship Between Type I Error, Type II Error, Power and Sample Size

αβPower
Increases
nSame
(μ0 – μa)Same

χ2 Test

  • Is used to test counts of categorical data.

Types of χ2 Tests

  • Goodness of Fit (univariate)
  • Independence (bivariate)
  • Homogeneity (univariate 2 (or more) samples)
  • χ2 distribution – All curves are skewed right, every df has a different curve, and as the degrees of freedom increase the χ2 curve becomes more normal.

Goodness of Fit

  • Is for univariate categorical data from a single sample. Does the observed count “fit” what we expect. Must use list to perform, df = \text{number of the categories} – 1, use \text{χ2cdf} (χ2, ∞, df)
    to calculate p-value

Independence

  • Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate

Homogeneity

  • Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate

For Both χ2 Tests of Independence & Homogeneity:

  • Expected counts = \frac{\text{(row total)(column total)}}{\text{grand total}}
  • df = (r – 1)(c – 1)

Regression Model:

  • X & Y have a linear relationship where the true LSRL is μ_y = α + βx
  • The responses (y) are normally distributed for a given x-value.
  • The standard deviation of the responses (σ_y) is the same for all values of x.
    • S is the estimate for σ_y

Confidence Interval for Regression Model

  • b ± t^*s

Hypothesis Testing for Regression Model:

  • t = \frac{b - β1}{SEb}

Assumptions for Statistical Procedures

Proportions (z - procedures)

One sample:

  • SRS from population
  • Can be approximated by normal distribution if n(p) \geq 10 & n(1 – p) \geq 10
  • Population size is at least 10n

Two samples:

  • 2 independent SRS’s from populations (or randomly assigned treatments)
  • Can be approximated by normal distribution if n1(p1) \geq 10, n1(1 – p1) \geq 10, n2p2 \geq 10, & n2(1 – p2) \geq 10
  • Population sizes are at least 10n

Means (t - procedures)

One sample:

  • SRS from population
  • Distribution is approximately normal
    • Given
    • Large sample size
    • Graph of data is approximately symmetrical and unimodal with no outliers

Matched pairs:

  • SRS from population
  • Distribution of differences is approximately normal
    • Given
    • Large sample size
    • Graph of differences is approximately symmetrical and unimodal with no outliers

Two samples:

  • 2 independent SRS’s from populations (or randomly assigned treatments)
  • Distributions are approximately normal
    • Given
    • Large sample sizes
    • Graphs of data are approximately symmetrical and unimodal with no outliers

Bivariate Data: (t – procedures on slope)

  • SRS from population
  • There is linear relationship between x & y.
  • Residual plot has no pattern.
  • The standard deviation of the responses is constant for all values of x.
    • Points are scattered evenly across the LSRL in the scatterplot.
  • The responses are approximately normally distributed.
    • Graph of residuals is approximately symmetrical & unimodal with no outliers.

Counts (\chi^2 - procedures)

  • All types:
    • Reasonably random sample(s)
    • All expected counts > 5
      • Must show expected counts