AP Statistics Exam Study Guide
AP Statistics Cumulative AP Exam Study Guide
Statistics Basics
- Statistics: Science of collecting, analyzing, and drawing conclusions from data.
- Descriptive Statistics: Methods for organizing and summarizing data.
- Inferential Statistics: Making generalizations from a sample to a population.
- Population: Entire collection of individuals or objects.
- Sample: Subset of the population selected for study.
- Variable: Characteristic whose value changes.
- Data: Observations on single or multi-variables.
Types of Variables
- Categorical (Qualitative): Basic characteristics.
- Numerical (Quantitative): Measurements or observations of numerical data.
- Discrete: Listable sets (counts).
- Continuous: Any value over an interval (measurements).
- Univariate: One variable.
- Bivariate: Two variables.
- Multivariate: Many variables.
Distributions
- Symmetrical: Data with fairly same shape and size on both sides.
- Uniform: Every class has equal frequency.
- Skewed: One side (tail) is longer than the other. Skewness is in the direction the tail points.
- Bimodal: Two or more classes have large frequencies separated by another class.
Describing Numerical Graphs (S.O.C.S.)
- Shape: Symmetrical, skewed (right/left), uniform, or bimodal.
- Outliers: Gaps, clusters, etc.
- Center: Middle of the data (mean, median, mode).
- Spread: Variability (range, standard deviation, IQR).
- Context: Everything must be in context to the data and situation.
- Comparison: When comparing distributions, use comparative language.
Parameters vs. Statistics
- Parameter: Value of a population (typically unknown).
- Statistic: Calculated value about a population from a sample(s).
Measures of Center
- Median: Middle point of the data (50th percentile) in numerical order.
- Mean: for population, for sample.
- Mode: Occurs most in the data. Can have multiple modes or none.
Measures of Spread (Variability)
- Range: .
- IQR: Interquartile range .
- Standard Deviation: for population, for sample. Measures typical deviation from the mean. Sample standard deviation is divided by .
- Sum of deviations from the mean is always zero.
- Variance: Standard deviation squared.
Resistant vs. Non-Resistant Measures
- Resistant: Not affected by outliers (Median, IQR).
- Non-Resistant: Affected by outliers (Mean, Range, Variance, Standard Deviation, Correlation Coefficient (r), Least Squares Regression Line (LSRL), Coefficient of Determination ).
Comparison of Mean & Median Based on Graph Type
- Symmetrical: Mean and median are the same.
- Skewed Right: Mean > Median.
- Skewed Left: Mean < Median.
- Mean is pulled in the direction of the skew.
- Trimmed Mean: Use a % to remove observations from the top and bottom to eliminate outliers.
Linear Transformations of Random Variables
- (Mean is changed by both addition/subtraction & multiplication/division).
- (Standard deviation is changed by multiplication/division ONLY).
Combination of Two (or More) Random Variables
- (Add or subtract the means).
- (Always add the variances - X & Y MUST be independent).
Z-Score
- Standardized score indicating how many standard deviations an observation is from the mean. Creates a standard normal curve with & .
Normal Curve
- Bell-shaped and symmetrical.
- As increases, the curve flattens.
- As decreases, the curve thins.
Empirical Rule (68-95-99.7)
- Measures , , and on normal curves from the center .
- 68% of the population is between and .
- 95% of the population is between and .
- 99.7% of the population is between and .
Boxplots
- For medium or large numerical data; doesn't contain original observations.
- Use modified boxplots with fences at from the ends of the box (Q1 & Q3).
- Points outside the fence are outliers.
- Whiskers extend to the smallest & largest observations within the fences.
5-Number Summary
- Minimum, Q1 (25th Percentile), Median, Q3 (75th Percentile), Maximum
Probability Rules
- Sample Space: Collection of all outcomes.
- Event: Any sample of outcomes.
- Complement: All outcomes not in the event.
- Union: A or B, all outcomes in both circles.
- Intersection: A and B, happening in the middle of A and B.
- Mutually Exclusive (Disjoint): A and B have no intersection; they cannot happen at the same time.
- Independent: Knowing one event doesn't change the outcome of another.
- Experimental Probability: Number of successes from an experiment divided by the total amount from the experiment.
- Law of Large Numbers: As an experiment is repeated, the experimental probability gets closer to the true probability.
Probability Rules (Formulas)
- All values are 0 < P < 1.
- Probability of sample space is 1.
- Complement:
- Addition: P(A or B) = P(A) + P(B) – P(A & B)
- Multiplication: P(A & B) = P(A) * P(B) if A & B are independent.
- Conditional Probability: P(A|B) = {P(A & B) \over P(B)}
Correlation Coefficient
- (r) - Quantitative assessment of the strength and direction of a linear relationship. (use (rho) for population parameter)
- Values: ; 0 - no correlation, - weak, - moderate, - strong
Least Squares Regression Line (LSRL)
- Line of best fit. Minimizes deviations (residuals) from the line. Used with bivariate data.
- ; x is independent, y is dependent.
- Residuals (error) - vertical difference of a point from the LSRL. All residuals sum to 0.
- Residual =
- Residual Plot - scatterplot of . No pattern indicates a linear relationship.
Coefficient of Determination
- - Proportion of variation in y explained by the relationship of (x, y). Never use the adjusted .
- Interpretations: (must be in context!)
- Slope (b): For unit increase in x, the y variable will increase/decrease by the slope amount.
- Correlation coefficient (r): There is a (strength, direction, linear) association between x & y.
- **Coefficient of determination : Approximately % of the variation in y can be explained by the LSRL of x and y.
- Extrapolation - LRSL cannot be used to find values outside the range of the original data.
- Influential Points - if removed, significantly change the LSRL.
- Outliers - points with large residuals.
Census
- A complete count of the population. Why not to use a census?
- Expensive
- Impossible to do
- If destructive sampling you get extinction
Sampling Frame
- Is a list of everyone in the population.
Sampling Design
- Refers to the method used to choose a sample.
SRS (Simple Random Sample)
- One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
- Advantages: Easy and unbiased
- Disadvantages: Large and must know population.
Stratified
- Divide the population into homogeneous groups called strata, then SRS each strata.
- Advantages: More precise than an SRS and cost reduced if strata are already available.
- Disadvantages: Difficult to divide into groups, more complex formulas & must know population.
Systematic
- Use a systematic approach (every 50th) after choosing randomly where to begin.
- Advantages: Unbiased, the sample is evenly distributed across the population & don’t need to know population.
- Disadvantages: A large and can be confounded by trends.
Cluster Sample
- Based on location. Select a random location and sample ALL at that location.
- Advantages: Cost is reduced, unbiased & don’t need to know the population.
- Disadvantages: May not be representative of the population and has complex formulas.
Random Digit Table
- Each entry is equally likely, and each digit is independent of the rest.
Random # Generator
- Calculator or computer program
Bias
- Error that favors a certain outcome related to the center of sampling distributions.
Sources of Bias
- Voluntary Response: People choose themselves to participate.
- Convenience Sampling: Ask people who are easy or comfortable to ask.
- Undercoverage: Some group(s) are left out of the selection process.
- Non-response: Someone cannot or does not want to participate.
- Response: False answers due to question wording.
Experimental Design
- Observational Study: Observe outcomes without giving a treatment.
- Experiment: Actively impose a treatment on the subjects.
- Experimental Unit: Single individual or object that receives a treatment.
- Factor: Explanatory variable being tested.
- Level: A specific value for the factor.
- Response Variable: What you are measuring with the experiment.
- Treatment: Experimental condition applied to each unit.
- Control Group: Group used to compare the factor to for effectiveness (doesn't have to be placebo).
- Placebo: Treatment with no active ingredients.
- Blinding: Subjects are unaware of the treatment.
- Double Blinding: Neither subjects nor evaluators know which treatment is being given.
Principles of Experimental Design
- Control: Keep all extraneous variables constant.
- Replication: Use many subjects to quantify the natural variation in the response.
- Randomization: Use chance to assign subjects to treatments.
- The only way to show cause and effect is with a well designed, well controlled experiment.
Experimental Designs
- Completely Randomized: All units are randomly allocated to all treatments.
- Randomized Block: Units are blocked and then randomly assigned within each block (reduces variation).
- Matched Pairs: Units are matched and then randomly assigned. OR individuals do both treatments in random order (assignment is dependent).
- Confounding Variables: Effect of the variable on the response cannot be separated from the factor being tested - happens in observational studies.
- Randomization reduces bias by spreading extraneous variables to all groups.
- Blocking helps reduce variability. Another way to reduce variability is to increase sample size.
Random Variable
- A numerical value that depends on the outcome of an experiment.
Discrete Probability Distributions
- Gives values & probabilities associated with each possible x.
- Fair Game = All pay-ins equal all pay-outs.
Special Discrete Distributions
Binomial Distributions:
- Two mutually exclusive outcomes, fixed number of trials (n), each trial is independent, probability (p) of success is the same for all trials.
- Random variable - is the number of successes out of a fixed # of trials. Starts at X = 0 and is finite.
- Calculator: binomialpdf (n, p, x) = single outcome P(X= x), binomialcdf (n, p, x) = cumulative outcome P(X < x), 1 - binomialcdf (n, p, (x -1)) = cumulative outcome P(X > x)
Geometric Distributions:
- Two mutually exclusive outcomes, each trial is independent, probability (p) of success is the same for all trials. (NOT a fixed number of trials)
- Random Variable –when the FIRST success occurs. Starts at 1 and is ∞.
- Calculator: geometricpdf (p, a) = single outcome P(X = a), geometriccdf (p, a) = cumulative outcomes P(X < a), 1 - geometriccdf (n, p, (a -1)) = cumulative outcome P(X > a)
Continuous Random Variable
- Numerical values that fall within a range or interval (measurements). Area under the curve always = 1. To find probabilities, find the area under the curve.
Unusual Density Curves
- Any shape (triangles, etc.)
Uniform Distributions
- Uniformly (evenly) distributed, shape of a rectangle
Normal Distributions
- Symmetrical, unimodal, bell shaped curves defined by the parameters & .
- Calculator:
- Normalpdf – used for graphing only
- Normalcdf(lower bound, upper bound, μ, σ) – finds probability
- InvNorm(p) – z-score OR InvNorm(p, μ, σ) – gives x-value
- To assess Normality - Use graphs – dotplots, boxplots, histograms, or normal probability plot.
- Distribution – is all of the values of a random variable.
- Sampling Distribution – of a statistic is the distribution of all possible values of all possible samples. Use normalcdf to calculate probabilities.
Sampling Distributions
- (standard deviation of the sample means)
- (standard deviation of the sample proportions)
- (standard deviation of the difference in sample means)
- (standard deviation of the difference in sample proportions)
- (do not need to find, usually given in computer printout (standard error of the slopes of the LSRLs)
- Standard error – estimate of the standard deviation of the statistic
Central Limit Theorem
- When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.
Confidence Intervals
- Point Estimate: Uses a single statistic based on sample data.
- Confidence Intervals: Used to estimate the unknown population parameter.
- Margin of Error: The smaller the margin of error, the more precise our estimate.
- Steps:
- Assumptions – see table below
- Calculations – C.I. = statistic ± critical value * (standard deviation of the statistic)
- Conclusion – Write your statement in context. We are [x]% confident that the true [parameter] of [context] is between [a] and [b].
- What makes the margin of error smaller:
- Make critical value smaller (lower confidence level).
- Get a sample with a smaller s.
- Make n larger.
T distributions compared to standard normal curve
- Centered around 0
- More spread out and shorter
- More area under the tails.
- When you increase n, t-curves become more normal.
- Can be no outliers in the sample data
- Degrees of Freedom = n – 1
- Robust – if the assumption of normality is not met, the confidence level or p-value does not change much – this is true of t-distributions because there is more area in the tails
Hypothesis Tests
- Tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.
- Null Hypothesis: is the statement being tested. Null hypothesis should be “no effect”, “no difference”, or “no relationship”
- Alternate Hypothesis: is the statement suspected of being true.
- P-Value: Assuming the null is true, the probability of obtaining the observed result or more extreme
- Level of Significance: is the amount of evidence necessary before rejecting the null hypothesis.
- Steps:
- Assumptions – see table below
- Hypotheses - don’t forget to define parameter
- Calculations – find z or t test statistic & p-value
- Conclusion – Write your statement in context. Since the p-value is < (>) α, I reject (fail to reject) the Ho. There is (is not) sufficient evidence to suggest that [Ha].
Type I and II Errors and Power
- Type I Error: Reject when is actually true (probability is ).
- Type II Error: Fail to reject , and is actually false (probability is ).
- and are inversely related. Consequences are the results of making a Type I or Type II error.
- The Power of a Test – is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true. Power = 1 –
| If you increase | Type I error | Type II error | Power |
|---|---|---|---|
| Increases | Decreases | Increases | |
| n | Same | Decreases | Increases |
| Same | Decreases | Increases |
Test
- Used to test counts of categorical data.
- Goodness of Fit (univariate)
- Independence (bivariate)
- Homogeneity (univariate 2 (or more) samples)
- Distribution: All curves are skewed right, every df has a different curve, and as the degrees of freedom increase the curve becomes more normal.
- Goodness of Fit: Univariate categorical data from a single sample. Does the observed count “fit” what we expect? Must use list to perform, df = number of the categories – 1, use χ2cdf (χ2, ∞, df) to calculate p-value
- Independence: Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate
- Homogeneity: Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate
For both tests of independence & homogeneity:
- Expected counts =
- df = (r – 1)(c – 1)
Regression Model
- X & Y have a linear relationship where the true LSRL is
- The responses (y) are normally distributed for a given x-value.
- The standard deviation of the responses () is the same for all values of x. o S is the estimate for
Confidence Interval
Hypothesis Testing
Assumptions:
| Proportions z - procedures | Means t - procedures | Counts - procedures | |
|---|---|---|---|
| One sample: | • SRS from population • Can be approximated by normal distribution if n(p) & n(1 – p) > 10 • Population size is at least 10n | • SRS from population • Distribution is approximately normal o Given o Large sample size o Graph of data is approximately symmetrical and unimodal with no outliers | All types: • Reasonably random sample(s) • All expected counts > 5 o Must show expected counts |
| Two samples: | • 2 independent SRS’s from populations (or randomly assigned treatments) • Can be approximated by normal distribution if n1(p1), n1(1 – p1), n2p2, & n2(1 – p2) > 10 • Population sizes are at least 10n | Matched pairs: • SRS from population • Distribution of differences is approximately normal - Given - Large sample size - Graph of differences is approximately symmetrical and unimodal with no outliers | |
| Two samples: • 2 independent SRS’s from populations (or randomly assigned treatments) • Distributions are approximately normal o Given o Large sample sizes o Graphs of data are approximately symmetrical and unimodal with no outliers | |||
| Bivariate Data: | t – procedures on slope • SRS from population • There is a linear relationship between x & y. • Residual plot has no pattern. • The standard deviation of the responses is constant for all values of x. • Points are scattered evenly across the LSRL in the scatterplot. • The responses are approximately normally distributed. • Graph of residuals is approximately symmetrical & unimodal with no outliers. |