AP Statistics Cumulative AP Exam Study Guide Notes
AP Statistics Cumulative AP Exam Study Guide
Statistics
- Statistics is the science of collecting, analyzing, and drawing conclusions from data.
- Descriptive statistics involve methods of organizing and summarizing data.
- Inferential statistics involve making generalizations from a sample to the population.
Basic Definitions
- Population: An entire collection of individuals or objects.
- Sample: A subset of the population selected for study.
- Variable: Any characteristic whose value changes.
- Data: Observations on single or multi-variables.
Types of Variables
- Categorical (Qualitative): Basic characteristics.
- Numerical (Quantitative): Measurements or observations of numerical data.
- Discrete: Listable sets (counts).
- Continuous: Any value over an interval of values (measurements).
- Univariate: One variable.
- Bivariate: Two variables.
- Multivariate: Many variables.
Distributions
- Symmetrical: Data on which both sides are fairly the same shape and size (Bell Curve).
- Uniform: Every class has an equal frequency (number) (a rectangle).
- Skewed: One side (tail) is longer than the other side. The skewness is in the direction that the tail points (left or right).
- Bimodal: Data of two or more classes have large frequencies separated by another class between them (double hump camel).
Describing Numerical Graphs: S.O.C.S.
- Shape: Overall type (symmetrical, skewed right, skewed left, uniform, or bimodal).
- Outliers: Gaps, clusters, etc.
- Center: Middle of the data (mean, median, and mode).
- Spread: Refers to variability (range, standard deviation, and IQR).
- Everything must be in context to the data and situation of the graph.
- When comparing two distributions – MUST use comparative language!
Parameter vs. Statistic
- Parameter: Value of a population (typically unknown).
- Statistic: A calculated value about a population from a sample(s).
Measures of Center
- Median: The middle point of the data (50th percentile) when the data is in numerical order. If two values are present, then average them together.
- Mean: μ is for a population (parameter) and \bar{x} is for a sample (statistic).
- Mode: Occurs the most in the data. There can be more than one mode, or no mode at all if all data points occur once.
Variability
- Allows statisticians to distinguish between usual and unusual occurrences.
Measures of Spread (Variability)
- Range: A single value – (Max – Min)
- IQR (Interquartile Range): (Q3 – Q1)
- Standard Deviation: σ for population (parameter) & s for sample (statistic) – measures the typical or average deviation of observations from the mean – sample standard deviation is divided by df = n-1
- The sum of the deviations from the mean is always zero!
- Variance: Standard deviation squared
Resistant vs. Non-Resistant Measures
- Resistant: Not affected by outliers.
- Non-Resistant:
- Mean
- Range
- Variance
- Standard Deviation
- Correlation Coefficient (r)
- Least Squares Regression Line (LSRL)
- Coefficient of Determination (r^2)
- Symmetrical: Mean and the median are the same value.
- Skewed Right: Mean is a larger value than the median.
- Skewed Left: The mean is smaller than the median.
- The mean is always pulled in the direction of the skew away from the median.
- Trimmed Mean: Use a % to take observations away from the top and bottom of the ordered data. This possibly eliminates outliers.
- μ{a +bx} =a +bμx The mean is changed by both addition (subtract) & multiplication (division).
- σ{a +bx} = |b|σx The standard deviation is changed by multiplication (division) ONLY.
Combination of Two (or More) Random Variables
- μ{x ± y} = μx ± μ_y Just add or subtract the two (or more) means
- σ{x± y} = \sqrt{σx^2 + σ_y^2} Always add the variances – X & Y MUST be independent
Z-Score
- Is a standardized score. This tells you how many standard deviations from the mean an observation is. It creates a standard normal curve consisting of z-scores with a μ = 0 & σ = 1.
- z = \frac{x - μ}{σ}
Normal Curve
- Is a bell-shaped and symmetrical curve.
- As σ increases the curve flattens.
- As σ decreases the curve thins.
Empirical Rule (68-95-99.7)
- Measures 1σ, 2σ, and 3σ on normal curves from a center of μ.
- 68% of the population is between -1σ and 1σ
- 95% of the population is between -2σ and 2σ
- 99.7% of the population is between -3σ and 3σ
Boxplots
- Are for medium or large numerical data. It does not contain original observations. Always use modified boxplots where the fences are 1.5 IQRs from the ends of the box (Q1 & Q3). Points outside the fence are considered outliers. Whiskers extend to the smallest & largest observations within the fences.
5-Number Summary
- Minimum, Q1 (1st Quartile – 25th Percentile), Median, Q3 (3rd Quartile – 75th Percentile), Maximum
Probability Rules
- Sample Space: Is the collection of all outcomes.
- Event: Any sample of outcomes.
- Complement: All outcomes not in the event.
- Union: A or B, all the outcomes in both circles. A ∪ B
- Intersection: A and B, happening in the middle of A and B. A ∩ B
- Mutually Exclusive (Disjoint): A and B have no intersection. They cannot happen at the same time.
- Independent: If knowing one event does not change the outcome of another.
- Experimental Probability: Is the number of success from an experiment divided by the total amount from the experiment.
- Law of Large Numbers: As an experiment is repeated the experimental probability gets closer and closer to the true (theoretical) probability. The difference between the two probabilities will approach “0”.
Rules of Probabilities
- All values are 0 < P < 1.
- Probability of sample space is 1.
- Compliment = P + (1 - P) = 1
- Addition P(A \text{ or } B) = P(A) + P(B) – P(A \& B)
- Multiplication P(A \& B) = P(A) ⋅ P(B) if A & B are independent
- P (\text{at least 1 or more}) = 1 – P (\text{none})
- Conditional Probability – takes into account a certain condition.
- P(A|B) = \frac{P(A \& B)}{P(B)} = \frac{P(\text{both})}{P(\text{given})}
Correlation Coefficient (r)
- Is a quantitative assessment of the strength and direction of a linear relationship. (use ρ (rho) for population parameter)
- Values – [-1, 1] 0 – no correlation, (0, ±0.5) – weak, [±0.5, ±0.8) – moderate, [±0.8, ±1] - strong
Least Squares Regression Line (LSRL)
- Is a line of mathematical best fit. Minimizes the deviations (residuals) from the line. Used with bivariate data.
- \hat{y} = a + bx .
- x is independent, the explanatory variable & y is dependent, the response variable
- Residuals (error): Is vertical difference of a point from the LSRL. All residuals sum up to “0”.
- Residual Plot: A scatterplot of (x \text{ (or } \hat{y} ), residual). No pattern indicates a linear relationship.
Coefficient of Determination (r^2)
- Gives the proportion of variation in y (response) that is explained by the relationship of (x, y). Never use the adjusted r^2.
- Interpretations: must be in context!
- Slope (b): For unit increase in x, then the y variable will increase/decrease slope amount.
- Correlation coefficient (r): There is a strength, direction, linear association between x & y.
- Coefficient of determination (r^2): Approximately r^2% of the variation in y can be explained by the LSRL of x and y.
- Extrapolation: LRSL cannot be used to find values outside of the range of the original data.
- Influential Points: Are points that if removed significantly change the LSRL.
- Outliers: Are points with large residuals.
Census
- A complete count of the population. Why not to use a census?
- Expensive
- Impossible to do
- If destructive sampling you get extinction
Sampling Frame
- Is a list of everyone in the population.
Sampling Design
- Refers to the method used to choose a sample.
- SRS (Simple Random Sample): One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
- Advantages: easy and unbiased
- Disadvantages: large σ^2 and must know population.
- Stratified: Divide the population into homogeneous groups called strata, then SRS each strata.
- Advantages: more precise than an SRS and cost reduced if strata already available.
- Disadvantages: difficult to divide into groups, more complex formulas & must know population.
- Systematic: Use a systematic approach (every 50th) after choosing randomly where to begin.
- Advantages: unbiased, the sample is evenly distributed across population & don’t need to know population.
- Disadvantages: a large σ^2 and can be confounded by trends.
- Cluster Sample: Based on location. Select a random location and sample ALL at that location.
- Advantages: cost is reduced, is unbiased& don’t need to know population.
- Disadvantages: May not be representative of population and has complex formulas.
- Random Digit Table: Each entry is equally likely and each digit is independent of the rest.
- Random # Generator: Calculator or computer program
Bias
- Error – favors a certain outcome, has to do with center of sampling distributions – if centered over true parameter then considered unbiased
Sources of Bias
- Voluntary Response: People choose themselves to participate.
- Convenience Sampling: Ask people who are easy, friendly, or comfortable asking.
- Undercoverage: Some group(s) are left out of the selection process.
- Non-response: Someone cannot or does not want to be contacted or participate.
- Response: False answers – can be caused by a variety of things
- Wording of the Questions: Leading questions.
Experimental Design
- Observational Study: Observe outcomes without giving a treatment.
- Experiment: Actively imposes a treatment on the subjects.
- Experimental Unit: Single individual or object that receives a treatment.
- Factor: Is the explanatory variable, what is being tested
- Level: A specific value for the factor.
- Response Variable: What you are measuring with the experiment.
- Treatment: Experimental condition applied to each unit.
- Control Group: A group used to compare the factor to for effectiveness – does NOT have to be placebo
- Placebo: A treatment with no active ingredients (provides control).
- Blinding: A method used so that the subjects are unaware of the treatment (who gets a placebo or the real treatment).
- Double Blinding: Neither the subjects nor the evaluators know which treatment is being given.
Principles of Experimental Design
- Control: Keep all extraneous variables (not being tested) constant
- Replication: Uses many subjects to quantify the natural variation in the response.
- Randomization: Uses chance to assign the subjects to the treatments.
- The only way to show cause and effect is with a well-designed, well-controlled experiment.
Experimental Designs
- Completely Randomized: All units are allocated to all of the treatments randomly
- Randomized Block: Units are blocked and then randomly assigned in each block – reduces variation
- Matched Pairs: Are matched up units by characteristics and then randomly assigned. Once a pair receives a certain treatment, then the other pair automatically receives the second treatment. OR individuals do both treatments in random order (before/after or pretest/post-test). Assignment is dependent
- Confounding Variables: Are where the effect of the variable on the response cannot be separated from the effects of the factor being tested – happens in observational studies – when you use random assignment to treatments you do NOT have confounding variables!
- Randomization: Reduces bias by spreading extraneous variables to all groups in the experiment.
- Blocking: Helps reduce variability. Another way to reduce variability is to increase sample size.
Random Variable
- A numerical value that depends on the outcome of an experiment.
- Discrete: A count of a random variable
- Continuous: A measure of a random variable
Discrete Probability Distributions
- Gives values & probabilities associated with each possible x.
- μX = \sum xi p(x_i)
- σX^2 = \sum (xi - μX)^2 p(xi)
- Calculator shortcut – 1 VARSTAT L1,L2
- Fair Game: A fair game is one in which all pay-ins equal all pay-outs.
Special Discrete Distributions
- Binomial Distributions
- Properties: two mutually exclusive outcomes, fixed number of trials (n), each trial is independent, the probability (p) of success is the same for all trials,
- Random variable - is the number of successes out of a fixed # of trials. Starts at X = 0 and is finite.
- μ_X = np
- σ_x = \sqrt{npq}
- Calculator:
- binomialpdf (n, p, x) = single outcome P(X= x)
- binomialcdf (n, p, x) = cumulative outcome P(X < x)
- 1 - binomialcdf (n, p, (x -1)) = cumulative outcome P(X > x)
- Geometric Distributions
- Properties -two mutually exclusive outcomes, each trial is independent, probability (p) of success is the same for all trials. (NOT a fixed number of trials)
- Random Variable –when the FIRST success occurs. Starts at 1 and is ∞.
- Calculator:
- geometricpdf (p, a) = single outcome P(X = a)
- geometriccdf (p, a) = cumulative outcomes P(X < a)
- 1 - geometriccdf (n, p, (a -1)) = cumulative outcome P(X > a)
Continuous Random Variable
- Numerical values that fall within a range or interval (measurements), use density curves where the area under the curve always = 1. To find probabilities, find area under the curve
- Unusual Density Curves: Any shape (triangles, etc.)
- Uniform Distributions: Uniformly (evenly) distributed, shape of a rectangle
- Normal Distributions: Symmetrical, unimodal, bell-shaped curves defined by the parameters μ& σ
- Calculator:
- Normalpdf – used for graphing only
- Normalcdf(lower bound, upper bound, μ, σ) – finds probability
- InvNorm(p) – z-score OR InvNorm(p, μ, σ) – gives x-value
- To assess Normality - Use graphs – dotplots, boxplots, histograms, or normal probability plot.
Distribution
- Is all of the values of a random variable.
Sampling Distribution
- Of a statistic is the distribution of all possible values of all possible samples. Use normalcdf to calculate probabilities – be sure to use correct SD.
- μ{\bar{X}} = μX
- σ_{\bar{x}} = \frac{σ}{\sqrt{n}}
- μ_{\hat{p}} = p
- σ_{\hat{p}} = \sqrt{\frac{pq}{n}}
- μ{\bar{X1} - \bar{X2}} = μ{\bar{X1}} - μ{\bar{X_2}}
- σ{\bar{X1} - \bar{X2}} = \sqrt{\frac{σ1^2}{n1} + \frac{σ2^2}{n_2}}
- μ{\hat{p1} - \hat{p2}} = p1 - p_2
- σ{\hat{p1} - \hat{p2}} = \sqrt{\frac{p1q1}{n1} + \frac{p2q2}{n_2}}
- μ_b=β
- s_b (do not need to find, usually given in computer printout (standard error of the slopes of the LSRLs)
- Standard error: Estimate of the standard deviation of the statistic
- Central Limit Theorem: When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.
Confidence Intervals
- Point Estimate: Uses a single statistic based on sample data, this is the simplest approach.
- Confidence Intervals: Used to estimate the unknown population parameter.
- Margin of Error: The smaller the margin of error, the more precise our estimate
- Steps:
- Assumptions – see table below
- Calculations – C.I. = statistic ± critical value (standard deviation of the statistic)
- Conclusion – Write your statement in context.
- We are [x]% confident that the true [parameter] of [context] is between [a] and [b].
- What makes the margin of error smaller
- make critical value smaller (lower confidence level).
- get a sample with a smaller s.
- make n larger.
- T distributions compared to standard normal curve
- centered around 0
- more spread out and shorter
- more area under the tails.
- when you increase n, t-curves become more normal.
- can be no outliers in the sample data
- Degrees of Freedom = n – 1
- Robust: If the assumption of normality is not met, the confidence level or p-value does not change much – this is true of t-distributions because there is more area in the tails
Hypothesis Tests
- Hypothesis Testing: Tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.
- Null Hypothesis: H_0 is the statement being tested. Null hypothesis should be “no effect”, “no difference”, or “no relationship”
- Alternate Hypothesis: H_a is the statement suspected of being true.
- P-Value: Assuming the null is true, the probability of obtaining the observed result or more extreme
- Level of Significance: α is the amount of evidence necessary before rejecting the null hypothesis.
- Steps:
- Assumptions – see table below
- Hypotheses - don’t forget to define parameter
- Calculations – find z or t test statistic & p-value
- Conclusion – Write your statement in context.
- Since the p-value is < (>) α, I reject (fail to reject) the H0. There is (is not) sufficient evidence to suggest that [Ha].
Type I and II Errors and Power
- Type I Error: Is when one rejects H0 when H0 is actually true. (probability is α)
- Type II Error: Is when you fail to reject H0, and H0 is actually false. (probability is β)
- α and β are inversely related. Consequences are the results of making a Type I or Type II error. Every decision has the possibility of making an error.
- The Power of a Test: Is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true. Power = 1 – β
- If you increase
- α Increases, β Decreases, Power Increases
- n Same, β Decreases, Power Increases
- (μ0 – μa) Same, β Decreases, Power Increases
Chi-Square (χ2) Test
- Is used to test counts of categorical data.
- Types
- Goodness of Fit (univariate)
- Independence (bivariate)
- Homogeneity (univariate 2 (or more) samples)
- χ2 distribution – All curves are skewed right, every df has a different curve, and as the degrees of freedom increase the χ2 curve becomes more normal.
- Goodness of Fit: Is for univariate categorical data from a single sample. Does the observed count “fit” what we expect. Must use list to perform, df = number of the categories – 1, use χ2cdf (χ2, ∞, df) to calculate p-value
- Independence: Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate
- Homogeneity: Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate
- For both χ2 tests of independence & homogeneity:
- Expected counts = \frac{(\text{row total}) (\text{column total})}{\text{grand total}} & df = (r – 1)(c – 1)
Regression Model
- X & Y have a linear relationship where the true LSRL is μ_y = α + βx
- The responses (y) are normally distributed for a given x-value.
- The standard deviation of the responses (σ_y) is the same for all values of x.
- S is the estimate for σ_y
- Confidence Interval: b ± t^*s_b
- Hypothesis Testing: t = \frac{b - β1}{sb}
Assumptions:
- Proportions (z - procedures)
- One sample:
- SRS from population
- Can be approximated by normal distribution if n(p) & n(1 – p) > 10
- Population size is at least 10n
- Two samples:
- 2 independent SRS’s from populations (or randomly assigned treatments)
- Can be approximated by normal distribution if n1(p1), n1(1 – p1), n2p2, & n2(1 – p2) > 10
- Population sizes are at least 10n
- Means (t - procedures)
- One sample:
- SRS from population
- Distribution is approximately normal
- Given
- Large sample size
- Graph of data is approximately symmetrical and unimodal with no outliers
- Matched pairs:
- SRS from population
- Distribution of differences is approximately normal
- Given
- Large sample size
- Graph of differences is approximately symmetrical and unimodal with no outliers
- Two samples:
- 2 independent SRS’s from populations (or randomly assigned treatments)
- Distributions are approximately normal
- Given
- Large sample sizes
- Graphs of data are approximately symmetrical and unimodal with no outliers
- Counts (χ^2 - procedures)
- All types:
- Reasonably random sample(s)
- All expected counts > 5
- Must show expected counts
- Bivariate Data (t – procedures on slope)
- SRS from population
- There is linear relationship between x & y.
- Residual plot has no pattern.
- The standard deviation of the responses is constant for all values of x.
- Points are scattered evenly across the LSRL in the scatterplot.
- The responses are approximately normally distributed.
- Graph of residuals is approximately symmetrical & unimodal with no outliers.