AP Statistics Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide

Statistics Basics

Statistics: Science of collecting, analyzing, and drawing conclusions from data.
Descriptive Statistics: Methods for organizing and summarizing data.
Inferential Statistics: Making generalizations from a sample to a population.
Population: Entire collection of individuals or objects.
Sample: Subset of the population selected for study.
Variable: Characteristic whose value changes.
Data: Observations on single or multi-variables.

Types of Variables

Categorical (Qualitative): Basic characteristics.
Numerical (Quantitative): Measurements or observations of numerical data.
- Discrete: Listable sets (counts).
- Continuous: Any value over an interval (measurements).
Univariate: One variable.
Bivariate: Two variables.
Multivariate: Many variables.

Distributions

Symmetrical: Data with fairly same shape and size on both sides.
Uniform: Every class has equal frequency.
Skewed: One side (tail) is longer than the other. Skewness is in the direction the tail points.
Bimodal: Two or more classes have large frequencies separated by another class.

Describing Numerical Graphs (S.O.C.S.)

Shape: Symmetrical, skewed (right/left), uniform, or bimodal.
Outliers: Gaps, clusters, etc.
Center: Middle of the data (mean, median, mode).
Spread: Variability (range, standard deviation, IQR).
Context: Everything must be in context to the data and situation.
Comparison: When comparing distributions, use comparative language.

Parameters vs. Statistics

Parameter: Value of a population (typically unknown).
Statistic: Calculated value about a population from a sample(s).

Measures of Center

Median: Middle point of the data (50th percentile) in numerical order.
Mean: $μ$ for population, $\bar{x}$ for sample.
Mode: Occurs most in the data. Can have multiple modes or none.

Measures of Spread (Variability)

Range: $Max - Min$ .
IQR: Interquartile range $(Q3 - Q1)$ .
Standard Deviation: $σ$ for population, $s$ for sample. Measures typical deviation from the mean. Sample standard deviation is divided by $df = n-1$ .
Sum of deviations from the mean is always zero.
Variance: Standard deviation squared.

Resistant vs. Non-Resistant Measures

Resistant: Not affected by outliers (Median, IQR).
Non-Resistant: Affected by outliers (Mean, Range, Variance, Standard Deviation, Correlation Coefficient (r), Least Squares Regression Line (LSRL), Coefficient of Determination $(r^2)$ ).

Comparison of Mean & Median Based on Graph Type

Symmetrical: Mean and median are the same.
Skewed Right: Mean > Median.
Skewed Left: Mean < Median.
Mean is pulled in the direction of the skew.
Trimmed Mean: Use a % to remove observations from the top and bottom to eliminate outliers.

Linear Transformations of Random Variables

$μ{a +bx} =a +bμx$ (Mean is changed by both addition/subtraction & multiplication/division).
$σ{a +bx} = |b|σx$ (Standard deviation is changed by multiplication/division ONLY).

Combination of Two (or More) Random Variables

$μ{x ± y} = μx ± μ_y$ (Add or subtract the means).
$σ^2{x ± y} = σ^2x + σ^2_y$ (Always add the variances - X & Y MUST be independent).

Z-Score

Standardized score indicating how many standard deviations an observation is from the mean. Creates a standard normal curve with $μ = 0$ & $σ = 1$ .
$z = {x - μ \over σ}$

Normal Curve

Bell-shaped and symmetrical.
As $σ$ increases, the curve flattens.
As $σ$ decreases, the curve thins.

Empirical Rule (68-95-99.7)

Measures $1σ$ , $2σ$ , and $3σ$ on normal curves from the center $μ$ .
68% of the population is between $-1σ$ and $1σ$ .
95% of the population is between $-2σ$ and $2σ$ .
99.7% of the population is between $-3σ$ and $3σ$ .

Boxplots

For medium or large numerical data; doesn't contain original observations.
Use modified boxplots with fences at $1.5 * IQR$ from the ends of the box (Q1 & Q3).
Points outside the fence are outliers.
Whiskers extend to the smallest & largest observations within the fences.

5-Number Summary

Minimum, Q1 (25th Percentile), Median, Q3 (75th Percentile), Maximum

Probability Rules

Sample Space: Collection of all outcomes.
Event: Any sample of outcomes.
Complement: All outcomes not in the event.
Union: A or B, all outcomes in both circles. $A ∪ B$
Intersection: A and B, happening in the middle of A and B. $A ∩ B$
Mutually Exclusive (Disjoint): A and B have no intersection; they cannot happen at the same time.
Independent: Knowing one event doesn't change the outcome of another.
Experimental Probability: Number of successes from an experiment divided by the total amount from the experiment.
Law of Large Numbers: As an experiment is repeated, the experimental probability gets closer to the true probability.

Probability Rules (Formulas)

All values are 0 < P < 1.
Probability of sample space is 1.
Complement: $P + (1 - P) = 1$
Addition: P(A or B) = P(A) + P(B) – P(A & B)
Multiplication: P(A & B) = P(A) * P(B) if A & B are independent.
$P (at least 1 or more) = 1 – P (none)$
Conditional Probability: P(A|B) = {P(A & B) \over P(B)}

Correlation Coefficient

(r) - Quantitative assessment of the strength and direction of a linear relationship. (use $ρ$ (rho) for population parameter)
Values: $[-1, 1]$ ; 0 - no correlation, $(0, ±0.5)$ - weak, $[±0.5, ±0.8)$ - moderate, $[±0.8, ±1]$ - strong

Least Squares Regression Line (LSRL)

Line of best fit. Minimizes deviations (residuals) from the line. Used with bivariate data.
$\hat{y} = a + bx$ ; x is independent, y is dependent.
Residuals (error) - vertical difference of a point from the LSRL. All residuals sum to 0.
Residual = $y - \hat{y}$
Residual Plot - scatterplot of $(x, residual)$ . No pattern indicates a linear relationship.

Coefficient of Determination

$(r^2)$ - Proportion of variation in y explained by the relationship of (x, y). Never use the adjusted $r^2$ .
Interpretations: (must be in context!)
- Slope (b): For unit increase in x, the y variable will increase/decrease by the slope amount.
- Correlation coefficient (r): There is a (strength, direction, linear) association between x & y.
- **Coefficient of determination $(r^2)$ : Approximately $r^2$ % of the variation in y can be explained by the LSRL of x and y.
Extrapolation - LRSL cannot be used to find values outside the range of the original data.
Influential Points - if removed, significantly change the LSRL.
Outliers - points with large residuals.

Census

A complete count of the population. Why not to use a census?
- Expensive
- Impossible to do
- If destructive sampling you get extinction

Sampling Frame

Is a list of everyone in the population.

Sampling Design

Refers to the method used to choose a sample.

SRS (Simple Random Sample)

One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
- Advantages: Easy and unbiased
- Disadvantages: Large $σ^2$ and must know population.

Stratified

Divide the population into homogeneous groups called strata, then SRS each strata.
- Advantages: More precise than an SRS and cost reduced if strata are already available.
- Disadvantages: Difficult to divide into groups, more complex formulas & must know population.

Systematic

Use a systematic approach (every 50th) after choosing randomly where to begin.
- Advantages: Unbiased, the sample is evenly distributed across the population & don’t need to know population.
- Disadvantages: A large $σ^2$ and can be confounded by trends.

Cluster Sample

Based on location. Select a random location and sample ALL at that location.
- Advantages: Cost is reduced, unbiased & don’t need to know the population.
- Disadvantages: May not be representative of the population and has complex formulas.

Random Digit Table

Each entry is equally likely, and each digit is independent of the rest.

Random # Generator

Calculator or computer program

Bias

Error that favors a certain outcome related to the center of sampling distributions.

Sources of Bias

Voluntary Response: People choose themselves to participate.
Convenience Sampling: Ask people who are easy or comfortable to ask.
Undercoverage: Some group(s) are left out of the selection process.
Non-response: Someone cannot or does not want to participate.
Response: False answers due to question wording.

Experimental Design

Observational Study: Observe outcomes without giving a treatment.
Experiment: Actively impose a treatment on the subjects.
Experimental Unit: Single individual or object that receives a treatment.
Factor: Explanatory variable being tested.
Level: A specific value for the factor.
Response Variable: What you are measuring with the experiment.
Treatment: Experimental condition applied to each unit.
Control Group: Group used to compare the factor to for effectiveness (doesn't have to be placebo).
Placebo: Treatment with no active ingredients.
Blinding: Subjects are unaware of the treatment.
Double Blinding: Neither subjects nor evaluators know which treatment is being given.

Principles of Experimental Design

Control: Keep all extraneous variables constant.
Replication: Use many subjects to quantify the natural variation in the response.
Randomization: Use chance to assign subjects to treatments.
The only way to show cause and effect is with a well designed, well controlled experiment.

Experimental Designs

Completely Randomized: All units are randomly allocated to all treatments.
Randomized Block: Units are blocked and then randomly assigned within each block (reduces variation).
Matched Pairs: Units are matched and then randomly assigned. OR individuals do both treatments in random order (assignment is dependent).
Confounding Variables: Effect of the variable on the response cannot be separated from the factor being tested - happens in observational studies.
Randomization reduces bias by spreading extraneous variables to all groups.
Blocking helps reduce variability. Another way to reduce variability is to increase sample size.

Random Variable

A numerical value that depends on the outcome of an experiment.

Discrete Probability Distributions

Gives values & probabilities associated with each possible x.
$μX = Σxi * p(x_i)$
$σ^2X = Σ(xi - μ)^2 * p(x_i)$
Fair Game = All pay-ins equal all pay-outs.

Special Discrete Distributions

Binomial Distributions:
- Two mutually exclusive outcomes, fixed number of trials (n), each trial is independent, probability (p) of success is the same for all trials.
- Random variable - is the number of successes out of a fixed # of trials. Starts at X = 0 and is finite.
- $μ_X = np$
- $σ_x = \sqrt{npq}$
- Calculator: binomialpdf (n, p, x) = single outcome P(X= x), binomialcdf (n, p, x) = cumulative outcome P(X < x), 1 - binomialcdf (n, p, (x -1)) = cumulative outcome P(X > x)
Geometric Distributions:
- Two mutually exclusive outcomes, each trial is independent, probability (p) of success is the same for all trials. (NOT a fixed number of trials)
- Random Variable –when the FIRST success occurs. Starts at 1 and is ∞.
- Calculator: geometricpdf (p, a) = single outcome P(X = a), geometriccdf (p, a) = cumulative outcomes P(X < a), 1 - geometriccdf (n, p, (a -1)) = cumulative outcome P(X > a)

Continuous Random Variable

Numerical values that fall within a range or interval (measurements). Area under the curve always = 1. To find probabilities, find the area under the curve.

Unusual Density Curves

Any shape (triangles, etc.)

Uniform Distributions

Uniformly (evenly) distributed, shape of a rectangle

Normal Distributions

Symmetrical, unimodal, bell shaped curves defined by the parameters $μ$ & $σ$ .
Calculator:
- Normalpdf – used for graphing only
- Normalcdf(lower bound, upper bound, μ, σ) – finds probability
- InvNorm(p) – z-score OR InvNorm(p, μ, σ) – gives x-value
To assess Normality - Use graphs – dotplots, boxplots, histograms, or normal probability plot.
Distribution – is all of the values of a random variable.
Sampling Distribution – of a statistic is the distribution of all possible values of all possible samples. Use normalcdf to calculate probabilities.

Sampling Distributions

$\mu{\bar{X}} = μX$
$σ_{\bar{x}} = {σ \over \sqrt{n}}$ (standard deviation of the sample means)
$\mu_{\hat{p}} = p$
$σ_{\hat{p}} = \sqrt{{pq \over n}}$ (standard deviation of the sample proportions)
$\mu{X1 - X2} = μ{X1} - μ{X_2}$
$σ{X1 - X2} = \sqrt{{σ1^2 \over n1} + {σ2^2 \over n_2}}$ (standard deviation of the difference in sample means)
$\mu{\hat{p1} - \hat{p2}} = p1 - p_2$
$σ{\hat{p1} - \hat{p2}} = \sqrt{{{p1q1} \over n1} + {{p2q2} \over n_2}}$ (standard deviation of the difference in sample proportions)
$\mu_b = β$
$s_b$ (do not need to find, usually given in computer printout (standard error of the slopes of the LSRLs)
Standard error – estimate of the standard deviation of the statistic

Central Limit Theorem

When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.

Confidence Intervals

Point Estimate: Uses a single statistic based on sample data.
Confidence Intervals: Used to estimate the unknown population parameter.
Margin of Error: The smaller the margin of error, the more precise our estimate.
Steps:
- Assumptions – see table below
- Calculations – C.I. = statistic ± critical value * (standard deviation of the statistic)
- Conclusion – Write your statement in context. We are [x]% confident that the true [parameter] of [context] is between [a] and [b].
What makes the margin of error smaller:
- Make critical value smaller (lower confidence level).
- Get a sample with a smaller s.
- Make n larger.

T distributions compared to standard normal curve

Centered around 0
More spread out and shorter
More area under the tails.
When you increase n, t-curves become more normal.
Can be no outliers in the sample data
Degrees of Freedom = n – 1
Robust – if the assumption of normality is not met, the confidence level or p-value does not change much – this is true of t-distributions because there is more area in the tails

Hypothesis Tests

Tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.
Null Hypothesis: $H_0$ is the statement being tested. Null hypothesis should be “no effect”, “no difference”, or “no relationship”
Alternate Hypothesis: $H_a$ is the statement suspected of being true.
P-Value: Assuming the null is true, the probability of obtaining the observed result or more extreme
Level of Significance: $α$ is the amount of evidence necessary before rejecting the null hypothesis.
Steps:
- Assumptions – see table below
- Hypotheses - don’t forget to define parameter
- Calculations – find z or t test statistic & p-value
- Conclusion – Write your statement in context. Since the p-value is < (>) α, I reject (fail to reject) the Ho. There is (is not) sufficient evidence to suggest that [Ha].

Type I and II Errors and Power

Type I Error: Reject $H0$ when $H0$ is actually true (probability is $α$ ).
Type II Error: Fail to reject $H0$ , and $H0$ is actually false (probability is $β$ ).
$α$ and $β$ are inversely related. Consequences are the results of making a Type I or Type II error.
The Power of a Test – is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true. Power = 1 – $β$

If you increase	Type I error $α$	Type II error $β$	Power
$α$	Increases	Decreases	Increases
n	Same	Decreases	Increases
$(μ<em>0 – μ</em>a)$	Same	Decreases	Increases

$\chi^2$ Test

Used to test counts of categorical data.
- Goodness of Fit (univariate)
- Independence (bivariate)
- Homogeneity (univariate 2 (or more) samples)
$\chi^2$ Distribution: All curves are skewed right, every df has a different curve, and as the degrees of freedom increase the $\chi^2$ curve becomes more normal.
Goodness of Fit: Univariate categorical data from a single sample. Does the observed count “fit” what we expect? Must use list to perform, df = number of the categories – 1, use χ2cdf (χ2, ∞, df) to calculate p-value
Independence: Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate
Homogeneity: Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate

For both $\chi^2$ tests of independence & homogeneity:

Expected counts = ${(row total)(column total) \over grand total}$
df = (r – 1)(c – 1)

Regression Model

X & Y have a linear relationship where the true LSRL is $μ_y = α + βx$
The responses (y) are normally distributed for a given x-value.
The standard deviation of the responses ( $\sigmay$ ) is the same for all values of x. o S is the estimate for $\sigmay$

Confidence Interval

$b ± t^*s_b$

Hypothesis Testing

${b - β \over s_b}$

Assumptions:

	Proportions z - procedures	Means t - procedures	Counts $\chi^2$ - procedures
One sample:	• SRS from population • Can be approximated by normal distribution if n(p) & n(1 – p) > 10 • Population size is at least 10n	• SRS from population • Distribution is approximately normal o Given o Large sample size o Graph of data is approximately symmetrical and unimodal with no outliers	All types: • Reasonably random sample(s) • All expected counts > 5 o Must show expected counts
Two samples:	• 2 independent SRS’s from populations (or randomly assigned treatments) • Can be approximated by normal distribution if n1(p1), n1(1 – p1), n2p2, & n2(1 – p2) > 10 • Population sizes are at least 10n	Matched pairs: • SRS from population • Distribution of differences is approximately normal - Given - Large sample size - Graph of differences is approximately symmetrical and unimodal with no outliers

		Two samples: • 2 independent SRS’s from populations (or randomly assigned treatments) • Distributions are approximately normal o Given o Large sample sizes o Graphs of data are approximately symmetrical and unimodal with no outliers
Bivariate Data:		t – procedures on slope • SRS from population • There is a linear relationship between x & y. • Residual plot has no pattern. • The standard deviation of the responses is constant for all values of x. • Points are scattered evenly across the LSRL in the scatterplot. • The responses are approximately normally distributed. • Graph of residuals is approximately symmetrical & unimodal with no outliers.

AP Statistics Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide

Statistics Basics

Types of Variables

Distributions

Describing Numerical Graphs (S.O.C.S.)

Parameters vs. Statistics

Measures of Center

Measures of Spread (Variability)

Resistant vs. Non-Resistant Measures

Comparison of Mean & Median Based on Graph Type

Linear Transformations of Random Variables

Combination of Two (or More) Random Variables

Z-Score

Normal Curve

Empirical Rule (68-95-99.7)

Boxplots

5-Number Summary

Probability Rules

Probability Rules (Formulas)

Correlation Coefficient

Least Squares Regression Line (LSRL)

Coefficient of Determination

Census

Sampling Frame

Sampling Design

SRS (Simple Random Sample)

Stratified

Systematic

Cluster Sample

Random Digit Table

Random # Generator

Bias

Sources of Bias

Experimental Design

Principles of Experimental Design

Experimental Designs

Random Variable

Discrete Probability Distributions

Special Discrete Distributions

Continuous Random Variable

Unusual Density Curves

Uniform Distributions

Normal Distributions

Sampling Distributions

Central Limit Theorem

Confidence Intervals

T distributions compared to standard normal curve

Hypothesis Tests

Type I and II Errors and Power

χ2\chi^2χ2 Test

Regression Model

Assumptions:

$\chi^2$ Test