AP Statistics Cumulative AP Exam Study Guide Notes

AP Statistics Cumulative AP Exam Study Guide

Statistics

Statistics is the science of collecting, analyzing, and drawing conclusions from data.
Descriptive statistics involve methods of organizing and summarizing data.
Inferential statistics involve making generalizations from a sample to the population.

Basic Definitions

Population: An entire collection of individuals or objects.
Sample: A subset of the population selected for study.
Variable: Any characteristic whose value changes.
Data: Observations on single or multi-variables.

Types of Variables

Categorical (Qualitative): Basic characteristics.
Numerical (Quantitative): Measurements or observations of numerical data.
- Discrete: Listable sets (counts).
- Continuous: Any value over an interval of values (measurements).
Univariate: One variable.
Bivariate: Two variables.
Multivariate: Many variables.

Distributions

Symmetrical: Data on which both sides are fairly the same shape and size (Bell Curve).
Uniform: Every class has an equal frequency (number) (a rectangle).
Skewed: One side (tail) is longer than the other side. The skewness is in the direction that the tail points (left or right).
Bimodal: Data of two or more classes have large frequencies separated by another class between them (double hump camel).

Describing Numerical Graphs: S.O.C.S.

Shape: Overall type (symmetrical, skewed right, skewed left, uniform, or bimodal).
Outliers: Gaps, clusters, etc.
Center: Middle of the data (mean, median, and mode).
Spread: Refers to variability (range, standard deviation, and IQR).
Everything must be in context to the data and situation of the graph.
When comparing two distributions – MUST use comparative language!

Parameter vs. Statistic

Parameter: Value of a population (typically unknown).
Statistic: A calculated value about a population from a sample(s).

Measures of Center

Median: The middle point of the data (50th percentile) when the data is in numerical order. If two values are present, then average them together.
Mean: μ is for a population (parameter) and \bar{x} is for a sample (statistic).
Mode: Occurs the most in the data. There can be more than one mode, or no mode at all if all data points occur once.

Variability

Allows statisticians to distinguish between usual and unusual occurrences.

Measures of Spread (Variability)

Range: A single value – (Max – Min)
IQR (Interquartile Range): (Q3 – Q1)
Standard Deviation: σ for population (parameter) & s for sample (statistic) – measures the typical or average deviation of observations from the mean – sample standard deviation is divided by df = n-1
The sum of the deviations from the mean is always zero!
Variance: Standard deviation squared

Resistant vs. Non-Resistant Measures

Resistant: Not affected by outliers.
- Median
- IQR
Non-Resistant:
- Mean
- Range
- Variance
- Standard Deviation
- Correlation Coefficient (r)
- Least Squares Regression Line (LSRL)
- Coefficient of Determination (r^2)

Comparison of Mean & Median Based on Graph Type

Symmetrical: Mean and the median are the same value.
Skewed Right: Mean is a larger value than the median.
Skewed Left: The mean is smaller than the median.
The mean is always pulled in the direction of the skew away from the median.
Trimmed Mean: Use a % to take observations away from the top and bottom of the ordered data. This possibly eliminates outliers.

Linear Transformations of Random Variables

μ{a +bx} =a +bμx The mean is changed by both addition (subtract) & multiplication (division).
σ{a +bx} = |b|σx The standard deviation is changed by multiplication (division) ONLY.

Combination of Two (or More) Random Variables

μ{x ± y} = μx ± μ_y Just add or subtract the two (or more) means
σ{x± y} = \sqrt{σx^2 + σ_y^2} Always add the variances – X & Y MUST be independent

Z-Score

Is a standardized score. This tells you how many standard deviations from the mean an observation is. It creates a standard normal curve consisting of z-scores with a μ = 0 & σ = 1.
z = \frac{x - μ}{σ}

Normal Curve

Is a bell-shaped and symmetrical curve.
As σ increases the curve flattens.
As σ decreases the curve thins.

Empirical Rule (68-95-99.7)

Measures 1σ, 2σ, and 3σ on normal curves from a center of μ.
68% of the population is between -1σ and 1σ
95% of the population is between -2σ and 2σ
99.7% of the population is between -3σ and 3σ

Boxplots

Are for medium or large numerical data. It does not contain original observations. Always use modified boxplots where the fences are 1.5 IQRs from the ends of the box (Q1 & Q3). Points outside the fence are considered outliers. Whiskers extend to the smallest & largest observations within the fences.

5-Number Summary

Minimum, Q1 (1st Quartile – 25th Percentile), Median, Q3 (3rd Quartile – 75th Percentile), Maximum

Probability Rules

Sample Space: Is the collection of all outcomes.
Event: Any sample of outcomes.
Complement: All outcomes not in the event.
Union: A or B, all the outcomes in both circles. A ∪ B
Intersection: A and B, happening in the middle of A and B. A ∩ B
Mutually Exclusive (Disjoint): A and B have no intersection. They cannot happen at the same time.
Independent: If knowing one event does not change the outcome of another.
Experimental Probability: Is the number of success from an experiment divided by the total amount from the experiment.
Law of Large Numbers: As an experiment is repeated the experimental probability gets closer and closer to the true (theoretical) probability. The difference between the two probabilities will approach “0”.

Rules of Probabilities

All values are 0 < P < 1.
Probability of sample space is 1.
Compliment = P + (1 - P) = 1
Addition P(A \text{ or } B) = P(A) + P(B) – P(A \& B)
Multiplication P(A \& B) = P(A) ⋅ P(B) if A & B are independent
P (\text{at least 1 or more}) = 1 – P (\text{none})
Conditional Probability – takes into account a certain condition.
- P(A|B) = \frac{P(A \& B)}{P(B)} = \frac{P(\text{both})}{P(\text{given})}

Correlation Coefficient (r)

Is a quantitative assessment of the strength and direction of a linear relationship. (use ρ (rho) for population parameter)
Values – [-1, 1] 0 – no correlation, (0, ±0.5) – weak, [±0.5, ±0.8) – moderate, [±0.8, ±1] - strong

Least Squares Regression Line (LSRL)

Is a line of mathematical best fit. Minimizes the deviations (residuals) from the line. Used with bivariate data.
\hat{y} = a + bx .
- x is independent, the explanatory variable & y is dependent, the response variable
Residuals (error): Is vertical difference of a point from the LSRL. All residuals sum up to “0”.
- Residual = y - \hat{y}
Residual Plot: A scatterplot of (x \text{ (or } \hat{y} ), residual). No pattern indicates a linear relationship.

Coefficient of Determination (r^2)

Gives the proportion of variation in y (response) that is explained by the relationship of (x, y). Never use the adjusted r^2.
Interpretations: must be in context!
- Slope (b): For unit increase in x, then the y variable will increase/decrease slope amount.
- Correlation coefficient (r): There is a strength, direction, linear association between x & y.
- Coefficient of determination (r^2): Approximately r^2% of the variation in y can be explained by the LSRL of x and y.
Extrapolation: LRSL cannot be used to find values outside of the range of the original data.
Influential Points: Are points that if removed significantly change the LSRL.
Outliers: Are points with large residuals.

Census

A complete count of the population. Why not to use a census?
- Expensive
- Impossible to do
- If destructive sampling you get extinction

Sampling Frame

Is a list of everyone in the population.

Sampling Design

Refers to the method used to choose a sample.
- SRS (Simple Random Sample): One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
  - Advantages: easy and unbiased
  - Disadvantages: large σ^2 and must know population.
- Stratified: Divide the population into homogeneous groups called strata, then SRS each strata.
  - Advantages: more precise than an SRS and cost reduced if strata already available.
  - Disadvantages: difficult to divide into groups, more complex formulas & must know population.
- Systematic: Use a systematic approach (every 50th) after choosing randomly where to begin.
  - Advantages: unbiased, the sample is evenly distributed across population & don’t need to know population.
  - Disadvantages: a large σ^2 and can be confounded by trends.
- Cluster Sample: Based on location. Select a random location and sample ALL at that location.
  - Advantages: cost is reduced, is unbiased& don’t need to know population.
  - Disadvantages: May not be representative of population and has complex formulas.
- Random Digit Table: Each entry is equally likely and each digit is independent of the rest.
- Random # Generator: Calculator or computer program

Bias

Error – favors a certain outcome, has to do with center of sampling distributions – if centered over true parameter then considered unbiased

Sources of Bias

Voluntary Response: People choose themselves to participate.
Convenience Sampling: Ask people who are easy, friendly, or comfortable asking.
Undercoverage: Some group(s) are left out of the selection process.
Non-response: Someone cannot or does not want to be contacted or participate.
Response: False answers – can be caused by a variety of things
Wording of the Questions: Leading questions.

Experimental Design

Observational Study: Observe outcomes without giving a treatment.
Experiment: Actively imposes a treatment on the subjects.
Experimental Unit: Single individual or object that receives a treatment.
Factor: Is the explanatory variable, what is being tested
Level: A specific value for the factor.
Response Variable: What you are measuring with the experiment.
Treatment: Experimental condition applied to each unit.
Control Group: A group used to compare the factor to for effectiveness – does NOT have to be placebo
Placebo: A treatment with no active ingredients (provides control).
Blinding: A method used so that the subjects are unaware of the treatment (who gets a placebo or the real treatment).
Double Blinding: Neither the subjects nor the evaluators know which treatment is being given.

Principles of Experimental Design

Control: Keep all extraneous variables (not being tested) constant
Replication: Uses many subjects to quantify the natural variation in the response.
Randomization: Uses chance to assign the subjects to the treatments.
The only way to show cause and effect is with a well-designed, well-controlled experiment.

Experimental Designs

Completely Randomized: All units are allocated to all of the treatments randomly
Randomized Block: Units are blocked and then randomly assigned in each block – reduces variation
Matched Pairs: Are matched up units by characteristics and then randomly assigned. Once a pair receives a certain treatment, then the other pair automatically receives the second treatment. OR individuals do both treatments in random order (before/after or pretest/post-test). Assignment is dependent
Confounding Variables: Are where the effect of the variable on the response cannot be separated from the effects of the factor being tested – happens in observational studies – when you use random assignment to treatments you do NOT have confounding variables!
Randomization: Reduces bias by spreading extraneous variables to all groups in the experiment.
Blocking: Helps reduce variability. Another way to reduce variability is to increase sample size.

Random Variable

A numerical value that depends on the outcome of an experiment.
Discrete: A count of a random variable
Continuous: A measure of a random variable

Discrete Probability Distributions

Gives values & probabilities associated with each possible x.
μX = \sum xi p(x_i)
σX^2 = \sum (xi - μX)^2 p(xi)
Calculator shortcut – 1 VARSTAT L1,L2
Fair Game: A fair game is one in which all pay-ins equal all pay-outs.

Special Discrete Distributions

Binomial Distributions
- Properties: two mutually exclusive outcomes, fixed number of trials (n), each trial is independent, the probability (p) of success is the same for all trials,
- Random variable - is the number of successes out of a fixed # of trials. Starts at X = 0 and is finite.
- μ_X = np
- σ_x = \sqrt{npq}
- Calculator:
  - binomialpdf (n, p, x) = single outcome P(X= x)
  - binomialcdf (n, p, x) = cumulative outcome P(X < x)
  - 1 - binomialcdf (n, p, (x -1)) = cumulative outcome P(X > x)
Geometric Distributions
- Properties -two mutually exclusive outcomes, each trial is independent, probability (p) of success is the same for all trials. (NOT a fixed number of trials)
- Random Variable –when the FIRST success occurs. Starts at 1 and is ∞.
- Calculator:
  - geometricpdf (p, a) = single outcome P(X = a)
  - geometriccdf (p, a) = cumulative outcomes P(X < a)
  - 1 - geometriccdf (n, p, (a -1)) = cumulative outcome P(X > a)

Continuous Random Variable

Numerical values that fall within a range or interval (measurements), use density curves where the area under the curve always = 1. To find probabilities, find area under the curve
Unusual Density Curves: Any shape (triangles, etc.)
Uniform Distributions: Uniformly (evenly) distributed, shape of a rectangle
Normal Distributions: Symmetrical, unimodal, bell-shaped curves defined by the parameters μ& σ
- Calculator:
  - Normalpdf – used for graphing only
  - Normalcdf(lower bound, upper bound, μ, σ) – finds probability
  - InvNorm(p) – z-score OR InvNorm(p, μ, σ) – gives x-value
To assess Normality - Use graphs – dotplots, boxplots, histograms, or normal probability plot.

Distribution

Is all of the values of a random variable.

Sampling Distribution

Of a statistic is the distribution of all possible values of all possible samples. Use normalcdf to calculate probabilities – be sure to use correct SD.
μ{\bar{X}} = μX
σ_{\bar{x}} = \frac{σ}{\sqrt{n}}
μ_{\hat{p}} = p
σ_{\hat{p}} = \sqrt{\frac{pq}{n}}
μ{\bar{X1} - \bar{X2}} = μ{\bar{X1}} - μ{\bar{X_2}}
σ{\bar{X1} - \bar{X2}} = \sqrt{\frac{σ1^2}{n1} + \frac{σ2^2}{n_2}}
μ{\hat{p1} - \hat{p2}} = p1 - p_2
σ{\hat{p1} - \hat{p2}} = \sqrt{\frac{p1q1}{n1} + \frac{p2q2}{n_2}}
μ_b=β
s_b (do not need to find, usually given in computer printout (standard error of the slopes of the LSRLs)
Standard error: Estimate of the standard deviation of the statistic
Central Limit Theorem: When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.

Confidence Intervals

Point Estimate: Uses a single statistic based on sample data, this is the simplest approach.
Confidence Intervals: Used to estimate the unknown population parameter.
Margin of Error: The smaller the margin of error, the more precise our estimate
Steps:
- Assumptions – see table below
- Calculations – C.I. = statistic ± critical value (standard deviation of the statistic)
- Conclusion – Write your statement in context.
  - We are [x]% confident that the true [parameter] of [context] is between [a] and [b].
What makes the margin of error smaller
- make critical value smaller (lower confidence level).
- get a sample with a smaller s.
- make n larger.
T distributions compared to standard normal curve
- centered around 0
- more spread out and shorter
- more area under the tails.
- when you increase n, t-curves become more normal.
- can be no outliers in the sample data
- Degrees of Freedom = n – 1
Robust: If the assumption of normality is not met, the confidence level or p-value does not change much – this is true of t-distributions because there is more area in the tails

Hypothesis Tests

Hypothesis Testing: Tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.
Null Hypothesis: H_0 is the statement being tested. Null hypothesis should be “no effect”, “no difference”, or “no relationship”
Alternate Hypothesis: H_a is the statement suspected of being true.
P-Value: Assuming the null is true, the probability of obtaining the observed result or more extreme
Level of Significance: α is the amount of evidence necessary before rejecting the null hypothesis.
Steps:
- Assumptions – see table below
- Hypotheses - don’t forget to define parameter
- Calculations – find z or t test statistic & p-value
- Conclusion – Write your statement in context.
  - Since the p-value is < (>) α, I reject (fail to reject) the H0. There is (is not) sufficient evidence to suggest that [Ha].

Type I and II Errors and Power

Type I Error: Is when one rejects H0 when H0 is actually true. (probability is α)
Type II Error: Is when you fail to reject H0, and H0 is actually false. (probability is β)
α and β are inversely related. Consequences are the results of making a Type I or Type II error. Every decision has the possibility of making an error.
The Power of a Test: Is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true. Power = 1 – β
If you increase
- α Increases, β Decreases, Power Increases
- n Same, β Decreases, Power Increases
- (μ0 – μa) Same, β Decreases, Power Increases

Chi-Square (χ2) Test

Is used to test counts of categorical data.
Types
- Goodness of Fit (univariate)
- Independence (bivariate)
- Homogeneity (univariate 2 (or more) samples)
χ2 distribution – All curves are skewed right, every df has a different curve, and as the degrees of freedom increase the χ2 curve becomes more normal.
Goodness of Fit: Is for univariate categorical data from a single sample. Does the observed count “fit” what we expect. Must use list to perform, df = number of the categories – 1, use χ2cdf (χ2, ∞, df) to calculate p-value
Independence: Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate
Homogeneity: Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate
For both χ2 tests of independence & homogeneity:
- Expected counts = \frac{(\text{row total}) (\text{column total})}{\text{grand total}} & df = (r – 1)(c – 1)

Regression Model

X & Y have a linear relationship where the true LSRL is μ_y = α + βx
The responses (y) are normally distributed for a given x-value.
The standard deviation of the responses (σ_y) is the same for all values of x.
- S is the estimate for σ_y
Confidence Interval: b ± t^*s_b
Hypothesis Testing: t = \frac{b - β1}{sb}

Assumptions:

Proportions (z - procedures)
- One sample:
  - SRS from population
  - Can be approximated by normal distribution if n(p) & n(1 – p) > 10
  - Population size is at least 10n
- Two samples:
  - 2 independent SRS’s from populations (or randomly assigned treatments)
  - Can be approximated by normal distribution if n1(p1), n1(1 – p1), n2p2, & n2(1 – p2) > 10
  - Population sizes are at least 10n
Means (t - procedures)
- One sample:
  - SRS from population
  - Distribution is approximately normal
    - Given
    - Large sample size
    - Graph of data is approximately symmetrical and unimodal with no outliers
- Matched pairs:
  - SRS from population
  - Distribution of differences is approximately normal
    - Given
    - Large sample size
    - Graph of differences is approximately symmetrical and unimodal with no outliers
- Two samples:
  - 2 independent SRS’s from populations (or randomly assigned treatments)
  - Distributions are approximately normal
    - Given
    - Large sample sizes
    - Graphs of data are approximately symmetrical and unimodal with no outliers
Counts (χ^2 - procedures)
- All types:
  - Reasonably random sample(s)
  - All expected counts > 5
    - Must show expected counts
Bivariate Data (t – procedures on slope)
- SRS from population
- There is linear relationship between x & y.
- Residual plot has no pattern.
- The standard deviation of the responses is constant for all values of x.
- Points are scattered evenly across the LSRL in the scatterplot.
- The responses are approximately normally distributed.
- Graph of residuals is approximately symmetrical & unimodal with no outliers.