AP Statistics Cumulative AP Exam Study Guide Notes
Statistics
- The science of collecting, analyzing, and drawing conclusions from data.
- Descriptive: Methods of organizing and summarizing statistics.
- Inferential: Making generalizations from a sample to the population.
Population
- An entire collection of individuals or objects.
Sample
- A subset of the population selected for study.
Variable
- Any characteristic whose value changes.
Data
- Observations on single or multi-variables.
Variables
- Categorical (Qualitative): Basic characteristics.
- Numerical (Quantitative): Measurements or observations of numerical data.
- Discrete: Listable sets (counts).
- Continuous: Any value over an interval of values (measurements).
Variable Types
- Univariate: One variable.
- Bivariate: Two variables.
- Multivariate: Many variables.
Distributions
- Symmetrical: Data on which both sides are fairly the same shape and size. "Bell Curve"
- Uniform: Every class has an equal frequency (number). "A rectangle"
- Skewed: One side (tail) is longer than the other side. The skewness is in the direction that the tail points (left or right).
- Bimodal: Data of two or more classes have large frequencies separated by another class between them. "Double hump camel"
Describing Numerical Graphs (S.O.C.S.)
- Shape: Overall type (symmetrical, skewed right/left, uniform, or bimodal).
- Outliers: Gaps, clusters, etc.
- Center: Middle of the data (mean, median, and mode).
- Spread: Refers to variability (range, standard deviation, and IQR).
- Everything must be in context to the data and situation of the graph.
- When comparing two distributions, MUST use comparative language!
Parameter vs. Statistic
- Parameter: Value of a population (typically unknown).
- Statistic: A calculated value about a population from a sample(s).
Measures of Center
- Median: The middle point of the data (50th percentile) when the data is in numerical order. If two values are present, then average them together.
- Mean: μ is for a population (parameter) and \bar{x} is for a sample (statistic).
- Mode: Occurs the most in the data. There can be more than one mode, or no mode at all if all data points occur once.
- Variability allows statisticians to distinguish between usual and unusual occurrences.
Measures of Spread (Variability)
- Range: A single value (Max - Min).
- IQR (Interquartile Range): (Q3 - Q1).
- Standard Deviation: σ for population (parameter) & s for sample (statistic). Measures the typical or average deviation of observations from the mean. Sample standard deviation is divided by df = n - 1.
- The sum of the deviations from the mean is always zero!
- Variance: Standard deviation squared.
Resistant Measures
- Resistant: Not affected by outliers.
- Non-Resistant:
- Mean
- Range
- Variance
- Standard Deviation
- Correlation Coefficient (r)
- Least Squares Regression Line (LSRL)
- Coefficient of Determination r^2
- Symmetrical: Mean and the median are the same value.
- Skewed Right: Mean is a larger value than the median.
- Skewed Left: The mean is smaller than the median.
- The mean is always pulled in the direction of the skew away from the median.
Trimmed Mean
- Use a % to take observations away from the top and bottom of the ordered data. This possibly eliminates outliers.
- X = a + bx
- The mean is changed by both addition (subtraction) & multiplication (division).
- The standard deviation is changed by multiplication (division) ONLY.
Combination of Two (or More) Random Variables
- E(X ± Y) = E(X) ± E(Y)
- Just add or subtract the two (or more) means.
- Always add the variances. X & Y MUST be independent.
Z-Score
- Is a standardized score. This tells you how many standard deviations from the mean an observation is.
- It creates a standard normal curve consisting of z-scores with \mu = 0 & σ = 1.
Normal Curve
- Is a bell-shaped and symmetrical curve.
- As σ increases the curve flattens. As σ decreases the curve thins.
Empirical Rule (68-95-99.7)
- Measures 1σ, 2σ, and 3σ on normal curves from a center of \mu.
- 68% of the population is between -1σ and 1σ
- 95% of the population is between -2σ and 2σ
- 99.7% of the population is between -3σ and 3σ
Boxplots
- Are for medium or large numerical data. It does not contain original observations.
- Always use modified boxplots where the fences are 1.5 IQRs from the ends of the box (Q1 & Q3).
- Points outside the fence are considered outliers.
- Whiskers extend to the smallest & largest observations within the fences.
5-Number Summary
- Minimum, Q1 (1st Quartile - 25th Percentile), Median, Q3 (3rd Quartile - 75th Percentile), Maximum
Probability Rules
- Sample Space: Is a collection of all outcomes.
- Event: any sample of outcomes.
- Complement: All outcomes not in the event.
- Union: A or B, all the outcomes in both circles. A \cup B
- Intersection: A and B, happening in the middle of A and B. A \cap B
- Mutually Exclusive (Disjoint): A and B have no intersection. They cannot happen at the same time.
- Independent: If knowing one event does not change the outcome of another.
- Experimental Probability: Is the number of success from an experiment divided by the total amount from the experiment.
- Law of Large Numbers: As an experiment is repeated the experimental probability gets closer and closer to the true (theoretical) probability. The difference between the two probabilities will approach 0.
Rules of Probability
- All values are 0 \le P \le 1.
- Probability of sample space is 1.
- Complement: P + (1 - P) = 1
- Addition: P(A \cup B) = P(A) + P(B) - P(A \cap B)
- Multiplication: P(A \cap B) = P(A) \cdot P(B) if A & B are independent
- P(\text{at least one or more}) = 1 - P(\text{none})
- P(\text{both})
Conditional Probability
- Takes into account a certain condition.
- P(A | B) = \frac{P(A \cap B)}{P(B)}
Correlation Coefficient (r)
- Is a quantitative assessment of the strength and direction of a linear relationship. (use ρ (rho) for population parameter)
- Values: [-1, 1]
- 0 = no correlation
- (0, ±0.5) = weak
- [±0.5, ±0.8) = moderate
- [±0.8, ±1] = strong
Least Squares Regression Line (LSRL)
- Is a line of mathematical best fit. Minimizes the deviations (residuals) from the line. Used with bivariate data.
- \hat{y} = a + bx
- x: independent, the explanatory variable & y: dependent, the response variable
Residuals (error)
- Is vertical difference of a point from the LSRL. All residuals sum up to 0
- Residual = Y - \hat{y}
Residual Plot
- A scatterplot of (x (or \hat{y}), residual). No pattern indicates a linear relationship.
Coefficient of Determination (r^2)
- Gives the proportion of variation in y (response) that is explained by the relationship of (x, y). Never use the adjusted r^2
Interpretations (Must be in context!)
- Slope (b): For unit increase in x, then the y variable will increase/decrease slope amount.
- Correlation coefficient (r): There is a strength, direction, linear association between x & y.
- Coefficient of determination (r^2): Approximately % of the variation in y can be explained by the LSRL of x and y.
- Extrapolation: LRSL cannot be used to find values outside of the range of the original data.
- Influential Points: Are points that if removed significantly change the LSRL.
- Outliers: Are points with large residuals.
Census
- A complete count of the population.
Why Not to Use a Census?
- Expensive
- Impossible to do
- If destructive sampling you get extinction
Sampling Frame
- Is a list of everyone in the population.
Sampling Design
- Refers to the method used to choose a sample.
SRS (Simple Random Sample)
- One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
- Advantages: Easy and unbiased
- Disadvantages: Large σ^2 and must know population.
Stratified
- Divide the population into homogeneous groups called strata, then SRS each strata.
- Advantages: More precise than an SRS and cost reduced if strata already available.
- Disadvantages: Difficult to divide into groups, more complex formulas & must know population.
Systematic
- Use a systematic approach (every 50th) after choosing randomly where to begin.
- Advantages: Unbiased, the sample is evenly distributed across population & don't need to know population.
- Disadvantages: A large σ^2 and can be confounded by trends.
Cluster Sample
- Based on location. Select a random location and sample ALL at that location.
- Advantages: Cost is reduced, is unbiased & don't need to know population.
- Disadvantages: May not be representative of the population and has complex formulas.
Random Digit Table
- Each entry is equally likely and each digit is independent of the rest.
Random Generator
- Calculator or computer program
Bias (Error)
- Favors a certain outcome, has to do with the center of sampling distributions. If centered over true parameter then considered unbiased
Sources of Bias
- Voluntary Response: People choose themselves to participate.
- Convenience Sampling: Ask people who are easy, friendly, or comfortable asking.
- Undercoverage: Some group(s) are left out of the selection process.
- Non-response: Someone cannot or does not want to be contacted or participate.
- Response: False answers - can be caused by a variety of things
- Wording of the Questions: Leading questions.
Experimental Design
- Observational Study: Observe outcomes without giving a treatment.
- Experiment: Actively imposes a treatment on the subjects.
- Experimental Unit: Single individual or object that receives a treatment.
- Factor: Is the explanatory variable, what is being tested
- Level: A specific value for the factor.
- Response Variable: What you are measuring with the experiment.
- Treatment: Experimental condition applied to each unit.
- Control Group: A group used to compare the factor to for effectiveness. Does NOT have to be placebo
- Placebo: A treatment with no active ingredients (provides control).
- Blinding: A method used so that the subjects are unaware of the treatment (who gets a placebo or the real treatment).
- Double Blinding: Neither the subjects nor the evaluators know which treatment is being given.
Principles of Experimental Design
- Control: Keep all extraneous variables (not being tested) constant
- Replication: Uses many subjects to quantify the natural variation in the response.
- Randomization: Uses chance to assign the subjects to the treatments.
- The only way to show cause and effect is with a well-designed, well-controlled experiment.
Experimental Designs
Completely Randomized
- All units are allocated to all of the treatments randomly
Randomized Block
- Units are blocked and then randomly assigned in each block. Reduces variation
Matched Pairs
- Are matched up units by characteristics and then randomly assigned. Once a pair receives a certain treatment, then the other pair automatically receives the second treatment. OR individuals do both treatments in random order (before/after or pretest/post-test). Assignment is dependent
Confounding Variables
- Are where the effect of the variable on the response cannot be separated from the effects of the factor being tested. Happens in observational studies. When you use random assignment to treatments you do NOT have confounding variables!
- Randomization reduces bias by spreading extraneous variables to all groups in the experiment.
- Blocking helps reduce variability.
- Another way to reduce variability is to increase sample size.
Random Variable
- A numerical value that depends on the outcome of an experiment.
- Discrete: A count of a random variable
- Continuous: A measure of a random variable
Discrete Probability Distributions
- Gives values & probabilities associated with each possible x.
- \mu = \Sigma xi p(xi)
- σ = \sqrt{\Sigma (xi - \mux)^2 p(x_i)}
- Calculator shortcut— 1 VARSTAT L1,L2
- Fair Game: A fair game is one in which all pay-ins equal all pay-outs.
Special Discrete Distributions
Binomial Distributions
- Properties: Two mutually exclusive outcomes, fixed number of trials (n), each trial is independent, the probability (p) of success is the same for all trials.
- Random variable: Is the number of successes out of a fixed number of trials. Starts at X = 0 and is finite.
- Calculator:
binomialpdf(n, p, x) = single outcome P(X = x)binomialcdf(n, p, x) = cumulative outcome P(X \le x)1 - binomialcdf(n, p, (x - 1)) = cumulative outcome P(X > x)
Geometric Distributions
- Properties: Two mutually exclusive outcomes, each trial is independent, probability (p) of success is the same for all trials. (NOT a fixed number of trials)
- Random Variable: When the FIRST success occurs. Starts at 1 and is \infty.
- Calculator:
geometricpdf(p, a) = single outcome P(X = a)geometriccdf(p, a) = cumulative outcomes P(X \le a)1 - geometriccdf(n, p, (a - 1)) = cumulative outcome P(X > a)
Continuous Random Variable
- Numerical values that fall within a range or interval (measurements), use density curves where the area under the curve always = 1. To find probabilities, find area under the curve
Unusual Density Curves
- Any shape (triangles, etc.)
- Uniformly (evenly) distributed, shape of a rectangle
Normal Distributions
- Symmetrical, unimodal, bell-shaped curves defined by the parameters \mu & σ.
- Calculator:
Normalpdf— used for graphing onlyNormalcdf(lower bound, upper bound, μ, σ)— finds probabilityInvNorm(p)— z-score OR InvNorm(p, μ, σ)— gives x-value
Assessing Normality
- Use graphs— dotplots, boxplots, histograms, or normal probability plot.
Distribution Types
- Distribution: Is all of the values of a random variable.
- Sampling Distribution: Of a statistic is the distribution of all possible values of all possible samples.
- Use normalcdf to calculate probabilities - be sure to use correct SD
- σ{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}, σ{\bar{x}} = \frac{σ}{\sqrt{n}}, \mu{\hat{p}} = p, \mu{\bar{x}} = \mu
Standard Error Calculation
- (do not need to find, usually given in computer printout
- Standard error is the estimate of the standard deviation of the statistic (standard deviation of the sample means)
- (standard deviation of the sample proportions)
- (standard deviation of the difference in sample means)
- (standard deviation of the difference in sample proportions)
- (standard error of the slopes of the LSRLs)
Central Limit Theorem
- When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.
Confidence Intervals
- Point Estimate: Uses a single statistic based on sample data, this is the simplest approach.
- Confidence Intervals: Used to estimate the unknown population parameter.
- Margin of Error: The smaller the margin of error, the more precise our estimate
Steps for Creating Confidence Intervals
- Assumptions:
- Calculations: C.I. = statistic ± critical value * (standard deviation of the statistic)
- Conclusion: Write your statement in context. We are [x]% confident that the [parameter] of [context] is between [a] and [b].
- What makes the margin of error smaller:
- Make the critical value smaller (lower confidence level).
- Get a sample with a smaller s.
- Make n larger.
T Distributions
- Compared to standard normal curve:
- Centered around 0
- More spread out and shorter
- More area under the tails.
- When we increase n, t-curves become more normal.
- Can be no outliers in the sample data
- Degrees of Freedom = n - 1
Robustness
- If the assumption of normality is not met, the confidence level or p-value does not change much. This is true of t-distributions because there is more area in the tails
Hypothesis Tests
- Hypothesis Testing tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.
- Null Hypothesis (H_0): Is the statement being tested. The null hypothesis should be "no effect", "no difference", or "no relationship"
- Alternate Hypothesis (H_a): Is the statement suspected of being true.
- P-Value: Assuming the null is true, the probability of obtaining the observed result or more extreme
- Level of Significance (\alpha): Is the amount of evidence necessary before rejecting the null hypothesis.
Steps for Hypothesis Testing
- Assumptions
- Hypotheses - don't forget to define parameter
- Calculations - find z or t test statistic & p-value
- Conclusion - Write your statement in context. Since the p-value is < (>) \alpha, I reject (fail to reject) the H0. There is (is not) sufficient evidence to suggest that [Ha].
Type I and II Errors and Power
- Type I Error: Is when one rejects H0 when H0 is actually true. (probability is \alpha)
- Type II Error: Is when you fail to reject H0, and H0 is actually false. (probability is β)
- \alpha and β are inversely related.
- Consequences are the results of making a Type I or Type II error. Every decision has the possibility of making an error.
- The Power of a Test: Is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true.
- Power = 1 - β
Error Relations
| Type I error | Type II error | Power | CL |
|---|
| Increases \alpha | ✓ | | ✓ | |
| Increases n | | ✓ | ✓ | |
| | | | ✓ |
Chi-Square \chi^2 Test
- Test is used to test counts of categorical data.
- Types:
- Goodness of Fit (univariate)
- Independence (bivariate)
- Homogeneity (univariate 2 (or more) samples)
- \chi^2 distribution. All curves are skewed right and every df has a different curve. As the degrees of freedom increase the curve becomes more normal.
Chi-Square Test Types
Goodness of Fit
- Is for univariate categorical data from a single sample. Does the observed count "fit" what we expect?
Must use list to perform, df = number of the categories - 1, use $\chi^2cdf(\chi^2, ∞, df)$ to calculate p-value.
Independence
- Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate Expected counts.
Homogeneity
- Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate.
Expected Counts for both Independence and Homogeneity
- Expected counts = \frac{(row total) * (column total)}{grand total}
Regression Model
- X & Y have a linear relationship where the true LSRL is \mu_y = \alpha + \beta x
- The responses (y) are normally distributed for a given x-value.
- The standard deviation of the responses (\sigma_y) is the same for all values of x.
- s is the estimate for \sigma_y
- Confidence Interval: b ± t^* s
- Hypothesis Testing: t
Assumptions for statistical procedures
Proportions (z - procedures)
One sample:
- SRS from population
- Can be approximated by normal distribution if n(p) & n(1 - p) > 10
- Population size is at least 10n
Two samples:
- 2 independent SRS's from populations (or randomly assigned treatments)
- Can be approximated by normal distribution if n1 p1, n1(1 - p1), n2 p2, & n2(1 - p2) > 10
- Population sizes are at least 10n
Means (t - procedures)
One sample:
- SRS from population
- Distribution is approximately normal
- \sigma Given
- Large sample size
- Graph of data is approximately symmetrical and unimodal with no outliers
Matched pairs:
- SRS from population
- Distribution of differences is approximately normal
- Given
- Large sample size
- Graph of differences is approximately symmetrical and unimodal with no outliers
Two samples:
- 2 independent SRS's from populations (or randomly assigned treatments)
- Distributions are approximately normal
- \sigma Given
- Large sample sizes
- Graphs of data are approximately symmetrical and unimodal with no outliers
Counts (Chi-Square - procedures)
All types:
- Reasonably random sample(s)
- All expected counts > 5
- Must show expected counts
Bivariate Data: (t — procedures on slope)
- SRS from population
- There is linear relationship between x & y.
- Residual plot has no pattern. The standard deviation of the responses is constant for all values of x.
- Points are scattered evenly across the LSRL in the scatterplot.
- The responses are approximately normally distributed. Graph of residuals is approximately symmetrical & unimodal with no outliers.