AP Statistics Cumulative AP Exam Study Guide Notes

Statistics

The science of collecting, analyzing, and drawing conclusions from data.
Descriptive: Methods of organizing and summarizing statistics.
Inferential: Making generalizations from a sample to the population.

Population

An entire collection of individuals or objects.

Sample

A subset of the population selected for study.

Variable

Any characteristic whose value changes.

Data

Observations on single or multi-variables.

Variables

Categorical (Qualitative): Basic characteristics.
Numerical (Quantitative): Measurements or observations of numerical data.
- Discrete: Listable sets (counts).
- Continuous: Any value over an interval of values (measurements).

Variable Types

Univariate: One variable.
Bivariate: Two variables.
Multivariate: Many variables.

Distributions

Symmetrical: Data on which both sides are fairly the same shape and size. "Bell Curve"
Uniform: Every class has an equal frequency (number). "A rectangle"
Skewed: One side (tail) is longer than the other side. The skewness is in the direction that the tail points (left or right).
Bimodal: Data of two or more classes have large frequencies separated by another class between them. "Double hump camel"

Describing Numerical Graphs (S.O.C.S.)

Shape: Overall type (symmetrical, skewed right/left, uniform, or bimodal).
Outliers: Gaps, clusters, etc.
Center: Middle of the data (mean, median, and mode).
Spread: Refers to variability (range, standard deviation, and IQR).
Everything must be in context to the data and situation of the graph.
When comparing two distributions, MUST use comparative language!

Parameter vs. Statistic

Parameter: Value of a population (typically unknown).
Statistic: A calculated value about a population from a sample(s).

Measures of Center

Median: The middle point of the data (50th percentile) when the data is in numerical order. If two values are present, then average them together.
Mean: $μ$ is for a population (parameter) and $\bar{x}$ is for a sample (statistic).
Mode: Occurs the most in the data. There can be more than one mode, or no mode at all if all data points occur once.
Variability allows statisticians to distinguish between usual and unusual occurrences.

Measures of Spread (Variability)

Range: A single value $(Max - Min)$ .
IQR (Interquartile Range): $(Q3 - Q1)$ .
Standard Deviation: $σ$ for population (parameter) & $s$ for sample (statistic). Measures the typical or average deviation of observations from the mean. Sample standard deviation is divided by $df = n - 1$ .
The sum of the deviations from the mean is always zero!
Variance: Standard deviation squared.

Resistant Measures

Resistant: Not affected by outliers.
- Median
- IQR
Non-Resistant:
- Mean
- Range
- Variance
- Standard Deviation
- Correlation Coefficient (r)
- Least Squares Regression Line (LSRL)
- Coefficient of Determination $r^2$

Comparison of Mean & Median Based on Graph Type

Symmetrical: Mean and the median are the same value.
Skewed Right: Mean is a larger value than the median.
Skewed Left: The mean is smaller than the median.
The mean is always pulled in the direction of the skew away from the median.

Trimmed Mean

Use a % to take observations away from the top and bottom of the ordered data. This possibly eliminates outliers.

Linear Transformations of Random Variables

$X = a + bx$
The mean is changed by both addition (subtraction) & multiplication (division).
- $\mu{a+bx} = a + b\mux$
The standard deviation is changed by multiplication (division) ONLY.
- $σ{a+bx} = |b|σx$

Combination of Two (or More) Random Variables

$E(X ± Y) = E(X) ± E(Y)$
Just add or subtract the two (or more) means.
Always add the variances. X & Y MUST be independent.

Z-Score

Is a standardized score. This tells you how many standard deviations from the mean an observation is.
It creates a standard normal curve consisting of z-scores with $\mu = 0$ & $σ = 1$ .

Normal Curve

Is a bell-shaped and symmetrical curve.
As $σ$ increases the curve flattens. As $σ$ decreases the curve thins.

Empirical Rule (68-95-99.7)

Measures 1 $σ$ , 2 $σ$ , and 3 $σ$ on normal curves from a center of $\mu$ .
68% of the population is between -1 $σ$ and 1 $σ$
95% of the population is between -2 $σ$ and 2 $σ$
99.7% of the population is between -3 $σ$ and 3 $σ$

Boxplots

Are for medium or large numerical data. It does not contain original observations.
Always use modified boxplots where the fences are 1.5 IQRs from the ends of the box (Q1 & Q3).
Points outside the fence are considered outliers.
Whiskers extend to the smallest & largest observations within the fences.

5-Number Summary

Minimum, Q1 (1st Quartile - 25th Percentile), Median, Q3 (3rd Quartile - 75th Percentile), Maximum

Probability Rules

Sample Space: Is a collection of all outcomes.
Event: any sample of outcomes.
Complement: All outcomes not in the event.
Union: A or B, all the outcomes in both circles. $A \cup B$
Intersection: A and B, happening in the middle of A and B. $A \cap B$
Mutually Exclusive (Disjoint): A and B have no intersection. They cannot happen at the same time.
Independent: If knowing one event does not change the outcome of another.
Experimental Probability: Is the number of success from an experiment divided by the total amount from the experiment.
Law of Large Numbers: As an experiment is repeated the experimental probability gets closer and closer to the true (theoretical) probability. The difference between the two probabilities will approach 0.

Rules of Probability

All values are $0 \le P \le 1$ .
Probability of sample space is 1.
Complement: $P + (1 - P) = 1$
Addition: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
Multiplication: $P(A \cap B) = P(A) \cdot P(B)$ if A & B are independent
$P(\text{at least one or more}) = 1 - P(\text{none})$
$P(\text{both})$

Conditional Probability

Takes into account a certain condition.
$P(A | B) = \frac{P(A \cap B)}{P(B)}$

Correlation Coefficient (r)

Is a quantitative assessment of the strength and direction of a linear relationship. (use $ρ$ (rho) for population parameter)
Values: [-1, 1]
- 0 = no correlation
- (0, ±0.5) = weak
- [±0.5, ±0.8) = moderate
- [±0.8, ±1] = strong

Least Squares Regression Line (LSRL)

Is a line of mathematical best fit. Minimizes the deviations (residuals) from the line. Used with bivariate data.
$\hat{y} = a + bx$
$x$ : independent, the explanatory variable & $y$ : dependent, the response variable

Residuals (error)

Is vertical difference of a point from the LSRL. All residuals sum up to 0
$Residual = Y - \hat{y}$

Residual Plot

A scatterplot of ( $x$ (or $\hat{y}$ ), residual). No pattern indicates a linear relationship.

Coefficient of Determination ( $r^2$ )

Gives the proportion of variation in $y$ (response) that is explained by the relationship of ( $x$ , $y$ ). Never use the adjusted $r^2$

Interpretations (Must be in context!)

Slope (b): For unit increase in $x$ , then the $y$ variable will increase/decrease slope amount.
Correlation coefficient (r): There is a strength, direction, linear association between $x$ & $y$ .
Coefficient of determination ( $r^2$ ): Approximately % of the variation in $y$ can be explained by the LSRL of $x$ and $y$ .
Extrapolation: LRSL cannot be used to find values outside of the range of the original data.
Influential Points: Are points that if removed significantly change the LSRL.
Outliers: Are points with large residuals.

Census

A complete count of the population.

Why Not to Use a Census?

Expensive
Impossible to do
If destructive sampling you get extinction

Sampling Frame

Is a list of everyone in the population.

Sampling Design

Refers to the method used to choose a sample.

SRS (Simple Random Sample)

One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
- Advantages: Easy and unbiased
- Disadvantages: Large $σ^2$ and must know population.

Stratified

Divide the population into homogeneous groups called strata, then SRS each strata.
- Advantages: More precise than an SRS and cost reduced if strata already available.
- Disadvantages: Difficult to divide into groups, more complex formulas & must know population.

Systematic

Use a systematic approach (every 50th) after choosing randomly where to begin.
- Advantages: Unbiased, the sample is evenly distributed across population & don't need to know population.
- Disadvantages: A large $σ^2$ and can be confounded by trends.

Cluster Sample

Based on location. Select a random location and sample ALL at that location.
- Advantages: Cost is reduced, is unbiased & don't need to know population.
- Disadvantages: May not be representative of the population and has complex formulas.

Random Digit Table

Each entry is equally likely and each digit is independent of the rest.

Random Generator

Calculator or computer program

Bias (Error)

Favors a certain outcome, has to do with the center of sampling distributions. If centered over true parameter then considered unbiased

Sources of Bias

Voluntary Response: People choose themselves to participate.
Convenience Sampling: Ask people who are easy, friendly, or comfortable asking.
Undercoverage: Some group(s) are left out of the selection process.
Non-response: Someone cannot or does not want to be contacted or participate.
Response: False answers - can be caused by a variety of things
Wording of the Questions: Leading questions.

Experimental Design

Observational Study: Observe outcomes without giving a treatment.
Experiment: Actively imposes a treatment on the subjects.
Experimental Unit: Single individual or object that receives a treatment.
Factor: Is the explanatory variable, what is being tested
Level: A specific value for the factor.
Response Variable: What you are measuring with the experiment.
Treatment: Experimental condition applied to each unit.
Control Group: A group used to compare the factor to for effectiveness. Does NOT have to be placebo
Placebo: A treatment with no active ingredients (provides control).
Blinding: A method used so that the subjects are unaware of the treatment (who gets a placebo or the real treatment).
Double Blinding: Neither the subjects nor the evaluators know which treatment is being given.

Principles of Experimental Design

Control: Keep all extraneous variables (not being tested) constant
Replication: Uses many subjects to quantify the natural variation in the response.
Randomization: Uses chance to assign the subjects to the treatments.
The only way to show cause and effect is with a well-designed, well-controlled experiment.

Experimental Designs

Completely Randomized

All units are allocated to all of the treatments randomly

Randomized Block

Units are blocked and then randomly assigned in each block. Reduces variation

Matched Pairs

Are matched up units by characteristics and then randomly assigned. Once a pair receives a certain treatment, then the other pair automatically receives the second treatment. OR individuals do both treatments in random order (before/after or pretest/post-test). Assignment is dependent

Confounding Variables

Are where the effect of the variable on the response cannot be separated from the effects of the factor being tested. Happens in observational studies. When you use random assignment to treatments you do NOT have confounding variables!
Randomization reduces bias by spreading extraneous variables to all groups in the experiment.
Blocking helps reduce variability.
Another way to reduce variability is to increase sample size.

Random Variable

A numerical value that depends on the outcome of an experiment.
- Discrete: A count of a random variable
- Continuous: A measure of a random variable

Discrete Probability Distributions

Gives values & probabilities associated with each possible $x$ .
$\mu = \Sigma xi p(xi)$
$σ = \sqrt{\Sigma (xi - \mux)^2 p(x_i)}$
Calculator shortcut— 1 VARSTAT L1,L2
Fair Game: A fair game is one in which all pay-ins equal all pay-outs.

Special Discrete Distributions

Binomial Distributions

Properties: Two mutually exclusive outcomes, fixed number of trials ( $n$ ), each trial is independent, the probability ( $p$ ) of success is the same for all trials.
Random variable: Is the number of successes out of a fixed number of trials. Starts at $X = 0$ and is finite.
Calculator:
- binomialpdf(n, p, x) = single outcome $P(X = x)$
- binomialcdf(n, p, x) = cumulative outcome $P(X \le x)$
- 1 - binomialcdf(n, p, (x - 1)) = cumulative outcome P(X > x)

Geometric Distributions

Properties: Two mutually exclusive outcomes, each trial is independent, probability ( $p$ ) of success is the same for all trials. (NOT a fixed number of trials)
Random Variable: When the FIRST success occurs. Starts at 1 and is $\infty$ .
Calculator:
- geometricpdf(p, a) = single outcome $P(X = a)$
- geometriccdf(p, a) = cumulative outcomes $P(X \le a)$
- 1 - geometriccdf(n, p, (a - 1)) = cumulative outcome P(X > a)

Continuous Random Variable

Numerical values that fall within a range or interval (measurements), use density curves where the area under the curve always = 1. To find probabilities, find area under the curve

Unusual Density Curves

Any shape (triangles, etc.)

Uniform Distributions

Uniformly (evenly) distributed, shape of a rectangle

Normal Distributions

Symmetrical, unimodal, bell-shaped curves defined by the parameters $\mu$ & $σ$ .
Calculator:
- Normalpdf— used for graphing only
- Normalcdf(lower bound, upper bound, μ, σ)— finds probability
- InvNorm(p)— z-score OR InvNorm(p, μ, σ)— gives x-value

Assessing Normality

Use graphs— dotplots, boxplots, histograms, or normal probability plot.

Distribution Types

Distribution: Is all of the values of a random variable.
Sampling Distribution: Of a statistic is the distribution of all possible values of all possible samples.
Use normalcdf to calculate probabilities - be sure to use correct SD
$σ{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$ , $σ{\bar{x}} = \frac{σ}{\sqrt{n}}$ , $\mu{\hat{p}} = p$ , $\mu{\bar{x}} = \mu$

Standard Error Calculation

(do not need to find, usually given in computer printout
Standard error is the estimate of the standard deviation of the statistic (standard deviation of the sample means)
(standard deviation of the sample proportions)
(standard deviation of the difference in sample means)
(standard deviation of the difference in sample proportions)
(standard error of the slopes of the LSRLs)

Central Limit Theorem

When $n$ is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.

Confidence Intervals

Point Estimate: Uses a single statistic based on sample data, this is the simplest approach.
Confidence Intervals: Used to estimate the unknown population parameter.
Margin of Error: The smaller the margin of error, the more precise our estimate

Steps for Creating Confidence Intervals

Assumptions:
Calculations: $C.I. = statistic ± critical value * (standard deviation of the statistic)$
Conclusion: Write your statement in context. We are [x]% confident that the [parameter] of [context] is between [a] and [b].

What makes the margin of error smaller:
- Make the critical value smaller (lower confidence level).
- Get a sample with a smaller $s$ .
- Make $n$ larger.

T Distributions

Compared to standard normal curve:
- Centered around 0
- More spread out and shorter
- More area under the tails.
When we increase $n$ , t-curves become more normal.
Can be no outliers in the sample data
Degrees of Freedom = $n - 1$

Robustness

If the assumption of normality is not met, the confidence level or p-value does not change much. This is true of t-distributions because there is more area in the tails

Hypothesis Tests

Hypothesis Testing tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.
Null Hypothesis ( $H_0$ ): Is the statement being tested. The null hypothesis should be "no effect", "no difference", or "no relationship"
Alternate Hypothesis ( $H_a$ ): Is the statement suspected of being true.
P-Value: Assuming the null is true, the probability of obtaining the observed result or more extreme
Level of Significance ( $\alpha$ ): Is the amount of evidence necessary before rejecting the null hypothesis.

Steps for Hypothesis Testing

Assumptions
Hypotheses - don't forget to define parameter
Calculations - find z or t test statistic & p-value
Conclusion - Write your statement in context. Since the p-value is < (>) $\alpha$ , I reject (fail to reject) the $H0$ . There is (is not) sufficient evidence to suggest that [ $Ha$ ].

Type I and II Errors and Power

Type I Error: Is when one rejects $H0$ when $H0$ is actually true. (probability is $\alpha$ )
Type II Error: Is when you fail to reject $H0$ , and $H0$ is actually false. (probability is $β$ )
$\alpha$ and $β$ are inversely related.
Consequences are the results of making a Type I or Type II error. Every decision has the possibility of making an error.
The Power of a Test: Is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true.
$Power = 1 - β$

Error Relations

	Type I error	Type II error	Power	CL
Increases $\alpha$	✓		✓
Increases $n$		✓	✓
				✓

Chi-Square $\chi^2$ Test

Test is used to test counts of categorical data.
Types:
- Goodness of Fit (univariate)
- Independence (bivariate)
- Homogeneity (univariate 2 (or more) samples)
$\chi^2$ distribution. All curves are skewed right and every df has a different curve. As the degrees of freedom increase the curve becomes more normal.

Chi-Square Test Types

Goodness of Fit

Is for univariate categorical data from a single sample. Does the observed count "fit" what we expect?
Must use list to perform, $df = number of the categories - 1$ , use $\chi^2cdf(\chi^2, ∞, df)$ to calculate p-value.

Independence

Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate Expected counts.

Homogeneity

Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate.

Expected Counts for both Independence and Homogeneity

$Expected counts = \frac{(row total) * (column total)}{grand total}$

Regression Model

$X$ & $Y$ have a linear relationship where the true LSRL is $\mu_y = \alpha + \beta x$
The responses ( $y$ ) are normally distributed for a given $x$ -value.
The standard deviation of the responses ( $\sigma_y$ ) is the same for all values of $x$ .
$s$ is the estimate for $\sigma_y$
Confidence Interval: $b ± t^* s$
Hypothesis Testing: t

Assumptions for statistical procedures

Proportions (z - procedures)

One sample:

SRS from population
Can be approximated by normal distribution if n(p) & n(1 - p) > 10
Population size is at least $10n$

Two samples:

2 independent SRS's from populations (or randomly assigned treatments)
Can be approximated by normal distribution if n1 p1, n1(1 - p1), n2 p2, & n2(1 - p2) > 10
Population sizes are at least $10n$

Means (t - procedures)

One sample:

SRS from population
Distribution is approximately normal
- $\sigma$ Given
- Large sample size
- Graph of data is approximately symmetrical and unimodal with no outliers

Matched pairs:

SRS from population
Distribution of differences is approximately normal
- Given
- Large sample size
- Graph of differences is approximately symmetrical and unimodal with no outliers

Two samples:

2 independent SRS's from populations (or randomly assigned treatments)
Distributions are approximately normal
- $\sigma$ Given
- Large sample sizes
- Graphs of data are approximately symmetrical and unimodal with no outliers

Counts (Chi-Square - procedures)

All types:

Reasonably random sample(s)
All expected counts > 5
Must show expected counts

Bivariate Data: (t — procedures on slope)

SRS from population
There is linear relationship between x & y.
Residual plot has no pattern. The standard deviation of the responses is constant for all values of x.
Points are scattered evenly across the LSRL in the scatterplot.
The responses are approximately normally distributed. Graph of residuals is approximately symmetrical & unimodal with no outliers.

AP Statistics Cumulative AP Exam Study Guide Notes

Statistics

Population

Sample

Variable

Data

Variables

Variable Types

Distributions

Describing Numerical Graphs (S.O.C.S.)

Parameter vs. Statistic

Measures of Center

Measures of Spread (Variability)

Resistant Measures

Comparison of Mean & Median Based on Graph Type

Trimmed Mean

Linear Transformations of Random Variables

Combination of Two (or More) Random Variables

Z-Score

Normal Curve

Empirical Rule (68-95-99.7)

Boxplots

5-Number Summary

Probability Rules

Rules of Probability

Conditional Probability

Correlation Coefficient (r)

Least Squares Regression Line (LSRL)

Residuals (error)

Residual Plot

Coefficient of Determination (r2r^2r2)

Interpretations (Must be in context!)

Census

Why Not to Use a Census?

Sampling Frame

Sampling Design

SRS (Simple Random Sample)

Stratified

Systematic

Cluster Sample

Random Digit Table

Random Generator

Bias (Error)

Sources of Bias

Experimental Design

Principles of Experimental Design

Experimental Designs

Completely Randomized

Randomized Block

Matched Pairs

Confounding Variables

Random Variable

Discrete Probability Distributions

Special Discrete Distributions

Binomial Distributions

Geometric Distributions

Continuous Random Variable

Unusual Density Curves

Uniform Distributions

Normal Distributions

Assessing Normality

Distribution Types

Standard Error Calculation

Central Limit Theorem

Confidence Intervals

Steps for Creating Confidence Intervals

T Distributions

Robustness

Hypothesis Tests

Steps for Hypothesis Testing

Type I and II Errors and Power

Error Relations

Chi-Square χ2\chi^2χ2 Test

Chi-Square Test Types

Goodness of Fit

Independence

Homogeneity

Expected Counts for both Independence and Homogeneity

Regression Model

Assumptions for statistical procedures

Coefficient of Determination ( $r^2$ )

Chi-Square $\chi^2$ Test