AP Statistics Cumulative AP Exam Study Guide Notes

Statistics

  • The science of collecting, analyzing, and drawing conclusions from data.

  • Descriptive: Methods of organizing and summarizing statistics.

  • Inferential: Making generalizations from a sample to the population.

Basic Definitions

  • Population: An entire collection of individuals or objects.

  • Sample: A subset of the population selected for study.

  • Variable: Any characteristic whose value changes.

  • Data: Observations on single or multi-variables.

Types of Variables

  • Categorical (Qualitative): Basic characteristics.

  • Numerical (Quantitative): Measurements or observations of numerical data.

    • Discrete: Listable sets (counts).

    • Continuous: Any value over an interval of values (measurements).

  • Univariate: One variable.

  • Bivariate: Two variables.

  • Multivariate: Many variables.

Distributions

  • Symmetrical: Data on which both sides are fairly the same shape and size. "Bell Curve".

  • Uniform: Every class has an equal frequency (number) "a rectangle".

  • Skewed: One side (tail) is longer than the other side. The skewness is in the direction that the tail points (left or right).

  • Bimodal: Data of two or more classes have large frequencies separated by another class between them. "double hump camel".

Describing Numerical Graphs (S.O.C.S.)

  • Shape: Overall type (symmetrical, skewed right/left, uniform, or bimodal).

  • Outliers: Gaps, clusters, etc.

  • Center: Middle of the data (mean, median, and mode).

  • Spread: Refers to variability (range, standard deviation, and IQR).

  • Everything must be in context to the data and situation of the graph.

  • When comparing two distributions – MUST use comparative language!

Parameter vs. Statistic

  • Parameter: Value of a population (typically unknown).

  • Statistic: A calculated value about a population from a sample(s).

Measures of Center

  • Median: The middle point of the data (50th percentile) when the data is in numerical order. If two values are present, then average them together.

  • Mean: μ is for a population (parameter) and x̄ is for a sample (statistic).

  • Mode: Occurs the most in the data. There can be more than one mode, or no mode at all if all data points occur once.

Variability

  • Allows statisticians to distinguish between usual and unusual occurrences.

Measures of Spread (Variability)

  • Range: A single value – (Max – Min)

  • IQR: Interquartile range – (Q3 – Q1)

  • Standard deviation: σ for population (parameter) & s for sample (statistic) – measures the typical or average deviation of observations from the mean – sample standard deviation is divided by df = n-1

  • Sum of the deviations from the mean is always zero!

  • Variance: Standard deviation squared

Resistant vs. Non-Resistant Measures

  • Resistant: Not affected by outliers.

    • Median

    • IQR

  • Non-Resistant:

    • Mean

    • Range

    • Variance

    • Standard Deviation

    • Correlation Coefficient (r)

    • Least Squares Regression Line (LSRL)

    • Coefficient of Determination (r^2)

Comparison of Mean & Median Based on Graph Type

  • Symmetrical: Mean and the median are the same value.

  • Skewed Right: Mean is a larger value than the median.

  • Skewed Left: The mean is smaller than the median.

  • The mean is always pulled in the direction of the skew away from the median.

  • Trimmed Mean: Use a % to take observations away from the top and bottom of the ordered data. This possibly eliminates outliers.

Linear Transformations of Random Variables

  • μ{a +bx} =a +bμx The mean is changed by both addition (subtract) & multiplication (division).

  • σ{a +bx} = |b|σx The standard deviation is changed by multiplication (division) ONLY.

Combination of Two (or More) Random Variables

  • μ{x ± y} = μx ± μ_y Just add or subtract the two (or more) means

  • σ^2{x± y} = σ^2x + σ^2_y Always add the variances – X & Y MUST be independent

Z-Score

  • Is a standardized score. This tells you how many standard deviations from the mean an observation is. It creates a standard normal curve consisting of z-scores with a μ = 0 & σ = 1.

  • z = {x - μ \over σ}

Normal Curve

  • Is a bell shaped and symmetrical curve.

  • As σ increases the curve flattens.

  • As σ decreases the curve thins.

Empirical Rule (68-95-99.7)

  • Measures 1σ, 2σ, and 3σ on normal curves from a center of μ.

  • 68% of the population is between -1σ and 1σ

  • 95% of the population is between -2σ and 2σ

  • 99.7% of the population is between -3σ and 3σ

Boxplots

  • Are for medium or large numerical data. It does not contain original observations. Always use modified boxplots where the fences are 1.5 IQRs from the ends of the box (Q1 & Q3). Points outside the fence are considered outliers. Whiskers extend to the smallest & largest observations within the fences.

5-Number Summary

  • Minimum, Q1 (1st Quartile – 25th Percentile), Median, Q3 (3rd Quartile – 75th Percentile), Maximum

Probability Rules

  • Sample Space: Is collection of all outcomes.

  • Event: Any sample of outcomes.

  • Complement: All outcomes not in the event.

  • Union: A or B, all the outcomes in both circles. A ∪ B

  • Intersection: A and B, happening in the middle of A and B. A ∩ B

  • Mutually Exclusive (Disjoint): A and B have no intersection. They cannot happen at the same time.

  • Independent: If knowing one event does not change the outcome of another.

  • Experimental Probability: Is the number of successes from an experiment divided by the total amount from the experiment.

  • Law of Large Numbers: As an experiment is repeated the experimental probability gets closer and closer to the true (theoretical) probability. The difference between the two probabilities will approach “0”.

Rules of Probability

  • (1) All values are 0 < P < 1.

  • (2) Probability of sample space is 1.

  • (3) Compliment = P + (1 - P) = 1

  • (4) Addition P(A or B) = P(A) + P(B) – P(A & B)

  • (5) Multiplication P(A & B) = P(A) · P(B) if a & B are independent

  • (6) P (at least 1 or more) = 1 – P (none)

  • (7) Conditional Probability – takes into account a certain condition.

    • P(A|B) = {P(A & B) \over P(B)} = {\text{P both} \over \text{P given}}

Correlation Coefficient

  • (r): Is a quantitative assessment of the strength and direction of a linear relationship. (use ρ (rho) for population parameter)

  • Values – [-1, 1] 0 – no correlation, (0, ±0.5) – weak, [±0.5, ±0.8) – moderate, [±0.8, ±1] - strong

Least Squares Regression Line (LSRL)

  • Is a line of mathematical best fit. Minimizes the deviations (residuals) from the line. Used with bivariate data.

  • ŷ = a + bx where x is independent, the explanatory variable & y is dependent, the response variable

  • Residuals (error): Is vertical difference of a point from the LSRL. All residuals sum up to “0”.

  • Residual = y - ŷ

  • Residual Plot: A scatterplot of (x (or ŷ), residual). No pattern indicates a linear relationship.

Coefficient of Determination

  • (r^2): Gives the proportion of variation in y (response) that is explained by the relationship of (x, y). Never use the adjusted r^2.

Interpretations (Must Be in Context!)

  • Slope (b): For unit increase in x, then the y variable will increase/decrease slope amount.

  • Correlation coefficient (r): There is a strength, direction, linear association between x & y.

  • Coefficient of determination (r^2): Approximately r^2% of the variation in y can be explained by the LSRL of x any y.

  • Extrapolation: LRSL cannot be used to find values outside of the range of the original data.

  • Influential Points: Are points that if removed significantly change the LSRL.

  • Outliers: Are points with large residuals.

Census

  • A complete count of the population.

  • Why not to use a census?

    • Expensive

    • Impossible to do

    • If destructive sampling you get extinction

Sampling Frame

  • Is a list of everyone in the population.

Sampling Design

  • Refers to the method used to choose a sample.

Types of Samples

  • SRS (Simple Random Sample): One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.

    • Advantages: easy and unbiased

    • Disadvantages: large σ^2 and must know population.

  • Stratified: Divide the population into homogeneous groups called strata, then SRS each strata.

    • Advantages: more precise than an SRS and cost reduced if strata already available.

    • Disadvantages: difficult to divide into groups, more complex formulas & must know population.

  • Systematic: Use a systematic approach (every 50th) after choosing randomly where to begin.

    • Advantages: unbiased, the sample is evenly distributed across population & don’t need to know population.

    • Disadvantages: a large σ^2 and can be confounded by trends.

  • Cluster Sample: Based on location. Select a random location and sample ALL at that location.

    • Advantages: cost is reduced, is unbiased& don’t need to know population.

    • Disadvantages: May not be representative of population and has complex formulas.

  • Random Digit Table: Each entry is equally likely and each digit is independent of the rest.

  • Random # Generator: Calculator or computer program

Bias

  • Error – favors a certain outcome, has to do with center of sampling distributions – if centered over true parameter then considered unbiased

Sources of Bias

  • Voluntary Response: People choose themselves to participate.

  • Convenience Sampling: Ask people who are easy, friendly, or comfortable asking.

  • Undercoverage: Some group(s) are left out of the selection process.

  • Non-response: Someone cannot or does not want to be contacted or participate.

  • Response: False answers – can be caused by a variety of things

  • Wording of the Questions: Leading questions.

Experimental Design

  • Observational Study: Observe outcomes with out giving a treatment.

  • Experiment: Actively imposes a treatment on the subjects.

  • Experimental Unit: Single individual or object that receives a treatment.

  • Factor: Is the explanatory variable, what is being tested

  • Level: A specific value for the factor.

  • Response Variable: What you are measuring with the experiment.

  • Treatment: Experimental condition applied to each unit.

  • Control Group: A group used to compare the factor to for effectiveness – does NOT have to be placebo

  • Placebo: A treatment with no active ingredients (provides control).

  • Blinding: A method used so that the subjects are unaware of the treatment (who gets a placebo or the real treatment).

  • Double Blinding: Neither the subjects nor the evaluators know which treatment is being given.

Principles of Experimental Design

  • Control: Keep all extraneous variables (not being tested) constant

  • Replication: Uses many subjects to quantify the natural variation in the response.

  • Randomization: Uses chance to assign the subjects to the treatments.

  • The only way to show cause and effect is with a well designed, well controlled experiment.

Experimental Designs

  • Completely Randomized: All units are allocated to all of the treatments randomly

  • Randomized Block: Units are blocked and then randomly assigned in each block –reduces variation

  • Matched Pairs: Are matched up units by characteristics and then randomly assigned. Once a pair receives a certain treatment, then the other pair automatically receives the second treatment. OR individuals do both treatments in random order (before/after or pretest/post-test). Assignment is dependent

  • Confounding Variables: Are where the effect of the variable on the response cannot be separated from the effects of the factor being tested – happens in observational studies – when you use random assignment to treatments you do NOT have confounding variables!

  • Randomization: Reduces bias by spreading extraneous variables to all groups in the experiment.

  • Blocking: Helps reduce variability. Another way to reduce variability is to increase sample size.

Random Variable

  • A numerical value that depends on the outcome of an experiment.

  • Discrete: A count of a random variable

  • Continuous: A measure of a random variable

Discrete Probability Distributions

  • Gives values & probabilities associated with each possible x.

  • μX = ∑xi p(x_i)

  • σ^2x=∑(xi-μx)^2 p(xi)

  • Calculator shortcut – 1 VARSTAT L1,L2

  • Fair Game: A fair game is one in which all pay-ins equal all pay-outs.

Special Discrete Distributions

  • Binomial Distributions

    • Properties: two mutually exclusive outcomes, fixed number of trials (n), each trial is independent, the probability (p) of success is the same for all trials,

    • Random variable: is the number of successes out of a fixed # of trials. Starts at X = 0 and is finite.

    • μ_X = np

    • σ_x = \sqrt{npq}

    • Calculator:

      • binomialpdf (n, p, x) = single outcome P(X= x)

      • binomialcdf (n, p, x) = cumulative outcome P(X < x)

      • 1 - binomialcdf (n, p, (x -1)) = cumulative outcome P(X > x)

  • Geometric Distributions

    • Properties: two mutually exclusive outcomes, each trial is independent, probability (p) of success is the same for all trials. (NOT a fixed number of trials)

    • Random Variable: when the FIRST success occurs. Starts at 1 and is ∞.

    • Calculator:

      • geometricpdf (p, a) = single outcome P(X = a)

      • geometriccdf (p, a) = cumulative outcomes P(X < a)

      • 1 - geometriccdf (n, p, (a -1)) = cumulative outcome P(X > a)

  • Continuous Random Variable: numerical values that fall within a range or interval (measurements), use density curves where the area under the curve always = 1. To find probabilities, find area under the curve

  • Unusual Density Curves: any shape (triangles, etc.)

  • Uniform Distributions: uniformly (evenly) distributed, shape of a rectangle

  • Normal Distributions: symmetrical, unimodal, bell shaped curves defined by the parameters μ & σ

    • Calculator:

      • Normalpdf – used for graphing only

      • Normalcdf(lower bound, upper bound, μ, σ) – finds probability

      • InvNorm(p) – z-score OR InvNorm(p, μ, σ) – gives x-value

    • To assess Normality: Use graphs – dotplots, boxplots, histograms, or normal probability plot.

Distribution

  • Is all of the values of a random variable.

Sampling Distribution

  • Of a statistic is the distribution of all possible values of all possible samples. Use normalcdf to calculate probabilities – be sure to use correct SD

  • μX = μX

  • σ_x = {σ \over \sqrt{n}}
    (standard deviation of the sample means)

  • μ_p̂= p

  • σ_p̂ = \sqrt{{pq \over n}}
    (standard deviation of the sample proportions)

  • μ{X1 - X2} = μ{X1} - μ_{X2}

  • σ{X1 - X2} = \sqrt{{σ1^2 \over n1} + {σ2^2 \over n_2}}
    (standard deviation of the difference in sample means)

  • μ{p̂1 - p̂2} = p1 - p_2

  • σ{p̂1 - p̂2} = \sqrt{{{p1q1} \over n1} + {{p2q2} \over n_2}}
    (standard deviation of the difference in sample proportions)

  • μb = β sb (do not need to find, usually given in computer printout (standard error of the slopes of the LSRLs)

  • Standard error – estimate of the standard deviation of the statistic

  • Central Limit Theorem: When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.

Confidence Intervals

  • Point Estimate: Uses a single statistic based on sample data, this is the simplest approach.

  • Confidence Intervals: Used to estimate the unknown population parameter.

  • Margin of Error: The smaller the margin of error, the more precise our estimate

  • Steps:

    • Assumptions – see table below

    • Calculations – C.I. = statistic ± critical value (standard deviation of the statistic)

    • Conclusion – Write your statement in context.

      • We are [x]% confident that the true [parameter] of [context] is between [a] and [b].

    • What makes the margin of error smaller:

      • make critical value smaller (lower confidence level).

      • get a sample with a smaller s.

      • make n larger.

  • T distributions compared to standard normal curve

    • centered around 0

    • more spread out and shorter

    • more area under the tails.

    • when you increase n, t-curves become more normal.

    • can be no outliers in the sample data

    • Degrees of Freedom = n – 1

  • Robust: if the assumption of normality is not met, the confidence level or p-value does not change much – this is true of t-distributions because there is more area in the tails

Hypothesis Tests

  • Hypothesis Testing: Tells us if a value occurs by random chance or not. If it is unlikely to occur by random chance then it is statistically significant.

  • Null Hypothesis: H_0 is the statement being tested. Null hypothesis should be “no effect”, “no difference”, or “no relationship”

  • Alternate Hypothesis: H_a is the statement suspected of being true.

  • P-Value: Assuming the null is true, the probability of obtaining the observed result or more extreme

  • Level of Significance: α is the amount of evidence necessary before rejecting the null hypothesis.

  • Steps:

    • Assumptions – see table below

    • Hypotheses - don’t forget to define parameter

    • Calculations – find z or t test statistic & p-value

    • Conclusion – Write your statement in context.

      • Since the p-value is < (>) α, I reject (fail to reject) the Ho. There is (is not) sufficient evidence to suggest that [Ha].

Type I and II Errors and Power

  • Type I Error: Is when one rejects H0 when H0 is actually true. (probability is α)

  • Type II Error: Is when you fail to reject H0, and H0 is actually false. (probability is β)

  • α and β are inversely related. Consequences are the results of making a Type I or Type II error. Every decision has the possibility of making an error.

  • The Power of a Test: Is the probability that the test will reject the null hypothesis when the null hypothesis is false assuming the null is true. Power = 1 – β

    • If you increase Type I error Type II error Power

      • α Increases Decreases Increases

      • n Same Decreases Increases

      • 0 – μa) Same Decreases Increases

χ2 Test

  • Is used to test counts of categorical data.

  • Types

    • Goodness of Fit (univariate)

    • Independence (bivariate)

    • Homogeneity (univariate 2 (or more) samples)

  • χ^2 distribution: All curves are skewed right, every df has a different curve, and as the degrees of freedom increase the χ^2 curve becomes more normal.

  • Goodness of Fit: Is for univariate categorical data from a single sample. Does the observed count “fit” what we expect. Must use list to perform, df = number of the categories – 1, use χ^2cdf (χ^2, ∞, df) to calculate p-value

  • Independence: Bivariate categorical data from one sample. Are the two variables independent or dependent? Use matrices to calculate

  • Homogeneity: Single categorical variable from 2 (or more) samples. Are distributions homogeneous? Use matrices to calculate

  • For both χ^2 tests of independence & homogeneity:

    • Expected counts = {(\text{row total} \cdot \text{column total}) \over \text{grand total}}
      & df = (r – 1)(c – 1)

Regression Model

  • X & Y have a linear relationship where the true LSRL is μ_y = α + βx

  • The responses (y) are normally distributed for a given x-value.

  • The standard deviation of the responses (σy) is the same for all values of x. S is the estimate for σy

Confidence Interval

  • b ± t^*s

Hypothesis Testing

  • t = {b - β \over s}

Assumptions

  • Proportions: z - procedures

    • One sample:

      • SRS from population

      • Can be approximated by normal distribution if n(p) & n(1 – p) > 10

      • Population size is at least 10n

    • Two samples:

      • 2 independent SRS’s from populations (or randomly assigned treatments)

      • Can be approximated by normal distribution if n1(p1), n1(1 – p1), n2p2, & n2(1 – p2) > 10

      • Population sizes are at least 10n

  • Means: t - procedures

    • One sample:

      • SRS from population

      • Distribution is approximately normal

        • Given

        • Large sample size

        • Graph of data is approximately symmetrical and unimodal with no outliers

    • Matched pairs:

      • SRS from population

      • Distribution of differences is approximately normal

        • Given

        • Large sample size

        • Graph of differences is approximately symmetrical and unimodal with no outliers

    • Two samples:

      • 2 independent SRS’s from populations (or randomly assigned treatments)

      • Distributions are approximately normal

        • Given

        • Large sample sizes

        • Graphs of data are approximately symmetrical and unimodal with no outliers

  • Counts: χ^2 - procedures

    • All types:

      • Reasonably random sample(s)

      • All expected counts > 5

        • Must show expected counts

  • Bivariate Data: t – procedures on slope

    • SRS from population

    • There is linear relationship between x & y.

    • Residual plot has no pattern.

    • The standard deviation of the responses is constant for all values of x.

    • Points are scattered evenly across the LSRL in the scatterplot.

    • The responses are approximately normally distributed.

    • Graph of residuals is approximately symmetrical & unimodal with no outliers.