AP Statistics Unit 1-8 Comprehensive Notes

Categorical vs. Quantitative Data

  • Quantitative Data:

    • Deals with numbers (quantity).

    • Examples: Heights, class size, population size.

  • Categorical Data:

    • Deals with names and labels.

    • Examples: Eye color, hair color (cannot be quantified numerically).

Two-Way Tables for Categorical Data

  • Representation of two variables and their intersections.

  • Example:

    • Variables: Math students, English students (rows) vs. Internal, External (columns).

    • The table shows the number of students in each category (e.g., math students who are internal).

  • Vocab:

    • Marginal Relative Frequency:

      • Percentage of data in a single row or column compared to the total.

      • Calculated as b/db/d (row total / total) or c/dc/d (column total / total).

    • Joint Relative Frequency:

      • Percentage of data in a single group compared to the total.

      • Calculated as a/da/d (single group / total).

      • Example: Math internal (3/23).

      • Example: external English (9/23).

    • Conditional Relative Frequencies:

      • Percentage of data in a single category when given a specific group.

      • Calculated as a/ba/b or a/ca/c.

      • Example: Given the student is in math (total 10), the percentage that is internal is 3/10.

Describing Quantitative Data (CSOCS Acronym)

  • C: Context

  • S: Shape

    • Symmetrical or skewed.

    • Number of peaks (unimodal, bimodal).

  • O: Outliers

    • Data points that are far from the rest of the data.

  • C: Center

    • Mean or median.

  • S: Spread

    • Range, standard deviation, IQR (interquartile range).

  • Tips:

    • Use descriptive language (strongly, roughly).

    • Use comparative language.

Basic Terms for Quantitative Data

  • Mean:

    • Sum of all values divided by the number of values (average).

  • Standard Deviation:

    • Measure of variation.

    • Interpretation: "The value/context typically varies by [standard deviation value] from the mean of [mean]."

    • Example: "The average PrepWorks subscriber's IQ typically varies by 5 IQ points from the mean of 169 IQ points."

  • Median:

    • 50th percentile.

    • The value in the middle when data is organized from least to greatest.

  • Range:

    • Value of max minus min in the dataset.

Box Plots and Five-Number Summary

  • Five-Number Summary:

    • Minimum: Smallest value.

    • Q1 (25th percentile): Median of the minimum and median.

    • Median (50th percentile).

    • Q3 (75th percentile).

    • Maximum Value.

  • IQR (Interquartile Range):

    • Q3Q1Q3 - Q1

  • Outliers:

    • Low-end outlier: Any value less than Q11.5ImesIQRQ1 - 1.5 Imes IQR

    • High-end outlier: Any value greater than Q3+1.5ImesIQRQ3 + 1.5 Imes IQR

Percentiles and Cumulative Relative Frequency

  • Percentile: Percentage of values that are less than or equal to a specific value.

  • Cumulative Relative Frequency: Cumulative percentages from each interval up through all the data.

  • Relative Frequency: Chance of something occurring. Calculated as occurrence/frequency over total.

  • Visual Representation:

    • Data points are graphed. When no data exists, the graph plateaus.

Z-Scores

  • Number of standard deviations a value is away from the mean.

  • Equation: z=(valuemean)/standardDeviationz=(value-mean)/standardD_{}eviation

Transforming Data

  • Adding/Subtracting a Constant:

    • Shape and variability stay the same.

    • Center (mean, median) moves up or down by that amount.

  • Multiplying/Dividing by a Constant:

    • Shape stays the same.

    • Center and variability are multiplied or divided by that amount.

Density Curves and Normal Distributions

  • Density Curve:

    • On or above the horizontal axis.

    • Area of one.

    • Shows probability distribution.

    • Normal distribution is a type of density curve.

  • Uniform Density Curve: Rare, total area of one.

  • Normal Distribution:

    • 68-95-99.7 Rule:

      • 68% of values are within one standard deviation of the mean.

      • 95% of values are within two standard deviations of the mean.

      • 99.7% of values are within three standard deviations of the mean.

  • Calculator Commands:

    • normpdf: Probability at a specific value.

    • normcdf: Probability between a set interval.

    • Inverse normal: Finds a value that corresponds to a given percentile (area).

  • Normal Probability Plot:

    • Plots actual values versus theoretical z-values.

    • Shows how well the data fits a normal distribution.

    • Roughly linear = roughly normal distribution.

    • Not linear = roughly not a normal distribution.

Describing Scatter Plots (SEDAW Acronym)

  • S: Context

  • E: Direction

    • Positive or negative.

  • D: State Outliers

  • A: Form

    • Linear or nonlinear.

  • W: Strength

    • Strongly linear, weakly linear.

Correlation Coefficient (r-Value)

  • Ranges from 1-1 to 11.

    • Closer to 1-1 or 11 = stronger linear correlation.

  • Examples:

    • r=0.97r = -0.97: Strong linear correlation.

    • r=0r = 0: No correlation.

    • r=0.21r = 0.21: Weak linear correlation.

  • Changing units or switching x and y axis: does not impact the correlation coefficient.

Effect of Outliers on r-Value

  • Outliers within the pattern of data: strengthen rr.

  • Outliers outside the pattern of data: weaken rr.

  • Correlation does not equal causation.

Regression Lines

  • Best fit line used to estimate values.

  • Equation: y^=a+bx\hat{y} = a + bx

    • $\hat{y}$: Predicted value.

    • a: Constant.

    • b: Slope.

  • Residual: Degree of error of the regression line prediction.

    • Calculated as actual value minus predicted value.

    • Negative residual: Overestimation.

    • Positive residual: Underestimation.

Interpreting Regression Lines

  • Focus on using proper language when interpreting slope, y-intercept, and residuals.

  • Explanatory Variable: Independent variable.

  • Response Variable: Dependent variable.

Least Squares Regression Line

  • Line that minimizes the sum of the squared residuals.

  • s-value: Average distance the predicted values are away from the LSR.

  • r-squared value: Coefficient of determination.

    • Percent of response variables that can be explained with the explanatory variable.

  • Optimal LSRL: Low s-value, high r-squared value.

Computer Printouts

  • Memorize the location of y-intercept, slope, s-value, r-squared value.

Effect of Outliers on the Least Squares Regression Line

  • Outliers decrease correlation.

  • Outliers far away from the mean of y (horizontal line):

    • Decrease slope.

    • Increase y-intercept.

  • Outliers above the mean of x (vertical line):

    • Slope stays the same.

    • Y-intercept decreases.

  • Outliers below the mean of x (vertical line):

    • Slope stays the same.

    • Y-intercept increases.

Residual Plots

  • Plots residual values versus the explanatory variable.

  • Clear pattern: Linear function is unlikely to be best fit.

  • Unclear pattern: Linear function is likely to be best fit.

Sampling Methods

  • Simple Random Sample (SRS):

    • Randomly selected subset of the population.

    • All members have an equal chance of being selected.

    • Process:

      • Label individuals.

      • Randomize.

      • Select members.

  • Stratified Random Sample:

    • Split population into groups/strata based on shared traits (homogeneous groups).

    • Randomly select samples from each group.

    • Example: Divide school by grade level and pick students from each grade.

  • Cluster Sample:

    • Split population into groups/clusters.

    • Randomly select ENTIRE clusters to sample.

    • Example: Split a city into neighborhoods and survey everyone in a few randomly chosen neighborhoods.

  • Systematic Random Sample:

    • Select individuals at regular/set intervals starting at a random point.

    • Example: Survey every fifth person that walks into a building.

Bad Sampling Methods

  • Convenience Sample:

    • Choose people who are easy to reach.

    • Example: Surveying people at a nearby mall.

  • Voluntary Response Sampling:

    • Allow people to choose to participate.

    • Example: Online poll.

Shortcomings of Sampling

  • Undercoverage: Some groups are left out or underrepresented.

    • Example: Survey sent only to people with internet access..Those lacking internet access will be excluded.

  • Non-response: Selected individuals don't/can't respond.

  • Response Bias: People give false or misleading answers.

  • Wording in the question: Poorly phrased/biased question influences answers.

Observational Studies vs. Experiments

  • Observational Study:

    • Observe and collect data without influencing the subjects.

    • Example: Recording seatbelt usage.

  • Experiment:

    • Manipulate variables/apply treatments to observe and measure effects.

    • Principles: comparison, random assignment, controls, replication.

Key Vocab for Experiments

  • Factor: Explanatory/independent variable(s).

  • Level: Specific value/category of the factor.

    • Example: low vs. high levels of sunlight

  • Confounding: Another variable affects the results.

  • Placebo: Fake treatment where participants may react favorably.

  • Single Blind: Subjects don't know which group they're in, researchers do.

  • Double Blind: Neither subjects nor researchers know who gets what treatment.

Experimental Designs

  • Randomized Block Design:

    • Subjects divided into blocks/groups based on specific characteristics.

    • Each block is randomly assigned a treatment.

    • Block: Group with experimental units with same characteristic.

      • Example: PrepWorks subscribers vs. not subscribers.

  • Matched Pairs Design:

    • Subjects are paired based on specific characteristics.

    • Each pair is randomly assigned a treatment, or one subject in each pair is a control.

      • Example: Pairing male vs. female

Definition and Interpretation of Probability

  • Chance something happens. Scales 00 to 11.

    • 00: Impossible.

    • 11: 100% chance of happening.

  • Unpredictable in the short term due to randomness.

  • Predictable in the long term after many repeated trials. Results approach true probability.

Simulation

  • Used to model and mimic real events to estimate probabilities in a variety of situations. For example, predicting the weather forecast.

  • Four-step process:

    • Define the problem questions.

    • Use random numbers to model the situation.

    • Run/ perform simulation.

    • Base on results to estimate probabilities.

Mutually Exclusive Events

  • Events have no overlap and cannot occur at the same and the same probability.

    • Probability of A: is the number of outcomes in A divided by the total number of outcomes.

    • Probability of B: is the number of outcome in B divided by the total number of outcomes.

    • Independent: The outcome of one event doesn't affect the outcome of the other.

Independence

  • Two events are independent if the outcome of one does not affect the outcome of the other

  • The visualization methods:

    • Venn diagram: represents a set of objects; shows probabilities in and between events.

    • Two-way table: visual tool that is useful for analyzing categorical data; presents a summary of the relationship between the two.

    • Probability tree: display all possible outcome combinations and their associated probabilities, but must start with probability of the first event in tree.

Transformation of Probability Distributions

  • Add or subtract the same constant c to each dataset, shape unchanged. Center increase or decrease by c variability remains unchanged, if you multiply or divide by the same constant c to each value in the dataset, your shape stay the same, center multiply is divided by c; variability multiply or divide by c

  • Equations:μx+y=μx+μy\mu x + y = \mu x + \mu y,μxy=μxμy\mu x-y=\mu x - \mu y

Types of Random Variables (DISCRETE)

  • Discrete random variable.

    • Take up a specific countable value. For example, number of heads or car by the day.

    • μx=x(p)\mu x=x\cdotp(p)

  • Continuous random variable.

    • time or height.

    • Any value within a range or interval. So it's not just 1 or 2, you can also count what is in between one and two.

  • Binomial Random

    • Has a binomial of the fixed number of trails.

    • Independent event with fixed number of trials.

    • BINS: binary, independence, number of trials and success rate for each trial.

    • Calculator commands: binompdf, binomcdf.

  • Geometric Random.

    • Have a certain chance or probability to achieve for the first time (first success in a series of independent trials) . For example, flipping a coin trying to see the number of flips it may require to get the head.

    • Calculator commands: geomatric pdf and geomatric cdf with success/ trail has to be constant(p= success to trial and be independent.

Differentiating Statistic vs. Parameter

  • Statistic:

    • A number that describes something from sample data.

  • Parameter:

    • A number that describes something from the entire population of data.

Core concept to remember

  • Population mean should equal sample mean of sample size if the sampling distribution is equal to that of the parameter

Sampling Distribution of Proportions

  • Repeated samples of the same size from the population, finding the proportion of each sample, then plotting info on a distribution.

  • Formulas:

    • Mean:μp^=p\text{Mean}: \mu_{\hat{p}} = p

    • Standard Deviation:σp^=p(1p)n\text{Standard Deviation}: \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}

    • Increasing the sample size will decrease your variability

Random Sampling

  • Need 1random sampling and assignment

    • 10% sample size from size: n<((=10%)N.

    • Large counts. $Np>=10, N(1-p)>=10

Probability calculation with z value

  • Sampling distribution is approximately normal.

  • Z value formulas:

    • Calculator command normal cdf

  • Mean formulas $M\mu x+\muy+m; \mu x-y= \mu x - \mu y</p></li><li><p></p></li><li><p>\sigma2=((\mu x+y = \mu \sqrtx^2+ \mu\sqrt^y2); \sigma2=((\mu x-y = \mu \sqrtx^2+ \mu\sqrt^y2)</p></li></ul><h3id="5d2c433ae5e04b359e61e64bb659e48e"datatocid="5d2c433ae5e04b359e61e64bb659e48e"collapsed="false"seolevelmigrated="true">KeycomponentsoftheSamplingdistribution</h3><ul><li><p>Repeatedthesampletakingfrompopulationtocalculateameantoreapplyofeachsampleintoadistribution(probabilitydistribution)</p></li><li><p>Withtheincreaseofsamplesize=variabilitygetdecrease.</p></li></ul><h3id="36418356af3c4d4280cff09c15c3259c"datatocid="36418356af3c4d4280cff09c15c3259c"collapsed="false"seolevelmigrated="true">10</p></li></ul><h3 id="5d2c433a-e5e0-4b35-9e61-e64bb659e48e" data-toc-id="5d2c433a-e5e0-4b35-9e61-e64bb659e48e" collapsed="false" seolevelmigrated="true">Key components of the Sampling distribution</h3><ul><li><p>Repeated the sample taking from population to calculate a mean to reapply of each sample into a distribution(probability distribution)</p></li><li><p>With the increase of sample size = variability get decrease.</p></li></ul><h3 id="36418356-af3c-4d42-80cf-f09c15c3259c" data-toc-id="36418356-af3c-4d42-80cf-f09c15c3259c" collapsed="false" seolevelmigrated="true">10% sample size from size:n<((=10%)N.</h3><ul><li><p>Largecounts..</h3><ul><li><p>Large counts.$Np>=10, N(1-p)>=10$$
    **CENTRAL LIMIT THEOREM:

  • Sample size must be at least 30

What parameter of interest

*Point estimate
*confidence interval
*Confidence levels- increasing will increase margin of errors. bias is affected by marginal errors
*Increasing sample size = reduces marginal errors, make narrower

Panic

*Check parameter
*Assumption of conclusions must be random-check 10% sample , then large counts (equal to sample 10+)

One sample z/test z intervals for P. (2 samples means P1-P2) and random for both-same condition to consider for both samples

*Name the test, stated distribution and assumption of conditions
What’s alpha value?

Check hypothesis with HO, and alternative in contact= make decisions wether to reject or to fail to rejected HO.

*Type 1 error: false negative - probability is related to alpha values.
*Type 2 error: false +.
*Power = 1+p(false negative)- probability is always less and HO is false.

  • increase alpha value

  • increase samples size and

  • increase the distance between null vs alternative hypothesis.

Types of Chi squared

***Test for goodness- differ from expected distribution?. Hypotheses needs not to have a variable. (random large count +5)
Goodness of fit- degree of freedom = category (N-1)
***Chi squares is higher and higher = higher discrepancy - means m and m is not telling the truth
***Test of homogeneity - shared same group distribute and what is being used:

  • Random samples = 1 + variables.
    *Hypotheses =no difference vs alternative is change in groups (chi squared for homeg)
    *Formula for sample = raw + column divide by total numbers and +equal to 5
    *Then reject all + conclusion with 4 chi squared or association- what variables is the chi quared and hypothesis : + variable but has different samples.
    Conclusion reject.

Assumptions of conditions

All point must has Random, assignment ,10% less N. equal variables +5

Inference Of slopes formulas

***Assumptions of conditions is 5 variables. With the data has x and y should is slope.

***The intervals and data does random and not random =no increasing patterns.
***If not the should that to use any confident data or sample.
***There are experiments= the number has assignments for experiments