AP Statistics Unit 1-8 Comprehensive Notes

Categorical vs. Quantitative Data

Quantitative Data:
- Deals with numbers (quantity).
- Examples: Heights, class size, population size.
Categorical Data:
- Deals with names and labels.
- Examples: Eye color, hair color (cannot be quantified numerically).

Two-Way Tables for Categorical Data

Representation of two variables and their intersections.
Example:
- Variables: Math students, English students (rows) vs. Internal, External (columns).
- The table shows the number of students in each category (e.g., math students who are internal).
Vocab:
- Marginal Relative Frequency:
  - Percentage of data in a single row or column compared to the total.
  - Calculated as $b/d$ (row total / total) or $c/d$ (column total / total).
- Joint Relative Frequency:
  - Percentage of data in a single group compared to the total.
  - Calculated as $a/d$ (single group / total).
  - Example: Math internal (3/23).
  - Example: external English (9/23).
- Conditional Relative Frequencies:
  - Percentage of data in a single category when given a specific group.
  - Calculated as $a/b$ or $a/c$ .
  - Example: Given the student is in math (total 10), the percentage that is internal is 3/10.

Describing Quantitative Data (CSOCS Acronym)

C: Context
S: Shape
- Symmetrical or skewed.
- Number of peaks (unimodal, bimodal).
O: Outliers
- Data points that are far from the rest of the data.
C: Center
- Mean or median.
S: Spread
- Range, standard deviation, IQR (interquartile range).
Tips:
- Use descriptive language (strongly, roughly).
- Use comparative language.

Basic Terms for Quantitative Data

Mean:
- Sum of all values divided by the number of values (average).
Standard Deviation:
- Measure of variation.
- Interpretation: "The value/context typically varies by [standard deviation value] from the mean of [mean]."
- Example: "The average PrepWorks subscriber's IQ typically varies by 5 IQ points from the mean of 169 IQ points."
Median:
- 50th percentile.
- The value in the middle when data is organized from least to greatest.
Range:
- Value of max minus min in the dataset.

Box Plots and Five-Number Summary

Five-Number Summary:
- Minimum: Smallest value.
- Q1 (25th percentile): Median of the minimum and median.
- Median (50th percentile).
- Q3 (75th percentile).
- Maximum Value.
IQR (Interquartile Range):
- $Q3 - Q1$
Outliers:
- Low-end outlier: Any value less than $Q1 - 1.5 Imes IQR$
- High-end outlier: Any value greater than $Q3 + 1.5 Imes IQR$

Percentiles and Cumulative Relative Frequency

Percentile: Percentage of values that are less than or equal to a specific value.
Cumulative Relative Frequency: Cumulative percentages from each interval up through all the data.
Relative Frequency: Chance of something occurring. Calculated as occurrence/frequency over total.
Visual Representation:
- Data points are graphed. When no data exists, the graph plateaus.

Z-Scores

Number of standard deviations a value is away from the mean.
Equation: $z=(value-mean)/standardD_{}eviation$

Transforming Data

Adding/Subtracting a Constant:
- Shape and variability stay the same.
- Center (mean, median) moves up or down by that amount.
Multiplying/Dividing by a Constant:
- Shape stays the same.
- Center and variability are multiplied or divided by that amount.

Density Curves and Normal Distributions

Density Curve:
- On or above the horizontal axis.
- Area of one.
- Shows probability distribution.
- Normal distribution is a type of density curve.
Uniform Density Curve: Rare, total area of one.
Normal Distribution:
- 68-95-99.7 Rule:
  - 68% of values are within one standard deviation of the mean.
  - 95% of values are within two standard deviations of the mean.
  - 99.7% of values are within three standard deviations of the mean.
Calculator Commands:
- normpdf: Probability at a specific value.
- normcdf: Probability between a set interval.
- Inverse normal: Finds a value that corresponds to a given percentile (area).
Normal Probability Plot:
- Plots actual values versus theoretical z-values.
- Shows how well the data fits a normal distribution.
- Roughly linear = roughly normal distribution.
- Not linear = roughly not a normal distribution.

Describing Scatter Plots (SEDAW Acronym)

S: Context
E: Direction
- Positive or negative.
D: State Outliers
A: Form
- Linear or nonlinear.
W: Strength
- Strongly linear, weakly linear.

Correlation Coefficient (r-Value)

Ranges from $-1$ to $1$ .
- Closer to $-1$ or $1$ = stronger linear correlation.
Examples:
- $r = -0.97$ : Strong linear correlation.
- $r = 0$ : No correlation.
- $r = 0.21$ : Weak linear correlation.
Changing units or switching x and y axis: does not impact the correlation coefficient.

Effect of Outliers on r-Value

Outliers within the pattern of data: strengthen $r$ .
Outliers outside the pattern of data: weaken $r$ .
Correlation does not equal causation.

Regression Lines

Best fit line used to estimate values.
Equation: $\hat{y} = a + bx$
- $\hat{y}$: Predicted value.
- a: Constant.
- b: Slope.
Residual: Degree of error of the regression line prediction.
- Calculated as actual value minus predicted value.
- Negative residual: Overestimation.
- Positive residual: Underestimation.

Interpreting Regression Lines

Focus on using proper language when interpreting slope, y-intercept, and residuals.
Explanatory Variable: Independent variable.
Response Variable: Dependent variable.

Least Squares Regression Line

Line that minimizes the sum of the squared residuals.
s-value: Average distance the predicted values are away from the LSR.
r-squared value: Coefficient of determination.
- Percent of response variables that can be explained with the explanatory variable.
Optimal LSRL: Low s-value, high r-squared value.

Computer Printouts

Memorize the location of y-intercept, slope, s-value, r-squared value.

Effect of Outliers on the Least Squares Regression Line

Outliers decrease correlation.
Outliers far away from the mean of y (horizontal line):
- Decrease slope.
- Increase y-intercept.
Outliers above the mean of x (vertical line):
- Slope stays the same.
- Y-intercept decreases.
Outliers below the mean of x (vertical line):
- Slope stays the same.
- Y-intercept increases.

Residual Plots

Plots residual values versus the explanatory variable.
Clear pattern: Linear function is unlikely to be best fit.
Unclear pattern: Linear function is likely to be best fit.

Sampling Methods

Simple Random Sample (SRS):
- Randomly selected subset of the population.
- All members have an equal chance of being selected.
- Process:
  - Label individuals.
  - Randomize.
  - Select members.
Stratified Random Sample:
- Split population into groups/strata based on shared traits (homogeneous groups).
- Randomly select samples from each group.
- Example: Divide school by grade level and pick students from each grade.
Cluster Sample:
- Split population into groups/clusters.
- Randomly select ENTIRE clusters to sample.
- Example: Split a city into neighborhoods and survey everyone in a few randomly chosen neighborhoods.
Systematic Random Sample:
- Select individuals at regular/set intervals starting at a random point.
- Example: Survey every fifth person that walks into a building.

Bad Sampling Methods

Convenience Sample:
- Choose people who are easy to reach.
- Example: Surveying people at a nearby mall.
Voluntary Response Sampling:
- Allow people to choose to participate.
- Example: Online poll.

Shortcomings of Sampling

Undercoverage: Some groups are left out or underrepresented.
- Example: Survey sent only to people with internet access..Those lacking internet access will be excluded.
Non-response: Selected individuals don't/can't respond.
Response Bias: People give false or misleading answers.
Wording in the question: Poorly phrased/biased question influences answers.

Observational Studies vs. Experiments

Observational Study:
- Observe and collect data without influencing the subjects.
- Example: Recording seatbelt usage.
Experiment:
- Manipulate variables/apply treatments to observe and measure effects.
- Principles: comparison, random assignment, controls, replication.

Key Vocab for Experiments

Factor: Explanatory/independent variable(s).
Level: Specific value/category of the factor.
- Example: low vs. high levels of sunlight
Confounding: Another variable affects the results.
Placebo: Fake treatment where participants may react favorably.
Single Blind: Subjects don't know which group they're in, researchers do.
Double Blind: Neither subjects nor researchers know who gets what treatment.

Experimental Designs

Randomized Block Design:
- Subjects divided into blocks/groups based on specific characteristics.
- Each block is randomly assigned a treatment.
- Block: Group with experimental units with same characteristic.
  - Example: PrepWorks subscribers vs. not subscribers.
Matched Pairs Design:
- Subjects are paired based on specific characteristics.
- Each pair is randomly assigned a treatment, or one subject in each pair is a control.
  - Example: Pairing male vs. female

Definition and Interpretation of Probability

Chance something happens. Scales $0$ to $1$ .
- $0$ : Impossible.
- $1$ : 100% chance of happening.
Unpredictable in the short term due to randomness.
Predictable in the long term after many repeated trials. Results approach true probability.

Simulation

Used to model and mimic real events to estimate probabilities in a variety of situations. For example, predicting the weather forecast.
Four-step process:
- Define the problem questions.
- Use random numbers to model the situation.
- Run/ perform simulation.
- Base on results to estimate probabilities.

Mutually Exclusive Events

Events have no overlap and cannot occur at the same and the same probability.
- Probability of A: is the number of outcomes in A divided by the total number of outcomes.
- Probability of B: is the number of outcome in B divided by the total number of outcomes.
- Independent: The outcome of one event doesn't affect the outcome of the other.

Independence

Two events are independent if the outcome of one does not affect the outcome of the other
The visualization methods:
- Venn diagram: represents a set of objects; shows probabilities in and between events.
- Two-way table: visual tool that is useful for analyzing categorical data; presents a summary of the relationship between the two.
- Probability tree: display all possible outcome combinations and their associated probabilities, but must start with probability of the first event in tree.

Transformation of Probability Distributions

Add or subtract the same constant c to each dataset, shape unchanged. Center increase or decrease by c variability remains unchanged, if you multiply or divide by the same constant c to each value in the dataset, your shape stay the same, center multiply is divided by c; variability multiply or divide by c
Equations: $\mu x + y = \mu x + \mu y$ , $\mu x-y=\mu x - \mu y$

Types of Random Variables (DISCRETE)

Discrete random variable.
- Take up a specific countable value. For example, number of heads or car by the day.
- $\mu x=x\cdotp(p)$
Continuous random variable.
- time or height.
- Any value within a range or interval. So it's not just 1 or 2, you can also count what is in between one and two.
Binomial Random
- Has a binomial of the fixed number of trails.
- Independent event with fixed number of trials.
- BINS: binary, independence, number of trials and success rate for each trial.
- Calculator commands: binompdf, binomcdf.
Geometric Random.
- Have a certain chance or probability to achieve for the first time (first success in a series of independent trials) . For example, flipping a coin trying to see the number of flips it may require to get the head.
- Calculator commands: geomatric pdf and geomatric cdf with success/ trail has to be constant(p= success to trial and be independent.

Differentiating Statistic vs. Parameter

Statistic:
- A number that describes something from sample data.
Parameter:
- A number that describes something from the entire population of data.

Core concept to remember

Population mean should equal sample mean of sample size if the sampling distribution is equal to that of the parameter

Sampling Distribution of Proportions

Repeated samples of the same size from the population, finding the proportion of each sample, then plotting info on a distribution.
Formulas:
- $\text{Mean}: \mu_{\hat{p}} = p$
- $\text{Standard Deviation}: \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$
- Increasing the sample size will decrease your variability

Random Sampling

Need 1random sampling and assignment
- 10% sample size from size: n<((=10%)N.
- Large counts. $Np>=10, N(1-p)>=10

Probability calculation with z value

Sampling distribution is approximately normal.
Z value formulas:
- Calculator command normal cdf
Mean formulas $M\mu x+\muy+m; \mu x-y= \mu x - \mu y $</p></li><li><p>$ \sigma2=((\mu x+y = \mu \sqrtx^2+ \mu\sqrt^y2); \sigma2=((\mu x-y = \mu \sqrtx^2+ \mu\sqrt^y2) $</p></li></ul><h3 id="5d2c433a-e5e0-4b35-9e61-e64bb659e48e" data-toc-id="5d2c433a-e5e0-4b35-9e61-e64bb659e48e" collapsed="false" seolevelmigrated="true">Key components of the Sampling distribution</h3><ul><li><p>Repeated the sample taking from population to calculate a mean to reapply of each sample into a distribution(probability distribution)</p></li><li><p>With the increase of sample size = variability get decrease.</p></li></ul><h3 id="36418356-af3c-4d42-80cf-f09c15c3259c" data-toc-id="36418356-af3c-4d42-80cf-f09c15c3259c" collapsed="false" seolevelmigrated="true">10% sample size from size:$ n<((=10%)N $.</h3><ul><li><p>Large counts.$ $Np>=10, N(1-p)>=10$$
**CENTRAL LIMIT THEOREM:
Sample size must be at least 30

What parameter of interest

*Point estimate
*confidence interval
*Confidence levels- increasing will increase margin of errors. bias is affected by marginal errors
*Increasing sample size = reduces marginal errors, make narrower

Panic

*Check parameter
*Assumption of conclusions must be random-check 10% sample , then large counts (equal to sample 10+)

One sample z/test z intervals for P. (2 samples means P1-P2) and random for both-same condition to consider for both samples

*Name the test, stated distribution and assumption of conditions
What’s alpha value?

Check hypothesis with HO, and alternative in contact= make decisions wether to reject or to fail to rejected HO.

*Type 1 error: false negative - probability is related to alpha values.
*Type 2 error: false +.
*Power = 1+p(false negative)- probability is always less and HO is false.

increase alpha value
increase samples size and
increase the distance between null vs alternative hypothesis.

Types of Chi squared

***Test for goodness- differ from expected distribution?. Hypotheses needs not to have a variable. (random large count +5)
Goodness of fit- degree of freedom = category (N-1)
***Chi squares is higher and higher = higher discrepancy - means m and m is not telling the truth
***Test of homogeneity - shared same group distribute and what is being used:

Random samples = 1 + variables.
*Hypotheses =no difference vs alternative is change in groups (chi squared for homeg)
*Formula for sample = raw + column divide by total numbers and +equal to 5
*Then reject all + conclusion with 4 chi squared or association- what variables is the chi quared and hypothesis : + variable but has different samples.
Conclusion reject.

Assumptions of conditions

All point must has Random, assignment ,10% less N. equal variables +5

Inference Of slopes formulas

***Assumptions of conditions is 5 variables. With the data has x and y should is slope.

***The intervals and data does random and not random =no increasing patterns.
***If not the should that to use any confident data or sample.
***There are experiments= the number has assignments for experiments