AP Statistics Unit 1-9 Notes
Categorical vs. Quantitative Data
Quantitative Data: Deals with numbers (quantity).
Examples: Heights, class size, population size.
Categorical Data: Deals with names and labels.
Examples: Eye color, hair color.
Representing Categorical Data: Two-Way Tables
Two variables on either side show intersections between them
Example: Math vs. English students, and Internal vs. External students.
Vocab Terms:
Marginal Relative Frequency: Percentage of data in a single row or column compared to the total.
Looks at (row total) or (column total) in a two-way table.
Joint Relative Frequency: Percentage of data in a single group compared to the total.
Looks at , where A is a singular variable.
Example: Math Internal (3/23) or External English (9/23).
Conditional Relative Frequencies: Percentage of data in a single category when given a specific group.
Looks at or
Example: Given a student is in Math (total 10), the percentage that are internal is 3/10.
Describing Quantitative Data: C-SOAPs
C: Context.
S: Shape.
Symmetrical or skewed.
Number of peaks: unimodal, bimodal.
O: Outliers.
Data points far from the rest.
C: Center.
Mean or median.
S: Spread.
Range, standard deviation, IQR.
Tip: Use descriptive language (strongly, roughly) and comparative language.
Basic Terms for Quantitative Data
Mean: Sum of values divided by the number of values (average).
Standard Deviation: Measure of variation.
Description in context: "The value context typically varies by standard deviation value from the mean of mean."
Example: "The average prep worker subscriber's IQ typically varies by 5 IQ points (standard deviation) from the mean of 169 IQ points."
Median: 50th percentile (value in the middle when data is ordered).
Range: Value of max minus min in the data set.
Box Plots: Five-Number Summary
Minimum: Smallest value.
Q1 (25th percentile): Median of the minimum and 50th percentile.
Median (50th percentile).
Q3 (75th percentile).
Maximum: Maximum value.
IQR (Interquartile Range)
Outliers
Low-End Outliers: Any value less than
High-End Outliers: Any value greater than
Other Terms
Percentile: Percentage of values less than or equal to a specific value.
Cumulative Relative Frequency: Cumulative percentages from each interval up through all the data.
Visual: data point is graphed, no data is a plateau
Relative Frequency: Occurrence/frequency over total.
Z-Scores
Number of standard deviations a value is away from the mean.
Equation:
Transforming Data
Adding/Subtracting a Constant:
Shape and variability stay the same.
Center moves up/down by that amount.
Multiplying/Dividing:
Shape stays the same.
Center and variability are multiplied/divided by that amount.
Density Curves and Normal Distribution
Density Curve: On or above the horizontal axis, area of one, shows probability distribution.
Normal Distribution: A type of density curve.
Uniform Density Curve: Total area of one.
68-95-99.7 Rule:
68% of values are within one standard deviation of the mean.
95% of values are within two standard deviations of the mean.
99.7% of values are within three standard deviations of the mean.
Calculator Commands for Normal Distribution
normPDF: Finds probability at a specific value.
normCDF: Shows probability between a set interval.
Inverse Normal: Finds a value corresponding to a given percentile (area).
Normal Probability Plot
Plots actual values versus theoretical z-values.
Shows how well the data fits a normal distribution.
Roughly linear: Roughly normal distribution.
Not linear: Not a normal distribution.
Describing Scatter Plots: SEED
S: Context.
E: Direction (Positive or Negative).
E: Outliers. State any outliers
D: Form (Linear or Nonlinear).
D: Strength of that form (strongly linear? weakly linear?).
Correlation Coefficient (R Value)
Ranges from -1 to 1.
Closer to -1 or 1 indicates a stronger linear correlation.
-1: Perfect negative linear correlation.
1: Perfect positive linear correlation.
-0.97: strong linear correlation
0: Normal correlation.
.021: Weak linear correlation.
Factors That Do Not Impact R Value
Changing units.
Switching X and Y axes.
Outliers and Their Effects on R Value
Outliers within the pattern of data strengthen R.
Outliers outside the pattern of data Weaken R.
Correlation does not equal causation.
Regression Lines
Best fit line used to estimate values for points not given.
Equation:
\hat{y} : Is the predicted value.
a: constant.
b: slope.
Residuals: Degree of error of the regression line prediction.
Negative residual: Overestimated.
Positive residual: Underestimated.
Interpreting Regression Lines
Use specific language to interpret slope, y-intercept, and residuals.
Explanatory variable: Independent variable
Response variable: Dependent variable
Least Squares Regression Line
The line that minimizes the sum of the squared residuals; found using a graphing calculator.
S Value: Average distance the predicted values are away from the LSR.
R-squared value: Coefficient of determination; percentage of response variables that can be explained with the explanatory variable.
Optimal Line: Low S value and high R-squared value.
Computer Printout
Identifies location of variables to analyze data. Y-intercept, the slope, S value, R-squared value Outliers
Are going to decrease your correlation
Outliers are added far away from the mean of Y (horizontal line); decreasing slope and increases y-intercept.
Outliers are added above the mean of X ( vertical line); slope stays the same, y-intercept decreases.
Outliers are added below the mean of X (vertical line); slope stays the same, y-intercept increases
Residual Plot
Plots residual values versus the explanatory/independent variable.
Clear pattern: Linear function unlikely to be the best fit.
Unclear pattern: Linear function likely to be the best fit.
Sampling Methods
Simple Random Sample (SRS):
Randomly selected subset of the population; all members have an equal chance of being selected.
Define population, label individuals, randomize, and select members.
Stratified Random Sample:
Split the population into groups/strata based on shared traits (homogeneous groups).
Randomly select samples from each group.
Example: Divide a school by grade level, then randomly pick students from each grade.
Cluster Sample
Splitting the population into Heterogeneous groups/clusters and then randomly selecting entire clusters to sample.
Spliting a city into neighborhoods and surveying everyone in a few randomly chosen neighborhoods.
Systematic Random Sample:
Select individuals at regular or set intervals; start at a random point.
Example: Surveying every fifth person entering a building.
Bad Sampling Methods
Convenience Sample:
Choosing people who are easily reached/accessible.
Example: Surveying people at a nearby mall only because it's close.
Voluntary Response Sampling:
Allowing people to choose to participate.
Example: Putting an online poll, which introduces bias.
Shortcomings
Undercoverage:
Some groups are left out or underrepresented in the sample.
Example: Sending a survey only to people with internet access.
Non-response:
Selected individuals don't or can't respond.
Example: Calling someone for a phone survey, but they decline.
Response Bias:
People give false or misleading answers.
Example: If you are a friend of the survey maker you are going to have a response bias
Wording in the Question:
Poorly phrased or biased questions influence answers.
Example: "Do you want to subscribe to Prepper's Education to receive $1 million?"
Observational Studies vs. Experiments
Observational Study:
Observe and collect data without influencing the subjects.
Example: Sitting in a car watching how many people use seat belts.
Experiment:
Manipulating variables or applying treatments to observe and measure the effects on the subjects.
Four Principles When you conduct experiments:
* Comparison- to compare all the grops
* Random assignment- reduce the bias
* Control all variable
* Replication- have enough subjects
Key Vocab Terms to Know:
* Factor the explanatory variable / independent
* Level specific value or category of the factor
* Confounding when another variable affects the results properly having control group to determine cause of your result.
* Single blind if a subject dont know what group they are in.
* Double Blind- with subject nor researchers know who get what treatment.
Randomized Block Design
Subjects are divided into blocks/groups based on specific characteristics. The Experimental design:
Subjects are paired based on the specific characteristics and each is assigned with treatment one subject in each pair could be control.
What is Probability?
Chance that something happens, written as 0 to 1.
Predictable in the long term
Simulation
Model used to mimic real-world events to estimate probabilities.
Define the problem
describe use of change process.
per form that
estimate based on result
Probability Rules
Mutually Exclusive:
Two events have no overlap and cannot occur at the same time.
Independence:
Outcome of one does not affect the outcome of the other.
Two Possible Events: Probability of A and Probability of B.
Finding the Probability; can find the probabilty one occurring so it doesnt matter if it's A or B.
* add the probability and subtract the overlap
* simple with it add the probabilities
Finding of Probability
If they were not independent, you would just multiply the probability of A with the probability of B given A.
If They were independent you would simple multiply them.
Complement Rule!
Visualizing Probability
Venn diagram
Two-way table
Probability tree
Random Variables: Four Types
Discrete
Continuous
Binomial
Geometric
Types
* Discrete - specific contable variable think numbers of heads
* Continuous - Variable with range / interval think what's in between it.
Finding the mean and deviation of discrete/continuous use you use the data set.
* calculator statistics.
Weighted Expected Value is an average value to repitition. of a chance of the random. Varible.
Weighted Average the average discrete or continuous.
Transform - you add or substrate to each data
Transformation with shape unchange center increase or decrease remains the same.
multiplying or dividing - center multiply divide the variabilaty m
these are constant to each in the data set.
Binomial Random Variable is used as acronymBins
Binary has to have a success and a failure.
Trials has to be indepeendent.
Fixed Number of Trials.
Set of probability success
A Geometric random model the first trials till first.
Trals has to be the independent each to have the same trials the random varibale counts occur till first succss.
Have these calculator commands - with geo cdf to get success with geo pdf.
Statistic vs. Parameter
Statistic: Describes something from a sample of data
Parameter: Describes something from an entire population of data.
Mean and proportion
Sampling Distribution
Probability distribution of a statistic is obtained through repeated from a same population
Sampling, Population the simulations and takings a mean and it should always be inline cuz it is an biased .
More Specification Taking repeated same samples with a population with the sample by distribution.
Formula to meet: Mean to to be able to do
Increase to be the variability
Assumptions
Randomly and Condition.
Key components
Random Sampling
The assignment
10% condition
Large counts of both has to be equal or above 10
# Large Counts Conditions #
basically to use it has to be roughly normal and then using a p value
Z value formula of to to what percent of statistics
Central limit Theorem #
tell the distributions has to have that with to see if it meets everything
One the value can be the accurate
Point Estimate
estimate of your population parameters and if its on 1 to to 5 the estimate is four.
Is as act of the range or range and we get that percentage level.
Is it the first interpretation is to show the conclusion.
That has to make sure they are with context.
Remember is if we increase the comdition has been set to higher is going to result wider margin.
Main thing to understand
if you increase sample is going to decrease to be more accurate and with sampleing distributions.
bias does not effect the margin of errors.
Aconynm need to know