AP statistics Cheet Sheat
Data Analysis
Displaying Categorical Data
Frequency (counts)
Relative Frequency (percent/proportion)
Two-Way Table
Marginal Relative Frequency: P(c)
Percent or proportion of individuals that have a specific value for one categorical variable.
Joint Relative Frequency: (A \cap c)
Percent or proportion of individuals that have a specific value for one categorical variable and a specific value for another.
Conditional Relative Frequency: P(A|c)
Percent or proportion of individuals that have a specific value for one categorical variable among individuals who share the same value of another categorical variable (condition).
Graph of Categorical Variable Side-by-Side
Pie Chart
Bar Graph
Segmented Bar Graph
Mosaic Plot
Association
There is an association between two variables if knowing the value of one variable predicts the specific value of the other categorical variable.
Describing Quantitative Data with Graphs
Dotplot: ranges - percent or proportion of individuals that have a specific value
Stemplot (Stem-and-Leaf Plot)
Histogram
Describing the Distribution (SOCV + Context)
Shape: The distribution of (context) is (shape) with a peak at (highest point) and gaps between (gap).
Roughly right-skewed
Left-skewed
Uniform
Double-peaked (bimodal)
Symmetric (unimodal)
Outlier: There seems to be outliers at (values).
Center: The (mean/median) of the distribution is (mean/median + units).
Symmetric -> use mean
Skewed -> use median
Variability: The distribution of (context) has a (SD/IQR/range + units).
Comparing Distributions (SOCV + Context)
Identify any outliers for both distributions.
Compare the center (which is greater/lesser).
Describe the shape of both distributions.
Compare the variability (which varies more).
Always write in context of the problem.
Describing Quantitative Data with Numbers
Measures of Center
Mean: \bar{x} = \frac{\sum x_i}{n}
The mean is greatly affected by outliers (non-resistant).
Median: Middle value
Odd value of data
Average of two middle values (even value of data)
The median is not affected by outliers (resistant).
Measures of Variability
Range: Maximum - Minimum
Standard Deviation (SD): Sx = \sqrt{\frac{\sum(xi - \bar{x})^2}{n-1}}
Interquartile Range (IQR): IQR = Q3 - Q1 (Quartile 1 (25%), Quartile 3 (75%))
How to Find Outliers?
1.5 x IQR Rule:
Low outlier < Q1 - (1.5 x IQR)
High outlier > Q3 + (1.5 x IQR)
Interpreting SD
"The (context) typically varies by about (SD + unit) from the mean of (\bar{X} + unit)."
Boxplots
Min, Q1, Median, Q3, Max
If the min or max is an outlier:
Remove your outliers, label them on your boxplot.
The new min is the lowest data (same for max).
Parameter vs. Statistic
Parameter: A number (or statement) that describes a population.
Statistic: A number (or statement) that describes a sample.
Five Number Summary
Minimum (Min)
Q1 (Quartile 1; 25th percentile)
Median
Q3 (Quartile 3; 75th percentile)
Maximum (Max)
Modeling Distributions of Quantitative Data
Describing Location in a Distribution
Percentiles: P_i of observations less than or equal to it.
Standardized scores (z-scores): Tells us how many standard deviations from the mean the value falls, and in what direction.
Z = \frac{x - \mu}{\sigma}
x = value
μ = mean
σ = SD
Interpretation: "(Context) is (z-score) standard deviations (above/below) the mean of (μ + unit)."
Cumulative relative frequency graph (ogive)
An ogive allows you to examine location in a distribution.
The completed graph allows you to estimate the percentile for an individual value & vice-versa.
Transformation of Data
Addition/Subtraction
Centers, Location: Change
Shape: No change
Variability: No change
Multiplication/Division
Centers, Location: Change
Shape: No change
Variability: Change
Density Curves and Normal Distributions
Density curves - models the distribution
Is always above the horizontal axis.
Has exactly 1 underneath it.
The area under the curve and above any interval of values on the horizontal axis estimates the proportion of all observations that fall in that interval.
Mean of a density curve - point at which the curve would balance if made of solid material.
Median of a density curve - is the equal areas point, the point that divides the area under the curve in half.
Uniform Density Curve
Why is the height 1/2?
Since the area under the curve should be equal to 1, then the distance on the horizontal axis is equal to the reciprocal of the height.
Approximately Normal
Described by a roughly symmetric, single-peaked, bell-shaped density curve called a Normal curve.
Any Normal distribution is completely specified by two parameters: mean (\mu) & SD (\sigma).
Finding the area under the curve (Probability)
Finding a value
First Z-score: \mu = 0
IFFinding the upper: upper :1000
&: = 1 lower: -1000
:0 E:1
*to Find an area
0: area
μ the context and in
μ: Context
5 Sd
5:Sd
Empirical Rule (68-95-99.7 Rule)
center\center μ \center (SD) \center count
μ ± 1(SD) count
μ ± 2(SD) count
μ ± 3(SD) count
If these values are close to 68-95-99.7, then the A
distribution is approximately Normal.Normal Probability Plot (Data values, expected z-score for each individual in a quantitative data set).
The scatterplot of ordered pairs (x, y) is.
Look for an almost linear form of the scatterplot.
If it's almost linear, then the distribution is approximately Normal.
Exploring Two-Variable Quantitative Data
Explanatory Variable (input) - helps predictor explain changes in a response variable.
Response variable (predicted output) - measures the outcome of a study.
Scatterplot
Correlation (r) - only applies to linear association.
Preferably, a graph is shown.
Only a number. NO UNITS.
Does not imply causation.
-1 perfect correlation
0 weak
+1 perfect correlation
How to describe this scatterplot?
Direction: (positive/negative/none)
Form: (linear/nonlinear)
Strength: (weak/moderate/strong)
Unusual Feature: (outlier)
"The correlation of r = (#) confirms that the linear association between (explanatory) and (response) is (positive/negative) and (weak/moderate/strong)."
Least-Squares Regression Line (LSRL)
Residual (e) = (Actual - Predicted)
Actual (y-context) was (above/below) the predicted value for x = (# in context).
Slope (b):
For every increase in (x-context) the predicted (y-context) (increases/decreases) by (slope unit of y)."
Y-int(a):
When (x-context) is 0, the predicted y-contexts is (y-int)."
Standard Deviation (s):
The actual (y-context) is typically about (s+ unit) away from the number predicted by the LSRL with x = (context)."
Coefficient of determination (r^2):
residuals}
This determines if a LINEAR MODEL is APPROPRIATE. *we look iF there's NO left ever curved pattern
*residual = Y- y'
*If given the r,Sx, Sy, x& Y, use these Formulas to Find the LSRL equation: b =r a =y-bx
Explanatory variable that are outsideo the range OFdata which the LSRL was calculated.
**influential points-can greatly affectcorrelation and regrestion calculations.
outof pattern (large residuals).
Very largeValues
Power Model:Option 1:raise the values of the explanatory variable by an integer, p
Option :take the pth rootOFthereSponsevalue.
Exponental & LogarithmModel stake the logarithm(log(base 10)or In (basee) OF one or both models
Always check the LSRL scan Plot & residual Plot before concluding if a LINEAR MODEListAPPROPRIATE
Collecting Data
Simple Random Sample (SRS)
Gives every possible sample of a given size the same chance to be chosen.
Make sure to do SAMPLING WITHOUT REPLACEMENT when doing SRS.
How to choose an SRS?
Technology:
Label: Label each individual from 1 to N.
Randomize: Use an RNG to get n different integers (ignore repeats, if necessary).
Select: Choose the individuals that correspond to the integers.
Slips of paper:
Label: Write corresponding numbers or letters on identical slips of paper.
Randomize: Put in a bowl or hat, shuttle the papers and let individuals take one paper (no replacement).
Select: Group individuals based on the slip of paper they got.
Types of Sampling
Convenience Sampling
Chooses individuals easiest to reach.
Voluntary Sampling
Individuals choose to be a part of the study b/c of open invitation.
Both of these sampling method can lead to BIAS, which leads to an over or underestimate of the study.
Stratified Random Sampling
Divide the population in strata (similar in some way) that might affect their response.
Then choose a separate stratum & then combine these SRSS to form the sample.
*strata are similar within (HOMOGENEOUS), but diFF between, stratified samples tend to give more precise estimates of unknown values than SRSs.
Cluster Sampling
Divide the population into non-overlapping groups of individuals that are located near each other.
Randomly select some of these clusters and all the individuals in the chosen clusters are included in the sample.
Systematic Random Sampling
Selects every kth individual based on the population size & desired sample size. Randomly select a value from 1 to k to identify the first individual, and choose every kth individual.
*iF there's a pattern in the way the population is ordered, the sample may not be representative of the population.
What else can go WRONG ???
Undercoverage: Occurs when some members of the population are less likely to be chosen or cannot be chosen in a sample.
Nonresponse: Occurs when an individual chosen for the sample can't be contacted.
Response Bias: Occurs when there is a systematic pattern of inaccurate answers to a survey question.
Types of Studies
Observational
Observes individuals and measures variables of interest but does not attempt to influence the response.
Experimental
Deliberately imposes treatments (conditions) on individuals to measure their responses.
Vocab
Factor- combination of treatments?
Levels: Different values of a
Confounding: occurs when two variables are associated in such a way that their cannot be distinguished from each other
Control group: Used to provide a baseline for comparing the effects of other treatments
Placebo effect: Describe the fact that some subjects in experiment will correspond
Block assign a group of experinental units
Treatenr: A specificconditionappliedto the individualsin experiment
** ExperimentalUnit the objectto Which a treatmentis randeomly assigneted
*Subjects: human beings are the experimenta units. is manipulated and may cause a change in the variable.
* *Factors:an:explanatory that
* *Random **: experimental units are assigned to treatments using chance process. * BASIC PRINCIPLES OF EXPERIMENTAL DESIGN: * Comparison: Use a design that compares two or more treatments. * thatRandom : Use process paper, RNG, table) assign * Control* : keeping same groups avoiding confoundin and variation in: : If treatments effective Each: that from between. * * thatTreatments effectcan distinguished differences
BLOCK DESIGN: * MATCHED PAIRS DESIGN:: design 2 * *sampling ,: from estimates
Observed resultS OF a STUDYT ARE Statistically significant
Collecting Data Continued
All planned studies must be reviewed in advance by an institutional review board charged with protecting the safety and well-being of the subjects.
All individuals who are subjects in a study must give their informed consent before data are collected.
All individual data must be kept confidential. Only statistical summaries for groups of subjects
SAMPLING SIZE larger random produce estimates closer true population value
other wordsestimates samples precise
1) ASSOCIATION strong The association between explanatory and response isstrongr
REDUCING THE CHANCE other Variable groups study association
larger are associated with stronger responses
THE continuedAPPLICATION that cause show
*PERCENTAGES (P-VALUES
*SAMPLING VARIABILIT Random
*conducted
Inference the population
RANDOM ISIMPORTANT-888
Probability
Random process