AP statistics Cheet Sheat

Data Analysis

  • Displaying Categorical Data

    • Frequency (counts)

    • Relative Frequency (percent/proportion)

    • Two-Way Table

      • Marginal Relative Frequency: P(c)

        • Percent or proportion of individuals that have a specific value for one categorical variable.

      • Joint Relative Frequency: (A \cap c)

        • Percent or proportion of individuals that have a specific value for one categorical variable and a specific value for another.

      • Conditional Relative Frequency: P(A|c)

        • Percent or proportion of individuals that have a specific value for one categorical variable among individuals who share the same value of another categorical variable (condition).

    • Graph of Categorical Variable Side-by-Side

      • Pie Chart

      • Bar Graph

      • Segmented Bar Graph

      • Mosaic Plot

    • Association

      • There is an association between two variables if knowing the value of one variable predicts the specific value of the other categorical variable.

Describing Quantitative Data with Graphs

  • Dotplot: ranges - percent or proportion of individuals that have a specific value

  • Stemplot (Stem-and-Leaf Plot)

  • Histogram

  • Describing the Distribution (SOCV + Context)

    • Shape: The distribution of (context) is (shape) with a peak at (highest point) and gaps between (gap).

      • Roughly right-skewed

      • Left-skewed

      • Uniform

      • Double-peaked (bimodal)

      • Symmetric (unimodal)

    • Outlier: There seems to be outliers at (values).

    • Center: The (mean/median) of the distribution is (mean/median + units).

      • Symmetric -> use mean

      • Skewed -> use median

    • Variability: The distribution of (context) has a (SD/IQR/range + units).

  • Comparing Distributions (SOCV + Context)

    • Identify any outliers for both distributions.

    • Compare the center (which is greater/lesser).

    • Describe the shape of both distributions.

    • Compare the variability (which varies more).

    • Always write in context of the problem.

Describing Quantitative Data with Numbers

  • Measures of Center

    • Mean: \bar{x} = \frac{\sum x_i}{n}

      • The mean is greatly affected by outliers (non-resistant).

    • Median: Middle value

      • Odd value of data

      • Average of two middle values (even value of data)

      • The median is not affected by outliers (resistant).

  • Measures of Variability

    • Range: Maximum - Minimum

    • Standard Deviation (SD): Sx = \sqrt{\frac{\sum(xi - \bar{x})^2}{n-1}}

    • Interquartile Range (IQR): IQR = Q3 - Q1 (Quartile 1 (25%), Quartile 3 (75%))

  • How to Find Outliers?

    • 1.5 x IQR Rule:

      • Low outlier < Q1 - (1.5 x IQR)

      • High outlier > Q3 + (1.5 x IQR)

  • Interpreting SD

    • "The (context) typically varies by about (SD + unit) from the mean of (\bar{X} + unit)."

  • Boxplots

    • Min, Q1, Median, Q3, Max

    • If the min or max is an outlier:

      • Remove your outliers, label them on your boxplot.

      • The new min is the lowest data (same for max).

  • Parameter vs. Statistic

    • Parameter: A number (or statement) that describes a population.

    • Statistic: A number (or statement) that describes a sample.

  • Five Number Summary

    • Minimum (Min)

    • Q1 (Quartile 1; 25th percentile)

    • Median

    • Q3 (Quartile 3; 75th percentile)

    • Maximum (Max)

Modeling Distributions of Quantitative Data

  • Describing Location in a Distribution

    • Percentiles: P_i of observations less than or equal to it.

    • Standardized scores (z-scores): Tells us how many standard deviations from the mean the value falls, and in what direction.

      • Z = \frac{x - \mu}{\sigma}

      • x = value

      • μ = mean

      • σ = SD

      • Interpretation: "(Context) is (z-score) standard deviations (above/below) the mean of (μ + unit)."

    • Cumulative relative frequency graph (ogive)

      • An ogive allows you to examine location in a distribution.

      • The completed graph allows you to estimate the percentile for an individual value & vice-versa.

  • Transformation of Data

    • Addition/Subtraction

      • Centers, Location: Change

      • Shape: No change

      • Variability: No change

    • Multiplication/Division

      • Centers, Location: Change

      • Shape: No change

      • Variability: Change

  • Density Curves and Normal Distributions

    • Density curves - models the distribution

      • Is always above the horizontal axis.

      • Has exactly 1 underneath it.

      • The area under the curve and above any interval of values on the horizontal axis estimates the proportion of all observations that fall in that interval.

    • Mean of a density curve - point at which the curve would balance if made of solid material.

    • Median of a density curve - is the equal areas point, the point that divides the area under the curve in half.

    • Uniform Density Curve

      • Why is the height 1/2?

        • Since the area under the curve should be equal to 1, then the distance on the horizontal axis is equal to the reciprocal of the height.

    • Approximately Normal

      • Described by a roughly symmetric, single-peaked, bell-shaped density curve called a Normal curve.

      • Any Normal distribution is completely specified by two parameters: mean (\mu) & SD (\sigma).

    • Finding the area under the curve (Probability)

      • Finding a value

        • First Z-score: \mu = 0

      • IFFinding the upper: upper :1000
        &: = 1 lower: -1000
        :0 E:1
        *to Find an area

    • 0: area
      μ the context and in
      μ: Context
      5 Sd
      5:Sd
      Empirical Rule (68-95-99.7 Rule)
      center\center μ \center (SD) \center count
      μ ± 1(SD) count
      μ ± 2(SD) count
      μ ± 3(SD) count

  • If these values are close to 68-95-99.7, then the A

    distribution is approximately Normal.

  • Normal Probability Plot (Data values, expected z-score for each individual in a quantitative data set).

    • The scatterplot of ordered pairs (x, y) is.

    • Look for an almost linear form of the scatterplot.

    • If it's almost linear, then the distribution is approximately Normal.

Exploring Two-Variable Quantitative Data

  • Explanatory Variable (input) - helps predictor explain changes in a response variable.

  • Response variable (predicted output) - measures the outcome of a study.

  • Scatterplot

  • Correlation (r) - only applies to linear association.

    • Preferably, a graph is shown.

    • Only a number. NO UNITS.

    • Does not imply causation.

    • -1 perfect correlation

    • 0 weak

    • +1 perfect correlation

  • How to describe this scatterplot?

    • Direction: (positive/negative/none)

    • Form: (linear/nonlinear)

    • Strength: (weak/moderate/strong)

    • Unusual Feature: (outlier)

  • "The correlation of r = (#) confirms that the linear association between (explanatory) and (response) is (positive/negative) and (weak/moderate/strong)."

  • Least-Squares Regression Line (LSRL)

  • Residual (e) = (Actual - Predicted)

    • Actual (y-context) was (above/below) the predicted value for x = (# in context).

  • Slope (b):

    • For every increase in (x-context) the predicted (y-context) (increases/decreases) by (slope unit of y)."

  • Y-int(a):

    • When (x-context) is 0, the predicted y-contexts is (y-int)."

  • Standard Deviation (s):

    • The actual (y-context) is typically about (s+ unit) away from the number predicted by the LSRL with x = (context)."

  • Coefficient of determination (r^2):

    • residuals}

  • This determines if a LINEAR MODEL is APPROPRIATE. *we look iF there's NO left ever curved pattern

  • *residual = Y- y'

  • *If given the r,Sx, Sy, x& Y, use these Formulas to Find the LSRL equation: b =r a =y-bx

    Explanatory variable that are outsideo the range OFdata which the LSRL was calculated.

**influential points-can greatly affectcorrelation and regrestion calculations.

  • outof pattern (large residuals).

Very largeValues

Power Model:Option 1:raise the values of the explanatory variable by an integer, p
Option :take the pth rootOFthereSponsevalue.

Exponental & LogarithmModel stake the logarithm(log(base 10)or In (basee) OF one or both models

  • Always check the LSRL scan Plot & residual Plot before concluding if a LINEAR MODEListAPPROPRIATE

Collecting Data

  • Simple Random Sample (SRS)

    • Gives every possible sample of a given size the same chance to be chosen.

    • Make sure to do SAMPLING WITHOUT REPLACEMENT when doing SRS.

    • How to choose an SRS?

      • Technology:

        • Label: Label each individual from 1 to N.

        • Randomize: Use an RNG to get n different integers (ignore repeats, if necessary).

        • Select: Choose the individuals that correspond to the integers.

      • Slips of paper:

        • Label: Write corresponding numbers or letters on identical slips of paper.

        • Randomize: Put in a bowl or hat, shuttle the papers and let individuals take one paper (no replacement).

        • Select: Group individuals based on the slip of paper they got.

  • Types of Sampling

    • Convenience Sampling

      • Chooses individuals easiest to reach.

    • Voluntary Sampling

      • Individuals choose to be a part of the study b/c of open invitation.

      • Both of these sampling method can lead to BIAS, which leads to an over or underestimate of the study.

    • Stratified Random Sampling

      • Divide the population in strata (similar in some way) that might affect their response.

      • Then choose a separate stratum & then combine these SRSS to form the sample.

      • *strata are similar within (HOMOGENEOUS), but diFF between, stratified samples tend to give more precise estimates of unknown values than SRSs.

    • Cluster Sampling

      • Divide the population into non-overlapping groups of individuals that are located near each other.

      • Randomly select some of these clusters and all the individuals in the chosen clusters are included in the sample.

    • Systematic Random Sampling

      • Selects every kth individual based on the population size & desired sample size. Randomly select a value from 1 to k to identify the first individual, and choose every kth individual.

      • *iF there's a pattern in the way the population is ordered, the sample may not be representative of the population.

  • What else can go WRONG ???

    • Undercoverage: Occurs when some members of the population are less likely to be chosen or cannot be chosen in a sample.

    • Nonresponse: Occurs when an individual chosen for the sample can't be contacted.

    • Response Bias: Occurs when there is a systematic pattern of inaccurate answers to a survey question.

  • Types of Studies

    • Observational

      • Observes individuals and measures variables of interest but does not attempt to influence the response.

    • Experimental

      • Deliberately imposes treatments (conditions) on individuals to measure their responses.

  • Vocab

    • Factor- combination of treatments?

    • Levels: Different values of a

    • Confounding: occurs when two variables are associated in such a way that their cannot be distinguished from each other

    • Control group: Used to provide a baseline for comparing the effects of other treatments

    • Placebo effect: Describe the fact that some subjects in experiment will correspond

    • Block assign a group of experinental units

      Treatenr: A specificconditionappliedto the individualsin experiment

** ExperimentalUnit the objectto Which a treatmentis randeomly assigneted

  • *Subjects: human beings are the experimenta units. is manipulated and may cause a change in the variable.

* *Factors:an:explanatory that

  • * *Random **: experimental units are assigned to treatments using chance process. * BASIC PRINCIPLES OF EXPERIMENTAL DESIGN: * Comparison: Use a design that compares two or more treatments. * thatRandom : Use process paper, RNG, table) assign * Control* : keeping same groups avoiding confoundin and variation in: : If treatments effective Each: that from between. * * thatTreatments effectcan distinguished differences

  • BLOCK DESIGN: * MATCHED PAIRS DESIGN:: design 2 * *sampling ,: from estimates

    Observed resultS OF a STUDYT ARE Statistically significant

Collecting Data Continued

  • All planned studies must be reviewed in advance by an institutional review board charged with protecting the safety and well-being of the subjects.

  • All individuals who are subjects in a study must give their informed consent before data are collected.

  • All individual data must be kept confidential. Only statistical summaries for groups of subjects

SAMPLING SIZE larger random produce estimates closer true population value

other wordsestimates samples precise

1) ASSOCIATION strong The association between explanatory and response isstrongr

REDUCING THE CHANCE other Variable groups study association

larger are associated with stronger responses

THE continuedAPPLICATION that cause show

*PERCENTAGES (P-VALUES

*SAMPLING VARIABILIT Random
*conducted

Inference the population

RANDOM ISIMPORTANT-888

Probability

  • Random process