AP-Statistics_Study-Guide
Key Exam Details
- The AP Statistics course is a first-semester, college-level class.
- The exam is 3 hours long with 46 questions.
- 40 multiple-choice questions (50% of the exam).
- 6 free-response questions (50% of the exam).
- Exam covers these course content categories:
- Exploring One-Variable Data: 15%‒23%
- Exploring Two-Variable Data: 5%‒7%
- Collecting Data: 12%‒15%
- Probability, Random Variables, and Probability Distributions: 10%‒20%
- Sampling Distributions: 7%‒12%
- Inference for Categorical Data: Proportions: 12%‒15%
- Inference for Quantitative Data: Means: 10%‒18%
- Inference for Categorical Data: Chi-Square: 2%‒5%
- Inference for Quantitative Data: Slopes: 2%‒5%
Exploring One-Variable Data
- 15‒23% of questions will fall under this topic.
Variables and Frequency Tables
- A variable is a characteristic or quantity that differs between individuals.
- A categorical variable classifies an individual by group or category.
- A quantitative variable takes on a numerical value that can be measured.
Examples of Variables
- Categorical variables:
- The country in which a product is manufactured
- The political party with which a person is affiliated
- The color of a car
- Quantitative variables:
- The height, in inches, of a person
- The number of red cars that pass through an intersection in a day
- A zip code is categorical data because it's a label for a location, not a quantity.
- Quantitative variables:
- Discrete: Can take on only countably many values (finite or countably infinite).
- Continuous: Can take on uncountably many values. Between any two possible values, another value can be found.
Graphs for Categorical Variables
- A frequency table shows how many individual items fall into each category.
- A relative frequency table gives the proportion of the total that is accounted for by each category.
- A bar chart represents the frequencies or relative frequencies of a categorical variable.
- Categories are organized along a horizontal axis.
- The height of the bar corresponds to the number of observations of that category.
- The vertical axis may be labeled with frequencies or with relative frequencies.
- A bar chart representing data from more than one set is useful for comparing the frequencies across the sets.
Graphs for Quantitative Variables
- A histogram is related to a bar chart but is used for quantitative data.
- Data is split into intervals, or bins, and the number of data points in each interval is counted.
- The horizontal axis contains the different intervals, which are adjacent to each other, as they form a number line.
- The vertical axis shows the count for each interval.
- How the data is split into intervals can have a big impact on the appearance of the histogram.
- A stem-and-leaf plot is another graphical representation of a quantitative variable.
- Each data value is split into a stem (one or more digits) and a leaf (the last digit).
- The stems are arranged in a column, and the leaves are listed alongside the stem to which they belong.
- In a dotplot, each data value is represented by a dot placed above a horizontal axis.
- The height of a column of dots shows how many repetitions there are of that value.
The Distribution of a Quantitative Variable
- The distribution of quantitative data is described by reference to shape, center, variability, and unusual features such as outliers, clusters, and gaps.
- When a distribution has a longer tail on either the right or left, the distribution is said to be skewed in that direction.
- If the right and left sides are approximately mirror images, the distribution is symmetric.
- A distribution with a single peak is unimodal; if it has two distinct peaks, it is bimodal.
- A distribution without any noticeable peaks is uniform.
- An outlier is a value that is unusually large or small.
- A gap is a significant interval that contains no data points, and a cluster is an interval that contains a high concentration of data points.
- In many cases, a cluster will be surrounded by gaps.
Summary Statistics and Outliers
- A statistic is a value that summarizes and is derived from a sample.
- Measures of center and position include the mean, median, quartiles, and percentiles.
- The commonly used measures of variability are variance, standard deviation, range, and IQR.
- The mean of a sample is denoted (x), and is defined as the sum of the values divided by the number of values. That is, x = \frac{1}{n} \sum{i=1}^{n} xi. The median is the value in the center when the data points are in order.
- The first quartile, Q1, and the third quartile, Q3, are the medians of the lower and upper halves of the data set.
- The pth percentile is the data point that has p% of the data less than or equal to it.
- The first and third quartiles are the 25th and 75th percentiles, respectively.
- The range of a data set is the difference between the maximum and minimum values, and the interquartile range, or IQR, is the difference between the first and third quartiles. That is, IQR = Q3 - Q1.
- Variance is defined in terms of the squares of the differences between the data points and the mean. More precisely, the variance s^2 is given by the formula s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - x)^2. The standard deviation is then simply the square root of the variance: s = \sqrt{\frac{1}{n-1} \sum{i=1}^{n} (xi - x)^2}.
- When units of measurement are changed, summary statistics behave in predictable ways that depend on the type of operation done.
| Statistic | Original value | Value after multiplying all data points by a constant c | Value after adding a constant c to all data points |
|---|---|---|---|
| Mean | x | cx | x + c |
| Median/Quartile/Percentile | m | cm | m + c |
| Range/IQR | r | cr | r |
| Variance | s^2 | c^2s^2 | s^2 |
| Standard deviation | s | cs | s |
- There are many possible ways to define an outlier. There are two methods commonly used in AP Statistics, depending on what statistic is being used to describe the spread of the distribution.
- When the IQR is used to describe the spread, the 1.5IQR rule is used to define outliers.
- Under this rule, a value is considered an outlier if it lies more than 1.5 \times IQR away from one of the quartiles.
- Specifically, an outlier is a value that is either less than Q1 - 1.5 \times IQR or greater than Q3 + 1.5 \times IQR.
- On the other hand, if the standard deviation is being used to describe the variation of the distribution, then any value that is more than 2 standard deviations away from the mean is considered an outlier.
- In other words, a value is an outlier if it is less than x - 2s or greater than x + 2s.
- If the existence of an outlier does not have a significant effect on the value of a certain statistic, we say that statistic is resistant (or robust).
- The median and IQR are examples of resistant statistics.
- On the other hand, some statistics, including mean, standard deviation, and range, are changed significantly by an outlier.
- These statistics are called nonresistant (or non-robust).
- Related to the idea of robustness is the relationship between mean and median in skewed distributions.
- If a distribution is close to symmetric, the mean and median will be approximately equal to each other.
- On the other hand, in a skewed distribution the mean will usually be pulled in the direction of the skew.
- That is, if the distribution is skewed right, the mean will usually be greater than the median, while if the distribution is skewed left, the mean will usually be less than the median.
Graphs of Summary Statistics
The five-number summary of a data set is composed of the following five values, in order: minimum, first quartile, median, third quartile, and maximum.
A boxplot is a graphical representation of the five-number summary that can be drawn vertically or horizontally along a number line.
- In a boxplot, a box is constructed that spans the distance between the quartiles.
- A line, representing the median, cuts the box in two.
Lines, often called whiskers, connect the ends of the box with the maximum and minimum points.
- If the set contains one or more outliers, the whiskers end at the most extreme values that are not outliers, and the outliers themselves are indicated by stars or dots.
Note that the two sections of the box, along with the two whiskers, each represent a section of the number line that contains approximately 25% of the values.
Boxplots can be used to compare two or more distributions to each other.
- The relative positions and sizes of the sections of the box and the whiskers can demonstrate differences in the center and spread of the distributions.
The Normal Distribution
- A normal distribution is unimodal and symmetric.
- It is often described as a bell curve.
- In fact, there are infinitely many normal distributions.
- Any single one is described by two parameters: the mean, {mu}, and the standard deviation, \sigma.
- The mean is the center of the distribution, and the standard deviation determines whether the peak is relatively tall and narrow or short and wide.
- The empirical rule gives guidelines for how much of a normally distributed data set is located within certain distances from the center.
- Approximately 68% of the data points are within 1 standard deviation of the mean.
- Approximately 95% are within 2 standard deviations of the mean.
- Approximately 99.7% are within 3 standard deviations of the mean.
- In practice, many sets of data that arise in statistics can be described as approximately normal: they are well modeled by a normal distribution, although it is rarely perfect.
- The standardized score, or z-score, of a data point is the number of standard deviations above or below the mean at which it lies.
- The formula is z = \frac{x - \mu}{\sigma}.
- It is analogous to a percentile in the sense that it describes the relative position of a point within a data set.
- If the z-score is positive, the value is greater than the mean, while if it is negative, the value is less than the mean.
- In either case, the absolute value of the z-score describes how far away the value is from the center of the distribution.
Exploring Two-Variable Data
- On your AP exam, 5‒7% of questions will fall under the topic of Exploring Two-Variable Data.
Two Categorical Variables
- When a data set involves two categorical variables, a contingency table can show how the data points are distributed categories.
- Totals can be calculated for the rows and columns, along with a grand total for the entire table.
- The entries can be given as relative frequencies by representing the value in each cell as a percentage of either the row or column total.
- Note that since the percentages are relative to the row column totals, each column now has a total of 100%.
- The row totals are shown as a percentage of the table total and are referred to as a marginal distribution.
- If the entries are given as relative frequencies by dividing the total for the entire table, rather than by the row or column totals, the table is referred to as a joint relative frequency.
Two Quantitative Variables
- When data consists of two quantitative variables, it can be represented as a scatterplot, which shows the relationship between the two variables.
- The variables are assigned to the x- and y- axes, and then each point can be represented by a point on the xy-plane.
- The variable that is chosen for the x-axis is often referred to as the explanatory variable, while the variable represented on the y-axis is the response variable.
- A scatterplot shows what kind of association, if any, exists between the two variables.
- The direction of the association can be described as positive or negative; positive means that as one variable increases, the other increases as well, while negative means that as one variable increases, the other decreases.
- The form of an association describes the shape that the points make.
- In particular, we are generally most interested in whether or not the association is linear.
- When it is non -linear, it may also be described as having another form, such as exponential or quadratic.
- The strength of an association is determined by how closely the points in the scatterplot follow a pattern whether the pattern is linear or not.
- Finally, a scatterplot might have some unusual features.
- Just as with data involving a single variable, these features include clusters and outliers.
Correlation
- The correlation between two variables is a single number, r, that quantifies the direction and strength of a linear association:
- In this formula, sx and sy denote the sample standard deviations of the x and y variables, respectively.
- Although it is possible to calculate by hand, it is implausible for all but the smallest data sets.
- The correlation is always between –1 and 1.
- The sign of r indicates the direction of the association, and the absolute value is a measure of its strength: values close to 0 indicate a weak association, and the strength increases as the values move toward –1 or 1.
- If r is 0, there is absolutely no linear relationship between the variables, whereas an r of –1 or 1 indicates a perfect linear relationship.
- It is important to note that a value close to –1 or 1 does not, by itself, imply that a linear model is appropriate for the data set.
- On the other hand, a value close to 0 does indicate that a linear model is probably not appropriate.
Regression and Residuals
- A linear regression model is a linear equation that relates the explanatory and response variables of a data set.
- The model is given by {hat{y}} = a + bx, where a is the y-intercept, b is the slope, x is the value of the explanatory variable, and \hat{y} is the predicted value of the response variable.
- The purpose of the linear regression model is to predict a y given an x that does not appear within the data set used to construct the model.
- If the x used is outside of the range of x-values of the original data set, using the model for prediction is called extrapolation.
- This tends to yield less reliable predictions than interpolation, which is the process of predicting y-values for x-values that are within the range of the original data set.
- Since regression models are rarely perfect, we need methods to analyze the prediction errors that occur.
- The difference between an actual y and the predicted y, y - \hat{y}, is called a residual.
- When the residuals for every data point are calculated and plotted versus the explanatory variable, x, the resulting scatterplot is called a residual plot.
- A residual plot gives useful information about the appropriateness of a linear model.
- In particular, any obvious pattern or trend in the residuals indicates that a linear model is probably inappropriate.
- When a linear model is appropriate, the points on the residual plot should appear random.
- The most common method for creating a linear regression model is called least-squares regression.
- The least squares model is defined by two features: it minimizes the sum of the squares of the residuals, and it passes through the point (x, y).
- The slope b of the least-squares regression line is given by the formula b = r \frac{sy}{sx}.
- The slope of the line is best interpreted as the predicted amount of change in y for every unit increase in x.
- Once the slope is known, the y-intercept, a, can be determined by ensuring that the line contains the point (x, y): a = y - bx.
- The y-intercept represents the predicted value of y when x is 0.
- Depending on the type of data under consideration, however, this may or may not have a reasonable interpretation.
- It always helps to define the line, but it does not necessarily have contextual significance.
- The square of the correlation r, or r^2, is also called the coefficient of determination.
- Its interpretation is difficult but is usually explained as the proportion of the variation in y that is explained by its relationship to x as given in the linear model.
- There are three ways to classify unusual points in the context of linear regression:
- A point that has a particularly large residual is called an outlier.
- A point that has a relatively large or small x-value than the other points is called a high-leverage point.
- An influential point is any point that, if removed, would cause a significant change in the regression model.
- There are situations in which transforming one of the variables results in a linear model of increased strength compared to the original data.
- Not only is the correlation even higher now, the residual plot does not show any obvious patterns.
- This means that the data was successfully transformed for the purposes of fitting a linear model.
- There are many other transformations that can be tried, including squaring or taking the square root of one of the variables.
Collecting Data
- About 12‒15% of the questions on your AP Statistics exam will cover the category of Collecting Data.
Planning a Study
- The entire set of people, items, or subjects of interest to us is called a population.
- Because it is often not feasible to collect data from a population, a sample, or smaller subset, is selected from the population.
- One of the goals of statistics is to use sample data to make reliable inferences about populations.
- Once a sample is selected, data collection must take place.
- In an experiment, the participants or subjects are explicitly assigned to two or more different conditions, or treatments.
For example, a medical study investigating a new cold medication might assign half of the people in the study to a group that receives the medication, and the other half to a group that receives an older medication.
- In an experiment, the participants or subjects are explicitly assigned to two or more different conditions, or treatments.
- Experiments are the only way to determine causal relationships between variables.
- In the experiment just described, the manufacturer of the medication would like to be able to state that taking their medication causes a reduced duration of the cold.
- When experiments are not possible to do for logistical or ethical reasons, observational studies often take their place.
- In an observational study, treatments are not assigned.
- Rather, data that already exists is collected and analyzed.
- As noted, an observational study can never be used to determine causality.
- Whether a study is experimental or observational, it is important to keep in mind that the results can only be generalized to the population from which the sample was selected.
Data Collection
- The methods used in collected data play a large role in determining what conclusions can be drawn from statistical analysis of the data.
- A sampling method is a technique, or plan, used in selecting a sample from a population.
- When a sampling method allows for the possibility of an item being selected more than once, the sampling is said to be done with replacement.
- If that is not possible, so that each item can be selected at most once, the sampling is without replacement.
- A random sample is one in which every item from the population has an equal chance of being chosen for the sample.
- A simple random sample, or SRS, is one in which every group of a given size has an equal chance of being chosen.
- Every simple random sample is also random, but the opposite is not true: some sampling techniques lead to random samples that are not simple random samples.
- In a stratified sample, the population is first divided into groups, or strata, based on some shared characteristic.
- A sample is then selected from within each stratum, and these are combined into a single larger sample.
- A stratified sample may be random, but it will never be an SRS.
- Another kind of sample is called a cluster sample.
- As with a stratified sample, the population is first divided into groups, called clusters.
- A sample of clusters is then chosen, and every item within each of the chosen clusters is used as part of the larger sample.
- Here again, a cluster sample may be random, but it will never be an SRS.
- A systematic random sample consists of choosing a random starting point within a population and then selecting every item at a fixed periodic interval.
- For example, perhaps every 10th item in a list is chosen.
- Again, this kind of sample is not an SRS.
- Each of these sampling methods has pros and cons that depend on the population from which they are drawn, as well as the kind of study being done.
Problems with Sampling
- There are many potential problems with sampling that can lead to unreliable statistical conclusions.
- Bias occurs when certain values or responses are more likely to be obtained than others.
- Examples of bias include:
- Voluntary response bias, which occurs when a sample consists of people who choose to participate
- Undercoverage bias, which happens when some segment of the population has a smaller chance of being included in a sample
- Nonresponse bias, which happens when data cannot be obtained from some part of the chosen sample
- Question wording bias, which is the result of confusing or leading questions
- A random sample, and specifically a simple random sample, is an important tool in helping to avoid bias, though it certainly does not guarantee that bias will not occur.
Experimental Design
- A well-designed experiment is the only kind of statistical study that can lead to a claim of a causal relationship.
- A sample is broken into one or more groups, and each group is assigned a treatment.
- The results of the data collection that follows show the effect that the treatment had on the subjects.
- In an experiment, the experimental units are the individuals that are assigned one of the treatments being investigated; these may or may not be people.
- When they are people, they are also called participants or subjects.
- The explanatory variable in an experiment is whatever variable is being manipulated by the experimenter, and the different values that it takes on are called treatments.
- The response variable is the outcome that is measured to determine what effects, if any, the treatments had.
- A potential problem in any experiment is the existence of confounding variables.
- A confounding variable has an effect on the response variable, and may create the impression of a relationship between the explanatory and response variable even where none exists.
- When possible, confounding variables should be controlled for by careful design of treatments and data collection.
- Even when they cannot be entirely controlled for, they should be acknowledged as potentially having an effect on the results of the experiment.
- A well-designed experiment should always consist of at least two treatment groups, so that the treatment under investigation can be compared to something else.
- Often, it is compared to a control group, whose sole purpose is to provide comparison data.
- The control group either receives no treatment, or treatment with an inactive substance called a placebo.
- It is important to realize, however, that there is a well document phenomenon called a placebo effect, in which subjects do respond to treatment with a placebo.
- Often, it is compared to a control group, whose sole purpose is to provide comparison data.
- Blinding is a precaution taken to ensure that the subjects and/or the researcher do not know which treatment is being given to a particular individual.
- In a single-blind experiment, either the subject or the researcher does have this information, but the other does not.
- In a double-blind experiment, neither party has this information.
- The experimental units should always be randomly assigned to the different treatment groups; if they are not, bias of the sort discussed in the previous section is likely to be an issue.
- In a completely randomized design, experimental units are assigned to treatment groups completely at random.
- This is usually done using random number generators, or some other technique for generating random choices.
- This design is most useful for controlling con founding variables.
- In a completely randomized design, experimental units are assigned to treatment groups completely at random.
- In a randomized block design, the experimental units are first grouped, or blocked, based on a blocking variable.
- The members of each block are then randomly assigned to treatment groups.
- This means that all the values of the blocking variable are represented in each treatment group, which helps ensure that it does not act as a confounding variable in the experiment.
- A matched pairs design is a particular kind of block design in which the experimental units are first arranged into pairs based on factors relevant to the experiment.
- Each pair is then randomly split into the two treatment groups.
Probability, Random Variables, and Probability Distributions
- On your AP Statistics exam, 10‒20% of questions will cover the topic of Probability, Random Variables, and Probability Distributions.
Basic Probability
- The field of probability involves random processes.
- That is, processes whose results are determined by chance.
- The set of all possible outcomes is called the sample space, and an event is any subset of the sample space.
- The probability of an event is the likelihood of it occurring and is represented as a number between 0 and 1, inclusive.
- If the chance process is repeatable, the probability can be interpreted as the relative frequency with which the event will occur if the process is repeated many times.
- If all of the outcomes in the sample space are equally like to occur, then the probability of an event E is the ratio of the number of outcomes in E to the number of outcomes in the sample space.
- The complement of an event E, denoted E’ or Ec , is the event that consists of all outcomes that are not in E.
- The probability of an event and its complement always sum to one: P(E) + P(E’) = 1.
- Rearranging the terms, thisis equivalentto P(E’) = 1 – P(E).
- In many real-world situations, probabilities can be very difficult to calculate.
- When this happens, simulation can be used.
- Simulation is a technique in which random events are simulated in a way that matches as closely as possible the random process that gives rise to the probability.
- This is usually done by generating random numbers.
- The simulation can be repeated many times, and the simulated outcome examined for each repetition.
- The relative frequency of an event in this sequence of simulated outcomes is an estimate of the probability of the event.
Joint and Conditional Probability
When a probability involves two events both occurring, it is referred to as a joint probability.
- The joint event is denoted using a , as in A \cap B.
Sometimes we are interested in a probability that depends on knowledge about whether or not another event occurred.
- This is called a conditional probability.
The probability that an A will occur given that another event B is known to have occurred is denoted P(A|B), and its value is given by P(A|B) = \frac{P(A \cap B)}{P(B)}.
Rearranging the terms in this formula, we get the multiplication rule for joint probabilities: P(A \cap B) = P(A) \cdot P(B|A).
If P(A|B) = P(A), then events A and B are said to be independent.
The significance of independence is that whether or not one of the events occur has no influence on the probability of the other event.
The roles of A and B can always be switched, so that P(B|A) = P(B) will also be true if A and B are independent.
Another important consequence of independence is that the multiplication rule simplifies to P(A \cap B) = P(A) \cdot P(B).
This last equation can also be used to check for independence.
Unions and Mutually Exclusive Events
The event consisting of either A or B occurring is called a union, and is denoted by A \cup B.
Its probability is given by the addition rule: P(A \cup B) = P(A) + P(B) - P(A \cap B).
Note that this is inclusive, so that any outcomes that are in both A and B are included in A \cup B.
Two events are called mutually exclusive if they cannot both occur, so that their joint probability is 0.
In other words, A and B are mutually exclusive if P(A \cap B) = 0.
When this occurs, the last term in the addition rule given previously is 0.
Therefore, if A and B are mutually exclusive, the addition rule simplifies to P(A \cup B) = P(A) + P(B).
Random Variables and Probability Distributions
A random variable is a variable whose numerical value depends on the outcome of a random experiment, so that it takes on different values with certain probabilities.
- A random variable is called discrete if it can take on finitely or countably many values.
- The sum of the probabilities of the possible values is always equal to 1, since they represent all possible outcomes of the experiment.
- A probability distribution represents the possible values of a random variable along with their respective probabilities.
- It is often represented as a table or graph, as in the following example:
| X | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| P(X = x) | 0.2 | 0.3 | 0.1 | 0.25 | 0.15 |
The table shows a random variable X that can take on each of the values 1, 2, 3, 4, and 5.
It takes on the value 1 with probability 0.2, the value 2 with probability 0.3, and so on.
Note that the sum of the probabilities is 0.2 + 0.3 + 0.1 + 0.25 + 0.15 = 1, as expected.
The notation P(X = x) in the second row represents the probability of the random variable (X) taking on one of its possible values (x).
Sometimes it is beneficial to have a cumulative probability distribution, which shows the probabilities of all values of a random variable less than or equal to a given value.
The cumulative distribution for the example in the previous table is as follows:
| X | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| P(X <= x) | 0.2 | 0.5 | 0.6 | 0.85 | 1 |
- A probability distribution has a mean and a standard deviation, just like a population.
- The mean, or expected value, of a discrete random variable X is \muX = \sum xi \cdot P(x_i).
- Its standard deviation is \sigma = \sqrt{\sum (xi - \mu)^2 P(xi)}.
Combining Random Variables
If X and Y are two discrete random variables, a new random variable can be constructed by combining X and Y in a linear combination aX + bY, where a and b are any real numbers.
The mean of this new random variable is {mu}{aX+bY} = a\muX + b\mu_Y .
If the two variables are independent, so that information obtained about one of them does not affect the distribution of the other, then the standard deviation of the linear combination is {sigma}^2{aX+bY} = \sqrt{a^2\sigma^2X + b^2\sigma^2_Y} if the two variable are independent.
If the variables are not independent, the computation of the standard deviation of the linear combination is well beyond the scope of AP Statistics.
A single random variable can also be transformed into a new one by means of the linear equation Y = a + bX.
The mean of the transformed variable is {mu}Y = a + b {mu}X.
and its standard deviation is \sigmaY = |b| \sigmaX.
In addition, if a and b are both positive, then the distribution of Y has the same shape as the distribution of X.
Binomial and Geometric Distributions
- A Bernoulli trial is an experiment that satisfies the following conditions:
- There are only two possible outcomes, called success and failure
- The probability of success is the same every time the experiment is conducted
- We will let p denote the probability of success.
- Because failure is the complement of success, the probability of failure is then 1 – p.
- Consider repeating a Bernoulli trial n times and counting the number of successes that occur in these repetitions.
- If we call the number of successes X, then X is called a binomial random variable.
- The probability of exactly x successes in n trials is given by
( ) (, ) x n x n P X x p p x − = = 1 − . Here n \choose x is the binomial coefficient often referred to as a combination. Its value is equal to {n \choose x} = \frac{n!}{(n-x)!x!}.
- The mean of a binomial random variable is \muX = np, and its standard deviation is {sigma}X = {sqrt{np(1-p)}}.
- A geometric random variable is also related to Bernoulli trials.
- Unlike a binomial random variable, a geometric random variable X is the number of the trial on which a success first occurs.
- The value is