AP Statistics Unit 2 Notes: Understanding Linear Regression for Two-Variable Data
Least-Squares Regression Lines
What a regression line is (and what it is not)
When you have two quantitative variables measured on the same individuals—an explanatory variable x and a response variable y—a **regression line** is a mathematical model you use to describe how y tends to change as x changes. In AP Statistics, the main model you start with is the least-squares regression line (LSRL), written as:
\hat{y} = a + bx
Here, \hat{y} (read “y-hat”) is the **predicted value** of y for a given x. The line is not claiming every point will lie on it—real data vary. Instead, it summarizes the overall linear trend.
A common misconception is to treat the regression line as “the true relationship.” In statistics, it’s better to think of it as a best-fitting linear approximation to the pattern in the data you observed.
Why “least squares” matters
Many lines could be drawn through a scatterplot. The least-squares line is special because it chooses the slope and intercept that make the vertical prediction errors as small as possible overall.
For any data point, the vertical difference between the actual value y and the predicted value \hat{y} is called a residual (you’ll study residuals deeply in the next major section). Least squares chooses the line that minimizes the sum of the squared residuals:
\sum (y - \hat{y})^2
Why square them?
- Squaring makes negative and positive residuals not cancel.
- Squaring penalizes large errors more heavily, so the fit is strongly influenced by points far from the line.
That “penalize large errors” idea is also why unusual points can have a big effect on the regression line.
How the LSRL is determined (key formulas and meaning)
In AP Statistics, you are expected to know (and use) the standard relationships among the LSRL, correlation, and summary statistics.
The slope b of the LSRL is:
b = r\frac{s_y}{s_x}
- r is the correlation between x and y.
- s_x is the standard deviation of x.
- s_y is the standard deviation of y.
The intercept a is:
a = \bar{y} - b\bar{x}
- \bar{x} is the mean of x.
- \bar{y} is the mean of y.
Two extremely important interpretations:
- Slope interpretation: For each increase of 1 unit in x, the predicted y changes by b units (on average, according to the model).
- Intercept interpretation: When x = 0, the predicted value of y is a.
That second one causes trouble: the intercept is only meaningful if x = 0 is within the scope of the data (or at least reasonable). If x = 0 is far outside the observed range, the intercept may be a purely mathematical artifact.
Key properties you can rely on
The LSRL has a couple of properties that are frequently tested because they help you reason without redoing calculations:
- The regression line always passes through the point:
(\bar{x}, \bar{y})
If r = 0, then b = 0, so the LSRL is horizontal at \hat{y} = \bar{y}.
The sign of b matches the sign of r (positive association gives positive slope, negative association gives negative slope).
Measuring how well the line fits: r^2
The coefficient of determination, r^2, measures the proportion of variation in y that is explained by the linear relationship with x (using the regression line).
If r^2 = 0.64, you say: “About 64% of the variability in y is explained by the linear regression of y on x.”
Two common misunderstandings to avoid:
- r^2 is not the percent of points “on” the line.
- A high r^2 does not prove causation. It only describes how tightly the points follow a linear pattern.
Example: Building an LSRL from summary statistics
Suppose a class studies the relationship between hours studied x and quiz score y. You are given:
- \bar{x} = 4, s_x = 2
- \bar{y} = 78, s_y = 10
- r = 0.60
Step 1: Find the slope
b = r\frac{s_y}{s_x} = 0.60\frac{10}{2} = 3
Interpretation: Each additional hour studied is associated with an increase of about 3 points in the predicted quiz score.
Step 2: Find the intercept
a = \bar{y} - b\bar{x} = 78 - 3(4) = 66
Regression equation
\hat{y} = 66 + 3x
Interpretation of intercept: A student who studied 0 hours would be predicted to score 66. Whether that’s meaningful depends on whether 0 hours is realistic in the context.
Prediction example: For x = 6 hours,
\hat{y} = 66 + 3(6) = 84
A vital exam habit: state the context and units in interpretations (hours, points).
Exam Focus
- Typical question patterns:
- You’re given r, \bar{x}, \bar{y}, s_x, s_y and asked to find \hat{y} = a + bx and interpret b.
- You’re given a regression equation and asked to predict \hat{y} for a value of x, then interpret the prediction in context.
- You’re asked to interpret r^2 in context (proportion of variation explained).
- Common mistakes:
- Interpreting r^2 as “percent accurate” or “percent of points on the line” instead of explained variability.
- Giving a slope interpretation without units or without specifying “predicted” change in y per 1 unit of x.
- Treating the intercept as meaningful when x = 0 is far outside the observed data (extrapolation issue).
Residuals, Residual Plots, and Assessing Linearity
Residuals: the model’s “errors” for each point
After you fit a regression line, you should immediately ask: “How well does it fit, and where does it fail?” A residual answers that question for a single data point.
For an observed pair (x, y) , the residual is:
e = y - \hat{y}
- If e > 0, the point lies above the line (the model underpredicted).
- If e < 0, the point lies below the line (the model overpredicted).
Residuals matter because the regression line is only a summary. Patterns in residuals reveal whether a linear model is appropriate or whether something more complex is happening.
Residual plots: the main tool for checking a linear model
A residual plot is a graph of residuals versus the explanatory variable x (or sometimes versus predicted values \hat{y}). You typically plot points (x, e) .
What you want to see if a linear model is appropriate:
- Residuals scattered randomly around 0.
- No clear curve or pattern.
- Roughly similar vertical spread across the range of x.
Why this works: If the model captures the main trend, what’s left (the residuals) should look like random noise. If the residuals still show structure, the line is missing something systematic.
Assessing linearity: what “good” and “bad” residual plots look like
A residual plot helps you diagnose several issues.
1) Nonlinearity (curved pattern)
If residuals form a curve (for example, positive, then negative, then positive), the relationship is likely not linear. A straight line is systematically too high in some regions and too low in others.
In real life, many processes curve: diminishing returns (studying more helps, but each additional hour helps less), growth that levels off, and so on.
2) Changing spread (non-constant variability)
If residuals start small and then “fan out” (or the reverse), the variability of y changes across x. This is often called nonconstant variance.
A linear model could still describe the center trend, but predictions may be less reliable in parts of the x-range. On AP questions, you should describe what you see and connect it to the model’s appropriateness.
3) Outliers (unusually large residuals)
Points with residuals much larger in magnitude than others stand out. They can distort summaries and may strongly affect conclusions.
Important: an outlier in the y-direction means large vertical deviation from the line. That’s different from being extreme in x (which relates to leverage, discussed later).
Standard deviation of the residuals: typical prediction error
While residual plots show pattern visually, you also sometimes summarize typical residual size with the standard deviation of the residuals (also called the standard error of the regression), commonly denoted s:
s = \sqrt{\frac{\sum e^2}{n-2}}
- e are the residuals.
- n is the number of data points.
- The n-2 reflects that two parameters (slope and intercept) were estimated.
Interpretation: s estimates the typical distance (in y units) that the observed values fall from the regression line.
This is not the same as s_y:
- s_y is the spread of the original y values.
- s is the spread of the vertical errors around the fitted line.
Example: Computing and interpreting residuals
Using the earlier regression model:
\hat{y} = 66 + 3x
Suppose a student studied x = 6 hours and scored y = 80.
Step 1: Predict
\hat{y} = 66 + 3(6) = 84
Step 2: Compute residual
e = y - \hat{y} = 80 - 84 = -4
Interpretation: The student scored 4 points below what the model predicted.
If you computed residuals for many students and graphed them against hours studied, you’d look for random scatter around 0. If low-study and high-study students tended to have negative residuals while middle-study students had positive residuals, that would suggest curvature (the line is not capturing the pattern).
Reading residual plots in words (a tested skill)
On AP Statistics questions, you are often given a residual plot and asked whether a linear model is appropriate. Your job is to describe the pattern and connect it to the model.
Strong answers usually include:
- A clear description (random scatter vs curved pattern vs fanning).
- A conclusion about linearity (appropriate or not).
- If not appropriate, what type of departure is suggested (curvature, changing variability, clusters).
A weak answer just says “it looks random” without describing what “random” means in the graph.
Exam Focus
- Typical question patterns:
- You’re given a residual plot and asked whether a linear model is appropriate, with justification.
- You’re given a regression equation and a data point and asked to compute and interpret the residual.
- You’re asked to interpret s (typical prediction error) in context.
- Common mistakes:
- Mixing up residual sign: students sometimes compute \hat{y} - y instead of y - \hat{y}.
- Saying “no correlation” when the residual plot is curved; the correct point is “nonlinear association,” not necessarily “no association.”
- Ignoring a fan-shaped residual plot and claiming the model is perfect; changing spread is a real departure you should mention.
Departures from Linearity and Influential Points
What counts as a “departure” from a linear model?
A departure from linearity happens when the relationship between x and y is not well described by a straight line, even if there is still a strong association.
The main types you should recognize (often through scatterplots and residual plots) are:
- Curvature: the trend bends.
- Changing variability: residual spread increases or decreases across x.
- Clusters or subgroups: points form separate clouds, possibly indicating a lurking variable (for example, two different populations mixed together).
- Outliers: points far from the overall pattern.
These matter because regression gives you numerical outputs (slope, intercept, r^2) that can look “official.” Departures remind you that the model is only valid if its assumptions are reasonably met.
Outliers vs high leverage: two different ways a point can be unusual
Students often call any unusual point an “outlier,” but in regression there are two distinct ideas.
- A response outlier (often just called an outlier in regression context) is a point with an unusually large residual—far above or below the regression line.
- A high leverage point is a point with an extreme x value compared to the rest of the data.
A point can be high leverage without being a large-residual outlier if it happens to fall near the extended line. And a point can be a large-residual outlier without being high leverage if it sits near the middle of the x range but far vertically.
Why leverage matters: because the regression line is chosen to minimize squared residuals, points far out in x can “pull” the line toward themselves, especially if they don’t line up with the existing pattern.
Influential points: when a point changes the regression line
An influential point is a point that, if removed, would noticeably change the regression line (slope and/or intercept). Influence is about impact on the fitted model, not just being unusual.
A classic (and testable) idea:
- High leverage points are the ones most likely to be influential.
- But not all high leverage points are influential; if they lie on the same trend, they may reinforce it rather than distort it.
A practical way to assess influence conceptually (often how AP problems frame it) is:
- Fit the regression line with all points.
- Remove the suspected influential point.
- Fit again (or reason about how the line would change).
- Compare the two lines.
If the slope/intercept changes a lot, the point is influential.
How influential points affect correlation and r^2
Because correlation r is sensitive to outliers and because the regression slope depends on r, a single influential point can dramatically change:
- the direction of association (even flipping the sign of r),
- the strength of association,
- the slope and intercept,
- and therefore predictions.
This is why scatterplots are essential. A computed regression equation without a plot can hide the fact that one point is driving the entire result.
Example: A high leverage point that is (and isn’t) influential
Imagine most data points have x values between 2 and 8 and follow a positive linear trend.
Case A: high leverage but not very influential
You add one new point at x = 20 that lies close to the extension of the existing line. This point has high leverage (extreme x), but it agrees with the trend. It may slightly strengthen the linear pattern and increase r^2, but it might not dramatically change the slope.
Case B: high leverage and influential
Instead, the point at x = 20 lies far above or below the extension of the existing trend. Now it has high leverage and a large residual relative to the old line—this is the dangerous combination. The least-squares line may rotate toward that point to reduce the squared residual, changing predictions for many other x values.
Even without calculating anything, you can often predict the direction the line will move: the line will try to get closer to the influential point, especially if it sits far out in x.
Departures from linearity: what to do with them (AP-level expectations)
In AP Statistics, you are not usually required in this unit to fit advanced nonlinear models. But you are expected to:
- detect when linear regression is inappropriate,
- explain what feature of the graph shows the problem,
- and describe consequences for prediction and interpretation.
Common appropriate next steps (depending on the question) include:
- Suggest that a different model might fit better (for example, a curved model).
- Restrict predictions to the observed x range (avoid extrapolation).
- Investigate the context for an outlier or influential point (data entry error? different subgroup? unusual circumstances?).
A subtle but important point: removing an influential point is not automatically justified. You need a reason tied to context (measurement error, different population, etc.), not just “it makes the line look nicer.”
Extrapolation: a “departure” in how you use the model
Even with a strong linear pattern, predicting outside the observed range of x is risky. **Extrapolation** means using the regression line to predict y for an x value beyond the data range.
The line is not guaranteed to remain reasonable outside the range you observed. Many real relationships change behavior—limits, saturation, thresholds—so extrapolation can produce unrealistic predictions.
AP questions often test whether you can recognize and criticize extrapolation in context.
A worked influence demonstration with simple numbers
Suppose you have four points that follow a perfect line:
(1, 2), (2, 4), (3, 6), (4, 8)
These lie exactly on:
\hat{y} = 2x
Now add a high leverage point at x = 10.
- If the new point is (10, 20) , it matches the pattern and will not change the line much.
- If the new point is (10, 0) , it is far below the pattern. Because x = 10 is extreme compared to 1 through 4, the regression line will be pulled downward and the slope will decrease substantially.
You don’t need the exact new equation to answer the key conceptual question: that point is high leverage and very likely influential because it conflicts with the established trend while being far out in x.
Exam Focus
- Typical question patterns:
- You’re shown a scatterplot (sometimes with a regression line) and asked to identify an outlier, a high leverage point, or an influential point.
- You’re asked what happens to the regression line if a specific point is removed (slope increases/decreases, line shifts up/down).
- You’re asked to diagnose curvature or fanning from a residual plot and explain why linear regression is not appropriate.
- Common mistakes:
- Calling any point with a large residual “influential” without considering leverage or whether the line would change if removed.
- Confusing “outlier” (large residual) with “high leverage” (extreme x); they are different and tested separately.
- Justifying removal of points purely for a better fit rather than using context (error, different condition, or other legitimate reason).