AP Statistics Unit 9 Notes: Inference for Linear Regression Slopes
Introducing Statistics for the Slope of a Regression Model
What “inference for regression” is really about
When you fit a least-squares regression line to data, you get a sample regression equation that describes the relationship in your sample. In AP Statistics, inference for linear regression asks a bigger question: what can you conclude about the population relationship between two quantitative variables based on this sample?
The key population quantity is the slope of the population regression line, usually written as \beta_1. Conceptually:
- \beta_1 is the _true_ average change in the response variable y for a one-unit increase in the explanatory variable x, in the population.
- Your data produce a sample slope b_1 (often just b), which is an estimate of \beta_1.
So the overall goal is to use b_1 (plus information about scatter) to estimate or test claims about \beta_1.
Why the slope matters (and why we do inference)
A regression slope from a sample can look impressive just by chance—especially with noisy data or small samples. Inference lets you quantify uncertainty:
- A confidence interval for \beta_1 gives a range of plausible values for the true slope.
- A significance test for \beta_1 evaluates whether the data provide convincing evidence that the true slope is different from a hypothesized value (often 0, meaning “no linear relationship”).
This is closely connected to what you already know about inference for means: you use a point estimate, a standard error, and a t distribution. The big difference is that the estimate is a **slope**, and its variability depends on the scatter around the regression line and the spread of the x values.
The model behind regression inference
Inference procedures rely on a statistical model. The simple linear regression model is:
y = \beta_0 + \beta_1 x + \epsilon
where:
- \beta_0 is the population intercept.
- \beta_1 is the population slope.
- \epsilon is a random error term with mean 0.
In a given sample of size n, you compute the least-squares line:
\hat{y} = b_0 + b_1 x
Here \hat{y} is the predicted value of y, and the residual is:
e = y - \hat{y}
Residuals are the “leftover” vertical distances. Inference for slope depends heavily on how these residuals behave.
Conditions for regression inference (what must be true)
AP Statistics expects you to check conditions before doing a confidence interval or test for \beta_1. A common memory aid is L.I.N.E.:
- Linear: The relationship between x and y is approximately linear.
- Independent: Observations are independent (often justified by random sampling or random assignment).
- Normal: For each fixed x, the distribution of y values (equivalently, the errors) is approximately Normal.
- Equal variance: The spread of residuals is about the same across the range of x values.
How you actually check these in practice:
- Linearity: Scatterplot of y vs x should look roughly linear; residual plot should show no curved pattern.
- Independence: Comes from the data-collection design (random sample, randomized experiment). This is not something a plot can prove.
- Normality: Residuals should be roughly symmetric/unimodal; a Normal probability plot of residuals (if provided) should be roughly linear. In AP, a histogram of residuals is common.
- Equal variance: Residual plot should show roughly constant vertical spread (no funnel shape).
Important nuance: these are conditions for inference, not for computing the regression line. You can always calculate b_1, but your confidence interval and p-value are only trustworthy if conditions are reasonably met.
Why we use a t-distribution and why degrees of freedom are n - 2
In slope inference, you estimate two parameters from the data: \beta_0 and \beta_1. That “uses up” two degrees of freedom, so the relevant t distribution uses:
df = n - 2
The general structure is:
t = \frac{\text{estimate} - \text{hypothesized value}}{SE}
For slope, the estimate is b_1, and the standard error is the standard error of the slope.
The standard error of the slope (what it measures)
The standard error of the slope, written SE_{b_1}, measures how much the estimated slope b_1 would typically vary from sample to sample if you repeatedly sampled from the same population.
Intuitively, SE_{b_1} gets smaller when:
- Points cluster tightly around the line (less vertical scatter).
- You have a larger sample size.
- The x values are spread out (it’s easier to detect a slope when x varies a lot).
In many AP problems, you are given SE_{b_1} in computer output (like a regression table). You usually do not have to compute it from scratch.
Notation you might see (be fluent in equivalents)
Different calculators and software use slightly different symbols.
| Meaning | Common notation | Other notation you may see |
|---|---|---|
| Population slope | \beta_1 | (sometimes written as “true slope”) |
| Sample slope estimate | b_1 | b |
| Standard error of slope | SE_{b_1} | SE_b, “SE Coef” for x |
| Degrees of freedom | n - 2 | “df” |
| Test statistic | t | “t-ratio” |
Worked example (conceptual, not yet inference)
Suppose you record x = hours studied and y = exam score for a random sample of students. You fit a least-squares line and get b_1 = 3.2.
Interpretation of b_1 (sample slope): In this sample, each additional hour studied is associated with an increase of about 3.2 points in predicted exam score.
Inference question: Is the true slope \beta_1 in the population positive? Or could a slope like 3.2 happen just due to random sampling variation?
That’s exactly what the confidence interval and significance test address.
Exam Focus
- Typical question patterns:
- You’re given a regression output table and asked to interpret b_1 and/or SE_{b_1} in context.
- You’re asked to check regression inference conditions using a scatterplot and residual plot.
- You’re asked to identify the correct degrees of freedom for slope inference.
- Common mistakes:
- Confusing b_1 (sample slope) with \beta_1 (population slope) when writing conclusions.
- Checking conditions using only the scatterplot (you often need the residual plot for linearity and equal variance).
- Assuming independence without referencing the data-collection method.
Confidence Interval for the Slope of a Regression Model
What a slope confidence interval means
A confidence interval for the slope estimates the population slope \beta_1 using the sample slope b_1 plus a margin of error.
A C% confidence interval for \beta_1 is a range of slopes that are **plausible** given your data and the regression model assumptions. If you repeated the sampling process many times and built the interval the same way each time, about C% of those intervals would capture the true \beta_1.
A crucial interpretation point: the interval is about the average change in y per unit x in the population, not about individual predictions.
The formula and pieces you must identify
The confidence interval for \beta_1 is:
b_1 \pm t^* SE_{b_1}
Where:
- b_1 = sample slope from the regression.
- SE_{b_1} = standard error of the slope (from output).
- t^* = critical value from a t distribution with df = n - 2 at the desired confidence level.
This looks like a one-sample t interval in structure: estimate ± (critical value)(standard error). The only difference is what you’re estimating.
How to choose t^*
To find t^* you need:
- The confidence level (commonly 90%, 95%, 99%).
- Degrees of freedom df = n - 2.
You might use a table, calculator, or be given t^*. On AP-style questions, if no table is provided, they often provide enough information to find it.
Step-by-step process (what you would write on an FRQ)
- Identify the parameter: \beta_1, the true slope of the population regression line relating y to x.
- Check conditions (L.I.N.E.):
- Show how plots and context support linearity and equal variance.
- Use the design to justify independence.
- Discuss Normality of residuals if evidence is provided.
- Calculate the interval using b_1 \pm t^* SE_{b_1}.
- Interpret in context: “We are C% confident that for each 1-unit increase in x, the mean (predicted) value of y in the population increases by between … and … units.”
Worked example: building and interpreting a CI
A random sample of n = 18 houses is selected. Let x be house size (in hundreds of square feet) and y be selling price (in thousands of dollars). Regression output gives:
- b_1 = 12.4
- SE_{b_1} = 3.1
You want a 95% confidence interval for \beta_1.
Step 1: Degrees of freedom
df = n - 2 = 18 - 2 = 16
Step 2: Critical value
For 95% confidence with df = 16, t^* \approx 2.12 (from a t table).
Step 3: Compute the interval
b_1 \pm t^* SE_{b_1} = 12.4 \pm 2.12(3.1)
Margin of error:
2.12(3.1) = 6.572
Interval:
12.4 - 6.572 = 5.828
12.4 + 6.572 = 18.972
So the 95% CI is approximately:
[5.83, 18.97]
Step 4: Interpret in context
“We are 95% confident that in the population of houses like these, for each additional 100 square feet of house size, the mean selling price increases by between about 5.83 and 18.97 thousand dollars, on average.”
Connecting the CI to “is the relationship real?”
A very common conceptual link: if a confidence interval for \beta_1 does not include 0, that suggests the slope is significantly different from 0 at a corresponding significance level (for a two-sided test). In plain language: if 0 is not plausible, “no linear relationship” is not plausible.
Be careful: this is a helpful shortcut, but you still need to match the confidence level to the test’s \alpha and whether the test is one-sided or two-sided.
What can go wrong (and how to avoid it)
- Interpreting the slope as “for each 1-unit increase in x, y increases” without saying “predicted/mean response” or clarifying it’s an average relationship. Regression describes the conditional mean, not every individual.
- Using the wrong degrees of freedom. For slope inference it’s n - 2, not n - 1.
- Ignoring a curved pattern in the residual plot. A CI for \beta_1 assumes the linear model is appropriate.
- Extrapolation: Even a beautifully computed CI does not justify conclusions for x values far outside the data range.
Exam Focus
- Typical question patterns:
- Given b_1 and SE_{b_1} (often from regression output), construct and interpret a C% CI for \beta_1.
- Justify conditions using provided scatterplots/residual plots.
- Decide whether 0 is a plausible slope based on the interval.
- Common mistakes:
- Interpreting the CI as “C% of data fall in this range” (it’s about the parameter \beta_1).
- Forgetting to interpret the slope using the units of y per unit of x.
- Using t^* for the wrong confidence level or the wrong df.
Significance Test for the Slope of a Regression Model
What the slope significance test asks
A significance test for the slope evaluates whether the sample provides convincing evidence about the population slope \beta_1.
The most common test in AP Statistics is whether there is evidence of a linear relationship between x and y in the population. That is framed as:
- Null hypothesis: H_0: \beta_1 = 0 (no linear association in the population)
- Alternative hypothesis: H_a: \beta_1 \ne 0 (some linear association)
Depending on context, one-sided alternatives also appear:
- H_a: \beta_1 > 0 (positive linear relationship)
- H_a: \beta_1 < 0 (negative linear relationship)
Why testing the slope is linked to “association vs. causation”
A significant slope tells you that a linear relationship is unlikely to be due to random chance alone (given the model assumptions). But it does not automatically mean x causes y.
- With a random sample from a population, you can generalize the association to that population.
- With a randomized experiment (random assignment of treatments), a significant slope can support a causal conclusion (because random assignment helps eliminate confounding).
AP questions often test whether you can make the correct type of conclusion based on the study design.
The test statistic and its distribution
The test statistic for \beta_1 is:
t = \frac{b_1 - \beta_{1,0}}{SE_{b_1}}
where \beta_{1,0} is the slope value in the null hypothesis (often 0). Under the null and the regression conditions, this statistic follows a t distribution with:
df = n - 2
How the p-value is interpreted
The p-value is the probability, assuming the null hypothesis is true, of observing a slope estimate at least as extreme as the one you got (in the direction(s) specified by H_a).
- Small p-value (typically less than \alpha) means the data would be surprising if \beta_1 were really 0, so you reject H_0.
- Large p-value means the data are reasonably consistent with \beta_1 = 0, so you fail to reject H_0.
Language matters on the exam: “fail to reject” is not the same as “accept.” A large p-value does not prove there is no relationship; it means you don’t have strong evidence of one from this data.
Step-by-step testing procedure (FRQ-ready reasoning)
- State hypotheses about \beta_1 in context.
- Check conditions (L.I.N.E.).
- Compute the test statistic using t = \frac{b_1 - \beta_{1,0}}{SE_{b_1}}.
- Find the p-value from the t distribution with df = n - 2 (or read it from regression output).
- Make a decision (reject or fail to reject) at significance level \alpha.
- Conclude in context about evidence for a linear relationship (or for a positive/negative slope).
Worked example: testing whether the slope differs from 0
A botanist takes a random sample of n = 25 plants of the same species. Let x be hours of sunlight per day and y be weekly growth (cm). Regression output gives:
- b_1 = 0.85
- SE_{b_1} = 0.30
Test at \alpha = 0.05:
- H_0: \beta_1 = 0
- H_a: \beta_1 \ne 0
Degrees of freedom
df = n - 2 = 25 - 2 = 23
Compute the test statistic
t = \frac{0.85 - 0}{0.30} = 2.833\ldots
P-value (approximate)
For df = 23, a t value around 2.83 is quite large. The two-sided p-value is less than 0.01 (you would confirm with a calculator/table or be given output).
Decision
Since p < 0.05, reject H_0.
Conclusion in context
“There is convincing evidence that the true slope \beta_1 is not 0. In the population of these plants, hours of sunlight per day is linearly associated with mean weekly growth.”
Notice what we did not say: we did not claim sunlight “causes” growth unless the data came from a randomized experiment where sunlight hours were assigned.
A second example: one-sided alternative
Suppose an economics question expects you to justify that higher advertising spending increases sales. Then you might test:
- H_0: \beta_1 = 0
- H_a: \beta_1 > 0
Everything about the mechanics is the same, but the p-value is one-sided, so it will be about half of the corresponding two-sided p-value (when the observed t is in the direction of H_a). A common exam trap is using a two-sided p-value when the alternative is one-sided.
How the test relates to correlation and regression output
In simple linear regression (one x variable), testing \beta_1 = 0 is closely connected to testing whether the population correlation is 0. Practically, AP questions often give regression output with a row for the explanatory variable that includes:
- the slope estimate b_1
- SE_{b_1}
- the t statistic
- the p-value
If the p-value listed for x is small, that corresponds to rejecting H_0: \beta_1 = 0.
What can go wrong (common reasoning errors)
- Confusing “no slope” with “no relationship.” The test is about linear relationship. There could be a strong curved relationship even if a linear slope test fails.
- Concluding causation from observational data. Significant slope does not eliminate confounding.
- Not tying hypotheses to the parameter. Hypotheses should be about \beta_1, not about b_1.
- Ignoring outliers/influential points. One influential point can drastically change b_1 and the p-value. Residual plots and scatterplots help you notice this.
Exam Focus
- Typical question patterns:
- Given regression output, conduct a test of H_0: \beta_1 = 0 and write a conclusion in context.
- Choose the correct alternative hypothesis (two-sided vs one-sided) based on the story.
- Explain whether a significant result supports a causal claim based on whether the study was an experiment or observational.
- Common mistakes:
- Writing hypotheses about b_1 instead of \beta_1.
- Using df = n - 1 or using a Normal distribution instead of t with df = n - 2.
- Saying “we accept H_0” or interpreting a large p-value as proof that the slope is 0.