AP Statistics Unit 9: Comprehensive Study Guide on Inference for Slope

Overview of Unit 9: Inference for Slope

Unit 9 is the final unit of content in the AP Statistics curriculum.
The primary focus is extending the concepts of bivariate quantitative data from Unit 2 to make judgments about entire populations using sample data.
Bivariate Quantitative Data Review: This data consists of two quantitative variables that can be visualized on a scatter plot to create a Least Squares Regression Line (LSRL).
The Concept of Inference for Slope: The slope calculated from a sample ( $b$ ) is just one estimate. Inference allows us to use this sample slope to make a judgment about the true population slope ( $\beta$ ).

The Sampling Distribution of Sample Slopes

To perform inference, one must understand the behavior of all possible sample slopes from samples of the same size taken from the same population.
Center: The mean of the sampling distribution of sample slopes is the true population slope ( $\beta$ ), provided the samples are drawn randomly to avoid bias.
Spread (Standard Deviation): The formula for the standard deviation of the sampling distribution is: - $\sigma_b = \frac{\sigma}{\sigma_x \sqrt{n}}$ - In this formula, $\sigma$ is the standard deviation of the true regression line, $\sigma_x$ is the standard deviation of the population X-values, and $n$ is the sample size.
Standard Error ( $SE_b$ ): Because population parameters ( $\sigma$ and $\sigma_x$ ) are typically unknown, we use the Standard Error of the sample slopes as an estimate: - $SE_b = \frac{s}{s_x \sqrt{n-1}}$ - Here, $s$ is the standard deviation of the sample residuals, $s_x$ is the standard deviation of the X-values in the sample, and $n-1$ is used in the denominator (the reasons for using $n-1$ are beyond the scope of AP Statistics). - Note: This value is almost always provided in a computer regression analysis output table.
Shape: The distribution of sample slopes is approximately normal provided certain conditions (specifically regarding residuals) are met.
T-Model Application: When building the sampling distribution using standard error, a T-model is used with degrees of freedom ( $df$ ) calculated as: - $df = n - 2$ - The subtraction of 2 reflects the use of two variables in bivariate data.

Constructing a T-Interval for Slope

A T-interval is used to estimate the true population slope ( $\beta$ ).
Four-Step Process: 1. Name the Interval: Explicitly state it is a "T-interval for the slope of [context]." 2. Check Conditions: Verify the assumptions necessary for the sampling distribution. 3. Build the Interval: Use the formula $b \pm t^* \times SE_b$ . 4. Interpret the Interval: State the confidence level and what the interval represents in terms of the population relationship.

Necessary Conditions for Inference

Linearity: The scatter plot of the sample data must appear somewhat linear. A regression line should not be applied to non-linear data.
Randomness: Observations must be selected randomly to avoid bias and allow for generalization.
Independence (10% Condition): The sample size $n$ must be less than $10\%$ of the population size to assume independence.
Normality of Residuals: The residuals must be approximately normal with no major outliers or extreme skewness. This is verified by checking a histogram or dot plot of the residuals.
Constant Variance (Equal SD of Residuals): The standard deviation of the Y-values should not vary with X. When looking at a residual plot, there should be similar variability (vertical spread) across all X-values, rather than a "fan pattern" (small residuals at one end and large ones at the other).

Example Case Study: Car Weight and Gas Mileage

Scenario: A dealership analyzes the relationship between car weight (X, in pounds) and gas mileage (Y, in miles per gallon) using a random sample of $n = 12$ cars.
Sample Data/Output: - Sample Slope ( $b$ ): $-0.00714$ - Standard Error ( $SE_b$ ): $0.00056$
Confidence Interval (95%) Calculation: - Step 1: T-interval for the slope of the regression line between weight and gas mileage. - Step 2: Conditions met (linear scatter plot, random sample of 12, 12 is less than $10\%$ of all cars, normal residuals, similar variability in residual plot). - Step 3: - $df = 12 - 2 = 10$ - $t^<em>$ (for $95\%$ confidence and $df = 10$ ) is found using invT(0.025, 10). - Interval: $-0.00714 \pm (t^</em>) \times 0.00056$ - Resulting Interval: $-0.00834$ to $-0.00594$
Interpreting the Result: "I am $95\%$ confident that the interval from $-0.00834$ to $-0.00594\,mpg/lb$ captures the slope of the true regression line relating gas mileage to weight."
Practical Meaning: For every one-pound increase in car weight, the gas mileage is predicted to decrease between $0.00594$ and $0.00834\,mpg$ .

Special Considerations: Zero in the Interval

If a confidence interval for slope contains zero (e.g., from $-0.5$ to $+0.2$ ), it suggests that there may be no linear relationship between the variables in the population.
A slope of zero represents a horizontal line, meaning changes in X do not predict changes in Y.
If the entire interval is negative, there is strong evidence of a negative relationship.
If the entire interval is positive, there is strong evidence of a positive relationship.

Kelly’s Case Study: Car Price vs. Mileage

Scenario: Kelly examines $n = 10$ cars to predict Price (Y) based on Mileage (X).
Details: - Sample size: $10$ - $df = 10 - 2 = 8$ - Slope ( $b$ ): $-0.2083$ - Standard Error ( $SE_b$ ): $0.0684$
99% Confidence Interval Calculation: - $t^*$ for $99\%$ confidence and $df = 8$ is approximately $3.3554$ - Interval: $-0.2083 \pm 3.3554 \times 0.0684$ - Resulting Interval: $-0.4378$ to $+0.0212$
Conclusion: Since the interval contains zero, there is not convincing evidence of a linear relationship. The price could decrease by $44\,\text{cents}$ per mile, increase by $2\,\text{cents}$ per mile, or not change at all (slope of zero).

Significance Tests for Slope

Purpose: To determine if there is statistically significant evidence of a linear relationship between two variables.
Step 1: Hypotheses and Naming: - Null Hypothesis ( $H_0$ ): $\beta = \beta_0$ (Usually $\beta = 0$ , meaning no relationship). - Alternative Hypothesis ( $H_a$ ): \beta > 0, \beta < 0, or $\beta \neq 0$ .
Step 2: Conditions and Sampling Distribution: - Same conditions as the T-interval. - Assume $H_0$ is true, so the sampling distribution is centered at zero (if $\beta = 0$ ).
Step 3: Calculating Test Statistic and P-value: - Test Statistic ( $t$ ): $t = \frac{b - \beta_0}{SE_b}$ - P-value: Use tcdf(lower, upper, df) to find the area under the T-curve corresponding to the alternative hypothesis.
Step 4: Conclusion: - If $P < \alpha$ : Reject $H_0$ , stating there is evidence of a relationship. - If $P > \alpha$ : Fail to reject $H_0$ , stating there is lack of evidence of a relationship.

Example Case Study: Baby Walking Age and Temperature

Scenario: Researchers studied $n = 16$ babies to see if the average daily temperature in their city affects the age (in days) at which they first walk.
Hypotheses: - $H_0: \beta = 0$ (Temperature doesn't affect walking age). - H_a: \beta < 0 (In warmer cities, babies walk sooner).
Data: - Sample Slope ( $b$ ): $-0.0555$ - Standard Error ( $SE_b$ ): $0.0906$ - $df = 16 - 2 = 14$
Calculations: - $t = \frac{-0.0555 - 0}{0.0906} = -0.6126$ - P-value: tcdf(-99, -0.6126, 14) yields approximately $0.2769$ .
Conclusion: With a P-value of $0.2769$ , which is greater than standard alpha levels ( $0.01$ or $0.05$ ), we fail to reject the null. There is no statistically significant evidence that a negative linear relationship exists. The observed relationship is likely due to chance.

Final Tips for the AP Exam

Computer Output Reading: You are rarely required to calculate $SE_b$ or the slope manually. You must be able to identify them in a table. The slope is usually in the row labeled with the X-variable name, and the Standard Error is in the column labeled "SE Coeff" or "Standard Error" next to the slope.
Multiple Choice: Often tests the ability to construct the formula part of the interval ( $b \pm t^* \times SE_b$ ). Remember $df = n-2$ .
Free Response: Focus on interpreting the results (contextualizing the slope and the confidence interval).