AP Statistics Unit 9: Comprehensive Study Guide on Inference for Slope
Overview of Unit 9: Inference for Slope
Unit 9 is the final unit of content in the AP Statistics curriculum.
The primary focus is extending the concepts of bivariate quantitative data from Unit 2 to make judgments about entire populations using sample data.
Bivariate Quantitative Data Review: This data consists of two quantitative variables that can be visualized on a scatter plot to create a Least Squares Regression Line (LSRL).
The Concept of Inference for Slope: The slope calculated from a sample (b) is just one estimate. Inference allows us to use this sample slope to make a judgment about the true population slope (β).
The Sampling Distribution of Sample Slopes
To perform inference, one must understand the behavior of all possible sample slopes from samples of the same size taken from the same population.
Center: The mean of the sampling distribution of sample slopes is the true population slope (β), provided the samples are drawn randomly to avoid bias.
Spread (Standard Deviation): The formula for the standard deviation of the sampling distribution is:
- σb=σxnσ
- In this formula, σ is the standard deviation of the true regression line, σx is the standard deviation of the population X-values, and n is the sample size.
Standard Error (SEb): Because population parameters (σ and σx) are typically unknown, we use the Standard Error of the sample slopes as an estimate:
- SEb=sxn−1s
- Here, s is the standard deviation of the sample residuals, sx is the standard deviation of the X-values in the sample, and n−1 is used in the denominator (the reasons for using n−1 are beyond the scope of AP Statistics).
- Note: This value is almost always provided in a computer regression analysis output table.
Shape: The distribution of sample slopes is approximately normal provided certain conditions (specifically regarding residuals) are met.
T-Model Application: When building the sampling distribution using standard error, a T-model is used with degrees of freedom (df) calculated as:
- df=n−2
- The subtraction of 2 reflects the use of two variables in bivariate data.
Constructing a T-Interval for Slope
A T-interval is used to estimate the true population slope (β).
Four-Step Process:
1. Name the Interval: Explicitly state it is a "T-interval for the slope of [context]."
2. Check Conditions: Verify the assumptions necessary for the sampling distribution.
3. Build the Interval: Use the formula b±t∗×SEb.
4. Interpret the Interval: State the confidence level and what the interval represents in terms of the population relationship.
Necessary Conditions for Inference
Linearity: The scatter plot of the sample data must appear somewhat linear. A regression line should not be applied to non-linear data.
Randomness: Observations must be selected randomly to avoid bias and allow for generalization.
Independence (10% Condition): The sample size n must be less than 10% of the population size to assume independence.
Normality of Residuals: The residuals must be approximately normal with no major outliers or extreme skewness. This is verified by checking a histogram or dot plot of the residuals.
Constant Variance (Equal SD of Residuals): The standard deviation of the Y-values should not vary with X. When looking at a residual plot, there should be similar variability (vertical spread) across all X-values, rather than a "fan pattern" (small residuals at one end and large ones at the other).
Example Case Study: Car Weight and Gas Mileage
Scenario: A dealership analyzes the relationship between car weight (X, in pounds) and gas mileage (Y, in miles per gallon) using a random sample of n=12 cars.
Confidence Interval (95%) Calculation:
- Step 1: T-interval for the slope of the regression line between weight and gas mileage.
- Step 2: Conditions met (linear scatter plot, random sample of 12, 12 is less than 10% of all cars, normal residuals, similar variability in residual plot).
- Step 3:
- df=12−2=10
- t<em> (for 95% confidence and df=10) is found using invT(0.025, 10).
- Interval: −0.00714±(t</em>)×0.00056
- Resulting Interval: −0.00834 to −0.00594
Interpreting the Result: "I am 95% confident that the interval from −0.00834 to −0.00594mpg/lb captures the slope of the true regression line relating gas mileage to weight."
Practical Meaning: For every one-pound increase in car weight, the gas mileage is predicted to decrease between 0.00594 and 0.00834mpg.
Special Considerations: Zero in the Interval
If a confidence interval for slope contains zero (e.g., from −0.5 to +0.2), it suggests that there may be no linear relationship between the variables in the population.
A slope of zero represents a horizontal line, meaning changes in X do not predict changes in Y.
If the entire interval is negative, there is strong evidence of a negative relationship.
If the entire interval is positive, there is strong evidence of a positive relationship.
Kelly’s Case Study: Car Price vs. Mileage
Scenario: Kelly examines n=10 cars to predict Price (Y) based on Mileage (X).
99% Confidence Interval Calculation:
- t∗ for 99% confidence and df=8 is approximately 3.3554
- Interval: −0.2083±3.3554×0.0684
- Resulting Interval: −0.4378 to +0.0212
Conclusion: Since the interval contains zero, there is not convincing evidence of a linear relationship. The price could decrease by 44cents per mile, increase by 2cents per mile, or not change at all (slope of zero).
Significance Tests for Slope
Purpose: To determine if there is statistically significant evidence of a linear relationship between two variables.
Step 1: Hypotheses and Naming:
- Null Hypothesis (H0): β=β0 (Usually β=0, meaning no relationship).
- Alternative Hypothesis (Ha): \beta > 0, \beta < 0, or β=0.
Step 2: Conditions and Sampling Distribution:
- Same conditions as the T-interval.
- Assume H0 is true, so the sampling distribution is centered at zero (if β=0).
Step 3: Calculating Test Statistic and P-value:
- Test Statistic (t): t=SEbb−β0
- P-value: Use tcdf(lower, upper, df) to find the area under the T-curve corresponding to the alternative hypothesis.
Step 4: Conclusion:
- If P<α: Reject H0, stating there is evidence of a relationship.
- If P>α: Fail to reject H0, stating there is lack of evidence of a relationship.
Example Case Study: Baby Walking Age and Temperature
Scenario: Researchers studied n=16 babies to see if the average daily temperature in their city affects the age (in days) at which they first walk.
Hypotheses:
- H0:β=0 (Temperature doesn't affect walking age).
- H_a: \beta < 0 (In warmer cities, babies walk sooner).
Conclusion: With a P-value of 0.2769, which is greater than standard alpha levels (0.01 or 0.05), we fail to reject the null. There is no statistically significant evidence that a negative linear relationship exists. The observed relationship is likely due to chance.
Final Tips for the AP Exam
Computer Output Reading: You are rarely required to calculate SEb or the slope manually. You must be able to identify them in a table. The slope is usually in the row labeled with the X-variable name, and the Standard Error is in the column labeled "SE Coeff" or "Standard Error" next to the slope.
Multiple Choice: Often tests the ability to construct the formula part of the interval (b±t∗×SEb). Remember df=n−2.
Free Response: Focus on interpreting the results (contextualizing the slope and the confidence interval).