AP Statistics Unit 9: Comprehensive Study Guide on Inference for Slope

Overview of Unit 9: Inference for Slope

  • Unit 9 is the final unit of content in the AP Statistics curriculum.
  • The primary focus is extending the concepts of bivariate quantitative data from Unit 2 to make judgments about entire populations using sample data.
  • Bivariate Quantitative Data Review: This data consists of two quantitative variables that can be visualized on a scatter plot to create a Least Squares Regression Line (LSRL).
  • The Concept of Inference for Slope: The slope calculated from a sample (bb) is just one estimate. Inference allows us to use this sample slope to make a judgment about the true population slope (β\beta).

The Sampling Distribution of Sample Slopes

  • To perform inference, one must understand the behavior of all possible sample slopes from samples of the same size taken from the same population.
  • Center: The mean of the sampling distribution of sample slopes is the true population slope (β\beta), provided the samples are drawn randomly to avoid bias.
  • Spread (Standard Deviation): The formula for the standard deviation of the sampling distribution is:   - σb=σσxn\sigma_b = \frac{\sigma}{\sigma_x \sqrt{n}}   - In this formula, σ\sigma is the standard deviation of the true regression line, σx\sigma_x is the standard deviation of the population X-values, and nn is the sample size.
  • Standard Error (SEbSE_b): Because population parameters (σ\sigma and σx\sigma_x) are typically unknown, we use the Standard Error of the sample slopes as an estimate:   - SEb=ssxn1SE_b = \frac{s}{s_x \sqrt{n-1}}   - Here, ss is the standard deviation of the sample residuals, sxs_x is the standard deviation of the X-values in the sample, and n1n-1 is used in the denominator (the reasons for using n1n-1 are beyond the scope of AP Statistics).   - Note: This value is almost always provided in a computer regression analysis output table.
  • Shape: The distribution of sample slopes is approximately normal provided certain conditions (specifically regarding residuals) are met.
  • T-Model Application: When building the sampling distribution using standard error, a T-model is used with degrees of freedom (dfdf) calculated as:   - df=n2df = n - 2   - The subtraction of 2 reflects the use of two variables in bivariate data.

Constructing a T-Interval for Slope

  • A T-interval is used to estimate the true population slope (β\beta).
  • Four-Step Process:   1. Name the Interval: Explicitly state it is a "T-interval for the slope of [context]."   2. Check Conditions: Verify the assumptions necessary for the sampling distribution.   3. Build the Interval: Use the formula b±t×SEbb \pm t^* \times SE_b.   4. Interpret the Interval: State the confidence level and what the interval represents in terms of the population relationship.

Necessary Conditions for Inference

  • Linearity: The scatter plot of the sample data must appear somewhat linear. A regression line should not be applied to non-linear data.
  • Randomness: Observations must be selected randomly to avoid bias and allow for generalization.
  • Independence (10% Condition): The sample size nn must be less than 10%10\% of the population size to assume independence.
  • Normality of Residuals: The residuals must be approximately normal with no major outliers or extreme skewness. This is verified by checking a histogram or dot plot of the residuals.
  • Constant Variance (Equal SD of Residuals): The standard deviation of the Y-values should not vary with X. When looking at a residual plot, there should be similar variability (vertical spread) across all X-values, rather than a "fan pattern" (small residuals at one end and large ones at the other).

Example Case Study: Car Weight and Gas Mileage

  • Scenario: A dealership analyzes the relationship between car weight (X, in pounds) and gas mileage (Y, in miles per gallon) using a random sample of n=12n = 12 cars.
  • Sample Data/Output:   - Sample Slope (bb): 0.00714-0.00714   - Standard Error (SEbSE_b): 0.000560.00056
  • Confidence Interval (95%) Calculation:   - Step 1: T-interval for the slope of the regression line between weight and gas mileage.   - Step 2: Conditions met (linear scatter plot, random sample of 12, 12 is less than 10%10\% of all cars, normal residuals, similar variability in residual plot).   - Step 3:     - df=122=10df = 12 - 2 = 10     - t<em>t^<em> (for 95%95\% confidence and df=10df = 10) is found using invT(0.025, 10).     - Interval: 0.00714±(t</em>)×0.00056-0.00714 \pm (t^</em>) \times 0.00056     - Resulting Interval: 0.00834-0.00834 to 0.00594-0.00594
  • Interpreting the Result: "I am 95%95\% confident that the interval from 0.00834-0.00834 to 0.00594mpg/lb-0.00594\,mpg/lb captures the slope of the true regression line relating gas mileage to weight."
  • Practical Meaning: For every one-pound increase in car weight, the gas mileage is predicted to decrease between 0.005940.00594 and 0.00834mpg0.00834\,mpg.

Special Considerations: Zero in the Interval

  • If a confidence interval for slope contains zero (e.g., from 0.5-0.5 to +0.2+0.2), it suggests that there may be no linear relationship between the variables in the population.
  • A slope of zero represents a horizontal line, meaning changes in X do not predict changes in Y.
  • If the entire interval is negative, there is strong evidence of a negative relationship.
  • If the entire interval is positive, there is strong evidence of a positive relationship.

Kelly’s Case Study: Car Price vs. Mileage

  • Scenario: Kelly examines n=10n = 10 cars to predict Price (Y) based on Mileage (X).
  • Details:   - Sample size: 1010   - df=102=8df = 10 - 2 = 8   - Slope (bb): 0.2083-0.2083   - Standard Error (SEbSE_b): 0.06840.0684
  • 99% Confidence Interval Calculation:   - tt^* for 99%99\% confidence and df=8df = 8 is approximately 3.35543.3554   - Interval: 0.2083±3.3554×0.0684-0.2083 \pm 3.3554 \times 0.0684   - Resulting Interval: 0.4378-0.4378 to +0.0212+0.0212
  • Conclusion: Since the interval contains zero, there is not convincing evidence of a linear relationship. The price could decrease by 44cents44\,\text{cents} per mile, increase by 2cents2\,\text{cents} per mile, or not change at all (slope of zero).

Significance Tests for Slope

  • Purpose: To determine if there is statistically significant evidence of a linear relationship between two variables.
  • Step 1: Hypotheses and Naming:   - Null Hypothesis (H0H_0): β=β0\beta = \beta_0 (Usually β=0\beta = 0, meaning no relationship).   - Alternative Hypothesis (HaH_a): \beta > 0, \beta < 0, or β0\beta \neq 0.
  • Step 2: Conditions and Sampling Distribution:   - Same conditions as the T-interval.   - Assume H0H_0 is true, so the sampling distribution is centered at zero (if β=0\beta = 0).
  • Step 3: Calculating Test Statistic and P-value:   - Test Statistic (tt): t=bβ0SEbt = \frac{b - \beta_0}{SE_b}   - P-value: Use tcdf(lower, upper, df) to find the area under the T-curve corresponding to the alternative hypothesis.
  • Step 4: Conclusion:   - If P<αP < \alpha: Reject H0H_0, stating there is evidence of a relationship.   - If P>αP > \alpha: Fail to reject H0H_0, stating there is lack of evidence of a relationship.

Example Case Study: Baby Walking Age and Temperature

  • Scenario: Researchers studied n=16n = 16 babies to see if the average daily temperature in their city affects the age (in days) at which they first walk.
  • Hypotheses:   - H0:β=0H_0: \beta = 0 (Temperature doesn't affect walking age).   - H_a: \beta < 0 (In warmer cities, babies walk sooner).
  • Data:   - Sample Slope (bb): 0.0555-0.0555   - Standard Error (SEbSE_b): 0.09060.0906   - df=162=14df = 16 - 2 = 14
  • Calculations:   - t=0.055500.0906=0.6126t = \frac{-0.0555 - 0}{0.0906} = -0.6126   - P-value: tcdf(-99, -0.6126, 14) yields approximately 0.27690.2769.
  • Conclusion: With a P-value of 0.27690.2769, which is greater than standard alpha levels (0.010.01 or 0.050.05), we fail to reject the null. There is no statistically significant evidence that a negative linear relationship exists. The observed relationship is likely due to chance.

Final Tips for the AP Exam

  • Computer Output Reading: You are rarely required to calculate SEbSE_b or the slope manually. You must be able to identify them in a table. The slope is usually in the row labeled with the X-variable name, and the Standard Error is in the column labeled "SE Coeff" or "Standard Error" next to the slope.
  • Multiple Choice: Often tests the ability to construct the formula part of the interval (b±t×SEbb \pm t^* \times SE_b). Remember df=n2df = n-2.
  • Free Response: Focus on interpreting the results (contextualizing the slope and the confidence interval).