Chapter 26 - Inferences for Regression

Chapter 26: Inferences for Regression

Homework

Chapter 26 CW "College Admissions"
Chapter 26 WS2 "High Stakes Test"
Chapter 26 FRQ 2011 #5

Objective

Determine the statistical significance of a least squares regression model.

An Example: Body Fat and Waist Size

The chapter example revolves around the relationship between % body fat and waist size (in inches).

Remembering Regression

Model the relationship between two quantitative and continuous variables: a predictor variable and a response variable.
Use a sample to estimate the model in the population.
The population regression line assumes that the means of the distributions of the response variable (for each value of the predictor variable) fall along the line with values normally distributed above and below the mean response value for that predictor value.
A correlation can always be calculated, and a regression model can always be created, even with random data.
The key question is whether the model is statistically significant or simply a function of sampling variability.

Inference for Regression

Test hypotheses and make confidence intervals about the slope and intercept of the regression model.
Regression models provide an idea of reality beyond just the data in the study, allowing inference to a larger population.

The Population and the Sample

Even with complete population data, the data would not perfectly align on a straight line.
There's a distribution of %body fat for men with the same waist size.
The idealized model assumes the means of the distributions of %body fat for each waist size fall along the line and the individual values are scattered above and below the line.
With all population values, least squares methodology can find the slope and intercept of an idealized regression line.

Population Regression Line

Represented using Greek letters (parameters):
- $\muy = \beta0 + \beta_1x$
- $\beta_0$ is the intercept.
- $\beta_1$ is the slope.
Individual y values are not all at these means; some lie above and some below the line.
Models have errors (residuals), denoted by $\epsilon$ .
Fitted line: $y = b0 + b1x$ (with ^ over y)
Errors should be random and can be positive or negative.
Equation for individual scores: $y = \beta0 + \beta1x + \epsilon$
- True for each data point because $\epsilon$ absorbs the residual.

Assumptions and Conditions

Previously, data needed to be quantitative, sufficiently linear, and have no significant outliers.
Now, to make inferences about the coefficients, more assumptions need verification.
Order of checking conditions is important; if an initial assumption is false, there's no point in checking later ones.
Always draw a picture.

Specific Assumptions and Conditions

Linearity Assumption:
- Check that the data are Quantitative and Continuous.
- Examine the scatterplot for linearity. If not linear, straighten the data or stop.
- Check the residual plot (part 1): The residuals should appear randomly scattered.
Independence Assumption:
- Randomization Condition: Data values are randomly selected and provide a representative sample from the population.
Equal Variance Assumption (Homoscedasticity):
- Check the residual plot (part 2): The spread of the residuals should be uniform.
Normal Population Assumption:
- Sufficiently Unimodal and Symmetric: Check a histogram of the residuals; it should be unimodal and symmetric.
- Outlier Condition: Check for outliers and gaps in the data.

The residuals are crucial for checking conditions but are computed after the regression model.

Idealized Regression Model

If all four assumptions are satisfied:
- At each value of x, there is a distribution of y-values that follows a Normal model (bivariate normal).
- Each Normal model is centered on the line and has the same standard deviation (homoscedastic).

Order of Analysis

Scatterplot: Check for linearity. If not straight, re-express data or stop.
Regression Model: Fit a regression model and find residuals (e) and predicted values (y-hat).
Residual Plot: Scatterplot of residuals against x or predicted values.
- Should have no pattern; check for bends, inconsistent variation, or outliers.
- If data is measured over time, plot residuals against time to check for dependence.
- If conditions are satisfied, proceed with inference.
Histogram/Normal Probability Plot: Check if residuals are sufficiently unimodal and symmetric.

Intuition About Regression Inference

A representative sample will produce a model with slope, $b1$ , whose expected value is the true slope, $\beta1$ .
The variability between regression models from different samples is affected by:
- Spread around the line.
- Spread of the x's.
- Sample size.

Factors Affecting Variability

Spread around the line: Less scatter means the slope will be more consistent from sample to sample. Measured by the residual standard deviation, $s_e$ (usually labeled 's' in regression output).
Spread of x values: A large standard deviation of x, $s_x$ , provides a more stable regression; a large range of x values makes more consistent models.
Sample size: Larger sample size, n, gives more consistent estimates.

Standard Error for the Slope

Three aspects affecting the standard error of the regression slope:
- Spread around the line ( $s_e$ ).
- Spread of x values ( $s_x$ ).
- Sample size (n).
Formula for the standard error:
- $SE(b1) = \frac{se}{s_x \sqrt{n-1}}$

Sampling Distribution

Sampling distribution of possible models (slopes) from all possible samples from the population.
When conditions are met, the standardized estimated regression slope follows a Student’s t-model with n – 2 degrees of freedom.
$t{n-2} = \frac{b1 - \beta1}{SE(b1)}$
Where:
- $se = \sqrt{\frac{\sum{i=1}^{n}(y_i - \hat{y})^2}{n-2}}$ , standard error of residuals.
- $s_x$ is the ordinary standard deviation of the x-values.
- n is the number of data values.

Formula Sheet

The AP formula sheet provides the formula for the standard error of the slope as:
- $s{b1} = \sqrt{\frac{\sum{i=1}^{n}(yi - \hat{y})^2}{n-2}} / \sqrt{\sum{i=1}^{n}(Xi - \bar{X})^2} = \frac{se}{s_x \sqrt{n-1}}$

What About the Intercept?

The same reasoning applies for the intercept.
The intercept is often of very little value.
Most hypothesis tests and confidence intervals for regression are about the slope $\beta_1$ (and $\rho$ ).
We can write:
- $t{n-2} = \frac{b0 - \beta0}{SE(b0)}$ but rarely use this fact for anything.

Regression Inference

A null hypothesis of a zero slope suggests that the entire claim of a linear relationship between the two variables is not valid.
To test $H0: \beta1 = 0$ , find $t = \frac{b1 - \beta1}{SE(b_1)}$ and continue as with any t-test.
The formula for a confidence interval for $\beta_1$ is:
- $b1 \pm t{n-2} \frac{se}{sx \sqrt{n-1}}$

Standard Errors for Predicted Values

Distinction between predicting the mean %body fat for all men with a certain waist size and predicting the %body fat for a particular man with that waist size.
The mean can be predicted with more precision.

Mean %body fat for all men with a waist size of, say, 38 inches?
Estimate the %body fat for a particular man with a 38-inch waist?

Predicting for a New Individual

Start with the same prediction in both cases:
- Call his x-value $x_v$ (38 inches).
- $\hat{y}v = b0 + b1xv$
Confidence intervals for both predictions take the form:
- $\hat{y}v \pm t{n-2} * (SE)$
However, the SE’s will be different for the two questions.

Standard Error of the Mean Predicted Value

Individuals vary more than means, so the standard error for a single predicted value is larger than the standard error for the mean:
Standard error of the mean predicted value:
- $SE(\hat{\mu}v) = \sqrt{SE^2(b1) \cdot (xv - \bar{x})^2 + \frac{se^2}{n}}$
Standard error for a single predicted value:
- $SE(\hat{\mu}v) = \sqrt{SE^2(b1) \cdot (xv - \bar{x})^2 + \frac{se^2}{n} + s_e^2}$
Models for means are more consistent than models for individual values.

Confidence Intervals for Predictions

The narrower interval is a confidence interval for the predicted mean value at $x_v$ .
The wider interval is a prediction interval for an individual with that x-value, $x_v$ .

Cautions for Regression

Don't extrapolate beyond your data; it's dangerous to predict for x-values far from the center of the data.
Watch out for highly influential points and outliers.
Tests are usually two-tailed; for a one-tailed test, divide the reported P-value in half.
Remember, models are likely wrong for any individual prediction.
Don't be too enamored of models.

Example: Anxiety and Test Scores

Determine if the correlation and regression model are statistically significant, and what the model tells us.
Given data:
- test score = 91.6583 – 4.4862(anxiety level)
- r = –.5250, r2 = .2756

Hypotheses Testing

Testing whether the weak relationship between anxiety level and math test score is an actual relationship or due to sampling error.
- $b1 = r \frac{sy}{s_x}$
Hypotheses:
- $H0: \beta1 = 0$ (no association between math test score and anxiety level).
- $Ha: \beta1 \neq 0$ (linear association between math test score and anxiety level).
Residuals:
- Linearity? The scatterplot shows a definite linear relationship.
- Independence: Unless students are cheating there is no reason to believe test scores are not independent.
- Variability Condition: The residual plot shows relatively consistent variation.
- Residual Normality Condition: A histogram of the residuals shows a sufficiently unimodal, symmetric distribution.
Conditions for a t-test with 24-2 = 22 degrees of freedom are met.

Calculation using Calculator

Calculator Steps:
- STAT ➔ TESTS ➔ F:LinRegTTest
Output:
- t = –2.893240474
- p = .0084352114
- df = 22
- a = 91.65825688 (intercept)
- b = –4.486238532 (slope)
- s = 18.6930621 (se, sample estimate of standard deviation of errors (residuals))
- r2 = .275620968
- r = –.52499616
Based on the t-statistic of –2.8932 and p-value of .0084, reject H0.
Sufficient evidence to believe there is a relationship between anxiety level and test score.

Manual Calculation of t-statistic

Calculate the t statistic:
- $t = \frac{b1 - \beta1}{\frac{se}{sx \sqrt{n-1}}} = \frac{–4.4862 - 0}{\frac{18.6931}{2.5137 \sqrt{23}}} = -2.8932$
- $p(t \le -2.8932 \text{ or } t \ge 2.8932) = p(b1 \le -4.4862 \text{ or } b1 \ge 4.4862) = tcdf(-9, -2.8932, 22) = .0042 x 2 = .0084$

Confidence Interval

Create a confidence interval around the slope of the regression model.
- Conditions and parameter are the same as the test.
Calculator Steps:
- STAT ➔ CALC ➔ G:LinRegTInt
Output:
- (-7.702, -1.271)
- b = –4.486238532
- df = 22
- s = 18.6930621
- a = 91.65825688
- r2 = .275620968
- r = –.52499616
95% confident the reduction in mean test score at all values is between -1.3 points and -7.7 points for each unit increase in anxiety.

Hand Calculation for Confidence Interval

Calculating Confidence Interval by hand:
- $b1 \pm t{n-2} \frac{se}{sx \sqrt{n-1}} = –4.4862 \pm 2.0738 \frac{18.6931}{2.5137 \sqrt{23}} = (-7.7019, -1.2705)$

Computer Output Interpretation

Key elements to look for in computer output in the regression analysis:
- Response Variable
- r2
- $S_e$ (SE Residuals)
- Predictor Variable
- Slope b1
- Intercept b0
- SE b1
- SE b0
- t = b1/SE(b1)
- t = b0/SE(b0)
- df
- p-values

Example: Seat Location and Achievement

Least squares regression analysis for Seat Location and Achievement for 30 students.
Conditions for inference:
- Linear: The scatterplot shows a weak, negative linear relationship. The residual plot shows no pattern. ✔
- Independent: Student were presumably randomly assigned to seats and, of course, were independent (not cheating). ✔
- Normal: The histogram of residuals is sufficiently unimodal and symmetric, the normal probability plot is mostly linear. ✔
- Equal Variance: The variation in scores from each row varies a bit but that variance is sufficiently constant (no pattern). ✔
Conditions are met and can continue with inference.

Interpreting Results

(a) Identify the standard error of the slope $SE_b$ from the computer output. Interpret this value in context.
(b) Calculate the 95% confidence interval for the true slope.
(c) Interpret the interval in context
(d) Based on the interval, is there convincing evidence that seat location affects scores?

Solution: Seat Location and Achievement

(a) Identify the standard error of the slope $SE_b$ from the computer output. Interpret this value in context.
- $SE_b$ = 0.9472. If we repeated the random assignment many times, the slope of the estimated regression line would typically vary by about 0.9472 from the slope of the true regression line.
(b) Calculate the 95% confidence interval for the true slope.
- df = 30 - 2 = 28. $t_{28}^*$ = invt(.975, 28) = 2.0484
- 95% CI = –1.1171 ± 2.0484(.9472) = –1.1171 ± 2.0484(0.9472) = (–3.0573, 0.8231)
(c) Interpret the interval in context.
- We are 95% confident the true slope of the model falls between –3.0573 and 0.8231.
(d) Based on the interval, is there convincing evidence that seat location affects scores?
- The interval captures 0, we do not have convincing evidence that there is an association between test score and row number.