Regression Part 2
Standard Error
Definition:
The standard error of the estimate (s_{y.x}) is described as the "standard deviation of the errors", which indicates the dispersion of data points above and below the regression line.
Equation:
s{y.x} = \frac{\sqrt{SSE}}{n-2} = \sqrt{\frac{\sum(yi - \tilde{y}_i)^2}{n-2}}
Where: - SSE = Sum of Squared Errors
n = Number of observations
y_i = Observed values
\tilde{y}_i = Predicted values
Interpretation:
A perfect prediction (all points lie on the regression line) results in:
Standard Error = 0, implying R^2 = 1.0.
Greater scatter around the regression line leads to a larger standard error, indicating poorer predictability.
Example: In the case of a scatterplot provided, the spread of points highlighted demonstrates this scatter.
Regression Interpretation
Regression Equation:
The form of regression equation derived from regression output is:
\tilde{y} = 1676 - 265x
When x = 0 (not a minority), the predicted monthly salary is $1,676.
When x = 1 (is a minority), the predicted salary reduces to $1,411.
Hypothesis Testing for the Slope (b_1):
Hypothesis: H0: \beta1 = 0 (no relationship between x and y)
Alternative Hypothesis: H1: \beta1 \neq 0
Testing involves observing the sampling distribution of the slope and its standard error:
Coefficient for the slope (b1) is -265, standard error (s{b1}) is 73.50, leading to a t-statistic:
t = \frac{b1 - \beta1}{s_{b1}} = \frac{-265.26 - 0}{73.50} = -3.61
Significance Determination: - Given a very small p-value (0.00043) compared to \alpha = 0.05 indicates rejection of the null hypothesis, thus supporting the conclusion that minorities earn less (265 less) than non-minorities.
Interpretation of b Weight (Coefficient)
B Weight Size:
The size of the b weight does not necessarily equate to importance. It is subject to:
Scale of x and y.- Example: A b weight of 0.01 may be significant for small y values (0 to 0.1), whereas a b weight of 10,000 may be non-significant for large y values (costs in millions).
Statistical Test Requirement: Avoid eyeballing the significance of coefficients; always use a t-test for significance verification.
Effect of Value Range:
Truncating the range of x can diminish its significant effect on y predictions. E.g., predicting GPA from a narrow GMAT score range may result in a non-significant b weight despite GMAT being a crucial predictor overall.
Interpretation of the Equation
Y-Intercept:
If x = 0 is out of the data range, interpretation is limited. In this context, x = 0 signifies white employees.
The intercept value of 1675 represents the predicted monthly salary for white employees.
Slope Interpretation:
A slope of -265 indicates a decrease in predicted salary as one moves from non-minority to minority status.
Interpretation using prediction:
If x = 1 (minority),
\tilde{y} = 1676 - 265(1) = $1411
This indicates minority employees earn $265 less per month compared to non-minorities.
Additional Notes
t-test for Y-Intercept:
Tests the null hypothesis (H_0: \text{Intercept} = 0). Generally not meaningful unless x=0 is in the data.
Two-Tailed p-Values:
Implicit in regression output. Always compare p-value with alpha and check the direction of slope sign to verify hypothesis.
ANOVA and t-test for Slope:
When there is a single predictor (x), relationship F = t^2 holds.
For simple regression, reliance on the t-test of the slope is preferred over ANOVA.
Confidence Intervals for b Weights:
These provide a range of values within which the true population slope (or intercept) is estimated to lie with a certain level of confidence (e.g., 95%). They are crucial for understanding the precision of the estimated regression coefficients. A narrower interval suggests a more precise estimate.
It is important to distinguish them from prediction intervals, which estimate the range for an individual y value, and confidence intervals for the mean response, which estimate the range for the average y value at specific x.
Reporting Suggestions:
When presenting regression results, streamline outputs for clarity by focusing on key information. This includes the overall model significance (F-statistic and its p-value), the coefficient of determination (R^2), and particularly, the estimated regression coefficients (b weights), their standard errors, t-statistics, and p-values. Visualizations (e.g., scatterplots with regression line) are often more impactful than raw statistical tables. Avoid overwhelming the audience with irrelevant statistics or direct software output.
Forecasting Models
Caution in Predictions: - Extrapolating predictions beyond the data range is fraught with danger. Different prediction patterns beyond existing data can result in incorrect conclusions.
Example:
When forecasting house prices based on square footage, the actual price relation may vary significantly at larger sizes, leading to unreliable predictions due to data limitations.