Regression Part 2

Standard Error

Definition:
- The standard error of the estimate (s_{y.x}) is described as the "standard deviation of the errors", which indicates the dispersion of data points above and below the regression line.
Equation:
- s{y.x} = \frac{\sqrt{SSE}}{n-2} = \sqrt{\frac{\sum(yi - \tilde{y}_i)^2}{n-2}}
- Where: - SSE = Sum of Squared Errors
  - n = Number of observations
  - y_i = Observed values
  - \tilde{y}_i = Predicted values
Interpretation:
- A perfect prediction (all points lie on the regression line) results in:
- Standard Error = 0, implying R^2 = 1.0.
- Greater scatter around the regression line leads to a larger standard error, indicating poorer predictability.
- Example: In the case of a scatterplot provided, the spread of points highlighted demonstrates this scatter.

Regression Interpretation

Regression Equation:
- The form of regression equation derived from regression output is:
  \tilde{y} = 1676 - 265x
- When x = 0 (not a minority), the predicted monthly salary is $1,676.
- When x = 1 (is a minority), the predicted salary reduces to $1,411.
Hypothesis Testing for the Slope (b_1):
- Hypothesis: H0: \beta1 = 0 (no relationship between x and y)
- Alternative Hypothesis: H1: \beta1 \neq 0
- Testing involves observing the sampling distribution of the slope and its standard error:
  - Coefficient for the slope (b1) is -265, standard error (s{b1}) is 73.50, leading to a t-statistic:
    t = \frac{b1 - \beta1}{s_{b1}} = \frac{-265.26 - 0}{73.50} = -3.61
- Significance Determination: - Given a very small p-value (0.00043) compared to \alpha = 0.05 indicates rejection of the null hypothesis, thus supporting the conclusion that minorities earn less (265 less) than non-minorities.

Interpretation of b Weight (Coefficient)

B Weight Size:
- The size of the b weight does not necessarily equate to importance. It is subject to:
- Scale of x and y.- Example: A b weight of 0.01 may be significant for small y values (0 to 0.1), whereas a b weight of 10,000 may be non-significant for large y values (costs in millions).
- Statistical Test Requirement: Avoid eyeballing the significance of coefficients; always use a t-test for significance verification.
Effect of Value Range:
- Truncating the range of x can diminish its significant effect on y predictions. E.g., predicting GPA from a narrow GMAT score range may result in a non-significant b weight despite GMAT being a crucial predictor overall.

Interpretation of the Equation

Y-Intercept:
- If x = 0 is out of the data range, interpretation is limited. In this context, x = 0 signifies white employees.
- The intercept value of 1675 represents the predicted monthly salary for white employees.
Slope Interpretation:
- A slope of -265 indicates a decrease in predicted salary as one moves from non-minority to minority status.
- Interpretation using prediction:
- If x = 1 (minority),
  \tilde{y} = 1676 - 265(1) = $1411
- This indicates minority employees earn $265 less per month compared to non-minorities.

Additional Notes

t-test for Y-Intercept:
- Tests the null hypothesis (H_0: \text{Intercept} = 0). Generally not meaningful unless x=0 is in the data.
Two-Tailed p-Values:
- Implicit in regression output. Always compare p-value with alpha and check the direction of slope sign to verify hypothesis.
ANOVA and t-test for Slope:
- When there is a single predictor (x), relationship F = t^2 holds.
- For simple regression, reliance on the t-test of the slope is preferred over ANOVA.
Confidence Intervals for b Weights:
- These provide a range of values within which the true population slope (or intercept) is estimated to lie with a certain level of confidence (e.g., 95%). They are crucial for understanding the precision of the estimated regression coefficients. A narrower interval suggests a more precise estimate.
- It is important to distinguish them from prediction intervals, which estimate the range for an individual y value, and confidence intervals for the mean response, which estimate the range for the average y value at specific x.
Reporting Suggestions:
- When presenting regression results, streamline outputs for clarity by focusing on key information. This includes the overall model significance (F-statistic and its p-value), the coefficient of determination (R^2), and particularly, the estimated regression coefficients (b weights), their standard errors, t-statistics, and p-values. Visualizations (e.g., scatterplots with regression line) are often more impactful than raw statistical tables. Avoid overwhelming the audience with irrelevant statistics or direct software output.

Forecasting Models

Caution in Predictions: - Extrapolating predictions beyond the data range is fraught with danger. Different prediction patterns beyond existing data can result in incorrect conclusions.
- Example:
- When forecasting house prices based on square footage, the actual price relation may vary significantly at larger sizes, leading to unreliable predictions due to data limitations.