Regression Part 2

Standard Error
  • Definition:

    • The standard error of the estimate (sy.xs_{y.x}) is described as the "standard deviation of the errors", which indicates the dispersion of data points above and below the regression line.

  • Equation:

    • s<em>y.x=SSEn2=(y</em>iy~i)2n2s<em>{y.x} = \frac{\sqrt{SSE}}{n-2} = \sqrt{\frac{\sum(y</em>i - \tilde{y}_i)^2}{n-2}}

    • Where: - SSESSE = Sum of Squared Errors

      • nn = Number of observations

      • yiy_i = Observed values

      • y~i\tilde{y}_i = Predicted values

  • Interpretation:

    • A perfect prediction (all points lie on the regression line) results in:

    • Standard Error = 0, implying R2=1.0R^2 = 1.0.

    • Greater scatter around the regression line leads to a larger standard error, indicating poorer predictability.

    • Example: In the case of a scatterplot provided, the spread of points highlighted demonstrates this scatter.


Regression Interpretation
  1. Regression Equation:

    • The form of regression equation derived from regression output is:

      y~=1676265x\tilde{y} = 1676 - 265x

    • When x=0x = 0 (not a minority), the predicted monthly salary is $1,676.

    • When x=1x = 1 (is a minority), the predicted salary reduces to $1,411.

  2. Hypothesis Testing for the Slope (b1b_1):

    • Hypothesis: H<em>0:β</em>1=0H<em>0: \beta</em>1 = 0 (no relationship between xx and yy)

    • Alternative Hypothesis: H<em>1:β</em>10H<em>1: \beta</em>1 \neq 0

    • Testing involves observing the sampling distribution of the slope and its standard error:

      • Coefficient for the slope (b<em>1b<em>1) is -265, standard error (s</em>b1s</em>{b1}) is 73.50, leading to a t-statistic:

        t=b<em>1β</em>1sb1=265.26073.50=3.61t = \frac{b<em>1 - \beta</em>1}{s_{b1}} = \frac{-265.26 - 0}{73.50} = -3.61

    • Significance Determination: - Given a very small p-value (0.00043) compared to α=0.05\alpha = 0.05 indicates rejection of the null hypothesis, thus supporting the conclusion that minorities earn less (265265 less) than non-minorities.


Interpretation of b Weight (Coefficient)
  • B Weight Size:

    • The size of the b weight does not necessarily equate to importance. It is subject to:

    • Scale of xx and yy.- Example: A b weight of 0.01 may be significant for small yy values (0 to 0.1), whereas a b weight of 10,000 may be non-significant for large yy values (costs in millions).

    • Statistical Test Requirement: Avoid eyeballing the significance of coefficients; always use a t-test for significance verification.

  • Effect of Value Range:

    • Truncating the range of xx can diminish its significant effect on yy predictions. E.g., predicting GPA from a narrow GMAT score range may result in a non-significant b weight despite GMAT being a crucial predictor overall.


Interpretation of the Equation
  • Y-Intercept:

    • If x=0x = 0 is out of the data range, interpretation is limited. In this context, x=0x = 0 signifies white employees.

    • The intercept value of 1675 represents the predicted monthly salary for white employees.

  • Slope Interpretation:

    • A slope of -265 indicates a decrease in predicted salary as one moves from non-minority to minority status.

    • Interpretation using prediction:

    • If x=1x = 1 (minority),

      \tilde{y} = 1676 - 265(1) = $1411

    • This indicates minority employees earn $265 less per month compared to non-minorities.


Additional Notes
  • t-test for Y-Intercept:

    • Tests the null hypothesis (H0:Intercept=0H_0: \text{Intercept} = 0). Generally not meaningful unless x=0x=0 is in the data.

  • Two-Tailed p-Values:

    • Implicit in regression output. Always compare p-value with alpha and check the direction of slope sign to verify hypothesis.

  • ANOVA and t-test for Slope:

    • When there is a single predictor (xx), relationship F=t2F = t^2 holds.

    • For simple regression, reliance on the t-test of the slope is preferred over ANOVA.

  • Confidence Intervals for b Weights:

    • These provide a range of values within which the true population slope (or intercept) is estimated to lie with a certain level of confidence (e.g., 95%). They are crucial for understanding the precision of the estimated regression coefficients. A narrower interval suggests a more precise estimate.

    • It is important to distinguish them from prediction intervals, which estimate the range for an individual yy value, and confidence intervals for the mean response, which estimate the range for the average yy value at specific xx.

  • Reporting Suggestions:

    • When presenting regression results, streamline outputs for clarity by focusing on key information. This includes the overall model significance (FF-statistic and its pp-value), the coefficient of determination (R2R^2), and particularly, the estimated regression coefficients (bb weights), their standard errors, tt-statistics, and pp-values. Visualizations (e.g., scatterplots with regression line) are often more impactful than raw statistical tables. Avoid overwhelming the audience with irrelevant statistics or direct software output.


Forecasting Models
  • Caution in Predictions: - Extrapolating predictions beyond the data range is fraught with danger. Different prediction patterns beyond existing data can result in incorrect conclusions.

    • Example:

    • When forecasting house prices based on square footage, the actual price relation may vary significantly at larger sizes, leading to unreliable predictions due to data limitations.