JZ

Regression Part 1

Simple Linear Regression
  1. Type of Data:

    • Requires 1 quantitative dependent variable (denoted as y)

    • Requires 1 quantitative or dichotomous predictor variable (denoted as x).

    • If the dependent variable (y) is dichotomous, use Logit regression instead.

  2. Purpose/Use:

    1. Predict the value of y based on the value of x.

    2. Understand the relationship between x and y.

  3. Regression Equation:

    • General form: y = b0 + b1x

    • Here, the slope b1 can be mathematically expressed with the formula b1 = r{xy}\frac{sy}{sx}, where: - r{xy}: correlation coefficient between x and y

      • s_y: standard deviation of y

      • s_x: standard deviation of x

  4. Example:

    • Predicting sales (y) based on advertising spend (x).

    • This regression analysis helps find out how much sales are affected through various levels of advertising.

  5. Additional Information:

    • Similar to correlation, but regression yields an equation for prediction of y.

    • It includes a test of slope to check if correlation is significant.

    • Types of Regression Models: - Population Equation: y = \beta0 + \beta1x + \text{error} (includes population parameters).

      • Sample Equation: \tilde{y} = b0 + b1x (uses sample estimates).

      • Error (denoted as \text{error} or \bar{\varepsilon}) represents the distance between predicted and actual values.

      • The expected error does not vanish but stays in the analysis as individual effects.

The Regression Line with r > 0
  • Model Estimation:

    • Find the b weights such that residuals (y - \tilde{y}) equal to zero.

    • Use Ordinary Least Squares (OLS) to minimize the sum of squared errors \bigg( \text{sum of } (y - \tilde{y})^2 \bigg) ; only one regression line will meet this criterion as it minimizes the squared errors.

  • Understanding Residuals:

    • Each point’s residual is the difference between observed and predicted values.

    • OLS estimates the best-fitting line that minimizes the sum of squared residuals.

Interpretation of the Regression Line
  1. Conditional Mean of y:

    • The regression line depicts the conditional means of y given values of x.

  2. Average Values:

    • If x is at a particular value, the regression calculates the average of y values occurring at that x.

    • It emphasizes the line assumes all observations of a given x average out at that point, reinforcing the relationship while accepting that individual observations may vary.

  3. Linearity:

    • Assumes that the relationship between x and y is linear.

    • If the common averages calculated do not align with the regression line due to sampling errors, deviations will occur.

The Regression Line with r = 0
  1. Behavior of the Line:

    • When r=0, predictions yield the same y value across all x, resulting in a flat line.

    • Here, \tilde{y} = \bar{y} implies that knowing x provides no predictive value about y.

  2. Implications:

    • Even for different values of x, the mean value of y remains unchanged, requiring the intercept b_0 as the mean value of y.

Simple Regression Equation
  1. Equations: - The established regression equation: \tilde{y} = b0 + b1x

    • Slope (b_1):

    1. b1 = r{xy}\frac{sy}{sx}, indicates how y changes as x changes.

    • y-intercept (b_0):

    1. b0 = \bar{y} - b1 \bar{x}, signifies where the line crosses the y-axis.

    2. Important to note if x=0 is relevant to the dataset.

Interpreting Regression Outputs
  1. Key Questions:

    • How much of the variance in y does the model explain?

    • What is the regression equation along with the respective b values?

    • Is the slope significantly different from zero? - If yes, x significantly predicts y and correlation is significant.

      • If no, x does not predict y, and correlation is not significant.

    • How to interpret the findings and subsequent equations?

  2. Final Note on Interpretation:

    • If answers to these questions are no, there is no point in further interpretation.

Application Example of Regression
  • Study Case: Salary Differences based on Minority Status

    1. Examining the relationship regarding possible salary discrimination against minorities vs. non-minorities.

    2. Scatterplot indicates a predictive downward trend in salary, but regression analysis is needed to measure significance.

  • Statistical Output Understanding:

    1. R Square (R²):

    • Indicates the proportion of variance in salary explained by the minority status variable, calculated at 8.6% for the model.

    • Misinterpretation: R² doesn’t present the percentage of salary predictions; rather, it covers variance explained via regression.

  1. Adjusted R²: - Adjusted for the number of predictor variables, it offers a more precise explanation of the ratio of variance.

    • Formula: \text{Adj R\textsuperscript{2}} = 1 - \frac{SSE/(n-K-1)}{SST/(n-1)} .

    • Where:

      • SS_E: Sum of Squares Error (or Residual Sum of Squares)

      • SS_T: Total Sum of Squares

      • n: Number of observations (sample size)

      • K: Number of predictor variables (for Simple Linear Regression, K=1)

    • Important for maintaining the robustness of the model against over-predicting when additional variables are included.

Sample Size Consideration
  1. General Guideline: - About 10 data points recommended per predictor variable.

    • Our case analyzes 140 observations with one predictor, signifying adequate sample size.

    • Note: With two data points, R² will equate to 1 due to perfect fit, hinting at the necessity of sampling variation for generalized regression representation.

Assumptions of Simple Linear Regression
  1. Linearity: The relationship between the independent variable (x) and the mean of the dependent variable (y) is linear.

  2. Independence of Errors: The errors (residuals) are independent of each other. This is especially important in time series data.

  3. Homoscedasticity: The variance of the errors is constant across all levels of the independent variable (i.e., no pattern in residuals vs. predicted values).

  4. Normality of Errors: The errors are normally distributed. This assumption is more important for smaller sample sizes when constructing confidence intervals and performing hypothesis tests (can be checked with a Q-Q plot of residuals).

Conclusion
  • Understanding regression through these steps allows for a basic application in analyzing empirical data relationships comprehensively, leading to informed business decisions regarding predictions and insight generation from statistical analyses.