Regression Part 1
Simple Linear Regression
Type of Data:
Requires 1 quantitative dependent variable (denoted as y)
Requires 1 quantitative or dichotomous predictor variable (denoted as x).
If the dependent variable (y) is dichotomous, use Logit regression instead.
Purpose/Use:
Predict the value of y based on the value of x.
Understand the relationship between x and y.
Regression Equation:
General form: y = b0 + b1x
Here, the slope b1 can be mathematically expressed with the formula b1 = r{xy}\frac{sy}{sx}, where: - r{xy}: correlation coefficient between x and y
s_y: standard deviation of y
s_x: standard deviation of x
Example:
Predicting sales (y) based on advertising spend (x).
This regression analysis helps find out how much sales are affected through various levels of advertising.
Additional Information:
Similar to correlation, but regression yields an equation for prediction of y.
It includes a test of slope to check if correlation is significant.
Types of Regression Models: - Population Equation: y = \beta0 + \beta1x + \text{error} (includes population parameters).
Sample Equation: \tilde{y} = b0 + b1x (uses sample estimates).
Error (denoted as \text{error} or \bar{\varepsilon}) represents the distance between predicted and actual values.
The expected error does not vanish but stays in the analysis as individual effects.
The Regression Line with r > 0
Model Estimation:
Find the b weights such that residuals (y - \tilde{y}) equal to zero.
Use Ordinary Least Squares (OLS) to minimize the sum of squared errors \bigg( \text{sum of } (y - \tilde{y})^2 \bigg) ; only one regression line will meet this criterion as it minimizes the squared errors.
Understanding Residuals:
Each point’s residual is the difference between observed and predicted values.
OLS estimates the best-fitting line that minimizes the sum of squared residuals.
Interpretation of the Regression Line
Conditional Mean of y:
The regression line depicts the conditional means of y given values of x.
Average Values:
If x is at a particular value, the regression calculates the average of y values occurring at that x.
It emphasizes the line assumes all observations of a given x average out at that point, reinforcing the relationship while accepting that individual observations may vary.
Linearity:
Assumes that the relationship between x and y is linear.
If the common averages calculated do not align with the regression line due to sampling errors, deviations will occur.
The Regression Line with r = 0
Behavior of the Line:
When r=0, predictions yield the same y value across all x, resulting in a flat line.
Here, \tilde{y} = \bar{y} implies that knowing x provides no predictive value about y.
Implications:
Even for different values of x, the mean value of y remains unchanged, requiring the intercept b_0 as the mean value of y.
Simple Regression Equation
Equations: - The established regression equation: \tilde{y} = b0 + b1x
Slope (b_1):
b1 = r{xy}\frac{sy}{sx}, indicates how y changes as x changes.
y-intercept (b_0):
b0 = \bar{y} - b1 \bar{x}, signifies where the line crosses the y-axis.
Important to note if x=0 is relevant to the dataset.
Interpreting Regression Outputs
Key Questions:
How much of the variance in y does the model explain?
What is the regression equation along with the respective b values?
Is the slope significantly different from zero? - If yes, x significantly predicts y and correlation is significant.
If no, x does not predict y, and correlation is not significant.
How to interpret the findings and subsequent equations?
Final Note on Interpretation:
If answers to these questions are no, there is no point in further interpretation.
Application Example of Regression
Study Case: Salary Differences based on Minority Status
Examining the relationship regarding possible salary discrimination against minorities vs. non-minorities.
Scatterplot indicates a predictive downward trend in salary, but regression analysis is needed to measure significance.
Statistical Output Understanding:
R Square (R²):
Indicates the proportion of variance in salary explained by the minority status variable, calculated at 8.6% for the model.
Misinterpretation: R² doesn’t present the percentage of salary predictions; rather, it covers variance explained via regression.
Adjusted R²: - Adjusted for the number of predictor variables, it offers a more precise explanation of the ratio of variance.
Formula: \text{Adj R\textsuperscript{2}} = 1 - \frac{SSE/(n-K-1)}{SST/(n-1)} .
Where:
SS_E: Sum of Squares Error (or Residual Sum of Squares)
SS_T: Total Sum of Squares
n: Number of observations (sample size)
K: Number of predictor variables (for Simple Linear Regression, K=1)
Important for maintaining the robustness of the model against over-predicting when additional variables are included.
Sample Size Consideration
General Guideline: - About 10 data points recommended per predictor variable.
Our case analyzes 140 observations with one predictor, signifying adequate sample size.
Note: With two data points, R² will equate to 1 due to perfect fit, hinting at the necessity of sampling variation for generalized regression representation.
Assumptions of Simple Linear Regression
Linearity: The relationship between the independent variable (x) and the mean of the dependent variable (y) is linear.
Independence of Errors: The errors (residuals) are independent of each other. This is especially important in time series data.
Homoscedasticity: The variance of the errors is constant across all levels of the independent variable (i.e., no pattern in residuals vs. predicted values).
Normality of Errors: The errors are normally distributed. This assumption is more important for smaller sample sizes when constructing confidence intervals and performing hypothesis tests (can be checked with a Q-Q plot of residuals).
Conclusion
Understanding regression through these steps allows for a basic application in analyzing empirical data relationships comprehensively, leading to informed business decisions regarding predictions and insight generation from statistical analyses.