Regression Part 1

Type of Data:
- Requires 1 quantitative dependent variable (denoted as y)
- Requires 1 quantitative or dichotomous predictor variable (denoted as x).
- If the dependent variable (y) is dichotomous, use Logit regression instead.
Purpose/Use:
1. Predict the value of y based on the value of x.
2. Understand the relationship between x and y.
Regression Equation:
- General form: y = b0 + b1x
- Here, the slope b1 can be mathematically expressed with the formula b1 = r{xy}\frac{sy}{sx}, where: - r{xy}: correlation coefficient between x and y
  - s_y: standard deviation of y
  - s_x: standard deviation of x
Example:
- Predicting sales (y) based on advertising spend (x).
- This regression analysis helps find out how much sales are affected through various levels of advertising.
Additional Information:
- Similar to correlation, but regression yields an equation for prediction of y.
- It includes a test of slope to check if correlation is significant.
- Types of Regression Models: - Population Equation: y = \beta0 + \beta1x + \text{error} (includes population parameters).
  - Sample Equation: \tilde{y} = b0 + b1x (uses sample estimates).
  - Error (denoted as \text{error} or \bar{\varepsilon}) represents the distance between predicted and actual values.
  - The expected error does not vanish but stays in the analysis as individual effects.

Model Estimation:
- Find the b weights such that residuals (y - \tilde{y}) equal to zero.
- Use Ordinary Least Squares (OLS) to minimize the sum of squared errors \bigg( \text{sum of } (y - \tilde{y})^2 \bigg) ; only one regression line will meet this criterion as it minimizes the squared errors.
Understanding Residuals:
- Each point’s residual is the difference between observed and predicted values.
- OLS estimates the best-fitting line that minimizes the sum of squared residuals.

Conditional Mean of y:
- The regression line depicts the conditional means of y given values of x.
Average Values:
- If x is at a particular value, the regression calculates the average of y values occurring at that x.
- It emphasizes the line assumes all observations of a given x average out at that point, reinforcing the relationship while accepting that individual observations may vary.
Linearity:
- Assumes that the relationship between x and y is linear.
- If the common averages calculated do not align with the regression line due to sampling errors, deviations will occur.

Behavior of the Line:
- When r=0, predictions yield the same y value across all x, resulting in a flat line.
- Here, \tilde{y} = \bar{y} implies that knowing x provides no predictive value about y.
Implications:
- Even for different values of x, the mean value of y remains unchanged, requiring the intercept b_0 as the mean value of y.

Equations: - The established regression equation: \tilde{y} = b0 + b1x
- Slope (b_1):
1. b1 = r{xy}\frac{sy}{sx}, indicates how y changes as x changes.
- y-intercept (b_0):
1. b0 = \bar{y} - b1 \bar{x}, signifies where the line crosses the y-axis.
2. Important to note if x=0 is relevant to the dataset.

Key Questions:
- How much of the variance in y does the model explain?
- What is the regression equation along with the respective b values?
- Is the slope significantly different from zero? - If yes, x significantly predicts y and correlation is significant.
  - If no, x does not predict y, and correlation is not significant.
- How to interpret the findings and subsequent equations?
Final Note on Interpretation:
- If answers to these questions are no, there is no point in further interpretation.

Study Case: Salary Differences based on Minority Status
1. Examining the relationship regarding possible salary discrimination against minorities vs. non-minorities.
2. Scatterplot indicates a predictive downward trend in salary, but regression analysis is needed to measure significance.
Statistical Output Understanding:
1. R Square (R²):
- Indicates the proportion of variance in salary explained by the minority status variable, calculated at 8.6% for the model.
- Misinterpretation: R² doesn’t present the percentage of salary predictions; rather, it covers variance explained via regression.

Adjusted R²: - Adjusted for the number of predictor variables, it offers a more precise explanation of the ratio of variance.
- Formula: \text{Adj R\textsuperscript{2}} = 1 - \frac{SSE/(n-K-1)}{SST/(n-1)} .
- Where:
  - SS_E: Sum of Squares Error (or Residual Sum of Squares)
  - SS_T: Total Sum of Squares
  - n: Number of observations (sample size)
  - K: Number of predictor variables (for Simple Linear Regression, K=1)
- Important for maintaining the robustness of the model against over-predicting when additional variables are included.

General Guideline: - About 10 data points recommended per predictor variable.
- Our case analyzes 140 observations with one predictor, signifying adequate sample size.
- Note: With two data points, R² will equate to 1 due to perfect fit, hinting at the necessity of sampling variation for generalized regression representation.

Linearity: The relationship between the independent variable (x) and the mean of the dependent variable (y) is linear.
Independence of Errors: The errors (residuals) are independent of each other. This is especially important in time series data.
Homoscedasticity: The variance of the errors is constant across all levels of the independent variable (i.e., no pattern in residuals vs. predicted values).
Normality of Errors: The errors are normally distributed. This assumption is more important for smaller sample sizes when constructing confidence intervals and performing hypothesis tests (can be checked with a Q-Q plot of residuals).

Understanding regression through these steps allows for a basic application in analyzing empirical data relationships comprehensively, leading to informed business decisions regarding predictions and insight generation from statistical analyses.