regression

Simple Linear Regression Overview

  • Definition: Simple linear regression is a statistical model used to describe the relationship between two numerical (quantitative) variables.
  • Variables: In this model, both variables are expressed as numbers, enabling separate summary statistics and graphs to understand their patterns.

Developing a Linear Regression Model

  • Purpose: To explore the correlation and build a model that reflects the relationship between the two variables, represented by linear equations.
  • Example: Use of data regarding tax efficiency and stock portfolios, with energy securities as a proxy represented by a scatter plot.

Regression Equation

  • Model Formula: The basic equation of a simple linear regression model can be expressed as: y = \beta0 + \beta1 x
    • Where:
    • \beta_0 (intercept) is the value of y when x = 0,
    • \beta_1 (slope) reflects the change in y for a one-unit change in x.

Estimating Parameters

  • Estimation Method: Parameters \beta0 and \beta1 are estimated using the method of least squares.
  • Purpose of Least Squares: Minimize the sum of squared differences between observed values and predicted values (errors) from the regression line, specifically the vertical distances from data points to the line.
    • Let:
    • \delta represents observed values,
    • The vertical distance is minimized by calculating the squared distances (errors).

Formulas for Estimation

  • Estimators for Parameters:
    • b1 = \frac{S{xy}}{S_{xx}}
    • b0 = \bar{y} - b1 \bar{x}
  • Where:
    • S_{xy} = sum of the product of deviations of x and y,
    • S_{xx} = sum of squares of deviations of x,
    • \bar{x} and \bar{y} = mean values of x and y.

Computational Procedures

  • Procedure: Steps to estimate coefficients:
    1. Write down columns of x and y.
    2. Compute product of corresponding x and y values, squares of x values.
    3. Use statistical software or Excel for calculations, obtaining immediate sums needed for estimations.

Interpretation of Results

  • Output from Regression Software: Common outputs across software like R, Python, Minitab, etc.
    • Regression equation
    • R-squared value
    • Coefficients and their p-values for hypothesis testing.

Terminology

  • Explanatory Variable (Predictor): The independent variable, or predictor variable (x).
  • Response Variable (Dependent): The dependent variable influenced by x (y).
  • Multiple Regression: Refers to models with more than one explicatory variable; while Multivariate considers multiple response variables.

Strength and Direction of Relationships

  • Strength: Indicated by the size of the coefficients and their statistical significance (p-value).
  • Direction: Determined by the sign of the slope (
    • Positive slope implies a positive relationship, negative slope implies a negative relationship.

Statistical Significance of Model

  • P-Value: A p-value smaller than a designated alpha level (commonly 0.05) indicates significant relationships between variables.

Evaluating Linear Models

  • Model Assessment: Using R-squared and residual analysis to determine the goodness of fit for the model.
    • R-squared explains the proportion of variance in the dependent variable explained by the independent variable.
    • Value range: 0 to 1; higher is better.

Residual Analysis

  • Definition: Residuals are differences between observed (actual) and predicted (fitted) values.
  • Purpose: Evaluate whether the model is appropriate:
    • Assume randomness and normality of residuals centered around zero.
    • Patterns may indicate model inadequacies (e.g., nonlinearity).

Common Issues with Simple Linear Regression

  1. Extrapolation: Predictions made outside the range of observed data can lead to unreliable estimates.
  2. Outliers: Data points that lie far from the others can skew results and influence model fit significantly.
  3. Nonlinear Relationships: Simple linear regression does not capture nonlinear relationships between variables.
  4. Correlation vs Causation: Just because two variables are correlated does not imply that one causes the other.

Correlation Coefficient and Determination

  • Definitions:
    • Correlation coefficient (r): Ranges from -1 to +1, exhibiting the strength and direction (negative or positive) of a linear relationship.
    • Coefficient of determination (R^2): Represents the proportion of explained variance, closer to 1 indicating a better model fit.

Summary of ANOVA in Regression

  • ANOVA Table: Used to assess the significance of the regression model as a whole.
    • Overall null hypothesis tests for the significance of at least one regression coefficient being non-zero.
    • Components include variation sources: regression, error, and total.

Conclusion

  • Simple linear regression is a crucial tool for understanding relationships between quantitative variables, yet caution must be exercised regarding assumptions and interpretations.