regression
Simple Linear Regression Overview
- Definition: Simple linear regression is a statistical model used to describe the relationship between two numerical (quantitative) variables.
- Variables: In this model, both variables are expressed as numbers, enabling separate summary statistics and graphs to understand their patterns.
Developing a Linear Regression Model
- Purpose: To explore the correlation and build a model that reflects the relationship between the two variables, represented by linear equations.
- Example: Use of data regarding tax efficiency and stock portfolios, with energy securities as a proxy represented by a scatter plot.
Regression Equation
- Model Formula: The basic equation of a simple linear regression model can be expressed as:
- Where:
- (intercept) is the value of y when x = 0,
- (slope) reflects the change in y for a one-unit change in x.
Estimating Parameters
- Estimation Method: Parameters and are estimated using the method of least squares.
- Purpose of Least Squares: Minimize the sum of squared differences between observed values and predicted values (errors) from the regression line, specifically the vertical distances from data points to the line.
- Let:
- represents observed values,
- The vertical distance is minimized by calculating the squared distances (errors).
Formulas for Estimation
- Estimators for Parameters:
- Where:
- = sum of the product of deviations of x and y,
- = sum of squares of deviations of x,
- and = mean values of x and y.
Computational Procedures
- Procedure: Steps to estimate coefficients:
- Write down columns of x and y.
- Compute product of corresponding x and y values, squares of x values.
- Use statistical software or Excel for calculations, obtaining immediate sums needed for estimations.
Interpretation of Results
- Output from Regression Software: Common outputs across software like R, Python, Minitab, etc.
- Regression equation
- R-squared value
- Coefficients and their p-values for hypothesis testing.
Terminology
- Explanatory Variable (Predictor): The independent variable, or predictor variable (x).
- Response Variable (Dependent): The dependent variable influenced by x (y).
- Multiple Regression: Refers to models with more than one explicatory variable; while Multivariate considers multiple response variables.
Strength and Direction of Relationships
- Strength: Indicated by the size of the coefficients and their statistical significance (p-value).
- Direction: Determined by the sign of the slope (
- Positive slope implies a positive relationship, negative slope implies a negative relationship.
Statistical Significance of Model
- P-Value: A p-value smaller than a designated alpha level (commonly 0.05) indicates significant relationships between variables.
Evaluating Linear Models
- Model Assessment: Using R-squared and residual analysis to determine the goodness of fit for the model.
- R-squared explains the proportion of variance in the dependent variable explained by the independent variable.
- Value range: 0 to 1; higher is better.
Residual Analysis
- Definition: Residuals are differences between observed (actual) and predicted (fitted) values.
- Purpose: Evaluate whether the model is appropriate:
- Assume randomness and normality of residuals centered around zero.
- Patterns may indicate model inadequacies (e.g., nonlinearity).
Common Issues with Simple Linear Regression
- Extrapolation: Predictions made outside the range of observed data can lead to unreliable estimates.
- Outliers: Data points that lie far from the others can skew results and influence model fit significantly.
- Nonlinear Relationships: Simple linear regression does not capture nonlinear relationships between variables.
- Correlation vs Causation: Just because two variables are correlated does not imply that one causes the other.
Correlation Coefficient and Determination
- Definitions:
- Correlation coefficient (r): Ranges from -1 to +1, exhibiting the strength and direction (negative or positive) of a linear relationship.
- Coefficient of determination (R^2): Represents the proportion of explained variance, closer to 1 indicating a better model fit.
Summary of ANOVA in Regression
- ANOVA Table: Used to assess the significance of the regression model as a whole.
- Overall null hypothesis tests for the significance of at least one regression coefficient being non-zero.
- Components include variation sources: regression, error, and total.
Conclusion
- Simple linear regression is a crucial tool for understanding relationships between quantitative variables, yet caution must be exercised regarding assumptions and interpretations.