Definition: Simple linear regression is a statistical model used to describe the relationship between two numerical (quantitative) variables.
Variables: In this model, both variables are expressed as numbers, enabling separate summary statistics and graphs to understand their patterns.
Developing a Linear Regression Model
Purpose: To explore the correlation and build a model that reflects the relationship between the two variables, represented by linear equations.
Example: Use of data regarding tax efficiency and stock portfolios, with energy securities as a proxy represented by a scatter plot.
Regression Equation
Model Formula: The basic equation of a simple linear regression model can be expressed as:
y = \beta0 + \beta1 x
Where:
\beta_0 (intercept) is the value of y when x = 0,
\beta_1 (slope) reflects the change in y for a one-unit change in x.
Estimating Parameters
Estimation Method: Parameters \beta0 and \beta1 are estimated using the method of least squares.
Purpose of Least Squares: Minimize the sum of squared differences between observed values and predicted values (errors) from the regression line, specifically the vertical distances from data points to the line.
Let:
\delta represents observed values,
The vertical distance is minimized by calculating the squared distances (errors).
Formulas for Estimation
Estimators for Parameters:
b1 = \frac{S{xy}}{S_{xx}}
b0 = \bar{y} - b1 \bar{x}
Where:
S_{xy} = sum of the product of deviations of x and y,
S_{xx} = sum of squares of deviations of x,
\bar{x} and \bar{y} = mean values of x and y.
Computational Procedures
Procedure: Steps to estimate coefficients:
Write down columns of x and y.
Compute product of corresponding x and y values, squares of x values.
Use statistical software or Excel for calculations, obtaining immediate sums needed for estimations.
Interpretation of Results
Output from Regression Software: Common outputs across software like R, Python, Minitab, etc.
Regression equation
R-squared value
Coefficients and their p-values for hypothesis testing.
Terminology
Explanatory Variable (Predictor): The independent variable, or predictor variable (x).
Response Variable (Dependent): The dependent variable influenced by x (y).
Multiple Regression: Refers to models with more than one explicatory variable; while Multivariate considers multiple response variables.
Strength and Direction of Relationships
Strength: Indicated by the size of the coefficients and their statistical significance (p-value).
Direction: Determined by the sign of the slope (
Positive slope implies a positive relationship, negative slope implies a negative relationship.
Statistical Significance of Model
P-Value: A p-value smaller than a designated alpha level (commonly 0.05) indicates significant relationships between variables.
Evaluating Linear Models
Model Assessment: Using R-squared and residual analysis to determine the goodness of fit for the model.
R-squared explains the proportion of variance in the dependent variable explained by the independent variable.
Value range: 0 to 1; higher is better.
Residual Analysis
Definition: Residuals are differences between observed (actual) and predicted (fitted) values.
Purpose: Evaluate whether the model is appropriate:
Assume randomness and normality of residuals centered around zero.
Patterns may indicate model inadequacies (e.g., nonlinearity).
Common Issues with Simple Linear Regression
Extrapolation: Predictions made outside the range of observed data can lead to unreliable estimates.
Outliers: Data points that lie far from the others can skew results and influence model fit significantly.
Nonlinear Relationships: Simple linear regression does not capture nonlinear relationships between variables.
Correlation vs Causation: Just because two variables are correlated does not imply that one causes the other.
Correlation Coefficient and Determination
Definitions:
Correlation coefficient (r): Ranges from -1 to +1, exhibiting the strength and direction (negative or positive) of a linear relationship.
Coefficient of determination (R^2): Represents the proportion of explained variance, closer to 1 indicating a better model fit.
Summary of ANOVA in Regression
ANOVA Table: Used to assess the significance of the regression model as a whole.
Overall null hypothesis tests for the significance of at least one regression coefficient being non-zero.
Components include variation sources: regression, error, and total.
Conclusion
Simple linear regression is a crucial tool for understanding relationships between quantitative variables, yet caution must be exercised regarding assumptions and interpretations.