regression

Definition: Simple linear regression is a statistical model used to describe the relationship between two numerical (quantitative) variables.
Variables: In this model, both variables are expressed as numbers, enabling separate summary statistics and graphs to understand their patterns.

Purpose: To explore the correlation and build a model that reflects the relationship between the two variables, represented by linear equations.
Example: Use of data regarding tax efficiency and stock portfolios, with energy securities as a proxy represented by a scatter plot.

Model Formula: The basic equation of a simple linear regression model can be expressed as: $y = \beta0 + \beta1 x$
- Where:
- $\beta_0$ (intercept) is the value of y when x = 0,
- $\beta_1$ (slope) reflects the change in y for a one-unit change in x.

Estimation Method: Parameters $\beta0$ and $\beta1$ are estimated using the method of least squares.
Purpose of Least Squares: Minimize the sum of squared differences between observed values and predicted values (errors) from the regression line, specifically the vertical distances from data points to the line.
- Let:
- $\delta$ represents observed values,
- The vertical distance is minimized by calculating the squared distances (errors).

Estimators for Parameters:
- $b1 = \frac{S{xy}}{S_{xx}}$
- $b0 = \bar{y} - b1 \bar{x}$
Where:
- $S_{xy}$ = sum of the product of deviations of x and y,
- $S_{xx}$ = sum of squares of deviations of x,
- $\bar{x}$ and $\bar{y}$ = mean values of x and y.

Procedure: Steps to estimate coefficients:
1. Write down columns of x and y.
2. Compute product of corresponding x and y values, squares of x values.
3. Use statistical software or Excel for calculations, obtaining immediate sums needed for estimations.

Output from Regression Software: Common outputs across software like R, Python, Minitab, etc.
- Regression equation
- R-squared value
- Coefficients and their p-values for hypothesis testing.

Explanatory Variable (Predictor): The independent variable, or predictor variable (x).
Response Variable (Dependent): The dependent variable influenced by x (y).
Multiple Regression: Refers to models with more than one explicatory variable; while Multivariate considers multiple response variables.

Strength: Indicated by the size of the coefficients and their statistical significance (p-value).
Direction: Determined by the sign of the slope (
- Positive slope implies a positive relationship, negative slope implies a negative relationship.

P-Value: A p-value smaller than a designated alpha level (commonly 0.05) indicates significant relationships between variables.

Model Assessment: Using R-squared and residual analysis to determine the goodness of fit for the model.
- R-squared explains the proportion of variance in the dependent variable explained by the independent variable.
- Value range: 0 to 1; higher is better.

Definition: Residuals are differences between observed (actual) and predicted (fitted) values.
Purpose: Evaluate whether the model is appropriate:
- Assume randomness and normality of residuals centered around zero.
- Patterns may indicate model inadequacies (e.g., nonlinearity).

Extrapolation: Predictions made outside the range of observed data can lead to unreliable estimates.
Outliers: Data points that lie far from the others can skew results and influence model fit significantly.
Nonlinear Relationships: Simple linear regression does not capture nonlinear relationships between variables.
Correlation vs Causation: Just because two variables are correlated does not imply that one causes the other.

Definitions:
- Correlation coefficient (r): Ranges from -1 to +1, exhibiting the strength and direction (negative or positive) of a linear relationship.
- Coefficient of determination (R^2): Represents the proportion of explained variance, closer to 1 indicating a better model fit.

ANOVA Table: Used to assess the significance of the regression model as a whole.
- Overall null hypothesis tests for the significance of at least one regression coefficient being non-zero.
- Components include variation sources: regression, error, and total.

Simple linear regression is a crucial tool for understanding relationships between quantitative variables, yet caution must be exercised regarding assumptions and interpretations.