Linear Regression

Classification & Regression

Simple Linear Regression

Definition: Implementing linear regression of a dependent variable $y$ on a set of independent variables $x = (x1, …, xr)$ , where $r$ is the number of predictors.
Assumption: A linear relationship exists between $x$ and $y$ .

Problem Formulation

Regression Equation: $y = b0 + b1x1 + … + brx_r + \epsilon$ , where:
- $b0, b1, …, b_r$ are regression coefficients.
- $\epsilon$ is a random error.

Estimating Regression Coefficients

Linear regression calculates estimators of regression coefficients (predicted weights) denoted as $\hat{b}0, \hat{b}1, …, \hat{b}_r$ .
Estimated Regression Function: $\hat{f}(x) = \hat{b}0 + \hat{b}1x1 + … + \hat{b}rx_r$ . This function captures dependencies between inputs and output.
The estimated/predicted response $\hat{f}(xi)$ for each observation $i = 1, …, n$ should be as close as possible to the actual response $yi$ .

Residuals

Definition: The differences between actual and predicted responses: $yi - \hat{f}(xi)$ for all observations $i = 1, …, n$ .
Regression aims to determine the best-predicted weights, minimizing residuals.

Ordinary Least Squares

Method: Minimizes the sum of squared residuals (SSR) for all observations $i = 1, …, n$ .
Formula: $SSR = \sum{i=1}^{n} (yi - \hat{f}(x_i))^2$ .

Coefficient of Determination ( $R^2$ )

Indicates the amount of variation in $y$ that can be explained by the dependence on $x$ using the regression model.
A larger $R^2$ indicates a better fit.
$R^2 = 1$ corresponds to $SSR = 0$ , representing a perfect fit.

Slope-Intercept Form

General Form: $y = mx + b$ , where:
- $m$ = slope of the line
- $b$ = y-intercept of the line
Statistical Form: $\hat{y} = b0 + b1x$ , where:
- $\hat{y}$ = the predicted value of y
- $b_0$ = the population y-intercept
- $b_1$ = the population slope

Specific Dependent Variable Value

$\hat{y}i = b0 + b1xi + \epsilon_i$ , where:

$x_i$ = the value of the independent variable for the $i^{th}$ value
$y_i$ = the value of the dependent variable for the $i^{th}$ value
$\epsilon_i$ = the error of prediction for the $i^{th}$ value
$b0 + b1x_i$ is the deterministic portion.
$b0 + b1xi + \epsiloni$ is the probabilistic model.
In a deterministic model, all points are assumed to be on the line and in all cases $\epsilon$ is zero.

Sample Data

Regression analyses use sample data, not population data.
Population parameters ( $b0$ and $b1$ ) are estimated using sample statistics ( $\hat{b}0$ and $\hat{b}1$ ).
Equation of the Simple Regression Line: $\hat{y} = \hat{b}0 + \hat{b}1x$

Least Squares Analysis

The process to determine the values for $\hat{b}0$ and $\hat{b}1$ .
A regression model is developed by producing the minimum sum of the squared error values.
The least squares regression line results in the smallest sum of errors squared.

Slope of the Regression Line

Numerator of the slope expression: $S{xy} = \sum{i=1}^{n}(xi - \bar{x})(yi - \bar{y})$
Denominator of the slope expression: $S{xx} = \sum{i=1}^{n}(x_i - \bar{x})^2$
Alternative formula for slope: $\hat{b}1 = \frac{S{xy}}{S_{xx}}$

Y-Intercept of the Regression Line

Data needed from sample information to compute the slope and intercept (unless sample means are used): $\sum x, \sum y, \sum xy, \sum x^2$ and $n$

Residual Analysis

A researcher tests a regression line to determine whether the line is a good fit for the data by using historical data to test the model.
Values of the independent variable (x values) are inserted into the regression model and a predicted value is obtained for each x value.
Compares predicted values to actual y values to determine error.
Each difference between the actual y values and the predicted values is the error (residual).
The sum of squares of these residuals is minimized to find the least squares line.

Residuals Sum

The sum of the residuals is approximately zero due to the placement of the line geometrically in the middle of all points.
Vertical distances from the line to the points will cancel each other and sum to zero.

Outliers

Outliers are data points that lie apart from the rest of the points.
Outliers can produce residuals with large magnitudes.
Outliers can be the result of mis-recorded or miscoded data, or they may simply be data points that do not conform to the general trend.

Standard Error of the Estimate

An alternative way of examining the error of the model, which provides a single measurement of the regression error.

Sum of Squares of Error

Because the sum of the residuals is zero, squaring the residuals and then summing them is necessary to determine the total amount of error.

Standard Error of the Estimate (se)

A standard deviation of the error of the regression model.
More useful than SSE.
Formula: $s_e = \sqrt{\frac{SSE}{n-2}}$

Coefficient of Determination ( $r^2$ )

A widely used measure of fit for regression models.
The proportion of variability of the dependent variable (y) accounted for or explained by the independent variable (x).
Ranges from 0 to 1.
The dependent variable, y, being predicted in a regression model has a variation that is measured by the sum of squares of y ( $SS_{yy}$ ) and is the sum of the squared deviations of the y values from the mean value of y.
This variation can be broken into two additive variations: the explained variation, measured by the sum of squares of regression ( $SSR$ ), and the unexplained variation, measured by the sum of squares of error ( $SSE$ ).
Relationship: $SS_{yy} = SSR + SSE$
If each term in the equation is divided by $SS{yy}$ , the resulting equation is: $1 = \frac{SSR}{SS{yy}} + \frac{SSE}{SS_{yy}}$
The term $r^2$ is the proportion of the y variability that is explained by the regression model.
Substituting this equation into the preceding relationship gives: $r^2 = \frac{SSR}{SS{yy}} = 1 - \frac{SSE}{SS{yy}}$
Computational Formula for $r^2$ :
$r^2 = \frac{SSR}{SS_{yy}}$

R-squared and Adjusted R-squared:

The R-squared ( $R^2$ ) ranges from 0 to 1.
Represents the proportion of information in the data that can be explained by the model.
The adjusted R-squared adjusts for the degrees of freedom.
The $R^2$ measures how well the model fits the data.
For a simple linear regression, $R^2$ is the square of the Pearson correlation coefficient.
A high value of $R^2$ is a good indication.
The value of $R^2$ tends to increase when more predictors are added in the model.
Consider the adjusted R-squared, which is a penalized $R^2$ for a higher number of predictors.
An (adjusted) $R^2$ that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
A number near 0 indicates that the regression model did not explain much of the variability in the outcome

Python Implementation

Step 1: Import Packages and Classes

Import NumPy, LinearRegression from sklearn.linear_model, and matplotlib.pyplot.
Use NumPy for array manipulation and LinearRegression for performing linear regression.
matplotlib.pyplot is used for visualizing the data and regression line.

Step 2: Provide Data

Define the input (regressors, x) and output (response, y) as arrays or similar objects.
Reshape the input array x to be two-dimensional with one column and as many rows as necessary using .reshape((-1, 1)).

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
print(x)
print(y)

plt.scatter(x, y)
plt.show()

Step 3: Create a Model and Fit it

Create an instance of the LinearRegression class.

model = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)

Parameters:
- fit_intercept: Boolean, decides whether to calculate the intercept (default: True).
- normalize: Boolean, decides whether to normalize the input variables (default: False).
- copy_X: Boolean, decides whether to copy the input variables (default: True).
- n_jobs: Integer or None, the number of jobs used in parallel computation (default: None).
Use .fit() to calculate the optimal values of the weights using the input and output data.

model.fit(x, y)

Step 4: Get Results

Obtain the coefficient of determination ( $R^2$ ) using .score().

r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")

Get the intercept and coefficients.

print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")

The intercept ( $b_0$ ) represents the predicted response when x is zero.
The coefficient ( $b_1$ ) represents the increase in the predicted response when x is increased by one.

Step 5: Predict Response

y_pred = model.intercept_ + model.coef_ * x
print(f"predicted response:\n{y_pred}")

Step 6: Visualize the Regression Line

plt.scatter(x, y)
plt.plot(x, y_pred)
plt.show()

Linear Regression

Classification & Regression

Simple Linear Regression

Problem Formulation

Estimating Regression Coefficients

Residuals

Ordinary Least Squares

Coefficient of Determination (R2R^2R2)

Slope-Intercept Form

Specific Dependent Variable Value

Sample Data

Least Squares Analysis

Slope of the Regression Line

Y-Intercept of the Regression Line

Residual Analysis

Residuals Sum

Outliers

Standard Error of the Estimate

Sum of Squares of Error

Standard Error of the Estimate (se)

Coefficient of Determination (r2r^2r2)

R-squared and Adjusted R-squared:

Python Implementation

Step 1: Import Packages and Classes

Step 2: Provide Data

Step 3: Create a Model and Fit it

Step 4: Get Results

Step 5: Predict Response

Step 6: Visualize the Regression Line

Coefficient of Determination ( $R^2$ )

Coefficient of Determination ( $r^2$ )