Linear Regression

Classification & Regression

Simple Linear Regression

  • Definition: Implementing linear regression of a dependent variable yy on a set of independent variables x=(x<em>1,,x</em>r)x = (x<em>1, …, x</em>r), where rr is the number of predictors.
  • Assumption: A linear relationship exists between xx and yy.

Problem Formulation

  • Regression Equation: y=b<em>0+b</em>1x<em>1++b</em>rxr+ϵy = b<em>0 + b</em>1x<em>1 + … + b</em>rx_r + \epsilon, where:
    • b<em>0,b</em>1,,brb<em>0, b</em>1, …, b_r are regression coefficients.
    • ϵ\epsilon is a random error.

Estimating Regression Coefficients

  • Linear regression calculates estimators of regression coefficients (predicted weights) denoted as b^<em>0,b^</em>1,,b^r\hat{b}<em>0, \hat{b}</em>1, …, \hat{b}_r.
  • Estimated Regression Function: f^(x)=b^<em>0+b^</em>1x<em>1++b^</em>rxr\hat{f}(x) = \hat{b}<em>0 + \hat{b}</em>1x<em>1 + … + \hat{b}</em>rx_r. This function captures dependencies between inputs and output.
  • The estimated/predicted response f^(x<em>i)\hat{f}(x<em>i) for each observation i=1,,ni = 1, …, n should be as close as possible to the actual response y</em>iy</em>i.

Residuals

  • Definition: The differences between actual and predicted responses: y<em>if^(x</em>i)y<em>i - \hat{f}(x</em>i) for all observations i=1,,ni = 1, …, n.
  • Regression aims to determine the best-predicted weights, minimizing residuals.

Ordinary Least Squares

  • Method: Minimizes the sum of squared residuals (SSR) for all observations i=1,,ni = 1, …, n.
  • Formula: SSR=<em>i=1n(y</em>if^(xi))2SSR = \sum<em>{i=1}^{n} (y</em>i - \hat{f}(x_i))^2.

Coefficient of Determination (R2R^2)

  • Indicates the amount of variation in yy that can be explained by the dependence on xx using the regression model.
  • A larger R2R^2 indicates a better fit.
  • R2=1R^2 = 1 corresponds to SSR=0SSR = 0, representing a perfect fit.

Slope-Intercept Form

  • General Form: y=mx+by = mx + b, where:
    • mm = slope of the line
    • bb = y-intercept of the line
  • Statistical Form: y^=b<em>0+b</em>1x\hat{y} = b<em>0 + b</em>1x, where:
    • y^\hat{y} = the predicted value of y
    • b0b_0 = the population y-intercept
    • b1b_1 = the population slope

Specific Dependent Variable Value

y^<em>i=b</em>0+b<em>1x</em>i+ϵi\hat{y}<em>i = b</em>0 + b<em>1x</em>i + \epsilon_i, where:

  • xix_i = the value of the independent variable for the ithi^{th} value
  • yiy_i = the value of the dependent variable for the ithi^{th} value
  • ϵi\epsilon_i = the error of prediction for the ithi^{th} value
  • b<em>0+b</em>1xib<em>0 + b</em>1x_i is the deterministic portion.
  • b<em>0+b</em>1x<em>i+ϵ</em>ib<em>0 + b</em>1x<em>i + \epsilon</em>i is the probabilistic model.
  • In a deterministic model, all points are assumed to be on the line and in all cases ϵ\epsilon is zero.

Sample Data

  • Regression analyses use sample data, not population data.
  • Population parameters (b<em>0b<em>0 and b</em>1b</em>1) are estimated using sample statistics (b^<em>0\hat{b}<em>0 and b^</em>1\hat{b}</em>1).
  • Equation of the Simple Regression Line: y^=b^<em>0+b^</em>1x\hat{y} = \hat{b}<em>0 + \hat{b}</em>1x

Least Squares Analysis

  • The process to determine the values for b^<em>0\hat{b}<em>0 and b^</em>1\hat{b}</em>1.
  • A regression model is developed by producing the minimum sum of the squared error values.
  • The least squares regression line results in the smallest sum of errors squared.

Slope of the Regression Line

  • Numerator of the slope expression: S<em>xy=</em>i=1n(x<em>ixˉ)(y</em>iyˉ)S<em>{xy} = \sum</em>{i=1}^{n}(x<em>i - \bar{x})(y</em>i - \bar{y})
  • Denominator of the slope expression: S<em>xx=</em>i=1n(xixˉ)2S<em>{xx} = \sum</em>{i=1}^{n}(x_i - \bar{x})^2
  • Alternative formula for slope: b^<em>1=S</em>xySxx\hat{b}<em>1 = \frac{S</em>{xy}}{S_{xx}}

Y-Intercept of the Regression Line

  • Data needed from sample information to compute the slope and intercept (unless sample means are used): x,y,xy,x2\sum x, \sum y, \sum xy, \sum x^2 and nn

Residual Analysis

  • A researcher tests a regression line to determine whether the line is a good fit for the data by using historical data to test the model.
  • Values of the independent variable (x values) are inserted into the regression model and a predicted value is obtained for each x value.
  • Compares predicted values to actual y values to determine error.
  • Each difference between the actual y values and the predicted values is the error (residual).
  • The sum of squares of these residuals is minimized to find the least squares line.

Residuals Sum

  • The sum of the residuals is approximately zero due to the placement of the line geometrically in the middle of all points.
  • Vertical distances from the line to the points will cancel each other and sum to zero.

Outliers

  • Outliers are data points that lie apart from the rest of the points.
  • Outliers can produce residuals with large magnitudes.
  • Outliers can be the result of mis-recorded or miscoded data, or they may simply be data points that do not conform to the general trend.

Standard Error of the Estimate

  • An alternative way of examining the error of the model, which provides a single measurement of the regression error.

Sum of Squares of Error

  • Because the sum of the residuals is zero, squaring the residuals and then summing them is necessary to determine the total amount of error.

Standard Error of the Estimate (se)

  • A standard deviation of the error of the regression model.
  • More useful than SSE.
  • Formula: se=SSEn2s_e = \sqrt{\frac{SSE}{n-2}}

Coefficient of Determination (r2r^2)

  • A widely used measure of fit for regression models.

  • The proportion of variability of the dependent variable (y) accounted for or explained by the independent variable (x).

  • Ranges from 0 to 1.

  • The dependent variable, y, being predicted in a regression model has a variation that is measured by the sum of squares of y (SSyySS_{yy}) and is the sum of the squared deviations of the y values from the mean value of y.

  • This variation can be broken into two additive variations: the explained variation, measured by the sum of squares of regression (SSRSSR), and the unexplained variation, measured by the sum of squares of error (SSESSE).

  • Relationship: SSyy=SSR+SSESS_{yy} = SSR + SSE

  • If each term in the equation is divided by SS<em>yySS<em>{yy}, the resulting equation is: 1=SSRSS</em>yy+SSESSyy1 = \frac{SSR}{SS</em>{yy}} + \frac{SSE}{SS_{yy}}

  • The term r2r^2 is the proportion of the y variability that is explained by the regression model.

  • Substituting this equation into the preceding relationship gives: r2=SSRSS<em>yy=1SSESS</em>yyr^2 = \frac{SSR}{SS<em>{yy}} = 1 - \frac{SSE}{SS</em>{yy}}

  • Computational Formula for r2r^2:
    r2=SSRSSyyr^2 = \frac{SSR}{SS_{yy}}

R-squared and Adjusted R-squared:

  • The R-squared (R2R^2) ranges from 0 to 1.
  • Represents the proportion of information in the data that can be explained by the model.
  • The adjusted R-squared adjusts for the degrees of freedom.
  • The R2R^2 measures how well the model fits the data.
  • For a simple linear regression, R2R^2 is the square of the Pearson correlation coefficient.
  • A high value of R2R^2 is a good indication.
  • The value of R2R^2 tends to increase when more predictors are added in the model.
  • Consider the adjusted R-squared, which is a penalized R2R^2 for a higher number of predictors.
  • An (adjusted) R2R^2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
  • A number near 0 indicates that the regression model did not explain much of the variability in the outcome

Python Implementation

Step 1: Import Packages and Classes
  • Import NumPy, LinearRegression from sklearn.linear_model, and matplotlib.pyplot.
  • Use NumPy for array manipulation and LinearRegression for performing linear regression.
  • matplotlib.pyplot is used for visualizing the data and regression line.
Step 2: Provide Data
  • Define the input (regressors, x) and output (response, y) as arrays or similar objects.
  • Reshape the input array x to be two-dimensional with one column and as many rows as necessary using .reshape((-1, 1)).
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
print(x)
print(y)

plt.scatter(x, y)
plt.show()
Step 3: Create a Model and Fit it
  • Create an instance of the LinearRegression class.
model = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)
  • Parameters:
    • fit_intercept: Boolean, decides whether to calculate the intercept (default: True).
    • normalize: Boolean, decides whether to normalize the input variables (default: False).
    • copy_X: Boolean, decides whether to copy the input variables (default: True).
    • n_jobs: Integer or None, the number of jobs used in parallel computation (default: None).
  • Use .fit() to calculate the optimal values of the weights using the input and output data.
model.fit(x, y)
Step 4: Get Results
  • Obtain the coefficient of determination (R2R^2) using .score().
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
  • Get the intercept and coefficients.
print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")
  • The intercept (b0b_0) represents the predicted response when x is zero.
  • The coefficient (b1b_1) represents the increase in the predicted response when x is increased by one.
Step 5: Predict Response
y_pred = model.intercept_ + model.coef_ * x
print(f"predicted response:\n{y_pred}")
Step 6: Visualize the Regression Line
plt.scatter(x, y)
plt.plot(x, y_pred)
plt.show()