Linear Regression
Classification & Regression
Simple Linear Regression
- Definition: Implementing linear regression of a dependent variable on a set of independent variables , where is the number of predictors.
- Assumption: A linear relationship exists between and .
Problem Formulation
- Regression Equation: , where:
- are regression coefficients.
- is a random error.
Estimating Regression Coefficients
- Linear regression calculates estimators of regression coefficients (predicted weights) denoted as .
- Estimated Regression Function: . This function captures dependencies between inputs and output.
- The estimated/predicted response for each observation should be as close as possible to the actual response .
Residuals
- Definition: The differences between actual and predicted responses: for all observations .
- Regression aims to determine the best-predicted weights, minimizing residuals.
Ordinary Least Squares
- Method: Minimizes the sum of squared residuals (SSR) for all observations .
- Formula: .
Coefficient of Determination ()
- Indicates the amount of variation in that can be explained by the dependence on using the regression model.
- A larger indicates a better fit.
- corresponds to , representing a perfect fit.
Slope-Intercept Form
- General Form: , where:
- = slope of the line
- = y-intercept of the line
- Statistical Form: , where:
- = the predicted value of y
- = the population y-intercept
- = the population slope
Specific Dependent Variable Value
, where:
- = the value of the independent variable for the value
- = the value of the dependent variable for the value
- = the error of prediction for the value
- is the deterministic portion.
- is the probabilistic model.
- In a deterministic model, all points are assumed to be on the line and in all cases is zero.
Sample Data
- Regression analyses use sample data, not population data.
- Population parameters ( and ) are estimated using sample statistics ( and ).
- Equation of the Simple Regression Line:
Least Squares Analysis
- The process to determine the values for and .
- A regression model is developed by producing the minimum sum of the squared error values.
- The least squares regression line results in the smallest sum of errors squared.
Slope of the Regression Line
- Numerator of the slope expression:
- Denominator of the slope expression:
- Alternative formula for slope:
Y-Intercept of the Regression Line
- Data needed from sample information to compute the slope and intercept (unless sample means are used): and
Residual Analysis
- A researcher tests a regression line to determine whether the line is a good fit for the data by using historical data to test the model.
- Values of the independent variable (x values) are inserted into the regression model and a predicted value is obtained for each x value.
- Compares predicted values to actual y values to determine error.
- Each difference between the actual y values and the predicted values is the error (residual).
- The sum of squares of these residuals is minimized to find the least squares line.
Residuals Sum
- The sum of the residuals is approximately zero due to the placement of the line geometrically in the middle of all points.
- Vertical distances from the line to the points will cancel each other and sum to zero.
Outliers
- Outliers are data points that lie apart from the rest of the points.
- Outliers can produce residuals with large magnitudes.
- Outliers can be the result of mis-recorded or miscoded data, or they may simply be data points that do not conform to the general trend.
Standard Error of the Estimate
- An alternative way of examining the error of the model, which provides a single measurement of the regression error.
Sum of Squares of Error
- Because the sum of the residuals is zero, squaring the residuals and then summing them is necessary to determine the total amount of error.
Standard Error of the Estimate (se)
- A standard deviation of the error of the regression model.
- More useful than SSE.
- Formula:
Coefficient of Determination ()
A widely used measure of fit for regression models.
The proportion of variability of the dependent variable (y) accounted for or explained by the independent variable (x).
Ranges from 0 to 1.
The dependent variable, y, being predicted in a regression model has a variation that is measured by the sum of squares of y () and is the sum of the squared deviations of the y values from the mean value of y.
This variation can be broken into two additive variations: the explained variation, measured by the sum of squares of regression (), and the unexplained variation, measured by the sum of squares of error ().
Relationship:
If each term in the equation is divided by , the resulting equation is:
The term is the proportion of the y variability that is explained by the regression model.
Substituting this equation into the preceding relationship gives:
Computational Formula for :
R-squared and Adjusted R-squared:
- The R-squared () ranges from 0 to 1.
- Represents the proportion of information in the data that can be explained by the model.
- The adjusted R-squared adjusts for the degrees of freedom.
- The measures how well the model fits the data.
- For a simple linear regression, is the square of the Pearson correlation coefficient.
- A high value of is a good indication.
- The value of tends to increase when more predictors are added in the model.
- Consider the adjusted R-squared, which is a penalized for a higher number of predictors.
- An (adjusted) that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
- A number near 0 indicates that the regression model did not explain much of the variability in the outcome
Python Implementation
Step 1: Import Packages and Classes
- Import NumPy, LinearRegression from sklearn.linear_model, and matplotlib.pyplot.
- Use NumPy for array manipulation and LinearRegression for performing linear regression.
- matplotlib.pyplot is used for visualizing the data and regression line.
Step 2: Provide Data
- Define the input (regressors, x) and output (response, y) as arrays or similar objects.
- Reshape the input array x to be two-dimensional with one column and as many rows as necessary using .reshape((-1, 1)).
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
print(x)
print(y)
plt.scatter(x, y)
plt.show()
Step 3: Create a Model and Fit it
- Create an instance of the LinearRegression class.
model = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)
- Parameters:
- fit_intercept: Boolean, decides whether to calculate the intercept (default: True).
- normalize: Boolean, decides whether to normalize the input variables (default: False).
- copy_X: Boolean, decides whether to copy the input variables (default: True).
- n_jobs: Integer or None, the number of jobs used in parallel computation (default: None).
- Use .fit() to calculate the optimal values of the weights using the input and output data.
model.fit(x, y)
Step 4: Get Results
- Obtain the coefficient of determination () using .score().
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
- Get the intercept and coefficients.
print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")
- The intercept () represents the predicted response when x is zero.
- The coefficient () represents the increase in the predicted response when x is increased by one.
Step 5: Predict Response
y_pred = model.intercept_ + model.coef_ * x
print(f"predicted response:\n{y_pred}")
Step 6: Visualize the Regression Line
plt.scatter(x, y)
plt.plot(x, y_pred)
plt.show()