Linear Regression Study Notes

Overview of Linear Regression

In this chapter, we will explore various aspects of linear regression, specifically covering the following topics:

Simple linear regression model
Least squares method
Assessing the fit of a simple linear regression model
Multiple linear regression model
Inference in linear regression
Categorical independent variables
Modeling nonlinear relationships
Model fitting
Big data and linear regression
Prediction with linear regression

Simple Linear Regression Model

A simple linear regression model can be represented by the following equation:
$y = \beta0 + \beta1 x + \epsilon$
Where:

y is the dependent variable,
\beta_0 is the intercept,
\beta_1 is the slope of the line,
x is the independent variable, and
\epsilon is the error term.

Since the values of \beta0 and \beta1 are unknown, we estimate them using a sample. The estimated simple linear regression equation is given by:
$\hat{Y} = B0 + B1 x$
Where:

\hat{Y} is the predicted value of Y,
B0 is the point estimate of \beta0, and
B1 is the estimate of \beta1.

Here, \hat{Y} serves as a point estimator for the expected value of Y given x. Specifically, it represents the mean value of Y for a specific value of x.

Least Squares Method

The least squares method is a statistical approach used to derive the estimated linear regression equation from sample data by minimizing the sum of the squared differences between observed and predicted values. The method aims to minimize the following summation:
$\sum{i=1}^{n} (yi - \hat{y}i)^2$ This can be rewritten as: $\min \sum{i=1}^{n} (yi - B0 - B1 xi)^2$
Where:

y_i is the observed value of the dependent variable for the i-th observation,
\hat{y}_i is the predicted value of the dependent variable for the i-th observation,
n is the total number of observations.

Application in Excel Using Sample Data

Using Excel, we can analyze the relationship between miles and time from a dataset of driving assignments for Butler Trucking. Here, miles serves as the independent variable and time is the dependent variable.
To analyze data, one would create a scatter plot in Excel:

Highlight the data.
Insert a scatter plot.
Right-click on a point to add a trend line and display the regression equation.

As a result, suppose our estimated regression equation appears as follows:
$Y = 0.678x + 1.2739$

This suggests a positive relationship: as miles increase, time also increases.

Assessing the Fit of the Simple Linear Regression Model

Key statistical measures include:

Sum of Squares Due to Error (SSE):
$SSE = \sum{i=1}^{n}(yi - \hat{y}_i)^2$
Total Sum of Squares (SST):
$SST = \sum{i=1}^{n}(yi - \bar{y})^2$
Sum of Squares Due to Regression (SSR):
$SSR = \sum{i=1}^{n}(\hat{y}i - \bar{y})^2$

The relationship among these is expressed as:
$SST = SSR + SSE$

Coefficient of Determination

This measure provides the proportion of variance in the dependent variable that is predictable from the independent variable, calculated as:
$R^2 = \frac{SSR}{SST}$
The value of $R^2$ can range from 0 to 1, where a value of 1 indicates a perfect model fit.

Multiple Linear Regression Model

A multiple linear regression model can be expressed as:
$y = \beta0 + \beta1 x1 + \beta2 x2 + … + \betaq x_q + \epsilon$
Where:

x1, x2, … x_q are independent variables.

The estimated multiple linear regression equation is:
$\hat{Y} = B0 + B1 x1 + B2 x2 + … + Bq xq$ Where B0, B1, … Bq are point estimates of the corresponding \beta values.

Example analysis may include using Excel's data analysis tool with a dataset (e.g., Butler with 300 assignments), defining Time as the dependent variable and both Miles and Deliveries as independent variables.
After performing the regression analysis, one might obtain:
$\hat{Y} = 0.127 + 0.067 \times \text{Miles} + 0.69 \times \text{Deliveries}$
With an associated $R^2$ of 0.817, indicating a strong fit of the model.

Inference in Linear Regression

Statistical inference allows us to make estimates about population characteristics based on sample data analysis. Conditions for valid inference in linear regression are:

The population of potential errors \epsilon is normally distributed with a mean of zero and constant variance for any combination of independent variables.
The values of \epsilon are statistically independent.

To evaluate these conditions, one must analyze residuals from the regression model. Utilizing scatter plots to compare residuals against predicted values and independent variables aids in identifying potential violations of these assumptions.

Performing Residual Analysis in Excel

Using the Butler deliveries file and Excel's data analysis tools, one can create residual plots to analyze whether conditions for linear regression inference are satisfied. Ideally, residuals should appear normally distributed around zero with constant variance across independent variable levels. If the results substantiate this, it supports the validity of the regression model.

Testing Parameter Significance

The significance of the regression parameters can be assessed using hypothesis testing, specifically a t-test. The null hypothesis states that:
$H0: \betaj = 0$
Contrastingly, the alternative hypothesis posits:
$Ha: \betaj \neq 0$

To determine the significance of the coefficients, divide the coefficient by its standard error to yield a t-value. For instance:

For Miles:
$T_{stat} = \frac{0.067}{0.0025} = 27.37$
The resulting p-value is very small, which allows us to reject the null hypothesis, suggesting Miles has a significant effect on Time.
Similarly, for Deliveries:
$T_{stat} = \frac{0.69}{0.295} = 23.37$
Again, yielding a significant relationship.

Multicollinearity Considerations

While seeking significant relationships among independent variables, it is crucial to ensure that they do not exhibit multicollinearity, which could compromise the accuracy of regression results. This can be demonstrated through further analysis utilizing various datasets.