Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

How to write a History Essay

Note

Studied by 2 people

5.0(1)

Unit 5: Factors That Impact the Quality of Life

Note

Studied by 603 people

4.0(1)

Chapter 1 - New World Beginnings

Note

Studied by 424 people

5.0(5)

4.3 Sales forecasting

Note

Studied by 29 people

5.0(1)

The Sale of the Eiffel Tower

Studied by 137 people

5.0(1)

OPRE 207: Statistics for Business and Management Science - Simple Linear Regression

Simple Linear Regression

Overview

Simple Linear Regression Model
Least Squares Method
Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation for Estimation and Prediction

Introduction

Managerial decisions often rely on understanding relationships between two or more variables.
- Example 1: Predicting sales based on advertising expenditure.
- Example 2: Predicting electricity usage based on daily high temperatures.
Regression analysis helps develop equations that show how variables are related.

Regression Terminology

Dependent Variable (y): The variable being predicted.
Independent Variable (x): The variables used to predict the dependent variable.
Examples:
- Sales (y) depend on advertising spending (x).
- Next week’s electricity usage (y) depends on daily high temperatures (x).
Alternative names:
- Independent variable (x): predictor, feature, experimental variable.
- Dependent variable (y): response, target, outcome variable.

Simple Linear Regression

Involves one independent variable (x) and one dependent variable (y).
The relationship between the variables is approximated by a straight line.
Multiple Regression: Regression analysis involving two or more independent variables (e.g., x1, x2, …).

Simple Linear Regression Model

The regression model describes how y is related to x and an error term (\epsilon).
Equation: y = \beta0 + \beta1x + \epsilon
- \beta0 and \beta1 are parameters of the model.
- \epsilon is a random variable called the error term.
  - The error term accounts for the variability in y that the linear relationship between x and y cannot explain.

Simple Linear Regression Equation

Describes how the expected value of y is related to x.
Equation: E[y] = \beta0 + \beta1x
- \beta_0 is the y-intercept of the regression line.
- \beta_1 is the slope of the regression line.
- E[y] is the expected value of y for a given x value.

Simple Linear Regression Equation - Relationships

Positive Linear Relationship
Negative Linear Relationship
No Relationship

Estimated Simple Linear Regression Equation

If the population parameters \beta0 and \beta1 were known, E[y] = \beta0 + \beta1x could be used to compute the mean value of y for a given value of x.
In practice, parameter values are estimated using sample data.
Sample statistics (b0 and b1) are computed as estimates of the population parameters \beta0 and \beta1.
Substituting sample statistics into the regression equation, we obtain the estimated regression equation.

Estimated Simple Linear Regression Equation - Equation

Equation: \hat{y} = b0 + b1x
- b_0 is the y-intercept of the regression line.
- b_1 is the slope of the regression line.
- \hat{y} is the estimated value of y for a given x value.

Estimation Process

In linear regression, each observation consists of two values:
- One for the independent variable (x).
- One for the dependent variable (y).

Least Squares Method

The least squares (LS) method is a procedure for using sample data to find the estimated regression equation \hat{y} = b0 + b1x.
The method uses sample data to find the values of b0 and b1 that minimize the sum of the squares of the deviations between the observed values of the dependent variable y and the predicted values of the dependent variable \hat{y}.
Least Squares Criterion: \sum(yi - \hat{y}i)^2
- y_i is the observed value of the dependent variable for the i^{th} observation.
- \hat{y}_i is the estimated value of the dependent variable for the i^{th} observation.

Least Squares Method - Formulas

Using differential calculus, the values that minimize the LS criterion are:
- b1 = \frac{\sum(xi - \bar{x})(yi - \bar{y})}{\sum(xi - \bar{x})^2}
- b0 = \bar{y} - b1\bar{x}
- Where:
  - x_i is the value of the independent variable for the i^{th} observation.
  - y_i is the value of the dependent variable for the i^{th} observation.
  - \bar{x} is the mean value for the independent variable.

Example – Armand’s Pizza

Armand’s Pizza, a chain of Italian-food restaurants, finds its most successful locations are near college campuses.
Managers believe that quarterly sales (y) are positively related to the size of the student population (x).
Restaurants near campuses with a large student population tend to generate more sales.
Using sample data, develop a regression equation showing how y is related to x.

Example – Armand’s Pizza - Calculations

Given the data, we first calculate \bar{x} and \bar{y}:
- \bar{x} = \frac{\sum x_i}{n} = \frac{140}{10} = 14
- \bar{y} = \frac{\sum y_i}{n} = \frac{1300}{10} = 130

Example – Armand’s Pizza - Slope

Slope for the estimated regression equation:
- b1 = \frac{\sum(xi - \bar{x})(yi - \bar{y})}{\sum(xi - \bar{x})^2} = \frac{2840}{568} = 5

Example – Armand’s Pizza - Intercept

-intercept for the estimated regression equation:
- b0 = \bar{y} - b1\bar{x} = 130 - (5)(14) = 60
Estimated Regression Equation: \hat{y} = 60 + 5x

Example – Armand’s Pizza - Diagram

Scatter diagrams for regression analysis are constructed with the independent variable x on the horizontal axis and the dependent variable y on the vertical axis

Interpretation of Intercept and Slope

\hat{y} = 60 + 5x
Interpretation of intercept (b_0 = 60):
- When the student population is 0 (x = 0), the expected quarterly sales is 60 (in thousands).
Interpretation of slope (b_1 = 5):
- A one-unit increase in the student population (x) is associated with a 5-unit increase in quarterly sales (in thousands).
- Sales are expected to increase by $5000 per 1000 more students.

Coefficient of Determination

Now that we learned how to develop a simple linear regression model, the question is: How well does the estimated regression equation fit the data?
Coefficient of determination (r^2) is a measure of the goodness of fit for the estimated regression equation

Residual (Error)

For observation i, the difference between the observed value of the dependent variable (yi) and the predicted value of the dependent variable (\hat{y}i) is called the i^{th} residual (or error).
The residual represents the error in using \hat{y}i to estimate yi.
Thus, for the i^{th} observation, the residual is yi - \hat{y}i.

Sum of Squares due to Error (SSE)

The sum of squares of these residuals or errors is the quantity that is minimized by the least squares method (\sum(yi - \hat{y}i)^2).
This quantity, also known as the sum of squares due to error, or SSE:
- SSE = \sum(yi - \hat{y}i)^2
The value of SSE is a measure of the error in using the estimated regression equation to predict the values of the dependent variable.

Total Sum of Squares (SST)

For the i^{th} observation in the sample, the difference yi - \bar{y} provides a measure of the error involved in using the sample mean (\bar{y}) to predict the dependent variable (yi).
The corresponding sum of squares is called the total sum of squares, or SST:
- SST = \sum(y_i - \bar{y})^2

Deviations Around the Estimated Regression Line for Armand’s Pizza

Deviations around the estimated regression line \hat{y} = 60 + 5x and the line \bar{y} for Armand's pizza:

Sum of Squares due to Regression (SSR)

To measure how much the \hat{y} values on the estimated regression line deviate from \bar{y}, another sum of squares is computed.
This sum of squares is called the sum of squares due to regression, or SSR:
- SSR = \sum(\hat{y}_i - \bar{y})^2

Relationship Among SST, SSR, and SSE

The relationship among SST, SSR, and SSE provides one of the most important results in statistics:
- SST = SSR + SSE
- \sum(yi - \bar{y})^2 = \sum(\hat{y}i - \bar{y})^2 + \sum(yi - \hat{y}i)^2
SSR can be thought of as the explained portion of SST.
SSE can be thought of as the unexplained portion of SST.

Coefficient of Determination

The ratio SSR/SST, which will take values between 0 and 1, is used to evaluate the goodness of fit for the estimated regression equation.
This ratio is called the coefficient of determination and is denoted by r^2
- r^2 = \frac{SSR}{SST}
- Where:
  - SSR = sum of squares due to regression
  - SST = total sum of squares

Example – Armand’s Pizza - Coefficient of Determination

Calculate the coefficient of determination for the Armand’s Pizza example.
Recall that SST=15,730 and SSE=1,530
- SSR = SST - SSE = 15730 - 1530 = 14200
- r^2 = \frac{SSR}{SST} = \frac{14200}{15730} = 0.9027
Conclusion: 90.27% of the total sum of squares can be explained by using the estimated regression equation \hat{y} = 60 + 5x to predict quarterly sales. In other words, 90.27% of the variability in sales can be explained by the linear relationship between the size of the student population and sales

Correlation Coefficient

Correlation coefficient is a measure of the strength of the linear relationship between two variables x and y (previously discussed).
The correlation coefficient can take on values between –1 and +1
- Values near –1 indicate a strong negative linear relationship
- Values near +1 indicate a strong positive linear relationship
- The closer the correlation is to zero, the weaker the linear relationship

Sample Correlation Coefficient

If a regression analysis has already been performed and the coefficient of determination r^2 computed, the sample correlation coefficient can be computed as follows:
- r = (sign \ of \ b_1)\sqrt{r^2}
The sign for the sample correlation coefficient is positive if the estimated regression equation has a positive slope (b1 > 0) and negative if the estimated regression equation has a negative slope (b1 < 0).

Example – Armand’s Pizza - Correlation Coefficient

Calculate the sample correlation coefficient for the Armand’s Pizza example. What conclusion can be made?
Recall that r^2 = 0.9027 and \hat{y} = 60 + 5x
- r = + \sqrt{0.9027} = +0.9501
With a sample correlation coefficient of r = +0.9501, we would conclude that a strong positive linear relationship exists between x and y.

A Note on Coefficient of Determination vs. Correlation Coefficient

In the case of a linear relationship between two variables, both the coefficient of determination (r^2) and the sample correlation coefficient (r) provide measures of the strength of the relationship.
The coefficient of determination (r^2) provides a measure between 0 and 1.
The sample correlation coefficient (r) provides a measure between −1 and +1.
Although the r is restricted to a linear relationship between two variables, the r^2 can be used for nonlinear relationships and for relationships that have two or more independent variables.
Thus, the coefficient of determination (r^2) provides a wider range of applicability

Model Assumptions

Consider the regression model: y = \beta0 + \beta1x + \epsilon
We make the following assumptions about \epsilon:
- The error \epsilon is a random variable with a mean of zero, i.e., E[\epsilon] = 0
- The variance of \epsilon, denoted by \sigma^2, is the same for all values of x
- The values of \epsilon are independent for all values of x
- The error \epsilon is a normally distributed random variable for all values of x

Assumptions about the Error Term in the Regression Model

The value of E[y] changes according to the specific value of x considered.
However, regardless of the x value, the probability distribution of \epsilon and hence the probability distributions of y are normally distributed, each with the same variance \sigma^2.
The specific value of the error \epsilon at any particular point depends on whether the actual value of y is greater than or less than E[y].
Bottomline: For all x values, y is a normally-distributed random variable with:
- E[y] = \beta0 + \beta1x
- Variance = \sigma^2

Testing for Significance

The value of the coefficient of determination (r^2) is a measure of the goodness of fit of the estimated regression equation.
However, even with a large value of r^2, the estimated regression equation should not be used until further analysis of the appropriateness of the assumed model has been conducted.
An important step in determining whether the assumed model is appropriate involves testing for the significance of the relationship between x and y

Testing for Significance - Explanation

In the regression equation E[y] = \beta0 + \beta1x
- If \beta_1 = 0, we can conclude that x and y are not related.
- If \beta_1 \neq 0, we can conclude that x and y are related.
To test for a significant regression relationship, we must conduct a hypothesis test to determine whether the value of \beta_1 is zero or not.
Two tests are commonly used: the t test, and F test
Both the t test and F test require an estimate of \sigma^2, the variance of \epsilon in the regression model
In this course, we only talk about the t test

Estimate of \sigma^2

Recall that the sum of squared errors (SSE) is a measure of the variability of the actual observations (yi) around the estimated regression line (\hat{y}i).
It can be shown that the mean square error (MSE) provides an unbiased estimator of \sigma^2, which is denoted by s^2:
- s^2 = MSE = \frac{SSE}{n-2}
- Where, SSE = \sum(yi - \hat{y}i)^2 = \sum(yi - b0 + b1xi)^2
- n-2 is the degrees of freedom for SSE

Standard Error of the Estimate

Standard error of the estimate is the square root of s^2:
- s = \sqrt{MSE} = \sqrt{\frac{SSE}{n-2}}

Sampling Distribution of b0 and b1

Recall that in the Armand’s Pizza example, we collected data from a sample of 10 restaurants and used the least squares method to develop the regression equation \hat{y} = 60 + 5x
If we take a new sample of 10 restaurants, and use least squares again, we will obtain a different regression equation
Thus, b0 and b1 are indeed random variables, and we can define a sampling distribution for them
We are particularly interested in the “sampling distribution of b1” because we are trying to conduct a hypothesis test about \beta1

Sampling Distribution of b_1

The “sampling distribution of b_1” can be defined by its mean, standard deviation (standard error), and shape
- Expected value (mean): E[b1] = \beta1
- Standard deviation (standard error): s{b1} = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}}, where s = \sqrt{\frac{SSE}{n-2}}
- Shape: Normal

Testing for Significance: t Test

The simple linear regression model is y = \beta0 + \beta1x + \epsilon
If x and y are linearly related, we must have \beta_1 \neq 0
The purpose of the t-test is to see whether we can conclude that \beta_1 \neq 0
We will use the sample data to test the following hypotheses about the parameter \beta_1:
- H0: \beta1 = 0
- H1: \beta1 \neq 0

Steps of Hypothesis Testing for Significance - p-value approach

The p-value Approach:
- Step 1. Develop the null and alternative hypotheses
  - H0: \beta1 = 0
  - H1: \beta1 \neq 0
- Step 2. Specify the level of significance \alpha
- Step 3. Collect the sample data, calculate b1 (the least squares estimate of \beta1), the standard error s{b1}, and the “test statistic”
  - t = \frac{b1}{s{b_1}}
- Step 4. Compute the p-value as follows:
  - p-value = 2*P(t > |test statistic|)
    *in excel: = 1 – (T.DIST(ABS(t), n-2, TRUE) – T.DIST(-ABS(t), n-2, TRUE))
- in a two-tailed test
- Step 5. Reject H_0 if the p-value ≤ \alpha

Steps of Hypothesis Testing for Significance - Critical Value Approach

The Critical Value Approach:
- Step 1. Develop the null and alternative hypotheses
  - H0: \beta1 = 0
  - H1: \beta1 \neq 0
- Step 2. Specify the level of significance \alpha
- Step 3. Collect the sample data, calculate b1 (the least squares estimate of \beta1), the standard error s{b1}, and the “test statistic”
  - t = \frac{b1}{s{b_1}}
- Step 4. Compute the critical value t_{\alpha/2} as follows:
  - t_{\alpha/2} = T.INV(\alpha/2, n-2)
- Step 5. Reject H0 if |t| \geq |t{\alpha/2}|

Example – Armand’s Pizza - Calculating t and s_b1

Conduct a hypothesis testing for significance of the relationship between student population and quarterly sales at \alpha = 0.05
Recall that we previously found that SSE = 1530
- s^2 = \frac{SSE}{n-2} = \frac{1530}{10-2} = 191.25
- s= \sqrt(MSE) = \sqrt{191.25} = 13.829
We also calculated \sum(x_i - \bar{x})^2 previously (Slide 19). So,
- \sum(x_i - \bar{x})^2 = 568
- s{b1} = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}} = \frac{13.829}{\sqrt{568}} = 0.5803

Example – Armand’s Pizza - p-value approach

Here's how to conduct the hypothesis test using the p-value approach:
Step 1. The null and alternative hypotheses are:
- H0: \beta1 = 0
- H1: \beta1 \neq 0
Step 2. \alpha = 0.05
Step 3. We previously calculated b1 and s{b_1}. So, the test statistic is:
- t = \frac{b1}{s{b_1}} = \frac{5}{0.5803} = 8.62
Step 4. The p-value is:
* p-value = 2*P(t > |test statistic|)
*in excel: = 1 – (T.DIST(ABS(t), n-2, TRUE) – T.DIST(-ABS(t), n-2, TRUE)) = 0
Step 5. We reject H_0 because the p-value ≤ \alpha (i.e., 0 < 0.05)
Conclusion: A significant relationship exists between student population and quarterly sales at Armand’s Pizza restaurant

Example – Armand’s Pizza - Critical Value Approach

Here's how to conduct the hypothesis test using the critical value approach:
Steps 1-3 are identical to the p-value approach
Step 4. The critical value t{\alpha/2} is: * t{\alpha/2} = T.INV(\alpha/2, n-2) = T.INV(0.025, 8) = -2.3
Step 5. We reject H0 because |t| \geq |t{\alpha/2}| (i.e., 8.62 > 2.3)

Hypothesis Testing for Significance in a Regression Model - Summary

Two-Tailed Test
- Hypotheses
  - H0: \beta1 = 0
  - H1: \beta1 \neq 0
- Test Statistic
  - t = \frac{b1}{s{b_1}}
- p-Value
- p-value = 2*P(t > |test statistic|)
  *in excel: = 1 – (T.DIST(ABS(t), n-2, TRUE) – T.DIST(-ABS(t), n-2, TRUE))
- Rejection Rule: p-Value Approach
  - Reject H_0 if p-value ≤ \alpha
- Critical Value
  - t_{\alpha/2} = T.INV (\alpha/2 , n-2)
- Rejection Rule: Critical Value Approach
  - Reject H0 if |t| \geq |t{\alpha/2}|
- Formulas:
  - b1 = \frac{\sum(xi - \bar{x})(yi - \bar{y})}{\sum(xi - \bar{x})^2}
  - s{b1} = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}}
  - SSE = \sum(yi - \hat{y}i)^2 = \sum(yi - b0 + b1xi)^2
  - s= \sqrt(MSE) = \sqrt{\frac{SSE}{n-2}}
  - \hat{y} = b0 + b1x

Some Cautions about the Interpretation of Significance Tests

Regression analysis cannot be used as evidence of a cause-and-effect relationship.
For example, a test of significance for a relationship between age (x) and blood sugar (y) rejects H0: \beta1 = 0, confirming a significant statistical relationship. However, we cannot necessarily conclude that an increase in age causes an increase in blood sugar

Some Cautions about the Interpretation of Significance Tests - Linearity

Just because we are able to reject H_0 and demonstrate statistical significance does not enable us to conclude that the relationship between x and y is truly linear
We can state only that x and y are related and that a linear relationship explains a significant portion of the variability in y over the range of values for x observed in the sample

Key Formulas

Simple Linear Regression Model: y = \beta0 + \beta1x + \epsilon
Simple Linear Regression Equation: E[y] = \beta0 + \beta1x
Estimated Simple Linear Regression Equation: \hat{y} = b0 + b1x
Least Squares Criterion: \sum(yi - \hat{y}i)^2
Slope and y-Intercept for the Estimated Regression Equation:
- b1 = \frac{\sum(xi - \bar{x})(yi - \bar{y})}{\sum(xi - \bar{x})^2}
- b0 = \bar{y} - b1\bar{x}
Sum of Squares Due to Error:
- SSE = \sum(yi - \hat{y}i)^2
Total Sum of Squares:
- SST = \sum(y_i - \bar{y})^2
Sum of Squares Due to Regression:
- SSR = \sum(\hat{y}_i - \bar{y})^2
Relationship Among SST, SSR, and SSE:
- SST = SSR + SSE
Coefficient of Determination:
- r^2 = \frac{SSR}{SST}
Sample Correlation Coefficient:
- r = (sign \ of \ slope)*\sqrt{r^2}
Mean Square Error (Estimate of \sigma^2):
- s^2 = MSE = \frac{SSE}{n-2}
Standard Error of the Estimate:
- s = \sqrt{MSE} = \sqrt{\frac{SSE}{n-2}}
Estimated Standard Deviation of b_1:
- s{b1} = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}}
  - t Test Statistic:
  - t = \frac{b1}{s{b_1}}