OPRE 207: Statistics for Business and Management Science - Simple Linear Regression
Simple Linear Regression
Overview
- Simple Linear Regression Model
- Least Squares Method
- Coefficient of Determination
- Model Assumptions
- Testing for Significance
- Using the Estimated Regression Equation for Estimation and Prediction
Introduction
- Managerial decisions often rely on understanding relationships between two or more variables.
- Example 1: Predicting sales based on advertising expenditure.
- Example 2: Predicting electricity usage based on daily high temperatures.
- Regression analysis helps develop equations that show how variables are related.
Regression Terminology
- Dependent Variable (y): The variable being predicted.
- Independent Variable (x): The variables used to predict the dependent variable.
- Examples:
- Sales (y) depend on advertising spending (x).
- Next week’s electricity usage (y) depends on daily high temperatures (x).
- Alternative names:
- Independent variable (x): predictor, feature, experimental variable.
- Dependent variable (y): response, target, outcome variable.
Simple Linear Regression
- Involves one independent variable (x) and one dependent variable (y).
- The relationship between the variables is approximated by a straight line.
- Multiple Regression: Regression analysis involving two or more independent variables (e.g., x1, x2, …).
Simple Linear Regression Model
- The regression model describes how y is related to x and an error term (\epsilon).
- Equation: y = \beta0 + \beta1x + \epsilon
- \beta0 and \beta1 are parameters of the model.
- \epsilon is a random variable called the error term.
- The error term accounts for the variability in y that the linear relationship between x and y cannot explain.
Simple Linear Regression Equation
- Describes how the expected value of y is related to x.
- Equation: E[y] = \beta0 + \beta1x
- \beta_0 is the y-intercept of the regression line.
- \beta_1 is the slope of the regression line.
- E[y] is the expected value of y for a given x value.
Simple Linear Regression Equation - Relationships
- Positive Linear Relationship
- Negative Linear Relationship
- No Relationship
Estimated Simple Linear Regression Equation
- If the population parameters \beta0 and \beta1 were known, E[y] = \beta0 + \beta1x could be used to compute the mean value of y for a given value of x.
- In practice, parameter values are estimated using sample data.
- Sample statistics (b0 and b1) are computed as estimates of the population parameters \beta0 and \beta1.
- Substituting sample statistics into the regression equation, we obtain the estimated regression equation.
Estimated Simple Linear Regression Equation - Equation
- Equation: \hat{y} = b0 + b1x
- b_0 is the y-intercept of the regression line.
- b_1 is the slope of the regression line.
- \hat{y} is the estimated value of y for a given x value.
Estimation Process
- In linear regression, each observation consists of two values:
- One for the independent variable (x).
- One for the dependent variable (y).
Least Squares Method
- The least squares (LS) method is a procedure for using sample data to find the estimated regression equation \hat{y} = b0 + b1x.
- The method uses sample data to find the values of b0 and b1 that minimize the sum of the squares of the deviations between the observed values of the dependent variable y and the predicted values of the dependent variable \hat{y}.
- Least Squares Criterion: \sum(yi - \hat{y}i)^2
- y_i is the observed value of the dependent variable for the i^{th} observation.
- \hat{y}_i is the estimated value of the dependent variable for the i^{th} observation.
- Using differential calculus, the values that minimize the LS criterion are:
- b1 = \frac{\sum(xi - \bar{x})(yi - \bar{y})}{\sum(xi - \bar{x})^2}
- b0 = \bar{y} - b1\bar{x}
- Where:
- x_i is the value of the independent variable for the i^{th} observation.
- y_i is the value of the dependent variable for the i^{th} observation.
- \bar{x} is the mean value for the independent variable.
Example – Armand’s Pizza
- Armand’s Pizza, a chain of Italian-food restaurants, finds its most successful locations are near college campuses.
- Managers believe that quarterly sales (y) are positively related to the size of the student population (x).
- Restaurants near campuses with a large student population tend to generate more sales.
- Using sample data, develop a regression equation showing how y is related to x.
Example – Armand’s Pizza - Calculations
- Given the data, we first calculate \bar{x} and \bar{y}:
- \bar{x} = \frac{\sum x_i}{n} = \frac{140}{10} = 14
- \bar{y} = \frac{\sum y_i}{n} = \frac{1300}{10} = 130
Example – Armand’s Pizza - Slope
- Slope for the estimated regression equation:
- b1 = \frac{\sum(xi - \bar{x})(yi - \bar{y})}{\sum(xi - \bar{x})^2} = \frac{2840}{568} = 5
Example – Armand’s Pizza - Intercept
- -intercept for the estimated regression equation:
- b0 = \bar{y} - b1\bar{x} = 130 - (5)(14) = 60
- Estimated Regression Equation: \hat{y} = 60 + 5x
Example – Armand’s Pizza - Diagram
- Scatter diagrams for regression analysis are constructed with the independent variable x on the horizontal axis and the dependent variable y on the vertical axis
Interpretation of Intercept and Slope
- \hat{y} = 60 + 5x
- Interpretation of intercept (b_0 = 60):
- When the student population is 0 (x = 0), the expected quarterly sales is 60 (in thousands).
- Interpretation of slope (b_1 = 5):
- A one-unit increase in the student population (x) is associated with a 5-unit increase in quarterly sales (in thousands).
- Sales are expected to increase by $5000 per 1000 more students.
Coefficient of Determination
- Now that we learned how to develop a simple linear regression model, the question is: How well does the estimated regression equation fit the data?
- Coefficient of determination (r^2) is a measure of the goodness of fit for the estimated regression equation
Residual (Error)
- For observation i, the difference between the observed value of the dependent variable (yi) and the predicted value of the dependent variable (\hat{y}i) is called the i^{th} residual (or error).
- The residual represents the error in using \hat{y}i to estimate yi.
- Thus, for the i^{th} observation, the residual is yi - \hat{y}i.
- The sum of squares of these residuals or errors is the quantity that is minimized by the least squares method (\sum(yi - \hat{y}i)^2).
- This quantity, also known as the sum of squares due to error, or SSE:
- SSE = \sum(yi - \hat{y}i)^2
- The value of SSE is a measure of the error in using the estimated regression equation to predict the values of the dependent variable.
Total Sum of Squares (SST)
- For the i^{th} observation in the sample, the difference yi - \bar{y} provides a measure of the error involved in using the sample mean (\bar{y}) to predict the dependent variable (yi).
- The corresponding sum of squares is called the total sum of squares, or SST:
- SST = \sum(y_i - \bar{y})^2
Deviations Around the Estimated Regression Line for Armand’s Pizza
- Deviations around the estimated regression line \hat{y} = 60 + 5x and the line \bar{y} for Armand's pizza:
Sum of Squares due to Regression (SSR)
- To measure how much the \hat{y} values on the estimated regression line deviate from \bar{y}, another sum of squares is computed.
- This sum of squares is called the sum of squares due to regression, or SSR:
- SSR = \sum(\hat{y}_i - \bar{y})^2
Relationship Among SST, SSR, and SSE
- The relationship among SST, SSR, and SSE provides one of the most important results in statistics:
- SST = SSR + SSE
- \sum(yi - \bar{y})^2 = \sum(\hat{y}i - \bar{y})^2 + \sum(yi - \hat{y}i)^2
- SSR can be thought of as the explained portion of SST.
- SSE can be thought of as the unexplained portion of SST.
Coefficient of Determination
- The ratio SSR/SST, which will take values between 0 and 1, is used to evaluate the goodness of fit for the estimated regression equation.
- This ratio is called the coefficient of determination and is denoted by r^2
- r^2 = \frac{SSR}{SST}
- Where:
- SSR = sum of squares due to regression
- SST = total sum of squares
Example – Armand’s Pizza - Coefficient of Determination
- Calculate the coefficient of determination for the Armand’s Pizza example.
- Recall that SST=15,730 and SSE=1,530
- SSR = SST - SSE = 15730 - 1530 = 14200
- r^2 = \frac{SSR}{SST} = \frac{14200}{15730} = 0.9027
- Conclusion: 90.27% of the total sum of squares can be explained by using the estimated regression equation \hat{y} = 60 + 5x to predict quarterly sales. In other words, 90.27% of the variability in sales can be explained by the linear relationship between the size of the student population and sales
Correlation Coefficient
- Correlation coefficient is a measure of the strength of the linear relationship between two variables x and y (previously discussed).
- The correlation coefficient can take on values between –1 and +1
- Values near –1 indicate a strong negative linear relationship
- Values near +1 indicate a strong positive linear relationship
- The closer the correlation is to zero, the weaker the linear relationship
Sample Correlation Coefficient
- If a regression analysis has already been performed and the coefficient of determination r^2 computed, the sample correlation coefficient can be computed as follows:
- r = (sign \ of \ b_1)\sqrt{r^2}
- The sign for the sample correlation coefficient is positive if the estimated regression equation has a positive slope (b1 > 0) and negative if the estimated regression equation has a negative slope (b1 < 0).
Example – Armand’s Pizza - Correlation Coefficient
- Calculate the sample correlation coefficient for the Armand’s Pizza example. What conclusion can be made?
- Recall that r^2 = 0.9027 and \hat{y} = 60 + 5x
- r = + \sqrt{0.9027} = +0.9501
- With a sample correlation coefficient of r = +0.9501, we would conclude that a strong positive linear relationship exists between x and y.
A Note on Coefficient of Determination vs. Correlation Coefficient
- In the case of a linear relationship between two variables, both the coefficient of determination (r^2) and the sample correlation coefficient (r) provide measures of the strength of the relationship.
- The coefficient of determination (r^2) provides a measure between 0 and 1.
- The sample correlation coefficient (r) provides a measure between −1 and +1.
- Although the r is restricted to a linear relationship between two variables, the r^2 can be used for nonlinear relationships and for relationships that have two or more independent variables.
- Thus, the coefficient of determination (r^2) provides a wider range of applicability
Model Assumptions
- Consider the regression model: y = \beta0 + \beta1x + \epsilon
- We make the following assumptions about \epsilon:
- The error \epsilon is a random variable with a mean of zero, i.e., E[\epsilon] = 0
- The variance of \epsilon, denoted by \sigma^2, is the same for all values of x
- The values of \epsilon are independent for all values of x
- The error \epsilon is a normally distributed random variable for all values of x
Assumptions about the Error Term in the Regression Model
- The value of E[y] changes according to the specific value of x considered.
- However, regardless of the x value, the probability distribution of \epsilon and hence the probability distributions of y are normally distributed, each with the same variance \sigma^2.
- The specific value of the error \epsilon at any particular point depends on whether the actual value of y is greater than or less than E[y].
- Bottomline: For all x values, y is a normally-distributed random variable with:
- E[y] = \beta0 + \beta1x
- Variance = \sigma^2
Testing for Significance
- The value of the coefficient of determination (r^2) is a measure of the goodness of fit of the estimated regression equation.
- However, even with a large value of r^2, the estimated regression equation should not be used until further analysis of the appropriateness of the assumed model has been conducted.
- An important step in determining whether the assumed model is appropriate involves testing for the significance of the relationship between x and y
Testing for Significance - Explanation
- In the regression equation E[y] = \beta0 + \beta1x
- If \beta_1 = 0, we can conclude that x and y are not related.
- If \beta_1 \neq 0, we can conclude that x and y are related.
- To test for a significant regression relationship, we must conduct a hypothesis test to determine whether the value of \beta_1 is zero or not.
- Two tests are commonly used: the t test, and F test
- Both the t test and F test require an estimate of \sigma^2, the variance of \epsilon in the regression model
- In this course, we only talk about the t test
Estimate of \sigma^2
- Recall that the sum of squared errors (SSE) is a measure of the variability of the actual observations (yi) around the estimated regression line (\hat{y}i).
- It can be shown that the mean square error (MSE) provides an unbiased estimator of \sigma^2, which is denoted by s^2:
- s^2 = MSE = \frac{SSE}{n-2}
- Where, SSE = \sum(yi - \hat{y}i)^2 = \sum(yi - b0 + b1xi)^2
- n-2 is the degrees of freedom for SSE
Standard Error of the Estimate
- Standard error of the estimate is the square root of s^2:
- s = \sqrt{MSE} = \sqrt{\frac{SSE}{n-2}}
Sampling Distribution of b0 and b1
- Recall that in the Armand’s Pizza example, we collected data from a sample of 10 restaurants and used the least squares method to develop the regression equation \hat{y} = 60 + 5x
- If we take a new sample of 10 restaurants, and use least squares again, we will obtain a different regression equation
- Thus, b0 and b1 are indeed random variables, and we can define a sampling distribution for them
- We are particularly interested in the “sampling distribution of b1” because we are trying to conduct a hypothesis test about \beta1
Sampling Distribution of b_1
- The “sampling distribution of b_1” can be defined by its mean, standard deviation (standard error), and shape
- Expected value (mean): E[b1] = \beta1
- Standard deviation (standard error): s{b1} = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}}, where s = \sqrt{\frac{SSE}{n-2}}
- Shape: Normal
Testing for Significance: t Test
- The simple linear regression model is y = \beta0 + \beta1x + \epsilon
- If x and y are linearly related, we must have \beta_1 \neq 0
- The purpose of the t-test is to see whether we can conclude that \beta_1 \neq 0
- We will use the sample data to test the following hypotheses about the parameter \beta_1:
- H0: \beta1 = 0
- H1: \beta1 \neq 0
Steps of Hypothesis Testing for Significance - p-value approach
- The p-value Approach:
- Step 1. Develop the null and alternative hypotheses
- H0: \beta1 = 0
- H1: \beta1 \neq 0
- Step 2. Specify the level of significance \alpha
- Step 3. Collect the sample data, calculate b1 (the least squares estimate of \beta1), the standard error s{b1}, and the “test statistic”
- Step 4. Compute the p-value as follows:
- p-value = 2*P(t > |test statistic|)
*in excel: = 1 – (T.DIST(ABS(t), n-2, TRUE) – T.DIST(-ABS(t), n-2, TRUE))
- in a two-tailed test
- Step 5. Reject H_0 if the p-value ≤ \alpha
Steps of Hypothesis Testing for Significance - Critical Value Approach
- The Critical Value Approach:
- Step 1. Develop the null and alternative hypotheses
- H0: \beta1 = 0
- H1: \beta1 \neq 0
- Step 2. Specify the level of significance \alpha
- Step 3. Collect the sample data, calculate b1 (the least squares estimate of \beta1), the standard error s{b1}, and the “test statistic”
- Step 4. Compute the critical value t_{\alpha/2} as follows:
- t_{\alpha/2} = T.INV(\alpha/2, n-2)
- Step 5. Reject H0 if |t| \geq |t{\alpha/2}|
Example – Armand’s Pizza - Calculating t and s_b1
- Conduct a hypothesis testing for significance of the relationship between student population and quarterly sales at \alpha = 0.05
- Recall that we previously found that SSE = 1530
- s^2 = \frac{SSE}{n-2} = \frac{1530}{10-2} = 191.25
- s= \sqrt(MSE) = \sqrt{191.25} = 13.829
- We also calculated \sum(x_i - \bar{x})^2 previously (Slide 19). So,
- \sum(x_i - \bar{x})^2 = 568
- s{b1} = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}} = \frac{13.829}{\sqrt{568}} = 0.5803
Example – Armand’s Pizza - p-value approach
- Here's how to conduct the hypothesis test using the p-value approach:
- Step 1. The null and alternative hypotheses are:
- H0: \beta1 = 0
- H1: \beta1 \neq 0
- Step 2. \alpha = 0.05
- Step 3. We previously calculated b1 and s{b_1}. So, the test statistic is:
- t = \frac{b1}{s{b_1}} = \frac{5}{0.5803} = 8.62
- Step 4. The p-value is:
* p-value = 2*P(t > |test statistic|)
*in excel: = 1 – (T.DIST(ABS(t), n-2, TRUE) – T.DIST(-ABS(t), n-2, TRUE)) = 0 - Step 5. We reject H_0 because the p-value ≤ \alpha (i.e., 0 < 0.05)
- Conclusion: A significant relationship exists between student population and quarterly sales at Armand’s Pizza restaurant
Example – Armand’s Pizza - Critical Value Approach
- Here's how to conduct the hypothesis test using the critical value approach:
- Steps 1-3 are identical to the p-value approach
- Step 4. The critical value t{\alpha/2} is:
* t{\alpha/2} = T.INV(\alpha/2, n-2) = T.INV(0.025, 8) = -2.3
- Step 5. We reject H0 because |t| \geq |t{\alpha/2}| (i.e., 8.62 > 2.3)
Hypothesis Testing for Significance in a Regression Model - Summary
- Two-Tailed Test
- Hypotheses
- H0: \beta1 = 0
- H1: \beta1 \neq 0
- Test Statistic
- p-Value
- p-value = 2*P(t > |test statistic|)
*in excel: = 1 – (T.DIST(ABS(t), n-2, TRUE) – T.DIST(-ABS(t), n-2, TRUE)) - Rejection Rule: p-Value Approach
- Reject H_0 if p-value ≤ \alpha
- Critical Value
- t_{\alpha/2} = T.INV (\alpha/2 , n-2)
- Rejection Rule: Critical Value Approach
- Reject H0 if |t| \geq |t{\alpha/2}|
- Formulas:
- b1 = \frac{\sum(xi - \bar{x})(yi - \bar{y})}{\sum(xi - \bar{x})^2}
- s{b1} = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}}
- SSE = \sum(yi - \hat{y}i)^2 = \sum(yi - b0 + b1xi)^2
- s= \sqrt(MSE) = \sqrt{\frac{SSE}{n-2}}
- \hat{y} = b0 + b1x
Some Cautions about the Interpretation of Significance Tests
- Regression analysis cannot be used as evidence of a cause-and-effect relationship.
- For example, a test of significance for a relationship between age (x) and blood sugar (y) rejects H0: \beta1 = 0, confirming a significant statistical relationship. However, we cannot necessarily conclude that an increase in age causes an increase in blood sugar
Some Cautions about the Interpretation of Significance Tests - Linearity
- Just because we are able to reject H_0 and demonstrate statistical significance does not enable us to conclude that the relationship between x and y is truly linear
- We can state only that x and y are related and that a linear relationship explains a significant portion of the variability in y over the range of values for x observed in the sample
- Simple Linear Regression Model: y = \beta0 + \beta1x + \epsilon
- Simple Linear Regression Equation: E[y] = \beta0 + \beta1x
- Estimated Simple Linear Regression Equation: \hat{y} = b0 + b1x
- Least Squares Criterion: \sum(yi - \hat{y}i)^2
- Slope and y-Intercept for the Estimated Regression Equation:
- b1 = \frac{\sum(xi - \bar{x})(yi - \bar{y})}{\sum(xi - \bar{x})^2}
- b0 = \bar{y} - b1\bar{x}
- Sum of Squares Due to Error:
- SSE = \sum(yi - \hat{y}i)^2
- Total Sum of Squares:
- SST = \sum(y_i - \bar{y})^2
- Sum of Squares Due to Regression:
- SSR = \sum(\hat{y}_i - \bar{y})^2
- Relationship Among SST, SSR, and SSE:
- Coefficient of Determination:
- Sample Correlation Coefficient:
- r = (sign \ of \ slope)*\sqrt{r^2}
- Mean Square Error (Estimate of \sigma^2):
- s^2 = MSE = \frac{SSE}{n-2}
- Standard Error of the Estimate:
- s = \sqrt{MSE} = \sqrt{\frac{SSE}{n-2}}
- Estimated Standard Deviation of b_1:
- s{b1} = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}}
- t Test Statistic:
- t = \frac{b1}{s{b_1}}