PM

Week 10: Regression Analysis

Regression Analysis

Introduction

  • Regression analysis is used to model relationships between variables.

  • It helps isolate and quantify the effect of one independent variable on a dependent variable while holding other factors constant.

  • Example: Quantifying the effect of advertisement investment on sales. If investment in advertisement increases by $1000, what change in sales should we expect?

  • Regression analysis explores causal links among variables and puts the “cause” X on the X-axis and “consequence” Y on the Y-axis.

Regression Line

  • Regression finds a line to fit the data in the best way.

  • The regression line is a prediction/forecasting device.

  • A line in mathematics is represented as Y = a + bX, where a is the intercept and b is the slope.

  • Upward sloping line implies b > 0, Y increases with X.

  • Downward sloping line implies b < 0, Y decreases with X.

  • Flat line implies b = 0, Y is not affected by X.

Simple Regression Model

  • The simple regression model (one independent variable) is given by: yi = \beta0 + \beta1xi + \epsilon_i

    • \epsiloni is the error/disturbance term for the i-th observation and it follows a normal distribution: \epsiloni \sim N(0, \sigma). It measures random noise that causes observations to deviate from the line Y = \beta0 + \beta1X.

    • \beta_0 is the intercept of the regression.

    • \beta_1 is the slope of X.

  • Population regression model: yi = \beta0 + \beta1xi + \epsilon_i

  • Sample estimated regression model can be represented in two styles:

    • Style 1: \hat{y}i = \hat{\beta}0 + \hat{\beta}1xi

    • Style 2: \hat{y}i = b0 + b1xi

  • The residual ei can be inferred as: ei = yi - \hat{\beta}0 - \hat{\beta}1xi

Computing Coefficients

  • The regression line is found by minimizing the differences between actual values of yi and predicted values \hat{y}i, which is equivalent to minimizing the residuals e_i.

  • The best regression line is defined by \hat{\beta}0 and \hat{\beta}1 such that the sum of squared residuals is minimized:

    • \sum{i=1}^{n} ei^2 = \sum{i=1}^{n} (yi - \hat{\beta}0 - \hat{\beta}1 x_i)^2

Multiple Regression Model

  • A regression model with more than one independent variable.

  • A multiple regression model has three components: variables, coefficients, and error terms.

  • The multiple regression model is given by: yi = \beta0 + \beta1x{1,i} + \beta2x{2,i} + \beta3x{3,i} + \cdots + \betajx{j,i} + \epsilon_i

    • \epsilon_i \sim N(0, \sigma) is the normally distributed error term.

Components of Multiple Regression Model

  • We have j independent variables/regressors/predictors/explanatory variables x{1,i}, x{2,i}, \dots, x_{j,i} for the i-th observation.

  • y_i is the dependent variable for the i-th observation.

  • \beta_0 is the intercept, also called the coefficient of the constant.

  • \betaj is the slope for the j-th independent variable Xj, also called the coefficient of X_j.

Computing Coefficients for Multiple Regression

  • The coefficients for a multiple regression model are estimated by minimizing the sum of squared residuals.

  • Estimates of coefficients: \hat{\beta}0, \hat{\beta}1, \dots, \hat{\beta}j are such that \sum{i=1}^{n} ei^2 = \sum{i=1}^{n} (yi - \hat{\beta}0 - \hat{\beta}1x{1,i} - \hat{\beta}2x{2,i} - \cdots - \hat{\beta}jx{j,i})^2

Example Regression Model

  • Regression model: yi = \beta0 + \beta1x{1,i} + \beta2x{2,i} + \epsilon_i for i = 1, \dots, 100.

    • \epsilon_i \sim N(0, \sigma) variable is a normally distributed error term.

    • y_i is the dependent variable: The annual income of individual i (in $1K).

    • x_{1,i} is an independent variable: Years of schooling of individual i.

    • x_{2,i} is an independent variable: Years of working experience of individual i.

Regression Analysis: Prediction

  • Given the value of coefficients (beta’s), we can predict the value of the dependent variable for an observation with certain x’s via the regression line (i.e. the regression model without \epsilon_i).

  • Suppose we have a regression model: yi = \beta0 + \beta1x{1,i} + \beta2x{2,i} + \epsilon_i

  • Suppose the coefficients are \beta0 = 33, \beta1 = 1.1, \beta_2 = 2.3.

  • We can write down the regression line: Y = 33 + 1.1X1 + 2.3X2

Interpretation of Slope Coefficients

  • A one unit increase in x (the independent variable) increases y (the dependent variable) by \beta units, holding all other variables in the model constant.

  • Given the coefficients \beta0 = 33, \beta1 = 1.1, \beta2 = 2.3, we can write down the regression line: Y = 33 + 1.1X1 + 2.3X_2

Regression Analysis: Interpretation of β’s

  • The regression coefficients, or \beta1, \dots, \betaj (the slopes), bear the following interpretation:

    • \beta1: Keeping other variables constant, one unit increase in X1 is expected to increase Y by \beta_1 unit.

    • \beta2: Keeping other variables constant, one unit increase in X2 is expected to increase Y by \beta_2 unit.

    • \betaj: Keeping other variables constant, one unit increase in Xj is expected to increase Y by \beta_j unit.

  • The coefficient of constant, or \beta_0 (the intercept), bears the following interpretation:

    • \beta0: When all independent variables equal to 0, Y is expected to equal \beta0. That is, \beta_0 is the expected value of Y, when all X’s are zero.

Importance of Holding Variables Constant

  • Omitting a variable that is correlated with the included variable can lead to biased estimates.

  • Multiple regression allows us to isolate the individual effect of the different independent variables.

Interpreting Coefficients

  • Consider a regression model of house price on size, age, and location:

    • yi = 1.3 + 2.9x{1,i} - 0.05x{2,i} - 0.2x{3,i} + \epsilon_i

      • \epsilon_i \sim N(0, \sigma) variable is a normally distributed error term.

      • y_i is the price of the i-th property (in $K/m^2).

      • x_{1,i} is the size of the i-th property (in m^2).

      • x_{2,i} is the number of years since the i-th property was built.

      • x_{3,i} is the proximity to CBD (in km).

  • Interpret the regression coefficients:

    • Keeping age and location constant, one square meter increase in size is expected to increase the property price by 2.9 thousand dollars per square meter.

    • Keeping size and location constant, one more year since the property was built is expected to decrease the property price by 0.05 thousand dollars per square meter.

    • Keeping size and age constant, one kilometer further away from CBD is expected to decrease the property price by 0.2 thousand dollars per square meter.

  • Interpret the intercept: If the property size, number of years since built and proximity to CBD all equal zero, the property price is expected to be 1.3 thousand dollars per square meter.

Interpreting Regression Results

  • Excel output has three tables: Regression summary, Table of test for joint significance, Table of estimated coefficients.

Estimated Coefficients

  • This table reports all estimated coefficients.

  • From this table, we can write down the estimated model.

  • yi = \hat{\beta}0 + \hat{\beta}1x{1,i} + \hat{\beta}2x{2,i} + \hat{\beta}3x{3,i} + e_i

  • where \hat{\beta}0 = 1.1993, \hat{\beta}1 = 2.8944, \hat{\beta}2 = -0.0453 and \hat{\beta}3 = -0.2472

Hypothesis Testing Coefficients

  • Used to test for H0:\betaj = 0

  • The 95% confidence interval is constructed by (with t_{0.025,v} not reported)

    • \hat{\beta} \pm t_{0.025,v} \times standard\ error

  • t\ Stat is the test statistic t\ Stat = \frac{\hat{\beta} - 0}{standard\ error} for testing H0: \betaj = 0, Ha: \betaj \neq 0

  • This is to test whether the j-th independent variable X_j has a significant effect on the dependent variable Y. E.g. whether or not size has a significant effect on property price.

Test for Statistically Significant Effect

  • Exam may ask you which variables are statistically significant (have significant effect on Y).

  • All we need to do is to compare the reported p-value with the given significance level \alpha.

  • If the j-th p-value < \alpha, reject H0: \betaj = 0, so \betaj is significantly different from 0. Xj is a significant variable. It has a significant effect on Y.

  • If the j-th p-value > \alpha, fail to reject H0: \betaj = 0, so \betaj is not significantly different from 0. Xj is not a significant variable. It has no significant effect on Y.

Test for Joint Significance

  • Apart from testing whether or not an individual variable X_j is significant, we can test whether the whole regression model has explanatory power for Y (whether a regression model is useful).

  • This is called test for joint significance.

  • Joint significance means the independent variables are jointly significant:

  • The null and the alternative hypothesis:

    • H0: all regression coefficients are zero or \beta1 = \beta2 = \cdots = \betaj = 0

    • H_1: at least one regression coefficient is nonzero

Test for Joint Significance (Continued)

  • In the example, we have H0: \beta1 = \beta2 = \beta3 = 0, H_a: at least one regression coefficient is non − zero.

  • We focus on the p-value Significance\ F in the table for this test.

  • If \alpha < Significance\ F, fail to reject the null

  • If \alpha > Significance\ F, reject the null

Coefficient of Determination, or R^2

  • An important statistic in the table of regression of summary is the R\ Square, or the so-called coefficient of determination, or R^2.

  • It is always between 0 and 1.

  • R^2 measures the explanatory power/performance/overall fit of a regression model.

  • R^2 \times 100\% of the variation in the dependent variable (Y) can be explained by variation in the independent variables (X’s).

  • A model with R^2 = 0 is a useless model, because X’s do not explain anything of Y

  • A model with R^2 = 1 is a perfect model, because Y is fully explained by X’s.

  • A model with a higher R^2 is usually preferred because it is more “powerful”.

Regression Analysis: Summary

  • Motivation of regression analysis:

    1. We hypothesis some variables (X1, X2, \dots, X_j) may have an effect on a variable (Y), and we want to quantify such an effect. E.g. we hypothesis size, age, and location may affect the property price.

    2. Given some value of independent variables, we want to predict the value of Y. E.g. Given size, age, and location, what is a property expected to cost?

  • In the exam, you need to know the following:

    1. How to write down a regression model

    2. How to write down the estimated model given Excel output

    3. How to interpret the coefficients

    4. How to predict Y given values of X’s

    5. How to test if a variable is significant or has a significant effect on the dependent variable, based on the Excel output. What are the null and the alternative hypotheses?

    6. How to test if independent variables are jointly significant, based on the Excel output. What are the null and the alternative hypotheses?

    7. How to interpret the R^2