Regression Analysis

Is a Linear Regression Appropriate?

  • A regression line may be found for any relationship, but it may not be appropriate or make sense.

Regression Conditions

  • Conditions that need to be met to properly use a least squares regression line:
    • Linearity
    • Approximately normal residuals
    • Constant variability
    • Independence

Linearity

  • Linear regression requires a linear relationship.
  • Ways to check linearity:
    • Scatterplot
    • Residuals Plot

Scatterplot

  • Visually assess if the data appears roughly linear.

Idea Behind Residuals Plot

  • Residual = observed value - predicted value

Residuals Plot

  • Plot residuals against the predictor variable.
  • Should show no discernible pattern if linearity is met.

Approximately Normal Residuals

  • Residuals should be approximately normally distributed.
  • Check using a histogram and QQ-plot of the residuals.

Residuals QQ-plot

  • Points should fall close to the line to indicate normality.

Constant Variability

  • Variances should be similar across different values of the predictor variable, similarly to ANOVA.
  • Check using the residual plot; look for consistent spread.

Independence

  • Observations must be independent.
  • Check how the data were collected.

Other Ways Regression Can Go Wrong

  • Meeting the conditions doesn’t guarantee appropriateness.
  • There must be a natural outcome variable and predictor variable.
  • Outliers can significantly affect the line and lead to unstable conclusions.

Outliers

  • Observations that fall outside the main group of points.
  • Leverage: An outlier’s level of pull on the regression line. High leverage points are horizontally far from the main group.
  • Influential points: Points that have a high influence on the slope of the regression line.

Outliers Examples

  • Examples provided to illustrate different types of outliers and their influence on the regression line.

Regression with Categorical Predictor

  • Using a categorical variable as a predictor.
  • Example: Predicting price based on whether the item is new or used.

Idea Behind Regression with Categorical Predictor

  • Finds the mean of the outcome when the categorical variable is 0 (e.g. used) and the mean of the outcome when the categorical variable is 1 (e.g. new).
  • Still possible to fit a "line" to the relationship by showing how the means might differ.

Regression Output

  • Illustrative regression output:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 42.87 0.81 52.67 0.0000
    cond_new 10.90 1.26 8.66 0.0000
    
  • If the reference level is "old", old games have a mean price of 42.8742.87, and new games have a mean price of 42.87+10.90=53.7742.87 + 10.90 = 53.77.

Interpreting the Slope

  • The slope represents the difference in means between the two groups.
  • The significance of the slope tests whether the two means are different.
  • A two-sample t-test or ANOVA with the same variables would yield the same significance results.

More than Two Categories

  • Categorical variables with more than two categories can be included, but they need to be converted into multiple binary (dummy) variables.

Making Dummy Variables

  • Example: Predicting email character count based on number category (none, small, large).
  • Use two binary variables: numbersmall and numberlarge.

Dummy Variables

  • Example table showing how a three-category variable (number) is transformed into two binary variables (numsmall, numlarge).

Regression Output

  • Example regression output with a categorical variable having more than two categories:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 1.9955 0.6014 3.318 0.000915
    numbersmall 9.2233 0.6572 14.033 < 2e-16
    numberbig 14.8294 0.8521 17.403 < 2e-16
    

So Far

  • Discussion of moving from simple linear regression to multiple regression by including more predictor variables.
  • Analogy to moving from one-way ANOVA to factorial ANOVA.

Idea Behind Multiple Regression

  • Simple Linear Regression: y^=β<em>0+β</em>1x\hat{y} = \beta<em>0 + \beta</em>1 * x
  • Multiple Regression: y^=β<em>0+β</em>1x<em>1+β</em>2x<em>2+β</em>3x3\hat{y} = \beta<em>0 + \beta</em>1 * x<em>1 + \beta</em>2 * x<em>2 + \beta</em>3 * x_3

Multiple Regression

  • The bo, b1, b2, b3 values are estimated using the same concept of least squares (minimizing the sum of squared residuals).

Regression with Two Predictor Variables

  • Can include both variables and their interaction effect, similar to two-way ANOVA.
  • y^=β<em>0+β</em>1x<em>1+β</em>2x<em>2+β</em>3x<em>1x</em>2\hat{y} = \beta<em>0 + \beta</em>1 * x<em>1 + \beta</em>2 * x<em>2 + \beta</em>3 * x<em>1 * x</em>2

Example

  • Predicting miles per gallon using horsepower and weight:
  • mpg=β<em>0+β</em>1hp+β<em>2wt+β</em>3hpwtmpg = \beta<em>0 + \beta</em>1 * hp + \beta<em>2 * wt + \beta</em>3 * hp * wt

Regression Output

  • Example regression output:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 49.80842 3.60516 13.816 5.01e-14
    hp -0.12010 0.02470 -4.863 4.04e-05
    wt -8.21662 1.26971 -6.471 5.20e-07
    hp:wt 0.02785 0.00742 3.753 0.000811
    
  • Estimated regression line: mpg=49.808420.12010hp8.21662wt+0.02785hpwtmpg = 49.80842 - 0.12010 * hp - 8.21662 * wt + 0.02785 * hp * wt

Interpreting Coefficients

  • With multiple predictor variables, interpret them together.

Interpreting Coefficients

  • The intercept shows the relationship when the other variable(s) are equal to 0.
  • Example: If weight is 0, for each additional horsepower, miles per gallon decreases by 0.12.

Interpreting Coefficients

  • Knowing the slope of horsepower when weight is 0 doesn't tell you anything substantive because it doesn't make sense in the context of the problem.

Interpreting Interactions

  • A significant interaction term indicates that the effect of one variable differs according to the values of another variable.
  • Example: The effect of horsepower on miles per gallon differs according to the different values of weight.

Multiple Regression with Numerical and Categorical Variables

  • Both numerical and categorical variables can be included in a linear regression.
  • Example: Predicting miles per gallon using horsepower and transmission type (automatic or manual).
  • mpg=β<em>0+β</em>1hp+β<em>2trans+β</em>3hptransmpg = \beta<em>0 + \beta</em>1 * hp + \beta<em>2 * trans + \beta</em>3 * hp * trans

Interpreting Coefficients

  • When transmission type is the reference level (0), the effect of horsepower on miles per gallon is β<em>1\beta<em>1. When transmission type is 1, the effect of horsepower on miles per gallon is β</em>1+β3\beta</em>1 + \beta_3.

Interpreting Coefficients

  • If automatic transmission is the reference level (0):
    • When Automatic: mpg=β<em>0+β</em>1hpmpg = \beta<em>0 + \beta</em>1 * hp
    • When Manual: mpg=(β<em>0+β</em>2)+(β<em>1+β</em>3)hpmpg = (\beta<em>0 + \beta</em>2) + (\beta<em>1 + \beta</em>3) * hp

Interpreting Coefficients

  • Given regression output:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 26.6248479 2.1829432 12.197 1.01e-12
    hp -0.0591370 0.0129449 -4.568 9.02e-05
    am 5.2176534 2.6650931 1.958 0.0603
    hp:am 0.0004029 0.0164602 0.024 0.9806
    
  • When the transmission (am) is automatic: mpg = 26.6248479 - 0.0591370 * hp

Interpreting Coefficients

  • Given the same regression output as before:
  • When the transmission (am) is manual: mpg=(26.6248479+5.2176534)(0.0591370+0.0004029)hpmpg = (26.6248479 + 5.2176534) - (0.0591370 + 0.0004029) * hp, which simplifies to mpg=31.84250.0595399hpmpg = 31.8425 - 0.0595399 * hp

Multiple Regression

  • Any number of variables can be included in a linear regression.
  • Typically, not every single interaction is included because it can become complicated quickly.