K

Regression Analysis

Is a Linear Regression Appropriate?

  • A regression line may be found for any relationship, but it may not be appropriate or make sense.

Regression Conditions

  • Conditions that need to be met to properly use a least squares regression line:
    • Linearity
    • Approximately normal residuals
    • Constant variability
    • Independence

Linearity

  • Linear regression requires a linear relationship.
  • Ways to check linearity:
    • Scatterplot
    • Residuals Plot

Scatterplot

  • Visually assess if the data appears roughly linear.

Idea Behind Residuals Plot

  • Residual = observed value - predicted value

Residuals Plot

  • Plot residuals against the predictor variable.
  • Should show no discernible pattern if linearity is met.

Approximately Normal Residuals

  • Residuals should be approximately normally distributed.
  • Check using a histogram and QQ-plot of the residuals.

Residuals QQ-plot

  • Points should fall close to the line to indicate normality.

Constant Variability

  • Variances should be similar across different values of the predictor variable, similarly to ANOVA.
  • Check using the residual plot; look for consistent spread.

Independence

  • Observations must be independent.
  • Check how the data were collected.

Other Ways Regression Can Go Wrong

  • Meeting the conditions doesn’t guarantee appropriateness.
  • There must be a natural outcome variable and predictor variable.
  • Outliers can significantly affect the line and lead to unstable conclusions.

Outliers

  • Observations that fall outside the main group of points.
  • Leverage: An outlier’s level of pull on the regression line. High leverage points are horizontally far from the main group.
  • Influential points: Points that have a high influence on the slope of the regression line.

Outliers Examples

  • Examples provided to illustrate different types of outliers and their influence on the regression line.

Regression with Categorical Predictor

  • Using a categorical variable as a predictor.
  • Example: Predicting price based on whether the item is new or used.

Idea Behind Regression with Categorical Predictor

  • Finds the mean of the outcome when the categorical variable is 0 (e.g. used) and the mean of the outcome when the categorical variable is 1 (e.g. new).
  • Still possible to fit a "line" to the relationship by showing how the means might differ.

Regression Output

  • Illustrative regression output:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 42.87 0.81 52.67 0.0000
    cond_new 10.90 1.26 8.66 0.0000
    
  • If the reference level is "old", old games have a mean price of 42.87, and new games have a mean price of 42.87 + 10.90 = 53.77.

Interpreting the Slope

  • The slope represents the difference in means between the two groups.
  • The significance of the slope tests whether the two means are different.
  • A two-sample t-test or ANOVA with the same variables would yield the same significance results.

More than Two Categories

  • Categorical variables with more than two categories can be included, but they need to be converted into multiple binary (dummy) variables.

Making Dummy Variables

  • Example: Predicting email character count based on number category (none, small, large).
  • Use two binary variables: numbersmall and numberlarge.

Dummy Variables

  • Example table showing how a three-category variable (number) is transformed into two binary variables (numsmall, numlarge).

Regression Output

  • Example regression output with a categorical variable having more than two categories:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 1.9955 0.6014 3.318 0.000915
    numbersmall 9.2233 0.6572 14.033 < 2e-16
    numberbig 14.8294 0.8521 17.403 < 2e-16
    

So Far

  • Discussion of moving from simple linear regression to multiple regression by including more predictor variables.
  • Analogy to moving from one-way ANOVA to factorial ANOVA.

Idea Behind Multiple Regression

  • Simple Linear Regression: \hat{y} = \beta0 + \beta1 * x
  • Multiple Regression: \hat{y} = \beta0 + \beta1 * x1 + \beta2 * x2 + \beta3 * x_3

Multiple Regression

  • The bo, b1, b2, b3 values are estimated using the same concept of least squares (minimizing the sum of squared residuals).

Regression with Two Predictor Variables

  • Can include both variables and their interaction effect, similar to two-way ANOVA.
  • \hat{y} = \beta0 + \beta1 * x1 + \beta2 * x2 + \beta3 * x1 * x2

Example

  • Predicting miles per gallon using horsepower and weight:
  • mpg = \beta0 + \beta1 * hp + \beta2 * wt + \beta3 * hp * wt

Regression Output

  • Example regression output:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 49.80842 3.60516 13.816 5.01e-14
    hp -0.12010 0.02470 -4.863 4.04e-05
    wt -8.21662 1.26971 -6.471 5.20e-07
    hp:wt 0.02785 0.00742 3.753 0.000811
    
  • Estimated regression line: mpg = 49.80842 - 0.12010 * hp - 8.21662 * wt + 0.02785 * hp * wt

Interpreting Coefficients

  • With multiple predictor variables, interpret them together.

Interpreting Coefficients

  • The intercept shows the relationship when the other variable(s) are equal to 0.
  • Example: If weight is 0, for each additional horsepower, miles per gallon decreases by 0.12.

Interpreting Coefficients

  • Knowing the slope of horsepower when weight is 0 doesn't tell you anything substantive because it doesn't make sense in the context of the problem.

Interpreting Interactions

  • A significant interaction term indicates that the effect of one variable differs according to the values of another variable.
  • Example: The effect of horsepower on miles per gallon differs according to the different values of weight.

Multiple Regression with Numerical and Categorical Variables

  • Both numerical and categorical variables can be included in a linear regression.
  • Example: Predicting miles per gallon using horsepower and transmission type (automatic or manual).
  • mpg = \beta0 + \beta1 * hp + \beta2 * trans + \beta3 * hp * trans

Interpreting Coefficients

  • When transmission type is the reference level (0), the effect of horsepower on miles per gallon is \beta1. When transmission type is 1, the effect of horsepower on miles per gallon is \beta1 + \beta_3.

Interpreting Coefficients

  • If automatic transmission is the reference level (0):
    • When Automatic: mpg = \beta0 + \beta1 * hp
    • When Manual: mpg = (\beta0 + \beta2) + (\beta1 + \beta3) * hp

Interpreting Coefficients

  • Given regression output:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 26.6248479 2.1829432 12.197 1.01e-12
    hp -0.0591370 0.0129449 -4.568 9.02e-05
    am 5.2176534 2.6650931 1.958 0.0603
    hp:am 0.0004029 0.0164602 0.024 0.9806
    
  • When the transmission (am) is automatic: mpg = 26.6248479 - 0.0591370 * hp

Interpreting Coefficients

  • Given the same regression output as before:
  • When the transmission (am) is manual: mpg = (26.6248479 + 5.2176534) - (0.0591370 + 0.0004029) * hp, which simplifies to mpg = 31.8425 - 0.0595399 * hp

Multiple Regression

  • Any number of variables can be included in a linear regression.
  • Typically, not every single interaction is included because it can become complicated quickly.