Regression Analysis
Is a Linear Regression Appropriate?
- A regression line may be found for any relationship, but it may not be appropriate or make sense.
Regression Conditions
- Conditions that need to be met to properly use a least squares regression line:
- Linearity
- Approximately normal residuals
- Constant variability
- Independence
Linearity
- Linear regression requires a linear relationship.
- Ways to check linearity:
- Scatterplot
- Residuals Plot
Scatterplot
- Visually assess if the data appears roughly linear.
Idea Behind Residuals Plot
- Residual = observed value - predicted value
Residuals Plot
- Plot residuals against the predictor variable.
- Should show no discernible pattern if linearity is met.
Approximately Normal Residuals
- Residuals should be approximately normally distributed.
- Check using a histogram and QQ-plot of the residuals.
Residuals QQ-plot
- Points should fall close to the line to indicate normality.
Constant Variability
- Variances should be similar across different values of the predictor variable, similarly to ANOVA.
- Check using the residual plot; look for consistent spread.
Independence
- Observations must be independent.
- Check how the data were collected.
Other Ways Regression Can Go Wrong
- Meeting the conditions doesn’t guarantee appropriateness.
- There must be a natural outcome variable and predictor variable.
- Outliers can significantly affect the line and lead to unstable conclusions.
Outliers
- Observations that fall outside the main group of points.
- Leverage: An outlier’s level of pull on the regression line. High leverage points are horizontally far from the main group.
- Influential points: Points that have a high influence on the slope of the regression line.
Outliers Examples
- Examples provided to illustrate different types of outliers and their influence on the regression line.
Regression with Categorical Predictor
- Using a categorical variable as a predictor.
- Example: Predicting price based on whether the item is new or used.
Idea Behind Regression with Categorical Predictor
- Finds the mean of the outcome when the categorical variable is 0 (e.g. used) and the mean of the outcome when the categorical variable is 1 (e.g. new).
- Still possible to fit a "line" to the relationship by showing how the means might differ.
Regression Output
Illustrative regression output:
Estimate Std. Error t value Pr(>|t|) (Intercept) 42.87 0.81 52.67 0.0000 cond_new 10.90 1.26 8.66 0.0000If the reference level is "old", old games have a mean price of , and new games have a mean price of .
Interpreting the Slope
- The slope represents the difference in means between the two groups.
- The significance of the slope tests whether the two means are different.
- A two-sample t-test or ANOVA with the same variables would yield the same significance results.
More than Two Categories
- Categorical variables with more than two categories can be included, but they need to be converted into multiple binary (dummy) variables.
Making Dummy Variables
- Example: Predicting email character count based on number category (none, small, large).
- Use two binary variables: numbersmall and numberlarge.
Dummy Variables
- Example table showing how a three-category variable (number) is transformed into two binary variables (numsmall, numlarge).
Regression Output
Example regression output with a categorical variable having more than two categories:
Estimate Std. Error t value Pr(>|t|) (Intercept) 1.9955 0.6014 3.318 0.000915 numbersmall 9.2233 0.6572 14.033 < 2e-16 numberbig 14.8294 0.8521 17.403 < 2e-16
So Far
- Discussion of moving from simple linear regression to multiple regression by including more predictor variables.
- Analogy to moving from one-way ANOVA to factorial ANOVA.
Idea Behind Multiple Regression
- Simple Linear Regression:
- Multiple Regression:
Multiple Regression
- The bo, b1, b2, b3 values are estimated using the same concept of least squares (minimizing the sum of squared residuals).
Regression with Two Predictor Variables
- Can include both variables and their interaction effect, similar to two-way ANOVA.
Example
- Predicting miles per gallon using horsepower and weight:
Regression Output
Example regression output:
Estimate Std. Error t value Pr(>|t|) (Intercept) 49.80842 3.60516 13.816 5.01e-14 hp -0.12010 0.02470 -4.863 4.04e-05 wt -8.21662 1.26971 -6.471 5.20e-07 hp:wt 0.02785 0.00742 3.753 0.000811Estimated regression line:
Interpreting Coefficients
- With multiple predictor variables, interpret them together.
Interpreting Coefficients
- The intercept shows the relationship when the other variable(s) are equal to 0.
- Example: If weight is 0, for each additional horsepower, miles per gallon decreases by 0.12.
Interpreting Coefficients
- Knowing the slope of horsepower when weight is 0 doesn't tell you anything substantive because it doesn't make sense in the context of the problem.
Interpreting Interactions
- A significant interaction term indicates that the effect of one variable differs according to the values of another variable.
- Example: The effect of horsepower on miles per gallon differs according to the different values of weight.
Multiple Regression with Numerical and Categorical Variables
- Both numerical and categorical variables can be included in a linear regression.
- Example: Predicting miles per gallon using horsepower and transmission type (automatic or manual).
Interpreting Coefficients
- When transmission type is the reference level (0), the effect of horsepower on miles per gallon is . When transmission type is 1, the effect of horsepower on miles per gallon is .
Interpreting Coefficients
- If automatic transmission is the reference level (0):
- When Automatic:
- When Manual:
Interpreting Coefficients
Given regression output:
Estimate Std. Error t value Pr(>|t|) (Intercept) 26.6248479 2.1829432 12.197 1.01e-12 hp -0.0591370 0.0129449 -4.568 9.02e-05 am 5.2176534 2.6650931 1.958 0.0603 hp:am 0.0004029 0.0164602 0.024 0.9806When the transmission (am) is automatic:
mpg = 26.6248479 - 0.0591370 * hp
Interpreting Coefficients
- Given the same regression output as before:
- When the transmission (am) is manual: , which simplifies to
Multiple Regression
- Any number of variables can be included in a linear regression.
- Typically, not every single interaction is included because it can become complicated quickly.