Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

Studied by 7 people

Studied by 11 people

Studied by 8 people

Passé composé, imparfait et plus-que-parfait

Studied by 15 people

Studied by 36 people

CH 13 The Reformation

Studied by 40 people

Regression Analysis

Is a Linear Regression Appropriate?

A regression line may be found for any relationship, but it may not be appropriate or make sense.

Regression Conditions

Conditions that need to be met to properly use a least squares regression line:
- Linearity
- Approximately normal residuals
- Constant variability
- Independence

Linearity

Linear regression requires a linear relationship.
Ways to check linearity:
- Scatterplot
- Residuals Plot

Scatterplot

Visually assess if the data appears roughly linear.

Idea Behind Residuals Plot

Residual = observed value - predicted value

Residuals Plot

Plot residuals against the predictor variable.
Should show no discernible pattern if linearity is met.

Approximately Normal Residuals

Residuals should be approximately normally distributed.
Check using a histogram and QQ-plot of the residuals.

Residuals QQ-plot

Points should fall close to the line to indicate normality.

Constant Variability

Variances should be similar across different values of the predictor variable, similarly to ANOVA.
Check using the residual plot; look for consistent spread.

Independence

Observations must be independent.
Check how the data were collected.

Other Ways Regression Can Go Wrong

Meeting the conditions doesn’t guarantee appropriateness.
There must be a natural outcome variable and predictor variable.
Outliers can significantly affect the line and lead to unstable conclusions.

Outliers

Observations that fall outside the main group of points.
Leverage: An outlier’s level of pull on the regression line. High leverage points are horizontally far from the main group.
Influential points: Points that have a high influence on the slope of the regression line.

Outliers Examples

Examples provided to illustrate different types of outliers and their influence on the regression line.

Regression with Categorical Predictor

Using a categorical variable as a predictor.
Example: Predicting price based on whether the item is new or used.

Idea Behind Regression with Categorical Predictor

Finds the mean of the outcome when the categorical variable is 0 (e.g. used) and the mean of the outcome when the categorical variable is 1 (e.g. new).
Still possible to fit a "line" to the relationship by showing how the means might differ.

Regression Output

Illustrative regression output:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.87 0.81 52.67 0.0000
cond_new 10.90 1.26 8.66 0.0000

If the reference level is "old", old games have a mean price of 42.87, and new games have a mean price of 42.87 + 10.90 = 53.77.

Interpreting the Slope

The slope represents the difference in means between the two groups.
The significance of the slope tests whether the two means are different.
A two-sample t-test or ANOVA with the same variables would yield the same significance results.

More than Two Categories

Categorical variables with more than two categories can be included, but they need to be converted into multiple binary (dummy) variables.

Making Dummy Variables

Example: Predicting email character count based on number category (none, small, large).
Use two binary variables: numbersmall and numberlarge.

Dummy Variables

Example table showing how a three-category variable (number) is transformed into two binary variables (numsmall, numlarge).

Regression Output

Example regression output with a categorical variable having more than two categories:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9955 0.6014 3.318 0.000915
numbersmall 9.2233 0.6572 14.033 < 2e-16
numberbig 14.8294 0.8521 17.403 < 2e-16

So Far

Discussion of moving from simple linear regression to multiple regression by including more predictor variables.
Analogy to moving from one-way ANOVA to factorial ANOVA.

Idea Behind Multiple Regression

Simple Linear Regression: \hat{y} = \beta0 + \beta1 * x
Multiple Regression: \hat{y} = \beta0 + \beta1 * x1 + \beta2 * x2 + \beta3 * x_3

Multiple Regression

The bo, b1, b2, b3 values are estimated using the same concept of least squares (minimizing the sum of squared residuals).

Regression with Two Predictor Variables

Can include both variables and their interaction effect, similar to two-way ANOVA.
\hat{y} = \beta0 + \beta1 * x1 + \beta2 * x2 + \beta3 * x1 * x2

Example

Predicting miles per gallon using horsepower and weight:
mpg = \beta0 + \beta1 * hp + \beta2 * wt + \beta3 * hp * wt

Regression Output

Example regression output:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.80842 3.60516 13.816 5.01e-14
hp -0.12010 0.02470 -4.863 4.04e-05
wt -8.21662 1.26971 -6.471 5.20e-07
hp:wt 0.02785 0.00742 3.753 0.000811

Estimated regression line: mpg = 49.80842 - 0.12010 * hp - 8.21662 * wt + 0.02785 * hp * wt

Interpreting Coefficients

With multiple predictor variables, interpret them together.

Interpreting Coefficients

The intercept shows the relationship when the other variable(s) are equal to 0.
Example: If weight is 0, for each additional horsepower, miles per gallon decreases by 0.12.

Interpreting Coefficients

Knowing the slope of horsepower when weight is 0 doesn't tell you anything substantive because it doesn't make sense in the context of the problem.

Interpreting Interactions

A significant interaction term indicates that the effect of one variable differs according to the values of another variable.
Example: The effect of horsepower on miles per gallon differs according to the different values of weight.

Multiple Regression with Numerical and Categorical Variables

Both numerical and categorical variables can be included in a linear regression.
Example: Predicting miles per gallon using horsepower and transmission type (automatic or manual).
mpg = \beta0 + \beta1 * hp + \beta2 * trans + \beta3 * hp * trans

Interpreting Coefficients

When transmission type is the reference level (0), the effect of horsepower on miles per gallon is \beta1. When transmission type is 1, the effect of horsepower on miles per gallon is \beta1 + \beta_3.

Interpreting Coefficients

If automatic transmission is the reference level (0):
- When Automatic: mpg = \beta0 + \beta1 * hp
- When Manual: mpg = (\beta0 + \beta2) + (\beta1 + \beta3) * hp

Interpreting Coefficients

Given regression output:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6248479 2.1829432 12.197 1.01e-12
hp -0.0591370 0.0129449 -4.568 9.02e-05
am 5.2176534 2.6650931 1.958 0.0603
hp:am 0.0004029 0.0164602 0.024 0.9806

When the transmission (am) is automatic: mpg = 26.6248479 - 0.0591370 * hp

Interpreting Coefficients

Given the same regression output as before:
When the transmission (am) is manual: mpg = (26.6248479 + 5.2176534) - (0.0591370 + 0.0004029) * hp, which simplifies to mpg = 31.8425 - 0.0595399 * hp

Multiple Regression

Any number of variables can be included in a linear regression.
Typically, not every single interaction is included because it can become complicated quickly.

Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

Studied by 7 people

Studied by 11 people

Studied by 8 people

Passé composé, imparfait et plus-que-parfait

Studied by 15 people

Studied by 36 people

CH 13 The Reformation

Studied by 40 people