EGM 101 - Week 6, Lecture 2 : The Coefficient of Determination, Outliers, and Interpolation

Studied by 0 people

0.0(0)

LearnA personalized and smart learning plan

Practice TestTake a test on your terms and definitions

Spaced RepetitionScientifically backed study method

Matching GameHow quick can you match all your cards?

FlashcardsStudy terms and definitions

1 / 17

There's no tags or description

Looks like no one added any tags here yet for you.

18 Terms

Correlation

Assessing linear relationships between variables

New cards

Regression

Fitting linear models to observations

New cards

Variability of each individual y-value (yi)

Difference between y-value and mean value of y ( = yi - y̅)

New cards

Coefficient of Determination (R²)

Portion of total variability accounted for by the model

Often expressed as a percent
Best case (perfect fit) : R² = 1 (100%)
Is possible to have R² < 0

New cards

Coefficient of Determination (Simple Linear Regression Case)

R² is the square of Pearson’s Correlation; R² = r²

New cards

What R² Says

What R² tells us
- Scatter of data points around the best-fit line
- Proportion of variability of dependent variable explained by the independent variable
What R² does not tell us
- How good the model is

New cards

High R² does not mean it’s the right model? (true/false)

True

New cards

Random Errors

No apparent pattern in the errors

New cards

Systematic Bias

Non-random errors (pattern); indicated variability not accounted for in the model (bad fit)

New cards

What’s a Good R²?

Depends on context, goal
- Understanding relationship between variables
- Predicting unknown values
Other questions
- How much of variability can be explained?
- Is the relationship statistically significant?
- How precise are the predictions of the model?

New cards

Outlier

Values that lie far away from the rest of the data

New cards

Large Outliers

Tend to “pull” the regression line toward themselves

New cards

Basic Rule of Thumb for Identifying Outliers

Points further than two standard deviations from regression line are (probably) outliers

standard deviation of residuals, not the observations!
instead of n – 1, we divide by n – 2 when estimating this standard deviation

New cards

Handling Outliers

First understand why it’s an outlier

Measurement/data entry error?
An actual difference?
- Not part of the population
- Just an extreme value

New cards

Interpolation

Estimating unknown values of response variable within the range of observed (x) values

New cards

Extrapolation

Estimating unknown values of response variable outside of the range of observed (x) values

New cards

Dangers of Extrapolation

Interpolation: less likely that new observations completely contradict regression
Can only be sure about “shape” of the relationship within the range of our observations
Outside of this range, we don’t know what we don’t know
- Relationship could be non-linear
- Could lead to ridiculous conclusions

New cards

When is it Okay to Extrapolate?

It depends on what you are trying to model
- Reasonable example: the sun will come out tomorrow
- Unreasonable example: number of husbands over time.
In general, it’s good to have some kind of theoretical basis for your model first.

New cards