Looks like no one added any tags here yet for you.
Correlation
Assessing linear relationships between variables
Regression
Fitting linear models to observations
Variability of each individual y-value (yi)
Difference between y-value and mean value of y ( = yi - y̅)
Coefficient of Determination (R²)
Portion of total variability accounted for by the model
Often expressed as a percent
Best case (perfect fit) : R² = 1 (100%)
Is possible to have R² < 0
Coefficient of Determination (Simple Linear Regression Case)
R² is the square of Pearson’s Correlation; R² = r²
What R² Says
What R² tells us
Scatter of data points around the best-fit line
Proportion of variability of dependent variable explained by the independent variable
What R² does not tell us
How good the model is
High R² does not mean it’s the right model? (true/false)
True
Random Errors
No apparent pattern in the errors
Systematic Bias
Non-random errors (pattern); indicated variability not accounted for in the model (bad fit)
What’s a Good R²?
Depends on context, goal
Understanding relationship between variables
Predicting unknown values
Other questions
How much of variability can be explained?
Is the relationship statistically significant?
How precise are the predictions of the model?
Outlier
Values that lie far away from the rest of the data
Large Outliers
Tend to “pull” the regression line toward themselves
Basic Rule of Thumb for Identifying Outliers
Points further than two standard deviations from regression line are (probably) outliers
standard deviation of residuals, not the observations!
instead of n – 1, we divide by n – 2 when estimating this standard deviation
Handling Outliers
First understand why it’s an outlier
Measurement/data entry error?
An actual difference?
Not part of the population
Just an extreme value
Interpolation
Estimating unknown values of response variable within the range of observed (x) values
Extrapolation
Estimating unknown values of response variable outside of the range of observed (x) values
Dangers of Extrapolation
Interpolation: less likely that new observations completely contradict regression
Can only be sure about “shape” of the relationship within the range of our observations
Outside of this range, we don’t know what we don’t know
Relationship could be non-linear
Could lead to ridiculous conclusions
When is it Okay to Extrapolate?
It depends on what you are trying to model
Reasonable example: the sun will come out tomorrow
Unreasonable example: number of husbands over time.
In general, it’s good to have some kind of theoretical basis for your model first.