Correlation and Regression
Predicting the future
based on incomplete information
One way to predict the outcome for an individual
Find others who are like that individual and whose outcomes you know
Given two numeric variables (x,y)
x - explanatory (information we know / data)
y - outcome we are trying to predict
trend
positive association
negative association
The correlation coefficient = r
measures linear association
based on standard units
-1 <= r <= 1
1: perfect straight line sloping up
-1: perfect straight line sloping down
0: no linear association
common pitfalls:
false conclusions of causation
nonlinearity
outliers
Nearest neighbor regression
group each x with similar (nearby) x values
average the corresponding y values for each group
Linear Regression
regression line/least square line → estimated y(in standard unites) = r * x (in standard units)
y = slope * x + intercept
slope of the regression line = r * (SD of y / SD of x)
intercept of the regression line = average of y - slope * average of x
Goal: predict y using x
i.e. predict y = final exam score using x = midterm score
These are numeric variables but predictions could use categorical variables as well
Error in estimation
error = actual value - estimate = y - yHat
Some errors are positive and some negative, so we consider squared errors to eliminate the sign magnitude.
“least squares” chose the sum of the squared errors and adjust the line in such a way that the sum of those is minimized
best fit line
least squares line
regression line
all the same
Residuals
error in regression estimate
one residual corresponding to each point (x, y)
y - yHat, yHat = least squares estimate
A scatter diagram of residuals…
should look line an unassociated blob for linear relations
but will show patterns for non-linear relations
used to check whether linear regression is appropriate
look for curves, trends, changes in spread, outliers, or other patterns
residuals from a linear regression always
have a mean of 0
no correlation to x
no correlation to y
Estimate the average value of y at x can be done using a confidence interval
Estimating a new value of Y at a given x is done using a prediction interval
prediction intervals are wider than confidence intervals
Y = intercept + slope(x) + random error
Y = B0 + B1X + E
is the model useful?
H0: B1 = 0 (the model is not useful)
HA: B1 ≠ 0 (the model is useful)
Model: Y = B0 + B1X + e
Fit: YHat = B0Hat + B1HatX
Inference for a parameter (B1) is based on its estimate (B1Hat)
Standard Error
test statistic = t = B1Hat / SE of B1Hat
for large sample, when H0 is true:
t will be approximately standard normal (z)
A t-distribution is often used in place of the standard normal distribution in statistical procedures that involve small samples
In R, CDF of t distn: pt(value, df)
for a two tailed test: 2*(pt(-|t|, n-2)
F-test shown by R on the regression summary
in the case of simple linear regression, the t-test for utility of x and F-test will be the same