Correlation and Regression

Predicting the future

  • based on incomplete information

  • One way to predict the outcome for an individual

    • Find others who are like that individual and whose outcomes you know

Given two numeric variables (x,y)

x - explanatory (information we know / data)

y - outcome we are trying to predict

  • trend

    • positive association

    • negative association

The correlation coefficient = r

  • measures linear association

  • based on standard units

  • -1 <= r <= 1

    • 1: perfect straight line sloping up

    • -1: perfect straight line sloping down

    • 0: no linear association

common pitfalls:

  • false conclusions of causation

  • nonlinearity

  • outliers

Nearest neighbor regression

  • group each x with similar (nearby) x values

  • average the corresponding y values for each group

Linear Regression

regression line/least square line → estimated y(in standard unites) = r * x (in standard units)

y = slope * x + intercept

slope of the regression line = r * (SD of y / SD of x)

intercept of the regression line = average of y - slope * average of x

Goal: predict y using x

i.e. predict y = final exam score using x = midterm score

These are numeric variables but predictions could use categorical variables as well

Error in estimation

error = actual value - estimate = y - yHat

Some errors are positive and some negative, so we consider squared errors to eliminate the sign magnitude.

“least squares” chose the sum of the squared errors and adjust the line in such a way that the sum of those is minimized

  • best fit line

  • least squares line

  • regression line

all the same

Residuals

error in regression estimate

one residual corresponding to each point (x, y)

y - yHat, yHat = least squares estimate

A scatter diagram of residuals…

  • should look line an unassociated blob for linear relations

  • but will show patterns for non-linear relations

  • used to check whether linear regression is appropriate

  • look for curves, trends, changes in spread, outliers, or other patterns

residuals from a linear regression always

  • have a mean of 0

  • no correlation to x

  • no correlation to y

Estimate the average value of y at x can be done using a confidence interval

Estimating a new value of Y at a given x is done using a prediction interval

prediction intervals are wider than confidence intervals

Y = intercept + slope(x) + random error

Y = B0 + B1X + E

is the model useful?

H0: B1 = 0 (the model is not useful)

HA: B1 ≠ 0 (the model is useful)

Model: Y = B0 + B1X + e

Fit: YHat = B0Hat + B1HatX

Inference for a parameter (B1) is based on its estimate (B1Hat)

Standard Error

test statistic = t = B1Hat / SE of B1Hat

for large sample, when H0 is true:

  • t will be approximately standard normal (z)

A t-distribution is often used in place of the standard normal distribution in statistical procedures that involve small samples

In R, CDF of t distn: pt(value, df)

for a two tailed test: 2*(pt(-|t|, n-2)

F-test shown by R on the regression summary

  • in the case of simple linear regression, the t-test for utility of x and F-test will be the same