Residuals and Correlation

Fitting a Line

A perfect linear relationship would let us predict the exact value of y just by knowing the value of x. Obviously, this is unrealistic in any natural process.

Linear regression describes the statistical method for fitting a line to data where the relationship between variables can be modelled by a straight line (with some error.)

y = β₀ + β₁x + ε,

with β₀ and β₁ representing the model’s parameters whose point estimate is b₀ and b₁ (estimated using data), and ε representing the error. x is the explanatory/predictor variable and y is the response. Note that, when writing our model, ε is often omitted since our focus is the prediction of the average outcome.

Since the data do not often form a perfect straight line, we most commonly observe a cloud of points falling around the straight line of our model and hence we have some uncertainty regarding our estimates of β₀ and β₁.

When producing an estimate data model, use yˆ (‘y-hat’) to make this distinction. We can think of such an estimate as an average y value for each x.

Residuals

Residuals (e) are the leftover variation in the data after accounting for the model fit, so y_d = yˆ + e.Each observation will have some residuals, which may be positive if that observation sits above the regression line, or negative if it sits below.

‘The residual of the i^th observation (x_i, y_i) is the difference of the observed response (y_i) and the response we would predict based on the model fit (yˆ_i).

e = y_i – yˆ_i

Identify yˆ_i by plugging x_i into the estimate model.’

The magnitude of a residual is the absolute value of the distance of the observation from the regression in the direction of the y-axis. In picking the right linear model, our goal is to minimise the size of our residuals.

We can evaluate how well a linear model fits a data set, and identify characteristics or patterns still apparent in the data after fitting a regression model, through a residual plot. We plot residuals at their original horizontal x locations with the signed magnitude of the residual on the y-axis, giving the effect of tipping the scatterplot so that the regression line is horizontal.

Correlation

Correlation (R) is a value between –1 and 1 describing the strength of the linear relationship between two variables. It is generally calculated by computer software, but the formula exists such that the correlation of observations (x₁, y₁), (x₂, y₂), ..., (x_n, y_n) is given R = 1/(n-1) sum(i=1 => n)[(x_i- x̄)/s_x (y_i- ȳ)/s_y].

Only when a relationship is perfectly linear is the correlation either –1 or 1, and in the same sense, if there is no relationship between the variables then the correlation will be (near) 0.

The intention of correlation is to specify the strength of a linear trend. Hence, non-linear trends (even those which are strong) sometimes produce correlations which do not reflect the strength of the relationship appropriately.

Since correlation does not imply causation, we cannot use a strong association to infer a causal relationship.

Note