Correlation & Simple Linear Regression Notes (STAT 2020)
Introduction to Correlation & Simple Linear Regression
Purpose: Analyze relationships in paired (bivariate, quantitative) data to determine linear associations and predict outcomes using a linear model.
Initial Step: Always begin with a scatterplot to visualize the relationship, checking for linearity and potential outliers. Patterns can be positive, negative, no, or nonlinear.
Correlation
Definition: The linear correlation coefficient
r
measures the strength and direction of the linear association between pairedx
andy
values in a sample.Properties:
-1 \le r \le 1
. Values closer to -1 or 1 indicate a stronger linear correlation, while values near 0 suggest no significant linear correlation.r
is scale-invariant, symmetric, but not resistant to outliers, and measures association, not causation.Conditions for Inference: Requires paired random sample, quantitative variables, an approximate straight-line pattern, and addressing known outlier errors.
Coefficient of Determination (
R^2
):R^2 = r^2
represents the proportion of variation iny
that is explained by the linear relationship withx
.
Simple Linear Regression (SLR)
Purpose: Express the linear association between an independent predictor (
x
) and a dependent response (y
) with a linear model.Least-Squares Regression Line: The most common model is
\hat{y} = b*0 + b*1 x
. This unique line minimizes the sum of squared vertical distances from the data points.Formulas for Coefficients:
Slope (
b*1
):b*1 = r \frac{s*y}{s\_x}
. It describes the average change iny
for a one-unit increase inx
.Intercept (
b*0
):b*0 = \bar{y} - b*1 \bar{x}
. It's the predicted value ofy
whenx = 0
, which may not be meaningful in context.
Predictions: Use the regression equation to predict
y
forx
values within the observed data range. Extrapolation (making predictions outside this range) should be avoided as it can be unreliable.Residuals:
\varepsilon*i = y*i - \hat{y}\_i
are the vertical distances between observedy
values and predicted\hat{y}
values. They help assess model fit; large residuals can indicate outliers or model inadequacy.
Assumptions and Cautions
Association vs. Causation: A strong correlation or regression model does not imply a causal relationship. Lurking or confounding variables can create spurious associations.
Linearity: The linear regression model is appropriate only if the relationship between variables is approximately linear.
Data Range: Interpretations and predictions are valid primarily within the observed range of the independent variable
x
.