Correlation & Simple Linear Regression Notes (STAT 2020)

Purpose: Analyze relationships in paired (bivariate, quantitative) data to determine linear associations and predict outcomes using a linear model.
Initial Step: Always begin with a scatterplot to visualize the relationship, checking for linearity and potential outliers. Patterns can be positive, negative, no, or nonlinear.

Definition: The linear correlation coefficient r measures the strength and direction of the linear association between paired x and y values in a sample.
Properties: -1 \le r \le 1 . Values closer to -1 or 1 indicate a stronger linear correlation, while values near 0 suggest no significant linear correlation. r is scale-invariant, symmetric, but not resistant to outliers, and measures association, not causation.
Conditions for Inference: Requires paired random sample, quantitative variables, an approximate straight-line pattern, and addressing known outlier errors.
Coefficient of Determination ( R^2 ): R^2 = r^2 represents the proportion of variation in y that is explained by the linear relationship with x.

Purpose: Express the linear association between an independent predictor (x) and a dependent response (y) with a linear model.
Least-Squares Regression Line: The most common model is \hat{y} = b*0 + b*1 x . This unique line minimizes the sum of squared vertical distances from the data points.
Formulas for Coefficients:
- Slope ( b*1 ): b*1 = r \frac{s*y}{s\_x} . It describes the average change in y for a one-unit increase in x.
- Intercept ( b*0 ): b*0 = \bar{y} - b*1 \bar{x} . It's the predicted value of y when x = 0, which may not be meaningful in context.
Predictions: Use the regression equation to predict y for x values within the observed data range. Extrapolation (making predictions outside this range) should be avoided as it can be unreliable.
Residuals: \varepsilon*i = y*i - \hat{y}\_i are the vertical distances between observed y values and predicted \hat{y} values. They help assess model fit; large residuals can indicate outliers or model inadequacy.

Association vs. Causation: A strong correlation or regression model does not imply a causal relationship. Lurking or confounding variables can create spurious associations.
Linearity: The linear regression model is appropriate only if the relationship between variables is approximately linear.
Data Range: Interpretations and predictions are valid primarily within the observed range of the independent variable x.