Correlation & Simple Linear Regression Notes (STAT 2020)

Introduction to Correlation & Simple Linear Regression
  • Purpose: Analyze relationships in paired (bivariate, quantitative) data to determine linear associations and predict outcomes using a linear model.

  • Initial Step: Always begin with a scatterplot to visualize the relationship, checking for linearity and potential outliers. Patterns can be positive, negative, no, or nonlinear.

Correlation
  • Definition: The linear correlation coefficient r measures the strength and direction of the linear association between paired x and y values in a sample.

  • Properties: -1 \le r \le 1 . Values closer to -1 or 1 indicate a stronger linear correlation, while values near 0 suggest no significant linear correlation. r is scale-invariant, symmetric, but not resistant to outliers, and measures association, not causation.

  • Conditions for Inference: Requires paired random sample, quantitative variables, an approximate straight-line pattern, and addressing known outlier errors.

  • Coefficient of Determination ( R^2 ): R^2 = r^2 represents the proportion of variation in y that is explained by the linear relationship with x.

Simple Linear Regression (SLR)
  • Purpose: Express the linear association between an independent predictor (x) and a dependent response (y) with a linear model.

  • Least-Squares Regression Line: The most common model is \hat{y} = b*0 + b*1 x . This unique line minimizes the sum of squared vertical distances from the data points.

  • Formulas for Coefficients:

    • Slope ( b*1 ): b*1 = r \frac{s*y}{s\_x} . It describes the average change in y for a one-unit increase in x.

    • Intercept ( b*0 ): b*0 = \bar{y} - b*1 \bar{x} . It's the predicted value of y when x = 0, which may not be meaningful in context.

  • Predictions: Use the regression equation to predict y for x values within the observed data range. Extrapolation (making predictions outside this range) should be avoided as it can be unreliable.

  • Residuals: \varepsilon*i = y*i - \hat{y}\_i are the vertical distances between observed y values and predicted \hat{y} values. They help assess model fit; large residuals can indicate outliers or model inadequacy.

Assumptions and Cautions
  • Association vs. Causation: A strong correlation or regression model does not imply a causal relationship. Lurking or confounding variables can create spurious associations.

  • Linearity: The linear regression model is appropriate only if the relationship between variables is approximately linear.

  • Data Range: Interpretations and predictions are valid primarily within the observed range of the independent variable x.