MD

Week 4 Lecture: Correlation, Regression, and Multiple Regression

Correlation and Regression Foundations

  • Correlation coefficient r: a single number that summarizes the association between two data sets (two variables) in bivariate data.

    • Two key constructs of r:

    • Strength: magnitude of the correlation (how close to -1 or +1).

    • Direction: positive or negative sign indicates the direction of the relationship.

    • Perfect correlation: $r = \pm 1$ (rare in psychology).

    • Zero correlation: $r = 0$.

  • Regression basics: regression line (line of best fit) through a scatter plot.

    • Goal: predict a dependent variable $y$ from a predictor $x$.

    • Notation: $\hat{y}$ = predicted value of $y$; $x$ = predictor (independent variable).

    • Bivariate regression is the link between correlation and regression.

Coefficient of Determination and Shared Variance

  • Coefficient of determination: $R^2$ is the proportion of variance shared between two variables in a bivariate correlation.

    • If $r = 0.75$, then $R^2 = r^2 = 0.75^2 = 0.56$, meaning 56% of the variance is shared.

    • Residual variance (unshared variance) = $1 - R^2$ (e.g., 44% in the example).

  • Cohen’s guidelines for strength (approximate):

    • Strong: $|r| > 0.5$

    • Moderate: $|r| \approx 0.3$ to $0.5$

    • Weak: $|r| \approx 0.1$ to $0.3$

  • Residual variance: the portion of variance not shared by the two variables; errors due to other variables.

  • Relationship between shared variance and error: $R^2$ is the shared variance; residual variance is $1-R^2$.

The Simple Linear Regression Equation and Interpretation

  • Four components of the regression line (simple linear regression):

    • $\hat{y}$: predicted dependent variable score.

    • $x$: raw score on the independent variable.

    • $b_0$: intercept (value of $y$ when $x = 0$) on the $y$-axis.

    • $b_1$: slope (regression coefficient) indicating the change in $\hat{y}$ for a one-unit change in $x$.

  • Equation of the line: \hat{y} = b0 + b1 x

  • Example interpretation: if $b_1 = 0.33$, a one-unit increase in $x$ increases $\hat{y}$ by 0.33 units.

  • Practical example (forward calculation): given $x = 80$, $b0 = 2$, $b1 = 0.33$:

    • \hat{y} = 2 + 0.33 \times 80 = 2 + 26.4 = 28.4.

    • If the scale of $y$ runs from 0 to 35, $\hat{y} = 28.4$ indicates a relatively high predicted outcome.

  • Interpretation of regression coefficient (slope): positive slope indicates higher predictor scores are associated with higher outcome values; the magnitude reflects the strength of that association in the context of the scale.

From Bivariate to Multiple Regression

  • Bivariate correlation describes the strength and direction of the association between two variables.

  • Bivariate regression predicts $y$ from a single $x$.

  • Multiple correlation and multiple regression extend this to multiple predictors:

    • Multiple regression: a single $y$ is predicted from several $x$ values.

    • The model controls for each predictor to isolate its unique contribution.

  • Goals of multiple regression:

    • Assess how much variance in $y$ is explained by the set of predictors together.

    • Determine the unique contribution of each predictor while holding others constant.

    • Identify the most important predictors (ranked by their independent contribution).

  • Examples of predictors for a final grade or exam score could include: high school GPA, academic effort, time spent on nonacademic activities, family education, etc.

The Multiple Regression Model and Coefficients

  • General multiple regression equation: \hat{y} = b0 + b1 x1 + b2 x2 + \cdots + bp x_p

    • $b_0$: intercept.

    • $bj$: regression coefficient for predictor $xj$ (unstandardised).

    • There can be as many predictors as needed (denoted by $n$ in slides).

  • Coefficients in SPSS-style outputs come in two forms:

    • Unstandardised coefficient: $Bj$ (raw units; predicted change in $y$ per unit change in $xj$, holding others constant).

    • Standardised coefficient: (\beta_j) (in standard deviation units) enabling comparison across predictors with different scales.

  • Important note on interpretation:

    • Unstandardised $Bj$ tells you the actual unit-change in $y$ per unit change in $xj$ when other predictors are held constant.

    • Standardised (\beta_j) allows comparison across predictors with different measurement scales.

Model Fit, Sums of Squares, and F Statistic

  • Model (explained variance) vs residual (unexplained variance):

    • SST = SSR + SSM, where

    • SST: total sum of squares (total variance in $y$).

    • SSR: sum of squared residuals (unexplained variance).

    • SSM: sum of squares due to the model (explained variance).

  • Mean squares: MSmodel = SSM/dfmodel, MSresidual = SSR/dfresidual.

  • F statistic to test model significance: F = \frac{MS{Model}}{MS{Residual}} = \frac{SSM/df{Model}}{SSR/df{Residual}}.

  • If $F$ is large and $p$-value $< 0.05$, the model explains a significant amount of variance beyond the baseline (mean) model.

  • Baseline model: a flat line at the mean of $y$; compare model to baseline using $F$ and $R^2$.

  • Output tells you overall model significance (via $F$) and the exact contributions of each predictor (via t-tests on coefficients).

Coefficients and Significance Testing

  • For each predictor, test whether its coefficient significantly contributes to predicting $y$ while holding others constant.

    • t-test for coefficient: tj = \frac{Bj}{SE(B_j)}, with associated p-value.

  • Look for significant predictors (p-value typically < 0.05, or stricter in some contexts).

  • Interpretation of coefficients in a concrete example (three predictors):

    • Unstandardised coefficients (example numbers):

    • Advertising spend: $B_1 = 0.085$ (per unit of currency, $y$ increases by 0.085 units).

    • Airplay: $B_2 = 3.37$ (each additional unit of airplay increases $y$ by 3.37 units).

    • Band attractiveness: $B_3 = 11.09$ (each unit increase in attractiveness increases $y$ by 11.09 units).

    • All three predictors can be significant and contribute unique variance.

  • Zero-order, partial, and semi-partial correlations (brief):

    • Zero-order: simple correlation between $x_j$ and $y$.

    • Partial correlation: correlation between $x_j$ and $y$ controlling for other predictors.

    • Semi-partial (part) correlation: correlation between $x_j$ and $y$ after removing only the variance in $y$ shared with other predictors.

    • Squared semi-partial ((\text{SP}^2)) indicates the unique variance in $y$ explained by predictor $x_j$.

Model Fit and Interpretation Summary

  • If the overall model is significant (large $F$ and small p-value): the set of predictors collectively helps predict $y$.

  • If individual predictors are significant (t-tests on $B_j$): each predictor adds unique predictive value while others are held constant.

  • R, R^2, and Adjusted R^2:

    • $R$ is the multiple correlation between observed $y$ and predicted $y$.

    • $R^2$ is the proportion of variance in $y$ explained by the predictors.

    • Adjusted $R^2$ accounts for the number of predictors relative to sample size.

  • Practical significance: even with a significant model, check for multicollinearity and ensure predictors contribute meaningful unique variance.

Outliers and Multivariate Outliers in Regression

  • Outliers can be univariate (extreme on one variable) or multivariate (extreme across several variables).

  • Tools for detecting multivariate outliers:

    • Mahalanobis distance: D^2 = (x - \mu)^\top S^{-1} (x - \mu),
      where $x$ is the vector of predictor values for a case, $\mu$ is the centroid (mean vector) of predictors, and $S$ is the covariance matrix.

    • Threshold: compare $D^2$ to the critical $\chi^2$ value with $k$ degrees of freedom (number of predictors). E.g., $D^2 > \chi^2_{k, \alpha}$ with $\alpha = 0.001$ is a conservative criterion.

    • Cook’s distance: a case-by-case influence measure; values > 1 often indicate influential cases.

  • Residual plots and scatter plots help visually identify multivariate outliers and influential cases.

  • Data cleaning approach:

    • Inspect outliers in context of raw data (possible data entry errors, malingering, or genuine extreme but valid cases).

    • Consider removing outliers if justified (cite a source) and re-run analyses to compare results.

    • If a multivariate outlier is present, decide whether to remove it or keep it (based on its influence and theoretical justification).

Assumptions for Multiple Regression

  • Measurement scale:

    • Prefer interval/ratio predictors; dichotomous/dategorical predictors can be used in some circumstances.

    • If the outcome is dichotomous, logistic regression is appropriate instead of standard multiple regression.

  • Sample size considerations:

    • Larger samples are better; common rules of thumb include:

    • N > 50 + 8p (where p is the number of predictors).

    • Or 10–15 participants per predictor.

    • Some authors propose N > 104 + p (or similar) depending on the source.

    • Software power analyses (e.g., G Power) depend on effect size expectations (small/medium/large) and number of predictors.

  • Outliers and data quality: ensure data are complete (listwise deletion may be used) and consider the impact of missing data.

  • Key regression assumptions to test post-hoc:

    • Linearity: relationships between predictors and outcome are linear.

    • Normality of residuals: residuals are normally distributed.

    • Homoscedasticity: constant variance of residuals across levels of predicted values.

  • Specific diagnostics for multicollinearity:

    • Multicollinearity or singularity is problematic when predictors are highly correlated.

    • Tempting indicators: high pairwise correlations, high VIF, or low tolerance.

    • VIF (Variance Inflation Factor): \text{VIF}j = \frac{1}{1 - Rj^2}\, where $Rj^2$ is the $R^2$ from regressing predictor $xj$ on the other predictors.

    • Tolerance: Tj = 1 - Rj^2 = \frac{1}{\text{VIF}_j}.

    • Common thresholds:

    • VIF between 1 and 5 is usually acceptable; VIF > 5–10 signals potential multicollinearity concerns.

    • Tolerance < 0.2 (or sometimes < 0.1) indicates potential multicollinearity concerns.

  • Residual diagnostics for normality/linearity/homoscedasticity:

    • Residual plots (residuals vs predicted values) are used to assess the three properties collectively.

    • PP plots (probability plots) and histograms can assess normality of residuals.

    • Partial and semi-partial plots can provide additional diagnostic intuition, but are usually secondary to residual plots.

Regression Diagnostics: Visual and Statistical Checks

  • Residual plots help assess normality, linearity, and homoscedasticity in a single graph when you plot standardized residuals against standardized predicted values.

  • If residuals form a tight, rectangular cluster around zero with no clear pattern, assumptions are met.

  • If residuals fan out (heteroscedasticity) or show a curved pattern (nonlinearity), transformations or alternative modeling may be needed.

  • Normality checks can be done via:

    • Histogram of standardized residuals with a normal curve overlay.

    • PP plot: points close to the diagonal line suggest normality.

  • Casewise diagnostics (in SPSS or similar):

    • Standardized residuals beyond |2| or |3| may indicate potential outliers; more conservative cutoffs (|3.29| for p < .001) are sometimes used.

    • Mahalanobis distance, Cook’s distance, and residual-based plots provide comprehensive outlier and influence assessment.

Practical Example: Album Sales and SPSS Workflow

  • Data scenario (two versions shown in lecture):

    • Example 1: Predict album sales (Y) from advertising budget and airplay (X1, X2).

    • Example 2 (more complex): Predict album sales (Y) from promotion dollars (X1), band attractiveness (X2), and radio airplay (X3).

  • Standard multiple regression (SMR): all predictors entered simultaneously in one step.

  • SPSS workflow (high-level):

    • Analyze → Regression → Linear.

    • Set Dependent variable: album sales; Enter: X1, X2, X3 (three predictors).

    • Options:

    • Statistics: keep model fit, coefficients, partial correlations, semi-partial correlations; include $R^2$, $R^2_{adj}$, and other relevant statistics.

    • Plots: request residual plots (Y = standardized residual, X = predicted) to check assumptions; include PP plots and histograms if desired.

    • Save: request Mahalanobis distance, Cook’s distance, residuals, standardized/unstandardized values for diagnostics.

    • Casewise diagnostics: inspect standardized residuals; flag potential outliers beyond |2| or |3|; compare against Mahalanobis distance and Cook’s distance to determine influence.

    • Output interpretation: correlation table (bivariate correlations among all variables), model summary (R, R^2, adjusted R^2, STD error, Durbin-Watson), ANOVA table (F-statistic for overall model), regression coefficients (unstandardised $B$, standardised $\beta$, t-values, p-values).

  • Interpreting the album-sales results (example):

    • Overall model significant (high $F$, $p<0.001$).

    • All three predictors entered are significant (each $t$-test p < .001).

    • Unstandardised coefficients give practical interpretation:

    • $B1$: effect of $X1$ (e.g., promotion spend) on album sales, holding others constant.

    • $B2$: effect of $X2$ (e.g., band attractiveness) on album sales, holding others constant.

    • $B3$: effect of $X3$ (e.g., airplay) on album sales, holding others constant.

    • Partial correlations help understand unique contribution of each predictor; semi-partial (part) correlations show the unique variance each predictor adds to $y$.

  • Model diagnostics for the album-sales example:

    • Multicollinearity check: use correlation table and VIF/Tolerance; expect moderate VIFs (between 1 and 5 is common) and tolerances > 0.2.

    • Linearity, normality, and homoscedasticity checks via residual plots, histograms, PP plots, and a single comprehensive residual plot.

    • If assumptions hold, report a final model with interpretation of each predictor and overall model significance.

Types of Multiple Regression Approaches

  • Standard Multiple Regression (SMR): all predictors entered simultaneously; no prior ordering; interpret unique contributions of each predictor.

  • Hierarchical (Sequential) Regression: predictors entered in a theoretically determined order (blocks or steps). Each step shows the incremental contribution of added predictors, typically as changes in $R^2$ (i.e., $\Delta R^2$).

  • Statistical (Stepwise) Regression: predictors selected algorithmically by the software based on their partial correlations with the outcome; predictors are added/removed one at a time according to statistical criteria. This approach is controversial because it is data-driven rather than theory-driven.

  • Practical guidance:

    • If you have a strong theoretical rationale, use hierarchical regression to test incremental value of predictors.

    • If you have no strong theoretical ordering, SMR is often preferred to avoid overfitting or data-driven selection.

    • Use statistical/stepwise with caution; consider theoretical justification and replication.

What to Consider Before Running a Regression

  • Measurement scale of each variable (nominal/ordinal/interval/ratio) and the nature of the outcome (continuous vs categorical): regression assumptions apply mainly to continuous outcomes; logistic regression is used for categorical outcomes.

  • Adequate sample size: as above, N rules of thumb depend on $p$ (number of predictors) and desired power/effect size.

  • Outliers: plan data cleaning steps (univariate and multivariate outliers) and assess their influence on results.

  • Multicollinearity: plan to check VIF and Tolerance; consider removing or combining predictors that are highly correlated.

  • Linearity, normality, and homoscedasticity: plan to examine residuals and plots to verify assumptions.

  • Reporting: provide correlation tables, model summary, ANOVA, coefficients (unstandardised and standardised), and residual diagnostics; discuss practical significance and limitations.

Key Takeaways for Week 4 Regression Practice

  • Regression extends correlation to prediction; $R^2$ represents the proportion of shared variance between $y$ and predicted $y$.

  • The simple regression model is $\hat{y} = b0 + b1 x$; multiple regression extends to $\hat{y} = b0 + \sum{j=1}^p bj xj$.

  • The model fit is evaluated via $F$ statistics, $R^2$, and residual analyses; the baseline model is the mean model.

  • Coefficients indicate the strength and direction of each predictor’s unique contribution; unstandardised $B$ values are on the original scales; standardised $\beta$ values enable comparison across predictors.

  • Outliers and multivariate outliers must be carefully identified and handled; Mahalanobis distance and Cook’s distance are common diagnostics.

  • Assumptions of linearity, normality, and homoscedasticity must be checked via residual plots, histograms, and PP plots; transformations may be needed to address violations.

  • SPSS workflows provide practical steps for SMR, hierarchical, and stepwise approaches, including saving diagnostics (e.g., Mahalanobis distance, Cook’s distance) for deeper inspection.

  • Readings and resources suggested include Andy Field, Tabachnick & Fidell, and related multivariate statistics texts; the Album Sales SPSS dataset is available for practice.

LaTeX Cheat Sheet (References in Transcript)

  • Simple regression: \hat{y} = b0 + b1 x

  • Multiple regression: \hat{y} = b0 + b1 x1 + b2 x2 + \cdots + bp x_p

  • Coefficient of determination: R^2 = r^2

  • Model vs residual sums of squares: SST = SSR + SSM

  • F statistic: F = \frac{SSM/df{Model}}{SSR/df{Residual}}

  • Multicollinearity indicators: \text{VIF}j = \frac{1}{1 - Rj^2}, \quad Tj = 1 - Rj^2

  • Mahalanobis distance: D^2 = (x - \mu)^\top S^{-1} (x - \mu)

  • Threshold: D^2 > \chi^2_{k, \alpha} (e.g., $\alpha = 0.001$)

  • T-distribution for coefficients: tj = \frac{Bj}{SE(B_j)}

  • Residuals: ei = yi - \hat{y}_i

  • Standardized residuals: typically used with cutoffs like $|\text{standardized residual}| > 2$ or $> 3.29$ for strict criteria

End of notes