Week 4 Lecture: Correlation, Regression, and Multiple Regression
Correlation and Regression Foundations
Correlation coefficient r: a single number that summarizes the association between two data sets (two variables) in bivariate data.
Two key constructs of r:
Strength: magnitude of the correlation (how close to -1 or +1).
Direction: positive or negative sign indicates the direction of the relationship.
Perfect correlation: $r = \pm 1$ (rare in psychology).
Zero correlation: $r = 0$.
Regression basics: regression line (line of best fit) through a scatter plot.
Goal: predict a dependent variable $y$ from a predictor $x$.
Notation: $\hat{y}$ = predicted value of $y$; $x$ = predictor (independent variable).
Bivariate regression is the link between correlation and regression.
Coefficient of Determination and Shared Variance
Coefficient of determination: $R^2$ is the proportion of variance shared between two variables in a bivariate correlation.
If $r = 0.75$, then $R^2 = r^2 = 0.75^2 = 0.56$, meaning 56% of the variance is shared.
Residual variance (unshared variance) = $1 - R^2$ (e.g., 44% in the example).
Cohen’s guidelines for strength (approximate):
Strong: $|r| > 0.5$
Moderate: $|r| \approx 0.3$ to $0.5$
Weak: $|r| \approx 0.1$ to $0.3$
Residual variance: the portion of variance not shared by the two variables; errors due to other variables.
Relationship between shared variance and error: $R^2$ is the shared variance; residual variance is $1-R^2$.
The Simple Linear Regression Equation and Interpretation
Four components of the regression line (simple linear regression):
$\hat{y}$: predicted dependent variable score.
$x$: raw score on the independent variable.
$b_0$: intercept (value of $y$ when $x = 0$) on the $y$-axis.
$b_1$: slope (regression coefficient) indicating the change in $\hat{y}$ for a one-unit change in $x$.
Equation of the line: \hat{y} = b0 + b1 x
Example interpretation: if $b_1 = 0.33$, a one-unit increase in $x$ increases $\hat{y}$ by 0.33 units.
Practical example (forward calculation): given $x = 80$, $b0 = 2$, $b1 = 0.33$:
\hat{y} = 2 + 0.33 \times 80 = 2 + 26.4 = 28.4.
If the scale of $y$ runs from 0 to 35, $\hat{y} = 28.4$ indicates a relatively high predicted outcome.
Interpretation of regression coefficient (slope): positive slope indicates higher predictor scores are associated with higher outcome values; the magnitude reflects the strength of that association in the context of the scale.
From Bivariate to Multiple Regression
Bivariate correlation describes the strength and direction of the association between two variables.
Bivariate regression predicts $y$ from a single $x$.
Multiple correlation and multiple regression extend this to multiple predictors:
Multiple regression: a single $y$ is predicted from several $x$ values.
The model controls for each predictor to isolate its unique contribution.
Goals of multiple regression:
Assess how much variance in $y$ is explained by the set of predictors together.
Determine the unique contribution of each predictor while holding others constant.
Identify the most important predictors (ranked by their independent contribution).
Examples of predictors for a final grade or exam score could include: high school GPA, academic effort, time spent on nonacademic activities, family education, etc.
The Multiple Regression Model and Coefficients
General multiple regression equation: \hat{y} = b0 + b1 x1 + b2 x2 + \cdots + bp x_p
$b_0$: intercept.
$bj$: regression coefficient for predictor $xj$ (unstandardised).
There can be as many predictors as needed (denoted by $n$ in slides).
Coefficients in SPSS-style outputs come in two forms:
Unstandardised coefficient: $Bj$ (raw units; predicted change in $y$ per unit change in $xj$, holding others constant).
Standardised coefficient: (\beta_j) (in standard deviation units) enabling comparison across predictors with different scales.
Important note on interpretation:
Unstandardised $Bj$ tells you the actual unit-change in $y$ per unit change in $xj$ when other predictors are held constant.
Standardised (\beta_j) allows comparison across predictors with different measurement scales.
Model Fit, Sums of Squares, and F Statistic
Model (explained variance) vs residual (unexplained variance):
SST = SSR + SSM, where
SST: total sum of squares (total variance in $y$).
SSR: sum of squared residuals (unexplained variance).
SSM: sum of squares due to the model (explained variance).
Mean squares: MSmodel = SSM/dfmodel, MSresidual = SSR/dfresidual.
F statistic to test model significance: F = \frac{MS{Model}}{MS{Residual}} = \frac{SSM/df{Model}}{SSR/df{Residual}}.
If $F$ is large and $p$-value $< 0.05$, the model explains a significant amount of variance beyond the baseline (mean) model.
Baseline model: a flat line at the mean of $y$; compare model to baseline using $F$ and $R^2$.
Output tells you overall model significance (via $F$) and the exact contributions of each predictor (via t-tests on coefficients).
Coefficients and Significance Testing
For each predictor, test whether its coefficient significantly contributes to predicting $y$ while holding others constant.
t-test for coefficient: tj = \frac{Bj}{SE(B_j)}, with associated p-value.
Look for significant predictors (p-value typically < 0.05, or stricter in some contexts).
Interpretation of coefficients in a concrete example (three predictors):
Unstandardised coefficients (example numbers):
Advertising spend: $B_1 = 0.085$ (per unit of currency, $y$ increases by 0.085 units).
Airplay: $B_2 = 3.37$ (each additional unit of airplay increases $y$ by 3.37 units).
Band attractiveness: $B_3 = 11.09$ (each unit increase in attractiveness increases $y$ by 11.09 units).
All three predictors can be significant and contribute unique variance.
Zero-order, partial, and semi-partial correlations (brief):
Zero-order: simple correlation between $x_j$ and $y$.
Partial correlation: correlation between $x_j$ and $y$ controlling for other predictors.
Semi-partial (part) correlation: correlation between $x_j$ and $y$ after removing only the variance in $y$ shared with other predictors.
Squared semi-partial ((\text{SP}^2)) indicates the unique variance in $y$ explained by predictor $x_j$.
Model Fit and Interpretation Summary
If the overall model is significant (large $F$ and small p-value): the set of predictors collectively helps predict $y$.
If individual predictors are significant (t-tests on $B_j$): each predictor adds unique predictive value while others are held constant.
R, R^2, and Adjusted R^2:
$R$ is the multiple correlation between observed $y$ and predicted $y$.
$R^2$ is the proportion of variance in $y$ explained by the predictors.
Adjusted $R^2$ accounts for the number of predictors relative to sample size.
Practical significance: even with a significant model, check for multicollinearity and ensure predictors contribute meaningful unique variance.
Outliers and Multivariate Outliers in Regression
Outliers can be univariate (extreme on one variable) or multivariate (extreme across several variables).
Tools for detecting multivariate outliers:
Mahalanobis distance: D^2 = (x - \mu)^\top S^{-1} (x - \mu),
where $x$ is the vector of predictor values for a case, $\mu$ is the centroid (mean vector) of predictors, and $S$ is the covariance matrix.Threshold: compare $D^2$ to the critical $\chi^2$ value with $k$ degrees of freedom (number of predictors). E.g., $D^2 > \chi^2_{k, \alpha}$ with $\alpha = 0.001$ is a conservative criterion.
Cook’s distance: a case-by-case influence measure; values > 1 often indicate influential cases.
Residual plots and scatter plots help visually identify multivariate outliers and influential cases.
Data cleaning approach:
Inspect outliers in context of raw data (possible data entry errors, malingering, or genuine extreme but valid cases).
Consider removing outliers if justified (cite a source) and re-run analyses to compare results.
If a multivariate outlier is present, decide whether to remove it or keep it (based on its influence and theoretical justification).
Assumptions for Multiple Regression
Measurement scale:
Prefer interval/ratio predictors; dichotomous/dategorical predictors can be used in some circumstances.
If the outcome is dichotomous, logistic regression is appropriate instead of standard multiple regression.
Sample size considerations:
Larger samples are better; common rules of thumb include:
N > 50 + 8p (where p is the number of predictors).
Or 10–15 participants per predictor.
Some authors propose N > 104 + p (or similar) depending on the source.
Software power analyses (e.g., G Power) depend on effect size expectations (small/medium/large) and number of predictors.
Outliers and data quality: ensure data are complete (listwise deletion may be used) and consider the impact of missing data.
Key regression assumptions to test post-hoc:
Linearity: relationships between predictors and outcome are linear.
Normality of residuals: residuals are normally distributed.
Homoscedasticity: constant variance of residuals across levels of predicted values.
Specific diagnostics for multicollinearity:
Multicollinearity or singularity is problematic when predictors are highly correlated.
Tempting indicators: high pairwise correlations, high VIF, or low tolerance.
VIF (Variance Inflation Factor): \text{VIF}j = \frac{1}{1 - Rj^2}\, where $Rj^2$ is the $R^2$ from regressing predictor $xj$ on the other predictors.
Tolerance: Tj = 1 - Rj^2 = \frac{1}{\text{VIF}_j}.
Common thresholds:
VIF between 1 and 5 is usually acceptable; VIF > 5–10 signals potential multicollinearity concerns.
Tolerance < 0.2 (or sometimes < 0.1) indicates potential multicollinearity concerns.
Residual diagnostics for normality/linearity/homoscedasticity:
Residual plots (residuals vs predicted values) are used to assess the three properties collectively.
PP plots (probability plots) and histograms can assess normality of residuals.
Partial and semi-partial plots can provide additional diagnostic intuition, but are usually secondary to residual plots.
Regression Diagnostics: Visual and Statistical Checks
Residual plots help assess normality, linearity, and homoscedasticity in a single graph when you plot standardized residuals against standardized predicted values.
If residuals form a tight, rectangular cluster around zero with no clear pattern, assumptions are met.
If residuals fan out (heteroscedasticity) or show a curved pattern (nonlinearity), transformations or alternative modeling may be needed.
Normality checks can be done via:
Histogram of standardized residuals with a normal curve overlay.
PP plot: points close to the diagonal line suggest normality.
Casewise diagnostics (in SPSS or similar):
Standardized residuals beyond |2| or |3| may indicate potential outliers; more conservative cutoffs (|3.29| for p < .001) are sometimes used.
Mahalanobis distance, Cook’s distance, and residual-based plots provide comprehensive outlier and influence assessment.
Practical Example: Album Sales and SPSS Workflow
Data scenario (two versions shown in lecture):
Example 1: Predict album sales (Y) from advertising budget and airplay (X1, X2).
Example 2 (more complex): Predict album sales (Y) from promotion dollars (X1), band attractiveness (X2), and radio airplay (X3).
Standard multiple regression (SMR): all predictors entered simultaneously in one step.
SPSS workflow (high-level):
Analyze → Regression → Linear.
Set Dependent variable: album sales; Enter: X1, X2, X3 (three predictors).
Options:
Statistics: keep model fit, coefficients, partial correlations, semi-partial correlations; include $R^2$, $R^2_{adj}$, and other relevant statistics.
Plots: request residual plots (Y = standardized residual, X = predicted) to check assumptions; include PP plots and histograms if desired.
Save: request Mahalanobis distance, Cook’s distance, residuals, standardized/unstandardized values for diagnostics.
Casewise diagnostics: inspect standardized residuals; flag potential outliers beyond |2| or |3|; compare against Mahalanobis distance and Cook’s distance to determine influence.
Output interpretation: correlation table (bivariate correlations among all variables), model summary (R, R^2, adjusted R^2, STD error, Durbin-Watson), ANOVA table (F-statistic for overall model), regression coefficients (unstandardised $B$, standardised $\beta$, t-values, p-values).
Interpreting the album-sales results (example):
Overall model significant (high $F$, $p<0.001$).
All three predictors entered are significant (each $t$-test p < .001).
Unstandardised coefficients give practical interpretation:
$B1$: effect of $X1$ (e.g., promotion spend) on album sales, holding others constant.
$B2$: effect of $X2$ (e.g., band attractiveness) on album sales, holding others constant.
$B3$: effect of $X3$ (e.g., airplay) on album sales, holding others constant.
Partial correlations help understand unique contribution of each predictor; semi-partial (part) correlations show the unique variance each predictor adds to $y$.
Model diagnostics for the album-sales example:
Multicollinearity check: use correlation table and VIF/Tolerance; expect moderate VIFs (between 1 and 5 is common) and tolerances > 0.2.
Linearity, normality, and homoscedasticity checks via residual plots, histograms, PP plots, and a single comprehensive residual plot.
If assumptions hold, report a final model with interpretation of each predictor and overall model significance.
Types of Multiple Regression Approaches
Standard Multiple Regression (SMR): all predictors entered simultaneously; no prior ordering; interpret unique contributions of each predictor.
Hierarchical (Sequential) Regression: predictors entered in a theoretically determined order (blocks or steps). Each step shows the incremental contribution of added predictors, typically as changes in $R^2$ (i.e., $\Delta R^2$).
Statistical (Stepwise) Regression: predictors selected algorithmically by the software based on their partial correlations with the outcome; predictors are added/removed one at a time according to statistical criteria. This approach is controversial because it is data-driven rather than theory-driven.
Practical guidance:
If you have a strong theoretical rationale, use hierarchical regression to test incremental value of predictors.
If you have no strong theoretical ordering, SMR is often preferred to avoid overfitting or data-driven selection.
Use statistical/stepwise with caution; consider theoretical justification and replication.
What to Consider Before Running a Regression
Measurement scale of each variable (nominal/ordinal/interval/ratio) and the nature of the outcome (continuous vs categorical): regression assumptions apply mainly to continuous outcomes; logistic regression is used for categorical outcomes.
Adequate sample size: as above, N rules of thumb depend on $p$ (number of predictors) and desired power/effect size.
Outliers: plan data cleaning steps (univariate and multivariate outliers) and assess their influence on results.
Multicollinearity: plan to check VIF and Tolerance; consider removing or combining predictors that are highly correlated.
Linearity, normality, and homoscedasticity: plan to examine residuals and plots to verify assumptions.
Reporting: provide correlation tables, model summary, ANOVA, coefficients (unstandardised and standardised), and residual diagnostics; discuss practical significance and limitations.
Key Takeaways for Week 4 Regression Practice
Regression extends correlation to prediction; $R^2$ represents the proportion of shared variance between $y$ and predicted $y$.
The simple regression model is $\hat{y} = b0 + b1 x$; multiple regression extends to $\hat{y} = b0 + \sum{j=1}^p bj xj$.
The model fit is evaluated via $F$ statistics, $R^2$, and residual analyses; the baseline model is the mean model.
Coefficients indicate the strength and direction of each predictor’s unique contribution; unstandardised $B$ values are on the original scales; standardised $\beta$ values enable comparison across predictors.
Outliers and multivariate outliers must be carefully identified and handled; Mahalanobis distance and Cook’s distance are common diagnostics.
Assumptions of linearity, normality, and homoscedasticity must be checked via residual plots, histograms, and PP plots; transformations may be needed to address violations.
SPSS workflows provide practical steps for SMR, hierarchical, and stepwise approaches, including saving diagnostics (e.g., Mahalanobis distance, Cook’s distance) for deeper inspection.
Readings and resources suggested include Andy Field, Tabachnick & Fidell, and related multivariate statistics texts; the Album Sales SPSS dataset is available for practice.
LaTeX Cheat Sheet (References in Transcript)
Simple regression: \hat{y} = b0 + b1 x
Multiple regression: \hat{y} = b0 + b1 x1 + b2 x2 + \cdots + bp x_p
Coefficient of determination: R^2 = r^2
Model vs residual sums of squares: SST = SSR + SSM
F statistic: F = \frac{SSM/df{Model}}{SSR/df{Residual}}
Multicollinearity indicators: \text{VIF}j = \frac{1}{1 - Rj^2}, \quad Tj = 1 - Rj^2
Mahalanobis distance: D^2 = (x - \mu)^\top S^{-1} (x - \mu)
Threshold: D^2 > \chi^2_{k, \alpha} (e.g., $\alpha = 0.001$)
T-distribution for coefficients: tj = \frac{Bj}{SE(B_j)}
Residuals: ei = yi - \hat{y}_i
Standardized residuals: typically used with cutoffs like $|\text{standardized residual}| > 2$ or $> 3.29$ for strict criteria
End of notes