Comprehensive Notes on Linear Regression, Correlation, GLM1, T-Tests, Single-Case Statistics, and Ethics
Linear Regression and Model Fit (Regression)
Two variable types drive regression analysis: predictors (also known as independent variables or covariates) and the outcome (also known as the dependent variable or response variable). The fundamental aim of regression is to quantify and predict the effect: specifically, if the predictor increases by one unit, what is the expected change in the outcome? For example, by using an advertising budget as a predictor, one can forecast album sales. The basic form of a simple regression model, which includes only one predictor, is:
In this equation, (\Upsiloni) represents the observed outcome for observation i, (b0) is the intercept (the predicted value of the outcome when the predictor (xi) is zero), (b1) is the slope (the unstandardized measure of the relationship, indicating the change in the outcome for a one-unit increase in the predictor), (xi) is the value of the predictor for observation i, and (\varepsiloni) is the error term or residual, which accounts for the variance in the outcome not explained by the predictor. Using the album sales example from the material, an intercept (b0 = 134.14) and a slope (b1 = 0.096) would mean that one unit of advertising budget is associated with an increase of 0.096 units in predicted album sales, while 134.14 represents the baseline sales when no advertising budget is spent.
Estimating the model and goodness of fit
A fitted line is generated through the data points, representing the predicted relationship. Visually, observed values are often depicted in blue, and predicted values in orange. The residuals (or errors) are the vertical distances between the observed values and the corresponding predicted values on the regression line. The primary goal of model estimation is to minimize these residuals. This is typically achieved through Ordinary Least Squares (OLS) estimation, which finds the values of (b0) and (b1) that minimize the sum of the squared residuals. This method ensures that the errors above and below the line do not cancel each other out during minimization by squaring them, and it further penalizes larger errors more heavily, leading to a unique optimal solution for the regression line.
Goodness of fit, or how well the model explains the variance in the outcome, is assessed relative to a baseline model, which typically assumes no relationship between the predictor and the outcome, using only the mean of the outcome variable as a predictor. The total sum of squares (SST) represents the total variance in the outcome around its mean. This total variance can be decomposed into two components:
SST (Total Sum of Squares) is the total variance in the outcome around its mean. It represents the total variability of the dependent variable, essentially the sum of squared differences between each observed outcome and the overall mean outcome.
SSR (Residual Sum of Squares) is the variance unexplained by the model (the sum of squared residuals). It reflects the error or noise remaining after the model has been fitted, representing how much the observed data points deviate from the regression line.
SSM (Model Sum of Squares) is the variance explained by the model (the sum of squared differences between the predicted values and the outcome mean). It quantifies how much variability in the outcome is captured by the regression model, indicating the improvement over simply using the mean as a predictor.
A larger SSM relative to SST indicates that the model explains a substantial proportion of the total variance, thus signifying a better model fit. The proportion of explained variance is quantified by (), also known as the coefficient of determination:
() values range from 0 to 1, where 0 indicates that the model explains no variance, and 1 indicates that the model explains all variance in the outcome. To formally test whether the model provides a statistically meaningful improvement over the baseline, an F-statistic is computed. This F-statistic compares the variance explained by the model to the unexplained variance:
where () (Mean Square Model) and () (Mean Square Residual) are the mean squares, derived by dividing the sum of squares by their respective degrees of freedom (() for the model, () for residuals). A larger F-value, coupled with a significant p-value (typically p < .05), indicates that the regression model explains a statistically nontrivial amount of variance in the outcome beyond what the baseline model would explain. An alternative (and directly related) descriptor of fit is the square root of (), which, in the case of simple regression (one predictor), is equivalent to Pearson’s correlation coefficient (r):
Model assumptions, residuals, and bias
Linear models, including regression, rely on several key assumptions wwwwwwfor the validity of their statistical inference:
Additivity and Linearity: The relationship between the outcome and predictor(s) must be linear and additive. This means that the effect of a predictor is independent of other predictors, and the relationship can be represented by a straight line. This assumption can be visually checked using scatter plots of the outcome vs. each predictor, or residuals vs. predicted values.
Independent Errors: The residuals (errors) should be independent of each other. This implies that there is no systematic pattern or correlation among the errors, meaning the error for one observation does not predict the error for another. The Durbin–Watson statistic is used to assess this; a value of 2 indicates no autocorrelation, values less than 2 suggest positive autocorrelation (errors are correlated in the same direction), and values greater than 2 suggest negative autocorrelation (errors are correlated in opposite directions). Violation of this assumption is common in time-series data.
Homoscedasticity: The variance of the residuals should be constant across all levels of the predictor(s). This means the spread of residuals should be roughly the same along the fitted regression line. Visual inspection of residual plots (residuals vs. predicted values or residuals vs. independent variables) should show a random scatter with no discernible pattern (e.g., a fanning-out or fanning-in shape). Violation of this (heteroscedasticity) can lead to biased standard errors and incorrect p-values, making the model’s coefficients appear more or less significant than they truly are. Alternatives like weighted least squares estimation or robust standard errors can address heteroscedasticity, and formal tests like the Breusch-Pagan test or White test can also be used.
Normally Distributed Errors: The residuals should be normally distributed. While the Central Limit Theorem can mitigate issues for large sample sizes, severe non-normality can affect the accuracy of p-values and confidence intervals, particularly in smaller samples. This can be assessed visually through Q-Q plots (which should show residuals falling approximately along a straight line) or histograms of residuals, and formally through tests like the Shapiro-Wilk test or Kolmogorov-Smirnov test, though caution is advised against over-reliance on formal tests with large sample sizes.
Residuals are crucial for diagnosing bias and assessing overall model adequacy. Visual inspection of residual plots (e.g., residuals vs. predicted values, or Q-Q plots for normality) is essential. Large residuals can indicate outliers—data points that deviate substantially from the general trend—which can disproportionately influence the regression line and distort the model coefficients. Influential cases are outliers that, when removed, significantly change the model parameters. Measures such as Cook’s distance are used to detect these cases; values greater than 1 are often considered problematic and may warrant further investigation, such as rechecking data entry or considering robust estimation methods. Other diagnostics include leverage (measured by Mahalanobis distance), which quantifies how far an observation's predictor values are from the mean of the predictor values, and DFFit or DFBeta, which assess how much the predicted value or a specific coefficient changes when an observation is removed.
To standardize residuals and make them comparable across different models or datasets, standardized residuals (Z-scores) are used. These convert residuals into a standard normal distribution, allowing for easier identification of extreme values. Common trouble thresholds, based on the properties of the standard normal distribution, include:
(||\text{Resids}|| > 3.29) (considered extremely unlikely, corresponding to (p < .001))
(||\text{Resids}|| > 3.0) (roughly in the tail of the distribution)
(||\text{Resids}|| > 2.58) (exceedingly rare, corresponding to (p < .01))
(||\text{Resids}|| > 1.96) (statistically significant at (p < .05))
If influential cases are detected, a researcher should consider diagnosing them with Cook's distance, examining leverage points, and potentially rechecking the data for entry errors. If the points are valid, robust regression techniques might be more appropriate.
Generalization and cross-validation
Beyond merely fitting the model to the current dataset, it is critically important to assess whether the model generalizes well to other, unseen samples or the broader population. This ensures the model is not overfitting the noise in the current data. Key concepts for generalization include ensuring the previously mentioned assumptions (additivity/linearity, independence of errors, homoscedasticity, and normality of residuals) hold. Cross-validation offers a practical and robust approach to evaluate generalizability. This technique involves splitting the data into distinct sets: a training set (used to fit or estimate the model parameters) and a testing set (used to assess the model's predictive performance on new data). Common cross-validation schemes include K-fold cross-validation (where data is split into K subsets, and the model is trained K times, each time leaving one subset out for testing) or leave-one-out cross-validation (a special case of K-fold where K equals the sample size, using N-1 observations for training and 1 for testing).
When reporting model fit, it is advisable to include adjusted (). Unlike (), which always increases as more predictors are added (even irrelevant ones), adjusted () accounts for the number of predictors in the model and the sample size. It provides a more realistic estimate of the population (), thus reflecting model complexity in the population context. The adjusted () formula is:
where (n) is the sample size and (p) is the number of predictors in the model. Adjusted () can be lower than () and can even be negative if the model provides a very poor fit. It is generally preferred over () when comparing models with different numbers of predictors because it penalizes for model complexity, providing a more honest estimate of a model's explanatory power in the population.
Two-or-more predictors; hierarchical modeling and multicollinearity
With two or more predictors, the model becomes multiple linear regression. While it remains a linear model, the interpretation of the coefficients ((b)-values) becomes more complex. Each coefficient now represents the predicted change in the outcome for a one-unit increase in that specific predictor, while holding all other predictors in the model constant. This is known as a partial effect.
Predictor selection can be guided by theoretical frameworks, prior empirical knowledge, or by carefully designed hierarchical regression. In hierarchical regression, predictors are added to the model in blocks based on a predetermined order (e.g., control variables first, then theoretical predictors). This allows for assessing the unique contribution of each block to the explained variance. Conversely, stepwise procedures (forward, backward, or hybrid selection) are generally discouraged in theory-driven research due to their tendency to capitalize on chance, produce unstable models across samples, and yield biased parameter estimates.
When adding predictors in a multiple regression model, it is essential to assess the improvement in model fit. This is often done by examining the change in () (or () change, denoted ()) from one step to the next in a hierarchical model. A significant () indicates that the newly added predictors explain a statistically significant amount of additional variance. For example, if () increases from 0.335 to 0.665 after adding new predictors, this suggests a substantial improvement.
A crucial diagnostic in multiple regression is checking for multicollinearity, which occurs when predictors in the model are highly correlated with one another. High multicollinearity poses several problems:
It makes it difficult to ascertain the unique contribution of each predictor, impeding the interpretation of individual (b)-values because their effects are confounded by shared variance.
It can lead to inflated standard errors of the regression coefficients, making them appear non-significant (higher p-values) even when they might have real effects, thereby increasing the risk of Type II errors.
It reduces the reliability and stability of the (b)-values across different samples, meaning small changes in data can lead to large changes in coefficient estimates.
Common diagnostics for multicollinearity include the Variance Inflation Factor (VIF) and tolerance.
VIF quantifies how much the variance of an estimated regression coefficient is inflated due to collinearity. A VIF value of 1 indicates no correlation between the predictor and other predictors, while higher values indicate increasing collinearity. A common caution is to avoid VIF values greater than 10 (some researchers suggest 5), as this implies the standard error of the coefficient is increased by a factor of 10 or more due to collinearity.
Tolerance is the reciprocal of VIF (i.e., Tolerance = ()). Consequently, a common caution is to avoid tolerance values less than 0.1 (or 0.2), as this indicates that 90% (or 80%) or more of the variance of a predictor is explained by the other predictors in the model.
An illustrative multiple-predictor example from the material involved advertising budget, airplay, and image rating. The regression equation would look like this:
In such an example, adding predictors like airplay and image to an initial model with only advertising budget would typically increase () (e.g., from 0.335 to 0.665), indicating more variance explained. The adjusted () would provide a more conservative estimate of this explanatory power in the population. The diagnostic table for such a model would typically report multicollinearity diagnostics with VIF values below 10 and tolerances above 0.2, suggesting that collinearity is not a major concern and the individual coefficients can be reliably interpreted.
Model summaries, interpretation, and reporting
In practical academic and research reporting, a comprehensive presentation of regression results typically includes:
Parameter estimates: The intercept (()) and the slope(s) (()), which represent the unstandardized partial effects of each predictor.
Standard Errors (SE): These quantify the precision of the coefficient estimates. Smaller standard errors indicate more precise estimates and less variability in the estimated coefficient across different samples.
t-statistics: Calculated by dividing each coefficient by its standard error (). These are used to test the null hypothesis that the true population coefficient is zero (i.e., that the predictor has no effect on the outcome when controlling for other predictors).
p-values: Associated with the t-statistics, indicating the probability of observing a coefficient as extreme as, or more extreme than, the one calculated, assuming the null hypothesis (no effect) is true. A small p-value (e.g., < .05) leads to rejection of the null hypothesis.
Confidence Intervals (CIs): Typically 95% CIs for each regression coefficient. These provide a range of plausible values for the true population parameter. If the CI for a coefficient does not include zero, the effect is considered statistically significant (at the specified confidence level). Bootstrap-based confidence intervals (e.g., 95% BCa, or Bias-Corrected and accelerated CIs) are often preferred as they are more robust to violations of normality assumptions and can be particularly valuable for non-normal distributions or when dealing with small sample sizes.
Beyond individual coefficients, one often reports the overall model fit indices:
(): The coefficient of determination, indicating the proportion of variance in the outcome explained by the predictor(s).
Adjusted (): A more robust estimate of population (), accounting for model complexity and sample size, offering a more conservative and generalizable measure of fit.
F-statistic: For the overall model, along with its degrees of freedom and associated p-value, indicating whether the model as a whole explains a significant portion of variance compared to a baseline model.
Additionally, comprehensive reporting requires including model diagnostics such as checks for independence of errors (Durbin-Watson statistic, typically reported), normality of residuals (e.g., Q-Q plots, Shapiro-Wilk test results, or a statement about visual inspection), and homoscedasticity (e.g., residual plots, Levene's test for variance homogeneity, or Breusch-Pagan test results). When these assumptions hold and the model demonstrates a good fit, the population value for a slope () is likely to fall within the reported confidence interval. If a genuine positive relationship exists, this interval should not cross zero, indicating that the effect is reliably non-zero and in the predicted direction. Similarly, for a negative relationship, the interval should be entirely below zero.
ANOVA-style GLM1 and the regression view of ANOVA
ANOVA (Analysis of Variance), often referred to as GLM1 (General Linear Model, type 1), can be entirely understood and conducted through the lens of regression. In this framework, you model the outcome variable as a function of a categorical predictor, which is entered into the regression model using dummy variables (also called indicator variables or binary variables). For instance, in a one-way between-subjects ANOVA with three groups (e.g., a control group, a 30-minute intervention group, and a 15-minute intervention group), the categorical predictor (Group) is transformed into dummy variables. If there are (k) groups, (k-1) dummy variables are created.
In this example, two dichotomous dummy variables might be created (e.g., dummy1 and dummy2). One group is designated as the baseline or reference group, and its dummy variables are typically coded as 0. For instance, the control group could be the baseline, meaning dummy1=0 and dummy2=0 for participants in the control group. The other groups then have one of their respective dummy variables coded as 1 to indicate membership:
Control Group:
dummy1 = 0,dummy2 = 030-minute Group:
dummy1 = 1,dummy2 = 015-minute Group:
dummy1 = 0,dummy2 = 1
The regression equation for this ANOVA setup would be:
In this regression framework,
The intercept () directly corresponds to the mean of the baseline (reference) group (e.g., the mean of the control group).
The coefficient () represents the difference between the mean of the 30-minute group and the mean of the control group.
The coefficient () represents the difference between the mean of the 15-minute group and the mean of the control group.
Thus, testing the significance of () or () is equivalent to performing a specific contrast between those groups and the baseline. The overall ANOVA tests the fit of the group means to the grand mean (i.e., whether there are any significant differences among the group means), utilizing the same decomposition of variance (SST, SSM, SSR) as regression. The F-statistic (()) for the overall model indicates whether the group means differ significantly from each other, reflecting the overall effect of the categorical predictor.
Effect-size measures for ANOVA include:
Eta-squared (()): Defined as
It represents the proportion of total variance explained by the categorical predictor. It is a descriptive measure and is sometimes used as an analogue to () in regression. However, it is known to overestimate the effect size in the population, especially in smaller samples. Conventional guidelines suggest: .01 (small), .06 (medium), .14 (large).
Omega-squared (()): Provides a less biased estimate of the proportion of explained variance in the population, making it more suitable for generalization than eta-squared. The formula is:
A typical example reported in the material yields () (with () in the example), indicating a large-to-medium effect according to conventional guidelines. The illustration also shows how adding predictors in a regression context increases explained variance, with changes in () (or ()) and adjusted () reflecting this improvement (e.g., () rising from 0.335 to 0.665, and adjusted () from 0.331 to 0.66). Bootstrap confidence intervals may also be reported for regression coefficients and differences in means when viewed through the GLM framework. Cross-validation is also relevant here for assessing the generalizability of group differences across samples.
Correlation (Bivariate and Partial)
Correlation quantifies the strength and direction of a linear relationship between two (or more) variables. For two continuous variables, the Pearson correlation coefficient ((r)) is the most widely used measure, defined as:
where () is the covariance between variables (X) and (Y), and () and () are their respective standard deviations. In example calculations, if one set (Orange) had a variance of 9, another (Blue) had a variance of 2.8, and the covariance was 4.25, the resultant (), indicating a strong positive linear relationship.
Variance, covariance, and interpretation
Variance (e.g., ()) is the mean of the squared deviations from the mean for a single variable. It quantifies the spread or dispersion of data points around their average, indicating how scattered the data points are from their mean.
Covariance (e.g., ()) is the mean product of the deviations for two variables from their respective means. It indicates the degree to which two variables change together. A positive covariance means that as one variable generally increases, the other also tends to increase (deviations occur in the same direction). A negative covariance means that as one variable increases, the other tends to decrease (deviations occur in opposite directions). The magnitude of covariance, however, is dependent on the scale of the variables, making it difficult to interpret universally.
The Pearson correlation coefficient (r) standardizes covariance, making it interpretable regardless of the variables' scales. It ranges from -1 to 1:
(): Perfect positive linear relationship (as one variable increases, the other increases proportionally).
(): Perfect negative linear relationship (as one variable increases, the other decreases proportionally).
(): No linear relationship (variables show no consistent linear pattern together).
Higher absolute values of (r) indicate stronger linear associations, while values closer to zero suggest weaker or no linear association. Note that correlation only captures linear relationships, and a zero correlation does not necessarily mean no relationship, only no linear relationship; a strong non-linear relationship might still exist.
Assumptions and robustness; bootstrapping
Pearson correlation relies on several assumptions:
Linearity: The relationship between the two variables should be linear. This is typically assessed by visually inspecting a scatter plot of the two variables.
Normality: The joint distribution of the two variables should be bivariate normal. This implies that each variable individually is normally distributed, and that for any given value of one variable, the values of the other variable are normally distributed. While strictly required for exact p-values, Pearson (r) is fairly robust to mild deviations, especially with large sample sizes.
No Outliers: Outliers can disproportionately influence the correlation coefficient, potentially leading to an over- or under-estimation of the true relationship. Visual inspection of scatter plots is crucial for identifying outliers.
When these assumptions are severely violated, or when the data contain significant outliers, nonparametric alternatives are generally preferred because they are less sensitive to distributional assumptions and extreme values:
Spearman’s rho ((r_s)): This is the Pearson correlation coefficient calculated on the ranks of the data rather than the raw data. It is appropriate for ordinal data or when the presence of outliers or non-normal distributions might distort Pearson’s (r). It captures monotonic relationships (where variables move in a consistent direction, but not necessarily linearly), making it useful for both linear and non-linear monotonic relationships.
Kendall’s tau ((\tau)): Another rank-based correlation coefficient, often suitable for small samples or when there are many tied ranks (i.e., multiple observations have the same rank) in the data. It is generally more robust to sample size issues and specific distributional shapes than Spearman's rho, and it is also based on counting concordant and discordant pairs.
Point-biserial correlation is a special case of Pearson correlation used when one variable is dichotomous (coded as 0/1, e.g., male/female) and the other is continuous. The underlying statistical principle is identical to Pearson correlation but adapted for a binary group division, reflecting the strength of association between group membership and the continuous variable. Partial correlation () measures the linear relationship between two variables (X and Y) after controlling for (or statistically removing the influence of) one or more other variables (Z). This helps isolate the unique association between X and Y, making it useful in multivariate analysis to rule out confounding effects. For robust inference, particularly when assumptions are in doubt or samples are small, bootstrapping can be used to generate confidence intervals for correlation coefficients (e.g., BCa bootstrap CIs), providing more reliable estimates that do not rely on strong distributional assumptions.
Reporting correlation results
Reporting correlation results adheres to specific conventions, typically including the correlation coefficient, its confidence interval, and its statistical significance (p-value). It is common practice to report the exact p-values where possible (e.g., () rather than (p < .01)), or at least indicate the level of significance. When bootstrapping is used, bootstrapped 95% BCa confidence intervals should be specified. For presenting multiple correlations, a correlation matrix table is customary, summarizing correlations with their CIs and the relevant sample sizes (N, especially if N varies across correlations) in a clear format.
In APA style guidelines:
Decimals for correlation coefficients are reported without a leading zero (e.g., .44 instead of 0.44).
Confidence intervals are typically placed in square brackets beneath the coefficients or provided in a separate row.
Many reports present the table with off-diagonal correlations (to avoid redundancy, as () and ()) and often include a footer legend for significance markers (e.g., () for (p < .05), () for (p < .01), () for (p < .001)).
Examples from the material illustrate proper reporting:
"Exam performance correlated with exam anxiety, () [(-0.56), (-0.30)], (p < .001)." (This shows a moderate negative correlation).
"Time spent revising correlated with exam anxiety, () [(-0.86), (-0.49)], (p < .001)." (This indicates a strong negative correlation).
Reporting correlations: specific notes
When presenting a table of correlations, the correlation matrix is typically structured such that the diagonal contains the correlation of a variable with itself (which is always 1) or descriptive statistics like means and standard deviations. The upper or lower diagonal then presents the pairwise correlation coefficients. A clear footer should explain any abbreviations, significance legends, and the number of observations. Bootstrap confidence intervals are particularly valuable for quantifying uncertainty in correlation estimates, as they are less reliant on the assumption of bivariate normality compared to traditional parametric confidence intervals. They are constructed by resampling the data many times with replacement and calculating the correlation coefficient for each resample, then using the distribution of these resampled statistics to form the confidence interval.
Independent Samples t-Test and Related Effect Sizes
The independent samples t-test is a classic statistical procedure used to compare the means of two distinct and independent groups on a single continuous measurement. For example, comparing the test scores of a treatment group versus a control group. From a regression perspective, this two-group difference can be conceptualized as a simple linear regression using a dichotomous predictor (coded 0 for one group and 1 for the other).
The regression model for an independent samples t-test is:
where () is the dummy-coded predictor (e.g., 0 for Group 1, 1 for Group 2). In this model:
The intercept () represents the mean of the reference group (the group coded 0).
The slope () directly represents the difference between the mean of Group 2 (coded 1) and the mean of Group 1 (coded 0). Thus, testing the significance of () in the regression model is statistically equivalent to performing an independent samples t-test.
Results and effect sizes
A typical report of an independent samples t-test includes:
The mean and standard deviation for each group.
The t-statistic, degrees of freedom (df), and the p-value. The t-statistic essentially measures the size of the difference relative to the variability within the samples.
An effect size measure, which quantifies the magnitude of the difference between the groups, independent of sample size. Common effect sizes for two independent groups are:
Cohen’s d (pooled SD): This is a widely used effect size that expresses the mean difference in terms of pooled standard deviations. It is calculated as:
Cohen's conventional guidelines for interpreting (d) are: 0.2 (small effect, practically negligible), 0.5 (medium effect, noticeable), 0.8 (large effect, substantial and clearly discernible).
Glass’s delta (()): Similar to Cohen's d, but it uses the standard deviation of the control group only in the denominator, which is particularly useful when the control group is considered the standard or when variances are substantially unequal and pooling is inappropriate:
where () is the SD of the control group. This is appropriate when one group is a control and the variance of the control group is more stable or representative of the population without intervention.
Hedges’ g: A bias-corrected version of Cohen's d, specifically designed to correct for the slight upward bias in Cohen's d when sample sizes are small (typically < 20). Hedges' g is typically preferred in meta-analyses because of this bias adjustment, providing a slightly more conservative and accurate estimate of population effect size.
In the example data provided, an independent samples t-test comparing a treatment group to a control group on reported well-being (Outcome) might yield a t-statistic of with a . If the treatment group's mean well-being was higher, this 'tells' us that the difference is statistically significant. The effect size, say Cohen's , would then quantify this as a 'medium-to-large' effect, providing a measure of the practical importance of the difference independent of sample size.