Notes on Design, GLM, and Model Comparison for Exam Preparation

Design Types: Between-Subjects vs Within-Subjects

  • Core distinction: how the predictor variable (often denoted as xx) is applied or observed across participants.

  • Between-subjects design: each participant belongs to one group only; groups may be randomly assigned (e.g., experimental vs control) or observed (e.g., male vs female, different races, religions).

    • Example: random assignment to exposed vs non-exposed conditions.

    • Implication: differences in the dependent variable (denoted yy) across groups are attributed to the between-group factor.

  • Within-subjects design: the same participants experience multiple conditions or are measured repeatedly over time; participants can be in multiple contexts across time or states.

    • Example: longitudinal tracking of beliefs (atheist → Christian → Hindu) or repeated measurements of the same person’s emotion regulation across different moments.

    • ABA or single-case designs in some clinical contexts illustrate switching between contexts over time.

  • Practical cue to identify within-subjects vs between-subjects:

    • If the dependent variable is measured multiple times on the same person, you are likely in a within-subjects framework.

    • If the predictor is a single characteristic that assigns individuals to one of several groups (e.g., gene present vs absent, treatment vs control), you are likely in a between-subjects design.

  • Important caveat: some clearly between-subjects examples (e.g., presence vs absence of a gene) can be analyzed with within-subjects approaches if you structure the data as repeated measures (but this changes the model and assumptions).

  • Why this distinction matters:

    • The type of statistical model you can validly estimate depends on whether designs are between-subjects or within-subjects.

    • Mis-specifying the design (e.g., treating within-subject data as between-subjects) inflates Type I error risk and misplaces alpha levels.

    • Many standard techniques assume between-subjects data; within-subjects data often require special handling (e.g., repeated measures, multilevel modeling).

  • Practical takeaway: always assess your data structure first, confirm whether observations are independent or correlated within subjects, and choose the analysis accordingly.

Signs of Within-Subjects Data and How They Shape Analysis

  • A key sign: the dependent variable is measured more than once per subject (or across time points) within the same study.

  • Example scenario: tracking emotion regulation across different belief states for the same participants (atheist → Christian → Hindu) and measuring at each state.

  • In contrast, a between-subjects example might be comparing two groups defined by a fixed attribute (e.g., gene presence) where each participant is observed once per study.

  • When you have time- or condition-related changes within participants, you should consider within-subject designs and appropriate analyses (e.g., repeated-measures ANOVA, repeated measures regression, multilevel modeling).

  • Special case: obvious, simple categorical division (e.g., gene present vs absent) may be treated as between-subjects, but if you collect repeated measurements, you can still model within-subject variation (increasing complexity, but potentially more power).

  • Simple practical check for within-subjects vs between-subjects:

    • If you have multiple observations per subject, do not ignore the within-subject correlation; use methods that account for the nesting (e.g., multilevel modeling).

Statistical Model: What It Is and What It Does

  • A statistical model is an algebraic statement that represents how the dependent variable yy is related to one or more independent variables (predictors) like x,z,extetc.x, z, ext{etc.}

  • General idea: the model explains variability in yy across individuals by attributing parts of that variability to predictors and to residual noise.

  • Notation and concepts used in the lecture:

    • Dependent variable: yy (the outcome you want to explain).

    • Independent variables: x,z,extetc.x, z, ext{etc.} (predictors that may cause or explain variation in yy).

    • Coefficients: Greek-letter parameters (e.g., β<em>0,β</em>1,β2,\beta<em>0, \beta</em>1, \beta_2, \dots) that quantify the size and direction of effects.

    • Predicted value (fitted value): y^\hat{y}, the model’s estimate of yy for each observation.

    • Residual: e=yy^e = y - \hat{y}, the difference between observed and predicted values.

  • Two interpretive views:

    • Causation view: predictors are treated as causal factors that explain changes in yy.

    • Explanatory view: predictors account for variability in yy and are used to predict new observations.

  • Typical representation (linear):

    • General linear model form: y=β<em>0+β</em>1x<em>1+β</em>2x2++ey = \beta<em>0 + \beta</em>1 x<em>1 + \beta</em>2 x_2 + \cdots + e

    • In text: the dependent variable is a linear combination of predictors plus an error term.

  • The role of coefficients (population parameters) and their estimation:

    • Coefficients (e.g., β<em>1,β</em>2,\beta<em>1, \beta</em>2, \ldots) represent population effects (e.g., average difference in yy per unit change in the predictor).

    • We estimate coefficients using data to obtain β^<em>1,β^</em>2,\hat{\beta}<em>1, \hat{\beta}</em>2, \ldots.

    • The sign and magnitude indicate direction and size of effects (e.g., a negative β^1\hat{\beta}_1 suggests that increasing the predictor reduces yy).

  • Why we need hypotheses about coefficients:

    • We test whether coefficients are different from zero (no effect) for population-level claims.

    • Example: a coefficient of 5-5 would imply that treatment reduces the outcome by 5 units on average, all else equal.

  • Notation about variables:

    • Latin letters (e.g., x,zx, z) denote observed or design variables.

    • Greek letters (e.g., α,β, gamma\alpha, \beta, \ gamma) denote population coefficients; we estimate them from data.

  • What the model is trying to do: explain variability in yy across individuals by attributing portions to the predictors and the residual noise.

Coefficients, Predictions, and the Role of the Residual

  • Coefficients are population parameters; we estimate them with sample data.

  • Predicted values (fit) for each observation: y^<em>i=β</em>0+β<em>1x</em>i1+β<em>2x</em>i2+\hat{y}<em>i = \beta</em>0 + \beta<em>1 x</em>{i1} + \beta<em>2 x</em>{i2} + \cdots

  • Residuals: e<em>i=y</em>iy^ie<em>i = y</em>i - \hat{y}_i for each observation i.

  • Fitting objective: ordinary least squares (OLS) selects coefficient estimates that minimize the sum of squared residuals (SSE):

    • SSE=<em>i=1n(y</em>iy^i)2SSE = \sum<em>{i=1}^n (y</em>i - \hat{y}_i)^2

  • Model goodness: smaller SSE indicates a better fit to the data; but zero residuals are neither typical nor desirable because they imply an overly complex model with little parsimony.

  • Why not overfit to zero residuals:

    • A model with many predictors can drive SSE toward zero but becomes overly complex and less generalizable (parsimony principle).

  • The role of nested models and model comparison:

    • A restricted model (null) imposes constraints (e.g., certain coefficients fixed to zero or set to known values).

    • A full model (alternative) includes additional predictors or parameters.

    • The two models are nested if you can obtain the full model by adding parameters to the restricted model.

    • Comparison aims to determine whether the additional parameters significantly improve fit beyond what the restricted model explains.

Restricted vs Full Models: A Concrete Example

  • Setup: a simple between-subjects example with two groups defined by a predictor (e.g., treatment vs control).

  • The restricted model (null):

    • Example form: y=βR0+0x+ey = \beta_R0 + 0 \cdot x + e or, in a simple mean form, the model that uses a fixed mean (e.g., μ=100\mu = 100 for all).

    • In the lecture, for a one-group scenario, the restricted model used the fixed mean (e.g., 100) as the predicted value for all observations.

  • The full model (alternative):

    • Example form: y=β<em>0+β</em>1x+ey = \beta<em>0 + \beta</em>1 x + e, where xx is a group indicator (0 = control, 1 = experimental).

    • In the lecture, a specific numeric example used: y=5x+2z+ey = -5x + 2z + e with an added predictor zz (e.g., sex) to illustrate how adding predictors changes predictions and residuals.

  • Data arrangement example (10 subjects):

    • First 5 subjects in control (x = 0), next 5 in experimental (x = 1); observed values of a dependent variable yy and the observed predictor values (e.g., sex).

    • For the restricted model, predicted y^\hat{y} is computed using only the restricted form (e.g., y^=5x\hat{y} = -5 \cdot x), giving two possible predictions per observation (0 maps to 0, 1 maps to -5).

    • Residuals under the restricted model: e<em>i=y</em>iy^ie<em>i = y</em>i - \hat{y}_i.

    • For the full model, include the additional predictor (e.g., zz) and compute new predictions, leading to new residuals which can be smaller if the extra predictor explains additional variance.

  • How residuals illustrate model improvement:

    • If the full model yields smaller residuals (closer to zero across observations) than the restricted model, it suggests that the extra predictor(s) provide meaningful explanatory power.

    • A clear example from the lecture showed that residuals decreased when moving from the restricted model to the full model, illustrating better fit.

  • Summary of the intuition:

    • You compare two models using their residuals: the full model should not be worse than the restricted model; it should be strictly better (lower SSE) if the added predictor is meaningful.

    • The next step (in later lectures) is to quantify the difference with a formal statistic and p-value, taking into account degrees of freedom and the nested nature of the models.

How to Do Model Comparison: From Residuals to Inference

  • Core idea: you cannot rely on eyeballing residuals; you need a single numeric summary to compare models.

  • Stepwise process described in the lecture:

    • Fit the restricted model and compute residuals for each observation.

    • Fit the full model (with additional predictor(s)) and compute residuals for each observation.

    • Compute the sum of squared residuals for each model: SSE<em>R,SSE</em>FSSE<em>R, SSE</em>F where

    • SSE<em>R=</em>i=1n(y<em>iy^</em>R,i)2SSE<em>R = \sum</em>{i=1}^n (y<em>i - \hat{y}</em>{R,i})^2

    • SSE<em>F=</em>i=1n(y<em>iy^</em>F,i)2SSE<em>F = \sum</em>{i=1}^n (y<em>i - \hat{y}</em>{F,i})^2

  • Compare via an F-type test for nested models (the standard approach in the Fisher framework):

    • Define degrees of freedom:

    • df<em>R=np</em>Rdf<em>R = n - p</em>R and df<em>F=np</em>Fdf<em>F = n - p</em>F, where p<em>Rp<em>R and p</em>Fp</em>F are the number of estimated parameters in the restricted and full models respectively (including the intercept).

    • The F-statistic for nested models can be expressed as:

    • F=(SSE<em>RSSE</em>F)/(df<em>Rdf</em>F)SSE<em>F/df</em>FF = \frac{\big(SSE<em>R - SSE</em>F) / (df<em>R - df</em>F)}{SSE<em>F / df</em>F}

    • Under the null hypothesis that the additional parameters do not improve fit, FF follows an F(df<em>Rdf</em>F,dfF)F(df<em>R - df</em>F, df_F) distribution.

    • A significant F indicates that the full model provides a significantly better fit than the restricted model.

  • The idea of model comparison in the Fisher tradition:

    • You can view any statistical test as a model comparison: the restricted model represents the null; the full model represents the alternative.

    • The data decide which model fits better by comparing residual sums of squares (or equivalent information criteria).

  • Proportional Increase in Error (PIE): a heuristic measure discussed in the lecture:

    • Aims to quantify the change in error when moving from restricted to full model.

    • A common informal formulation (to be formalized later) is

    • PIE=SSE<em>FSSE</em>RSSERPIE = \frac{SSE<em>F - SSE</em>R}{SSE_R}

    • Interpretation:

    • If PIE > 0, SSE increased when adding predictors (the full model fit got worse).

    • If PIE < 0, SSE decreased (the full model fit improved).

    • Note: the exact numerical formulation of PIE can vary across texts; the main point is that model comparison hinges on whether the full model meaningfully improves fit beyond the restricted model.

  • Summary of the approach:

    • Nested model comparison uses SSER and SSEF, dfR and dfF, and yields an F-test to determine statistical significance of added predictors.

    • The practical goal is to choose the model that yields a significantly better fit with a reasonable amount of complexity (parsimony).

The General Linear Model (GLM) and Why It Matters

  • The general linear model (GLM) provides a unifying framework for many different analyses:

    • Ordinary least squares (OLS) regression (continuous predictors or dummy-coded categorical predictors).

    • Repeated measures regression and multilevel modeling (nested data, random effects).

    • Structural equation modeling (SEM) and path analysis are related under GLM formalisms.

    • Categorical data analysis using GLM principles (e.g., logistic regression is a GLM with a different link function).

  • The GLM perspective: many statistical analyses are specific instances (special cases) of the general linear framework.

  • Why this is helpful for students:

    • It connects seemingly different analyses under a single conceptual umbrella.

    • It clarifies how to translate research questions into algebraic models and how to interpret coefficients, predictions, and residuals.

  • Key linear-model components you will repeatedly encounter:

    • Dependent variable: yy

    • Predictors: x<em>1,x</em>2,x<em>1, x</em>2, \dots (could be dummy-coded for groups or continuous)

    • Intercept (baseline level): β0\beta_0 (or, in some contexts, μ\mu for the mean in a one-group design)

    • Coefficients: β<em>j\beta<em>j represent the effect of predictor x</em>jx</em>j on yy, controlling for other predictors

    • Error term: ee accounts for variability not explained by the predictors

  • A concise, commonly used GLM expression (as a reminder):

    • y=β<em>0+β</em>1x<em>1+β</em>2x2++e,E[e]=0,  Var(e)=σ2Iy = \beta<em>0 + \beta</em>1 x<em>1 + \beta</em>2 x_2 + \cdots + e, \quad E[e] = 0, \; Var(e) = \sigma^2 I

  • Practical tip: when you see a model written with a Greek letter, think of it as the population parameter; the data-driven analysis estimates that parameter (e.g., β^1\hat{\beta}_1). The symbols x,zx, z are the observed variables (predictors) in your study.

One-Group Design: A Simple Starting Point for Model Comparison

  • The one-group (single-group) design is the simplest case and provides intuition for restricted vs full model comparisons.

  • Research question: estimate the mean of a subpopulation and test whether it differs from a known value (the test value).

    • Example from the lecture: ADHD kids’ IQ compared to a standard mean of 100.

  • Setup details:

    • Only one group is observed (all subjects share the same value of interest’s context besides random variation).

    • The restricted model uses a fixed mean as the prediction for all observations (e.g., the null says mean = 100).

    • The full model estimates a mean from the data (i.e., the sample mean) as the predictor for all observations.

  • Notational framing for the one-group case:

    • Restricted model: y<em>i(R)=μ</em>0y<em>i^{(R)} = \mu</em>0 (e.g., 100).

    • Full model: yi(F)=μ^y_i^{(F)} = \hat{\mu} where μ^\hat{\mu} is the sample mean (the best estimate of the population mean under the unrestricted model).

  • What the data show (illustrative example):

    • If the sample mean equals the test value (e.g., 100), then the restricted and full models yield similar predictions, and residuals are similar.

    • If the sample mean differs (e.g., 104), the full model will produce smaller residuals than the restricted model, indicating that the data provide evidence that the true mean differs from the test value.

  • How the comparison is made numerically:

    • Compute residuals for the restricted model: e<em>i(R)=y</em>iμ<em>0e<em>i^{(R)} = y</em>i - \mu<em>0 and then sum of squares: SSE</em>R=ei(R)2SSE</em>R = \sum e_i^{(R)2}.

    • Compute residuals for the full model: e<em>i(F)=y</em>iμ^e<em>i^{(F)} = y</em>i - \hat{\mu} and then sum of squares: SSE<em>F=e</em>i(F)2SSE<em>F = \sum e</em>i^{(F)2}.

    • Compare SSER vs SSEF to determine if the data significantly favor the full model over the restricted one.

  • Connecting to familiar test statistics:

    • The standard path is to translate this into a formal test (e.g., one-sample t-test or F-test in a regression framework) with degrees of freedom based on sample size (n) and the number of estimated parameters.

    • In the tutorial, the instructor notes that the next step would involve degrees of freedom, the F ratio, and computing a p-value, tying the example to a formal hypothesis test.

  • Conceptual takeaway:

    • In a one-group design, the key question is whether the population mean differs from a stated value, tested by comparing a restricted (mean fixed) model to a full (mean estimated from data) model.

General Principles: Why the Right Model and the Right Test Matter

  • The design (between vs within) dictates which statistical models and tests are appropriate; mis-specifying can lead to invalid inferences.

  • The role of residuals:

    • Residuals quantify the unexplained variance after accounting for predictors.

    • Smaller residuals indicate better predictive accuracy, but zero residuals are neither expected nor desirable (parsimony concerns).

  • The nested-model framework provides a coherent way to compare hypotheses:

    • Restricted model embodies the null hypothesis (constrained parameters).

    • Full model embodies the alternative hypothesis (additional parameters allowed).

    • The difference in SSEs, adjusted for degrees of freedom, yields a test statistic (F) and a p-value.

  • The broader GLM philosophy:

    • Most statistical analyses in psychology, education, and related fields can be framed as GLMs or special cases (e.g., repeated measures, SEM, categorical data analysis).

    • This framing supports a unified approach to estimation, prediction, and hypothesis testing.

Terminology Recap: Symbols, Notation, and their Meaning

  • y: dependent variable (outcome).

  • x, z: observed predictor variables (could be binary indicators or continuous).

  • \betaj: population coefficient for predictor xj (the effect size of x_j).

  • \hat{y}: predicted value from the model given the estimated coefficients.

  • e: residual/error term (unexplained portion of y).

  • \mu: mean in a one-group design (intercept-like parameter in the restricted case).

  • SSE: sum of squared residuals, a measure of model misfit.

  • dfR, dfF: degrees of freedom for restricted and full models, respectively.

  • F: test statistic used to compare nested models across degrees of freedom.

  • PIE: Proportional Increase in Error (informal model-comparison quantity discussed in the lecture).

Practical Takeaways for Exam Preparation

  • Always distinguish between between-subjects and within-subjects designs; this determines the appropriate analysis and how you interpret effects.

  • Understand the components of a linear model: intercept (baseline), predictors with their coefficients, and the residual term.

  • Recognize the difference between a restricted model (null) and a full model (alternative); remember that nested models allow formal comparison via SSE and F-statistics.

  • Be comfortable with the concept of residuals, predicted values, and how residual sums of squares reflect model fit.

  • In one-group designs, the core exercise is comparing a fixed test value to the sample mean using a restricted vs full-model framework, leading to a test statistic and p-value.

  • Know that the general linear model unifies many analyses, including between/within designs, repeated measures, SEM, and categorical data methods; if you learn GLM well, you can connect many techniques across courses.

  • Quick formula cheatsheet (to memorize and apply in problems):

    • General linear model: y=β<em>0+β</em>1x<em>1+β</em>2x2++ey = \beta<em>0 + \beta</em>1 x<em>1 + \beta</em>2 x_2 + \cdots + e

    • Predicted: y^<em>i=β</em>0+β<em>1x</em>i1+β<em>2x</em>i2+\hat{y}<em>i = \beta</em>0 + \beta<em>1 x</em>{i1} + \beta<em>2 x</em>{i2} + \cdots

    • Residual: e<em>i=y</em>iy^ie<em>i = y</em>i - \hat{y}_i

    • SSE: SSE=<em>i=1n(y</em>iy^i)2SSE = \sum<em>{i=1}^n (y</em>i - \hat{y}_i)^2

    • F-statistic for nested models: F=(SSE<em>RSSE</em>F)/(df<em>Rdf</em>F)SSE<em>F/df</em>FF = \frac{(SSE<em>R - SSE</em>F) / (df<em>R - df</em>F)}{SSE<em>F / df</em>F} with df<em>R=np</em>R,  df<em>F=np</em>Fdf<em>R = n - p</em>R, \; df<em>F = n - p</em>F