Reliability in Psychological Measurement: Test–Retest, Alternate Forms, Internal Consistency, and Inter-Rater Reliability

Reliability coefficients in psychology (overview)

  • In classical test theory, observed score = true score + error. The observed score we see on a test is an estimate of the true score, with error contaminating the measurement.

  • Reliability coefficients are, essentially, correlation coefficients. They describe how consistently a test measures a construct across repeated measurements or parallel forms.

  • A higher reliability (closer to 1) means less error and more stable measurement; reliability coefficients are typically used to refer to reliability or validity relationships, but here we focus on the reliability side as correlations between measurements.

  • Because reliability is about consistency, multiple methods exist to estimate it (test–retest, alternate forms, internal consistency, interrater reliability).

Test–retest reliability

  • Definition: administer the same test to the same group at two different times (Time 1 and Time 2) and compute the correlation between the two sets of scores. Let X1 represent Time 1 scores and X2 represent Time 2 scores.

  • Ideal outcome: r ≈ 1.0, indicating high stability of the trait or ability over the interval.

  • Rationale: Some constructs (e.g., personality traits, stable abilities) are relatively stable, so scores should correlate across occasions.

  • Interval (time between administrations): there is no universal rule; depends on the test and construct. Interval should be chosen to balance potential changes in the trait with practical considerations.

    • Short interval: reduces the chance of real change but may inflate learning or memory effects.

    • Long interval: increases chances of real change or external influences, which can lower the observed reliability.

  • Assumptions:

    • Test takers have not changed in the target skill/trait between administrations (e.g., typing speed should be similar across days unless there’s a genuine change).

    • Test administrations are the same: same instructions, same environment, no major intervening life events that could affect performance.

  • Potential sources of error:

    • Differences in administration (instructions, pacing, environment).

    • Life events between testing sessions (breakups, stress) that affect performance.

    • Practice effects: familiarity with the test content can artificially improve scores on Time 2, not because the trait increased but because the test was practiced.

    • Measurement error in general; error reduces reliability.

  • Practice effects: prior exposure to the test can boost performance on the retest, misleading conclusions about true ability or knowledge. For example, after taking a math achievement test once, a subsequent retest may reflect test familiarity rather than actual increased math ability.

  • Appropriateness of test–retest:

    • If the trait/ability is stable and not expected to change over the interval, test–retest can be useful for estimating reliability.

    • If the construct can change with learning or intervention, the correlation may reflect both stability and change, which complicates interpretation.

    • Longer intervals can degrade reliability due to forgetting or genuine change; short intervals can inflate reliability due to practice effects.

  • Practical implication: there is no one-size-fits-all interval; the choice depends on the test, the construct, and the intended use of the reliability estimate.

Alternate forms reliability

  • Definition: develop two parallel but different forms of the same test (Form A and Form B) that measure the same construct and are designed to be equivalent.

  • Purpose: reduce practice effects and memory/retention of item content that could inflate reliability if the same items were reused.

  • Procedure: administer Form A to some participants and Form B to others, ideally in random order and with counterbalancing to control for order effects.

    • Example: split participants into two groups; Group 1 takes Form A first, then Form B; Group 2 takes Form B first, then Form A. Random assignment helps prevent bias.

  • Reliability estimate: compute the correlation between Form A scores and Form B scores across participants. A high correlation suggests that the two forms are equivalent in measuring the same construct.

  • Target correlation: a value around
    r0.8r \,\approx\, 0.8
    (or higher) is typically considered strong for alternate forms.

  • Assumptions:

    • Forms A and B are truly equivalent in terms of psychometric properties (difficulty, discrimination, etc.).

    • The content on both forms targets the same underlying construct.

  • Considerations: alternate forms primarily address practice effects and order effects; they require careful construction to ensure equivalence and to minimize form-specific biases.

  • When to use: particularly valuable when practice effects are a concern or when repeated measurement is necessary but using the exact same items would bias results.

Internal consistency reliability

  • Core idea: items within a single test should reflect the same underlying construct. Internal consistency assesses whether the items are homogeneous and interrelated.

  • Item homogeneity: items should measure the same trait or dimension (e.g., all items intended to assess extraversion).

  • Why it matters: high internal consistency suggests that items are tapping a common construct; low internal consistency suggests multidimensionality or heterogeneous items.

  • Problems: high inter-item correlation does not guarantee that all items measure the same construct (there can be a “third variable” or multidimensionality). For example, some items might correlate because they are related to related but distinct constructs (e.g., warmth vs. sociability as facets of extraversion).

  • Methods to estimate internal consistency:

    • Split-half reliability: divide the test into two halves (e.g., even vs odd items) and compute the correlation between the two halves. This provides a reliability estimate based on halves.

    • Issue: halves are shorter; reliability tends to be lower with fewer items.

    • Correction: Spearman–Brown correction to adjust the split-half correlation to estimate reliability of the full-length test. The correction is known as the Spearman–Brown prophecy formula.

    • Spearman–Brown correction (concept): adjusts the observed split-half reliability to account for the reduced-length halves, providing an estimate of what the reliability would be if the full test were used.

    • Coefficient alpha (Cronbach’s alpha): the most common internal consistency statistic.

    • Definition (Cronbach’s alpha):
      α=kk1(1<em>i=1kσ</em>i2σT2)\alpha = \frac{k}{k-1}\left(1 - \frac{\sum<em>{i=1}^{k} \sigma</em>i^2}{\sigma_T^2}\right)
      where:

    • $k$ = number of items on the test,

    • $\sigma_i^2$ = variance of item i,

    • $\sigma_T^2$ = total variance of the test (variance of the sum of all item scores).

    • Interpretation: higher $\alpha$ (closer to 1) indicates greater internal consistency; reliability generally increases with more items (assuming item homogeneity).

  • When is internal consistency appropriate?

    • Very useful for many settings because it’s easy to compute (e.g., in SPSS with a few clicks).

    • Particularly valuable when you want to ensure that a scale intended to measure a single construct (e.g., a personality facet) is internally coherent.

  • Important caveats:

    • Item homogeneity is required; if items do not measure the same trait, alpha will be inflated by variance attributable to multiple constructs, leading to a misleading sense of reliability.

    • Internal consistency does not guarantee unidimensionality or validity; items could be highly redundant or could tap different facets of a broader construct.

    • A high alpha could reflect redundancy (very similar items) rather than a broad, reliable measure of a single construct.

  • Practical note: Cronbach’s alpha is widely reported in journals and used in grad school work; always check that the scale is unidimensional or that the alpha is interpreted in light of potential multidimensionality.

Inter-rater reliability (observational scoring)

  • When scoring requires human judgment (e.g., interviews, clinical ratings, rubric-based assessments), reliability concerns extend to raters rather than a single test form.

  • Cohen’s Kappa: a statistic used to measure inter-rater agreement for categorical data (or ordinal data with weighted versions). It assesses the extent to which raters give the same scores beyond what would be expected by chance.

  • Assumptions for reliable ratings:

    • Raters follow standardized scoring instructions or rubrics.

    • An individual rater applies the same standards across assessments (ratings are consistent over time).

  • What Cohen’s Kappa measures:

    • Inter-rater agreement: the consistency between two raters (or more, with extensions) in their judgments.

    • It accounts for chance agreement, unlike simple percent agreement.

  • Practical example: structured clinical interviews (SCID) or job interviews scored with a rubric. Two clinicians use the same scoring scheme to rate responses; kappa indicates how consistently they agree.

  • Interpretation: higher kappa indicates better agreement. Values close to 1 imply strong agreement beyond chance; values near 0 indicate agreement no better than chance; negative values indicate agreement worse than chance (rare in practice).

  • Relationship to reliability: inter-rater reliability is a form of reliability assessment analogous to test–retest or internal consistency, but it focuses on consistency across judges rather than across repeated measurements or items.

Practical implications and cautions

  • Reliability vs. validity: reliability concerns consistency; validity concerns whether the measure actually assesses the intended construct. A measure can be reliable without being valid, but it cannot be valid if it is not reliable.

  • Practical threats to reliability in real-world settings:

    • Inadequate standardization of administration or scoring can inflate error variance.

    • Practice effects can bias results in repeated measurements (e.g., hiring tests, training assessments).

    • Time-related changes in the construct (learning, fatigue, mood) can affect stability across sessions.

  • Implications for selection and decision-making:

    • Practice effects can make some candidates appear more capable than they are; this is problematic for decisions like employment selection.

    • When selecting measurement strategies, consider whether a form of reliability (test–retest, alternate forms, internal consistency, or inter-rater) best fits the construct and context.

  • Real-world relevance:

    • Employers or researchers rely on reliable measures to inform decisions; understanding the strengths and limitations of each reliability method helps interpret results correctly.

    • Random assignment and counterbalancing in alternate-form designs help control for order effects and other biases.

  • Quick recap of key methods:

    • Test–retest: stability over time; interval choice matters; assumes trait stability and equivalent conditions.

    • Alternate forms: parallel forms to reduce practice effects; requires form equivalence and randomization to control order effects.

    • Internal consistency: how well items on a single test measure the same construct; commonly via split-half (with Spearman–Brown correction) and Cronbach’s alpha; requires item homogeneity and careful interpretation regarding dimensionality.

    • Inter-rater reliability: Cohen’s Kappa measures agreement between raters beyond chance; hinges on standardized scoring and consistent application of the rubric.

Connections to broader concepts and real-world relevance

  • Reliability is a foundational pillar of test theory; without reliability, any conclusions drawn about a trait or ability are questionable.

  • The different reliability methods address different sources of error: time-related changes (test–retest), content familiarity (alternate forms), item-level coherence (internal consistency), and evaluator judgment (inter-rater reliability).

  • In practice, researchers and practitioners often report multiple reliability indices to give a comprehensive view of a measure’s stability and dependability.

  • Ethical and practical implications: bias due to practice effects or inconsistent administration can lead to unfair outcomes (e.g., biased hiring decisions); careful design and analysis help mitigate these risks.

Quick glossary (key formulas and terms)

  • Observed score model (CTT): X=T+EX = T + E where

    • XX = observed score,

    • TT = true score,

    • EE = error term.

  • Test–retest reliability (correlation between Time 1 and Time 2): r<em>tt=corr(X</em>1,X2)r<em>{tt} = corr(X</em>1, X_2)

  • Cronbach’s alpha (internal consistency):
    α=kk1(1<em>i=1kσ</em>i2σ<em>T2)\alpha = \frac{k}{k-1}\left(1 - \frac{\sum<em>{i=1}^{k} \sigma</em>i^2}{\sigma<em>T^2}\right) where kk is the number of items, σ</em>i2\sigma</em>i^2 is the variance of item i, and σT2\sigma_T^2 is the total variance of the test score.

  • Split-half reliability (concept): correlate two halves of a test to estimate reliability of the full test; requires Spearman–Brown correction:

    • Spearman–Brown correction: adjusts the split-half correlation to reflect full-length test reliability (formula often presented in practice as the general correction, depending on the exact implementation).

  • Alternate forms reliability: correlation between Form A and Form B scores across participants; ideally close to one.

  • Inter-rater reliability: Cohen’s Kappa measures agreement between raters beyond chance; used when scoring involves subjective judgments.

End of notes