Reliability in Psychological Measurement: Test–Retest, Alternate Forms, Internal Consistency, and Inter-Rater Reliability
Reliability coefficients in psychology (overview)
In classical test theory, observed score = true score + error. The observed score we see on a test is an estimate of the true score, with error contaminating the measurement.
Reliability coefficients are, essentially, correlation coefficients. They describe how consistently a test measures a construct across repeated measurements or parallel forms.
A higher reliability (closer to 1) means less error and more stable measurement; reliability coefficients are typically used to refer to reliability or validity relationships, but here we focus on the reliability side as correlations between measurements.
Because reliability is about consistency, multiple methods exist to estimate it (test–retest, alternate forms, internal consistency, interrater reliability).
Test–retest reliability
Definition: administer the same test to the same group at two different times (Time 1 and Time 2) and compute the correlation between the two sets of scores. Let X1 represent Time 1 scores and X2 represent Time 2 scores.
Ideal outcome: r ≈ 1.0, indicating high stability of the trait or ability over the interval.
Rationale: Some constructs (e.g., personality traits, stable abilities) are relatively stable, so scores should correlate across occasions.
Interval (time between administrations): there is no universal rule; depends on the test and construct. Interval should be chosen to balance potential changes in the trait with practical considerations.
Short interval: reduces the chance of real change but may inflate learning or memory effects.
Long interval: increases chances of real change or external influences, which can lower the observed reliability.
Assumptions:
Test takers have not changed in the target skill/trait between administrations (e.g., typing speed should be similar across days unless there’s a genuine change).
Test administrations are the same: same instructions, same environment, no major intervening life events that could affect performance.
Potential sources of error:
Differences in administration (instructions, pacing, environment).
Life events between testing sessions (breakups, stress) that affect performance.
Practice effects: familiarity with the test content can artificially improve scores on Time 2, not because the trait increased but because the test was practiced.
Measurement error in general; error reduces reliability.
Practice effects: prior exposure to the test can boost performance on the retest, misleading conclusions about true ability or knowledge. For example, after taking a math achievement test once, a subsequent retest may reflect test familiarity rather than actual increased math ability.
Appropriateness of test–retest:
If the trait/ability is stable and not expected to change over the interval, test–retest can be useful for estimating reliability.
If the construct can change with learning or intervention, the correlation may reflect both stability and change, which complicates interpretation.
Longer intervals can degrade reliability due to forgetting or genuine change; short intervals can inflate reliability due to practice effects.
Practical implication: there is no one-size-fits-all interval; the choice depends on the test, the construct, and the intended use of the reliability estimate.
Alternate forms reliability
Definition: develop two parallel but different forms of the same test (Form A and Form B) that measure the same construct and are designed to be equivalent.
Purpose: reduce practice effects and memory/retention of item content that could inflate reliability if the same items were reused.
Procedure: administer Form A to some participants and Form B to others, ideally in random order and with counterbalancing to control for order effects.
Example: split participants into two groups; Group 1 takes Form A first, then Form B; Group 2 takes Form B first, then Form A. Random assignment helps prevent bias.
Reliability estimate: compute the correlation between Form A scores and Form B scores across participants. A high correlation suggests that the two forms are equivalent in measuring the same construct.
Target correlation: a value around
(or higher) is typically considered strong for alternate forms.Assumptions:
Forms A and B are truly equivalent in terms of psychometric properties (difficulty, discrimination, etc.).
The content on both forms targets the same underlying construct.
Considerations: alternate forms primarily address practice effects and order effects; they require careful construction to ensure equivalence and to minimize form-specific biases.
When to use: particularly valuable when practice effects are a concern or when repeated measurement is necessary but using the exact same items would bias results.
Internal consistency reliability
Core idea: items within a single test should reflect the same underlying construct. Internal consistency assesses whether the items are homogeneous and interrelated.
Item homogeneity: items should measure the same trait or dimension (e.g., all items intended to assess extraversion).
Why it matters: high internal consistency suggests that items are tapping a common construct; low internal consistency suggests multidimensionality or heterogeneous items.
Problems: high inter-item correlation does not guarantee that all items measure the same construct (there can be a “third variable” or multidimensionality). For example, some items might correlate because they are related to related but distinct constructs (e.g., warmth vs. sociability as facets of extraversion).
Methods to estimate internal consistency:
Split-half reliability: divide the test into two halves (e.g., even vs odd items) and compute the correlation between the two halves. This provides a reliability estimate based on halves.
Issue: halves are shorter; reliability tends to be lower with fewer items.
Correction: Spearman–Brown correction to adjust the split-half correlation to estimate reliability of the full-length test. The correction is known as the Spearman–Brown prophecy formula.
Spearman–Brown correction (concept): adjusts the observed split-half reliability to account for the reduced-length halves, providing an estimate of what the reliability would be if the full test were used.
Coefficient alpha (Cronbach’s alpha): the most common internal consistency statistic.
Definition (Cronbach’s alpha):
where:$k$ = number of items on the test,
$\sigma_i^2$ = variance of item i,
$\sigma_T^2$ = total variance of the test (variance of the sum of all item scores).
Interpretation: higher $\alpha$ (closer to 1) indicates greater internal consistency; reliability generally increases with more items (assuming item homogeneity).
When is internal consistency appropriate?
Very useful for many settings because it’s easy to compute (e.g., in SPSS with a few clicks).
Particularly valuable when you want to ensure that a scale intended to measure a single construct (e.g., a personality facet) is internally coherent.
Important caveats:
Item homogeneity is required; if items do not measure the same trait, alpha will be inflated by variance attributable to multiple constructs, leading to a misleading sense of reliability.
Internal consistency does not guarantee unidimensionality or validity; items could be highly redundant or could tap different facets of a broader construct.
A high alpha could reflect redundancy (very similar items) rather than a broad, reliable measure of a single construct.
Practical note: Cronbach’s alpha is widely reported in journals and used in grad school work; always check that the scale is unidimensional or that the alpha is interpreted in light of potential multidimensionality.
Inter-rater reliability (observational scoring)
When scoring requires human judgment (e.g., interviews, clinical ratings, rubric-based assessments), reliability concerns extend to raters rather than a single test form.
Cohen’s Kappa: a statistic used to measure inter-rater agreement for categorical data (or ordinal data with weighted versions). It assesses the extent to which raters give the same scores beyond what would be expected by chance.
Assumptions for reliable ratings:
Raters follow standardized scoring instructions or rubrics.
An individual rater applies the same standards across assessments (ratings are consistent over time).
What Cohen’s Kappa measures:
Inter-rater agreement: the consistency between two raters (or more, with extensions) in their judgments.
It accounts for chance agreement, unlike simple percent agreement.
Practical example: structured clinical interviews (SCID) or job interviews scored with a rubric. Two clinicians use the same scoring scheme to rate responses; kappa indicates how consistently they agree.
Interpretation: higher kappa indicates better agreement. Values close to 1 imply strong agreement beyond chance; values near 0 indicate agreement no better than chance; negative values indicate agreement worse than chance (rare in practice).
Relationship to reliability: inter-rater reliability is a form of reliability assessment analogous to test–retest or internal consistency, but it focuses on consistency across judges rather than across repeated measurements or items.
Practical implications and cautions
Reliability vs. validity: reliability concerns consistency; validity concerns whether the measure actually assesses the intended construct. A measure can be reliable without being valid, but it cannot be valid if it is not reliable.
Practical threats to reliability in real-world settings:
Inadequate standardization of administration or scoring can inflate error variance.
Practice effects can bias results in repeated measurements (e.g., hiring tests, training assessments).
Time-related changes in the construct (learning, fatigue, mood) can affect stability across sessions.
Implications for selection and decision-making:
Practice effects can make some candidates appear more capable than they are; this is problematic for decisions like employment selection.
When selecting measurement strategies, consider whether a form of reliability (test–retest, alternate forms, internal consistency, or inter-rater) best fits the construct and context.
Real-world relevance:
Employers or researchers rely on reliable measures to inform decisions; understanding the strengths and limitations of each reliability method helps interpret results correctly.
Random assignment and counterbalancing in alternate-form designs help control for order effects and other biases.
Quick recap of key methods:
Test–retest: stability over time; interval choice matters; assumes trait stability and equivalent conditions.
Alternate forms: parallel forms to reduce practice effects; requires form equivalence and randomization to control order effects.
Internal consistency: how well items on a single test measure the same construct; commonly via split-half (with Spearman–Brown correction) and Cronbach’s alpha; requires item homogeneity and careful interpretation regarding dimensionality.
Inter-rater reliability: Cohen’s Kappa measures agreement between raters beyond chance; hinges on standardized scoring and consistent application of the rubric.
Connections to broader concepts and real-world relevance
Reliability is a foundational pillar of test theory; without reliability, any conclusions drawn about a trait or ability are questionable.
The different reliability methods address different sources of error: time-related changes (test–retest), content familiarity (alternate forms), item-level coherence (internal consistency), and evaluator judgment (inter-rater reliability).
In practice, researchers and practitioners often report multiple reliability indices to give a comprehensive view of a measure’s stability and dependability.
Ethical and practical implications: bias due to practice effects or inconsistent administration can lead to unfair outcomes (e.g., biased hiring decisions); careful design and analysis help mitigate these risks.
Quick glossary (key formulas and terms)
Observed score model (CTT): where
= observed score,
= true score,
= error term.
Test–retest reliability (correlation between Time 1 and Time 2):
Cronbach’s alpha (internal consistency):
where is the number of items, is the variance of item i, and is the total variance of the test score.Split-half reliability (concept): correlate two halves of a test to estimate reliability of the full test; requires Spearman–Brown correction:
Spearman–Brown correction: adjusts the split-half correlation to reflect full-length test reliability (formula often presented in practice as the general correction, depending on the exact implementation).
Alternate forms reliability: correlation between Form A and Form B scores across participants; ideally close to one.
Inter-rater reliability: Cohen’s Kappa measures agreement between raters beyond chance; used when scoring involves subjective judgments.