Reliability
≈ consistency/dependability of measurement, reflecting the precision with which an instrument measures something. It does not speak to the “goodness” or validity of what is being measured
.
Everyday example: a train that always arrives precisely at 7:02, even if it's consistently late; psychometrics: an instrument that consistently yields similar scores each time it is used under similar conditions.
Quantified by a reliability coefficient
r_{xx} \in [0,1]; a value closer to 1 indicates greater consistency and less measurement error, while 0 indicates no consistency.
Measurement error = the inevitable, inherent uncertainty or deviation in any observed score from the true score, which exists even after eliminating preventable mistakes and blunders.
Two broad categories:
Random error (“noise”) → unpredictable and non-systematic fluctuations that tend to cancel out over many measurements, leading to an average error of zero (e.g., a sudden, temporary surge in an examinee's blood sugar, a thunderclap during a testing session, momentarily remembering the word
“effervescent” during an SAT verbal section).
Systematic error → a predictable and consistent bias that inflates or deflates scores in one particular direction (e.g., a 12-inch ruler that is actually 12.1 inches long, a clinician’s personal religious bias consistently leading to higher or lower ratings of suicide risk).
Bias (in statistics) = the degree to which systematic error consistently skews score results, leading to an inaccurate representation of the true value.
Observed score
X = the raw score or what we actually record from a measurement instrument.
True score
T = a hypothetical, unobservable long-term average score of an individual if they were to take the test an infinite number of times under parallel administrations with no carry-over effects. It is a theoretical construct that represents the individual's consistent characteristic being measured.
Error score
E = the difference between the observed score and the true score.
Classic measurement model:
X = T + E
. This fundamental equation posits that every observed score is composed of a true score component and an error component.
Construct score = a person’s actual standing on the theoretical variable or latent trait being measured (e.g., intelligence, anxiety, extraversion). This is instrument-independent. Validity specifically concerns how well a test measures this true construct; reliability concerns how precisely the observed score approximates the true score.
Carry-over effects when retesting, which can confound test-retest reliability:
Practice effects (learning) → scores might artificially increase due to familiarity with the test content or format.
Fatigue effects (reduced energy/attention) → scores might artificially decrease due to mental or physical exhaustion.
Variance decomposition: The total variance of observed scores
\sigma^2
is theoretically composed of two independent components:
True variance
\sigma^2_t
(variance attributed to actual differences in true scores among individuals).
Error variance
\sigma^2_e
(variance attributed to random measurement error).
Thus, Total variance
\sigma^2 = \sigma^2t + \sigma^2e.
Reliability
r{xx} = \dfrac{\sigma^2t}{\sigma^2}
represents the proportion of total observed score variance that is attributable to true score variance. It can also be conceptualized as the squared correlation between observed scores and true scores.
Figure 5-1 context: The Standard Error of Measurement (SEM) is conceptually the standard deviation of repeated scores an individual would obtain around their true score T
.
Test construction (item/content sampling): Errors originating from how the test is designed or content is selected.
Examples: ambiguous item wording, inappropriate difficulty level of items for the target population, inadequate or imbalanced coverage of the content domain leading to an unrepresentative sample of behaviors.
Test administration: Errors introduced during the test delivery process.
Environment: variations in temperature, lighting, noise levels (e.g., a squeaky chair), distractions, or social contexts (e.g., testing during wartime vs. peace).
Examinee state: temporary fluctuations in an examinee's physical or psychological condition (e.g., sleep deprivation, effects of medication/drugs, anxiety, motivation levels, or a temporary drop in cognitive function linked to weight-related glucose fluctuations).
Examiner variables: appearance, demeanor, giving unintended cues, inconsistent adherence to standardized instructions, or examiner's personal biases (e.g., religiosity influencing suicide risk assessment).
Scoring/interpretation: Errors that arise during the process of evaluating or scoring responses.
Objective scoring minimizes error (e.g., multiple-choice tests scored by machine); however, minor errors like data entry still possible. Subjective scoring often introduces significant rater variance, where different scorers apply criteria inconsistently (e.g., essay grading, creativity blocks, Rorschach inkblot interpretations).
Training scorers, utilizing clear rubrics, and conducting group calibration sessions significantly reduce error by standardizing judgment and application of criteria.
Methodological & sampling errors: Broader errors related to research design or population representation.
Examples: polling margins influenced by sample size and selection; variations introduced by different interviewer training levels or inconsistent interviewing techniques; biased wording in survey questions that systematically affects responses, overlapping with test construction errors.
Test-Retest Reliability
Involves administering the same test to the same group of individuals on two separate occasions.
The correlation between the two sets of scores provides a coefficient of stability, indicating the consistency of scores over time.
Suitable for measuring stable traits (e.g., intelligence, personality traits). Long intervals (greater than 6 months) between administrations typically yield a
“coefficient of stability”
.
Threats: factors that cause true changes in the construct over time, confounding the reliability estimate (e.g., learning or memory of test items, therapeutic interventions that change an individual's state, emotional trauma, or rapid developmental spurts in children).
Parallel-Forms & Alternate-Forms Reliability
Parallel forms: Two distinct versions of a test designed to be statistically interchangeable, meaning they have equal means, variances, and correlations with other measures. Alternate forms are intended to be equivalent but are not statistically identical.
The correlation between scores on the two forms is known as the coefficient of equivalence. It reflects consistency across different sets of items.
Advantages: Requires the development of two distinct forms of the test. While it adds item-sampling error (variability due to different items being sampled), it significantly reduces memory and practice effects because examinees encounter different items.
Useful for make-up exams, longitudinal research designs where repeated testing is necessary, and situations requiring minimization of practice effects.
Internal Consistency Estimates
Split-Half
Steps:
Split the test into two statistically equivalent halves (e.g., random assignment of items, odd–even numbered items, or content-balanced subtests). This ensures each half measures the same construct to a similar degree.
Correlate the scores from the two halves (
r_{hh} ).
Adjust the half-test correlation using the Spearman–Brown prophecy formula, which corrects for the fact that the reliability applies to a test half the original length.
r{SB}=\frac{2r{hh}}{1+r_{hh}}
(specific form for doubling test length).
The general form is
r{SB}=\frac{nr{xy}}{1+(n-1)r_{xy}}
, where
n
is the factor by which the test length is changed.
Reliability generally increases with test length; adding more items tends to average out random errors. Figure 5-3 illustrates the multiplier needed to reach a desired reliability coefficient by increasing test length.
KR-20 / KR-21 (dichotomous) & Cronbach’s
\alpha
(multipoint)
Cronbach’s
\alpha (for tests with multipoint items, i.e., Likert scales) and KR-20/KR-21 (for dichotomous items, i.e., right/wrong) are measures of internal consistency.
\alpha = \big( \frac{k}{k-1}\big) \big(1-\frac{\sum \sigma^2i}{\sigma^2{total}}\big)
, where
k
is the number of items,
\sigma^2i is the variance of item i , and \sigma^2{total}
is the variance of the total test score.
Conceptually,
\alpha
represents the mean of all possible split-half correlation coefficients. It is often considered a lower bound estimate of true reliability. The coefficient typically ranges from
0
to
1
(though negative values can occur due to problematic items or small samples, typically reported as
0
if negative).
Limitations:
\alpha
can be inflated by excessively redundant items (items that measure virtually the same nuance) or by a large number of items. It underestimates reliability when items do not have equal true score variances or do not load equally on the same factor (i.e., violate the assumption of tau-equivalence).
McDonald’s
\omega
is an alternative reliability estimate based on factor analysis that does not assume equal item loadings, providing a more robust estimate, especially for multidimensional tests.
Inter-Scorer Reliability
Refers to the degree of agreement or consistency among two or more independent raters, scorers, judges, or observers when evaluating the same responses or behaviors.
Statistics commonly used include Pearson product-moment correlation coefficient (
r
), Cohen’s
\kappa
(which accounts for chance agreement), and the Intraclass Correlation Coefficient (ICC).
This type of reliability is enhanced by comprehensive training of raters, providing clear and specific scoring criteria, and conducting consensus discussions to calibrate judgments.
Historical example: Starch and Elliott (1912) famously demonstrated the wide variability in scoring English compositions, with grades for the same essay ranging from 50% to 98% among different teachers, highlighting the need for inter-scorer reliability.
Selecting a Method
The choice of reliability estimation method should always align with the purpose of the measurement, the nature of the test, and the specific sources of error that are most relevant to minimize. Table 5-1 in the textbook typically provides a guide linking methods to the primary error sources they address.
Spearman–Brown (general):
r{SB}=\frac{nr{xy}}{1+(n-1)r_{xy}}
, used to estimate the reliability of a test if its length were to be changed.
Standard Error of Measurement (SEM):
\sigma{meas}=\sigma\sqrt{1-r{xx}}
. SEM provides an estimate of the standard deviation of error scores for an individual score. A smaller SEM indicates greater precision of the observed score and higher reliability.
Standard Error of Difference (
SEdiff
): Used to determine if the difference between two scores (e.g., pre-test vs. post-test, or scores from two different individuals) is statistically significant.
\sigma{diff}=\sqrt{\sigma{meas1}^{2}+\sigma{meas2}^{2}} = \sigma\sqrt{2-r1-r_2}
.
Confidence Intervals (CI): A range within which an individual’s true score is likely to fall, based on their observed score and the SEM.
Formula:
X \pm z\,\sigma_{meas}
, where
X
is the observed score,
z
is the z-score corresponding to the desired confidence level (e.g., z=1 for 68%, z=1.96 for 95%, z=2.58 for 99%).
Item homogeneity vs heterogeneity: Tests with homogeneous items (measuring a very specific single construct) tend to have higher internal consistency reliability. Tests with heterogeneous items (measuring a broader or more complex construct with diverse facets) may have lower internal consistency but might offer better content coverage and comprehensive measurement.
Dynamic (state) vs static (trait) constructs: Reliability estimation methods should suit the nature of the construct. Dynamic constructs (states that fluctuate, e.g., mood) are less suited for long-interval test-retest reliability. Static constructs (traits that are stable, e.g., intelligence) are well-suited for test-retest.
Range restriction/inflation: The variability of scores in a sample affects reliability coefficients (which are correlations). Restricting the range of scores (e.g., testing only high-ability students) typically lowers the reliability coefficient. Conversely, inflating the range can artificially increase it.
Speed vs power tests:
Speed tests: Items are typically easy, and the challenge is completing as many as possible within a time limit. Split-half reliability on a single timed administration yields spuriously high reliability because virtually all attempted items are correct, making items highly correlated. Proper assessment requires two separately timed halves or a test-retest approach.
Power tests: Items are typically difficult, and there is ample time for all examinees to attempt all items. The challenge lies in the difficulty of the items, not the speed of response.
Criterion-Referenced Tests:
These tests focus on whether an individual has achieved a specific level of mastery or competence (e.g., passing a driving test). They often exhibit low total-score variance because most examinees either master the content (high scores) or not (low scores). Traditional reliability coefficients designed for norm-referenced tests (which emphasize individual differences) can be misleading. Alternate indices focusing on classification accuracy (e.g., consistency in classifying individuals as masters or non-masters) are more appropriate.
Classical Test Theory (CTT)
Widely used and foundational due to its relative simplicity and minimal assumptions.
Issues: CTT assumes all items are equally informative or contribute equally to the true score, which is often not realistic. It favors longer tests, as reliability, by formula, generally increases with test length.
Domain Sampling & Generalizability Theory (G-Theory)
Domain sampling: Conceptualizes a test as a sample of items drawn from a larger theoretical universe or domain of all possible items that could measure a given construct. Test reliability is, therefore, the representativeness of the sampled items relative to this universe of items.
Generalizability Theory (G-Theory): An extension of domain sampling that allows researchers to simultaneously estimate and differentiate between multiple sources of error variance (called “facets”, e.g., items, raters, occasions, settings). It yields coefficients of generalizability, which reflect the dependability of scores across different measurement conditions. A Generalizability study (G-study) estimates variance components, while a Decision study (D-study) evaluates how score dependability changes under different measurement designs for specific uses.
Item Response Theory (IRT)
A family of modern latent-trait models that link the probability of a correct or endorsed response to an item to an individual's unobservable trait level (
\theta ).
Item parameters (estimated for each item):
Difficulty (
b
): The location on the trait continuum where an examinee has a 50% probability of answering the item correctly (or endorsing it).
Discrimination (
a
): The slope of the item characteristic curve (ICC), indicating how well an item differentiates between individuals with high and low trait levels.
Guessing (
c
): In 3-parameter logistic (3-PL) models, this parameter accounts for the probability of low-ability examinees answering an item correctly by chance.
Rasch model: A strict subset of IRT models that assumes all items have equal discrimination values, simplifying the model. A key advantage is that it yields invariant measurement, meaning item parameters are independent of the specific sample of persons, and person parameters are independent of the specific sample of items.
Advantages: Provides item-level information, enabling the development of Computer Adaptive Testing (CAT) for efficient and precise measurement, and facilitates item banking. Challenges: Requires more mathematical sophistication, larger sample sizes, and specialized software.
Replicability Crisis (Close-Up)
The Open Science Collaboration (OSC) (2015) attempted to redo 100 psychology studies and found that only 40–60% could be replicated, leading to widespread concern about the robustness of scientific findings.
Causes: scarcity of replication publications (historically, less than 1.07% of published studies are direct replications), publication bias for positive or novel findings, and questionable research practices (QRPs) such as p-hacking (running analyses until a significant result is found) and HARKing (Hypothesizing After Results are Known).
Remedies: preregistration of studies (publicly documenting research hypotheses, design, and analysis plans before data collection) prevents p-hacking and HARKing; open science platforms promote data and code sharing; judicial gatekeeping emphasizes the need for robust and replicable scientific evidence in legal contexts.
Method Matters in Reliability (Everyday Psychometrics)
DSM-5 diagnostic reliability: A study using audio-recording of clinical interviews yielded a Kappa coefficient (
\kappa
) of approximately .80 (considered excellent reliability) for psychiatric diagnoses, while a 1-week test-retest method yielded a
\kappa
of approximately .47 (considered fair).
This difference highlights how methodology influences reliability estimates: the audio-recording method constrains information (all raters hear the exact same stimulus), which tends to inflate agreement and reduce observation error. In contrast, the test-retest method mirrors real-world independent assessments, capturing more sources of error such as patient state variability and clinician judgment differences over time, thus providing a more realistic estimate of reliability in practice.
Standard Error of Measurement (SEM): Provides a measure of the precision of an individual's observed score; a lower SEM indicates greater precision and is associated with higher reliability.
Example: If the observed score standard deviation is \sigma = 10, and reliability r{xx} = .84, then SEM \sigma{meas} = 10\sqrt{1-.84} = 10\sqrt{.16} = 10 \times .4 = 4. This suggests that for any given observed score, the true score is likely to be within \pm 4 points for about 68% of the time.
SB5 SEM table: For instance, the SEM for the Full Scale IQ (FSIQ) on the Stanford-Binet 5 (SB5) typically ranges from approximately 2.1–2.6 points, while the Abbreviated Battery IQ (ABIQ) generally has a higher SEM of about 3–5 points, reflecting its shorter length and lower reliability.
Use SEM to decide intellectual disability cutoff: For example, if an IQ score of 70 is the cutoff for intellectual disability, using SEM allows for establishing a confidence interval (e.g., 70 \pm 5 at the 95% CI) around the observed score to account for measurement error before making a diagnosis.
Standard Error of Difference (SEdiff): Tells us when two scores differ significantly. For example, in a Scholastic Mental Test (SMT) example, if
\sigma_{diff}=5.6
, a difference between two scores would need to be greater than approximately 11.2 points (
2 \times 5.6
) to be considered statistically significant at the 95% certainty level, indicating a true difference rather than just random error.
Rule-of-thumb grading (these are general guidelines and highly context-dependent):
.90s
= “A” (excellent reliability, suitable for critical individual decisions, such as clinical diagnoses, high-stakes selection, or life-or-death evaluations).
.80s
= “B” (good reliability, appropriate for important decisions, typically acceptable for many research and practical applications).
.65–.79
= borderline/weak (suggests caution; might be acceptable for preliminary research or screening purposes, but generally insufficient for individual decision-making);
<.65
often unacceptable for most psychometric applications, indicating substantial measurement error.
Always interpret relative to:
Purpose of the test: A test used for low-stakes screening might tolerate lower reliability than one used for high-stakes individual diagnosis or placement.
Error sources captured by the chosen coefficient: Different reliability methods account for different sources of error (e.g., test-retest captures temporal stability error, internal consistency captures content sampling error).
Test characteristics and population: Factors like test length, item format (e.g., multiple choice, essay), and the variability of the sample (e.g., homogeneous vs. heterogeneous group) can influence the observed reliability coefficient.
Highly reliable but invalid tests can profoundly mislead: Consistency does not equate to truth or accuracy. A test can consistently measure something erroneously, leading to systematically wrong but consistent decisions.
Bias & systematic error raise equity concerns: If a test contains systematic biases (e.g., culturally unfair content, differential impact due to demographic factors), even if highly reliable, it can lead to inequitable or discriminatory outcomes (e.g., disproportionate negative impacts in employment selection, educational placement, or courtroom stakes).
Reliability is a prerequisite for validity: A test cannot be valid if it is not reliable. An unreliable measure cannot consistently measure anything, let alone what it intends to measure; hence, insufficient reliability caps the maximum achievable validity of a test.
Replicability and transparency in scientific research are crucial for safeguarding societal decisions: The integrity of scientific findings, especially in legal, clinical, and educational contexts, relies on reproducible results. Open science practices promote accountability and trust in scientific evidence.