reliability

Reliability Instrumentation, Calculation & Issues in Psychology 272 – Spring 2025

Introduction

This section is dedicated to the memory of Joseph T. McCord (1990 – 2020).

Overview

The following topics are discussed:
- Definition of Reliability
- The Relationship between Reliability and Error
- True Scores
- Relationship with latent knowledge, skills, and abilities (KSA’s)
- Implications of Reliability Coefficients
- Test-retest reliability
- Parallel/alternate forms reliability
- Effect of Test Length on Reliability
- Spearman-Brown Formula
- Internal Consistency
- Cronbach’s alpha, KR-20, KR-21

- Interrater Reliability

Reliability

Definition: Reliability is defined as a measure of the consistency or stability of an instrument, which should measure the same thing in the same way every time it is used.
Consistency: If the same instrument is administered to the same individual repeatedly, the scores should be nearly uniform across trials.
Analogy: A speedometer that fails to measure speed accurately every time is deemed useless.

Measurement Error and Reliability

Reliable Measurements: A measurement is reliable if it is devoid of measurement error.
Quantification: While no instrument is entirely free of error, we can quantify the degree of reliability through a reliability coefficient.
Types of Reliability Coefficients: Common coefficients include Cronbach’s alpha, Spearman-Brown, KR-20, KR-21.
Degree of Reliability: Reliability is not simply a binary question; it exists on a continuum due to the inherent error present in observations.

Reliability and Error

In statistical analyses such as t-tests and ANOVAs, data can be viewed as having two components:
- Systematic variability (the desirable component)
- Error variability (the undesirable component)
Variability in measurement instruments can lead to inconsistent results, indicating a lack of reliability.

Sources of Error in Measurement

There are two primary types of error affecting measurements:
- Method Error: Issues with the experimenter or testing conditions, including faulty equipment, poor instructions, and distractions.
- Trial Error: Challenges faced by participants, such as dishonesty, illness, fatigue, or effects stemming from being observed (Hawthorne effect).

Reliability Calculation and Implications

The reliability of scores can be expressed through a ratio:
A smaller error will yield a reliability score closer to 1.0, with scores above 0.8 considered reliable and those below 0.4 viewed as poor.

Concept of True Score

Definition: A subject’s true score reflects their latent ability on a given construct, which is abstract and not directly measurable.
Observed vs. True Score: Observed scores are inferences based on true scores, which represent KSA’s.

Reliability Coefficients

Calculated using specific forms of correlation coefficients:
- Reliability coefficients typically yield values greater than .80 (good reliability) and below .40 (poor reliability).
This coefficient assumes that analyses are conducted on populations rather than samples, affecting the calculations.

Implications of Reliability Coefficients

For reliability coefficients approaching 1.0, all observations reflect true scores accurately, establishing a direct correlation between true and observed scores with no measurement error.
A coefficient of 0.0 indicates that all measurements represent random error only.
For values between 0.0 and 1.0, observed scores include both true score variance and error variance.

Methods of Estimating Reliability

Overview of Methods

Multiple methods exist for determining reliability, all related to correlation:
- Notable early work by Thorndike (1918) emphasized the connection between reliability and validity.

Test-Retest Reliability

Definition: Involves testing the same subjects on two occasions with the same instrument.
Calculation: The reliability coefficient is the Pearson correlation between the two score sets.
Use Case: Common in personality tests and behavioral assessments. Not suitable for cognitive tests due to possible carry-over and practice effects, which can affect results, depending on testing interval and participant familiarity.

Parallel-Forms Reliability

Assumptions: For two test forms (C and D) to be truly parallel, specific assumptions must be accepted.
Constructing Parallel Forms: Involves ensuring equivalency in true-score and error variances, which requires a robust understanding of latent constructs.
Calculation Method: Correlate results from both forms (C and D) to ascertain reliability.

Effect of Test Length on Reliability

General Implication: Adding items tends to increase reliability.
Spearman-Brown Formula: Utilized to evaluate increases in reliability due to added test items or length adjustments.
Explained using the formula, $ext{New Reliability} = \frac{k imes ext{Old Reliability}}{k + (m - 1) imes ext{Old Reliability}}$ .
Example Calculation: If an instrument initially has a reliability of $0.70$ and the number of items is doubled from $k$ to $m$, the increase in reliability can be assessed. A move from $k = 10$ to $m = 20$ results in an approximate $18%$ reliability increase.

Internal Consistency

Definition: Determined by correlating two halves of an assessment, which may be partitioned in various ways (e.g., first-half/second-half, even-odd).
Adaption: A simplified Spearman-Brown version applies to this context using dichotomous scoring for test items.
Reliability Estimates:
- Cronbach’s Alpha: Established by Cronbach (1951); represents the most prevalent reliability measure, functioning for both dichotomous and polytomous data.
- The formula for Cronbach's alpha is given by: $ext{Cronbach’s } \beta = \frac{k}{k - 1}(1 - \frac{ ext{Sum of item variances}}{ ext{Total variance}})$ .
- Coefficients of $0.70$ or higher generally indicate acceptable reliability results.

Interrater Reliability

Defined by Cohen’s Kappa coefficient (κ), a chi-square test that quantifies raters' agreement in assessments.
Example Case: A psychologist classifies behavioral problems among adolescents; a secondary rater checks this classification.
Potential for Chance Agreements: Simply calculating percent agreement can be misleading without accounting for agreements expected to occur by chance.
Expected Frequencies Calculation: In assessing multiple classifications, observed and expected frequencies are essential, specifically focusing on diagonal cells within contingency tables.

Summary of Reliability Interpretation Guidelines

Cohens κ ranges interpretation by levels of agreement as follows (Landis & Koch, 1977):
- < 0.1: Poor agreement
- 0.1 - 0.20: Slight agreement
- 0.21 - 0.40: Fair agreement
- 0.41 - 0.60: Moderate agreement
- 0.61 - 0.80: Substantial agreement
- 0.81 - 1.0: Almost perfect agreement

References

Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334.
Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151–160.
Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
Luecht, R.M. (2004). Reliability. Presentation to ERM 667: Foundations of Educational Measurement, The University of North Carolina at Greensboro. Greensboro, NC: Author.
Spearman, C.C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295.
Thorndike, E.L. (1918). The nature, purposes and general methods of measurements of educational products. National Society for the Study of Educational Products: Seventeenth Yearbook, 16–24. Chicago, IL: University of Chicago Press.