reliability
Reliability Instrumentation, Calculation & Issues in Psychology 272 – Spring 2025
Introduction
- This section is dedicated to the memory of Joseph T. McCord (1990 – 2020).
Overview
- The following topics are discussed:
- Definition of Reliability
- The Relationship between Reliability and Error
- True Scores
- Relationship with latent knowledge, skills, and abilities (KSA’s)
- Implications of Reliability Coefficients
- Test-retest reliability
- Parallel/alternate forms reliability
- Effect of Test Length on Reliability
- Spearman-Brown Formula
- Internal Consistency
- Cronbach’s alpha, KR-20, KR-21
- Interrater Reliability
Reliability
- Definition: Reliability is defined as a measure of the consistency or stability of an instrument, which should measure the same thing in the same way every time it is used.
- Consistency: If the same instrument is administered to the same individual repeatedly, the scores should be nearly uniform across trials.
- Analogy: A speedometer that fails to measure speed accurately every time is deemed useless.
Measurement Error and Reliability
- Reliable Measurements: A measurement is reliable if it is devoid of measurement error.
- Quantification: While no instrument is entirely free of error, we can quantify the degree of reliability through a reliability coefficient.
- Types of Reliability Coefficients: Common coefficients include Cronbach’s alpha, Spearman-Brown, KR-20, KR-21.
- Degree of Reliability: Reliability is not simply a binary question; it exists on a continuum due to the inherent error present in observations.
Reliability and Error
- In statistical analyses such as t-tests and ANOVAs, data can be viewed as having two components:
- Systematic variability (the desirable component)
- Error variability (the undesirable component)
- Variability in measurement instruments can lead to inconsistent results, indicating a lack of reliability.
Sources of Error in Measurement
- There are two primary types of error affecting measurements:
- Method Error: Issues with the experimenter or testing conditions, including faulty equipment, poor instructions, and distractions.
- Trial Error: Challenges faced by participants, such as dishonesty, illness, fatigue, or effects stemming from being observed (Hawthorne effect).
Reliability Calculation and Implications
- The reliability of scores can be expressed through a ratio:
- A smaller error will yield a reliability score closer to 1.0, with scores above 0.8 considered reliable and those below 0.4 viewed as poor.
Concept of True Score
- Definition: A subject’s true score reflects their latent ability on a given construct, which is abstract and not directly measurable.
- Observed vs. True Score: Observed scores are inferences based on true scores, which represent KSA’s.
Reliability Coefficients
- Calculated using specific forms of correlation coefficients:
- Reliability coefficients typically yield values greater than .80 (good reliability) and below .40 (poor reliability).
- This coefficient assumes that analyses are conducted on populations rather than samples, affecting the calculations.
Implications of Reliability Coefficients
- For reliability coefficients approaching 1.0, all observations reflect true scores accurately, establishing a direct correlation between true and observed scores with no measurement error.
- A coefficient of 0.0 indicates that all measurements represent random error only.
- For values between 0.0 and 1.0, observed scores include both true score variance and error variance.
Methods of Estimating Reliability
Overview of Methods
- Multiple methods exist for determining reliability, all related to correlation:
- Notable early work by Thorndike (1918) emphasized the connection between reliability and validity.
Test-Retest Reliability
- Definition: Involves testing the same subjects on two occasions with the same instrument.
- Calculation: The reliability coefficient is the Pearson correlation between the two score sets.
- Use Case: Common in personality tests and behavioral assessments. Not suitable for cognitive tests due to possible carry-over and practice effects, which can affect results, depending on testing interval and participant familiarity.
Parallel-Forms Reliability
- Assumptions: For two test forms (C and D) to be truly parallel, specific assumptions must be accepted.
- Constructing Parallel Forms: Involves ensuring equivalency in true-score and error variances, which requires a robust understanding of latent constructs.
- Calculation Method: Correlate results from both forms (C and D) to ascertain reliability.
Effect of Test Length on Reliability
- General Implication: Adding items tends to increase reliability.
- Spearman-Brown Formula: Utilized to evaluate increases in reliability due to added test items or length adjustments.
- Explained using the formula, .
- Example Calculation: If an instrument initially has a reliability of $0.70$ and the number of items is doubled from $k$ to $m$, the increase in reliability can be assessed. A move from $k = 10$ to $m = 20$ results in an approximate $18%$ reliability increase.
Internal Consistency
- Definition: Determined by correlating two halves of an assessment, which may be partitioned in various ways (e.g., first-half/second-half, even-odd).
- Adaption: A simplified Spearman-Brown version applies to this context using dichotomous scoring for test items.
- Reliability Estimates:
- Cronbach’s Alpha: Established by Cronbach (1951); represents the most prevalent reliability measure, functioning for both dichotomous and polytomous data.
- The formula for Cronbach's alpha is given by: .
- Coefficients of $0.70$ or higher generally indicate acceptable reliability results.
Interrater Reliability
- Defined by Cohen’s Kappa coefficient (κ), a chi-square test that quantifies raters' agreement in assessments.
- Example Case: A psychologist classifies behavioral problems among adolescents; a secondary rater checks this classification.
- Potential for Chance Agreements: Simply calculating percent agreement can be misleading without accounting for agreements expected to occur by chance.
- Expected Frequencies Calculation: In assessing multiple classifications, observed and expected frequencies are essential, specifically focusing on diagonal cells within contingency tables.
Summary of Reliability Interpretation Guidelines
- Cohens κ ranges interpretation by levels of agreement as follows (Landis & Koch, 1977):
- < 0.1: Poor agreement
- 0.1 - 0.20: Slight agreement
- 0.21 - 0.40: Fair agreement
- 0.41 - 0.60: Moderate agreement
- 0.61 - 0.80: Substantial agreement
- 0.81 - 1.0: Almost perfect agreement
References
- Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322.
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
- Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334.
- Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151–160.
- Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
- Luecht, R.M. (2004). Reliability. Presentation to ERM 667: Foundations of Educational Measurement, The University of North Carolina at Greensboro. Greensboro, NC: Author.
- Spearman, C.C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295.
- Thorndike, E.L. (1918). The nature, purposes and general methods of measurements of educational products. National Society for the Study of Educational Products: Seventeenth Yearbook, 16–24. Chicago, IL: University of Chicago Press.