reliability

Reliability Instrumentation, Calculation & Issues in Psychology 272 – Spring 2025


Introduction

  • This section is dedicated to the memory of Joseph T. McCord (1990 – 2020).

Overview

  • The following topics are discussed:
    • Definition of Reliability
    • The Relationship between Reliability and Error
    • True Scores
    • Relationship with latent knowledge, skills, and abilities (KSA’s)
    • Implications of Reliability Coefficients
    • Test-retest reliability
    • Parallel/alternate forms reliability
    • Effect of Test Length on Reliability
    • Spearman-Brown Formula
    • Internal Consistency
    • Cronbach’s alpha, KR-20, KR-21

- Interrater Reliability

Reliability

  • Definition: Reliability is defined as a measure of the consistency or stability of an instrument, which should measure the same thing in the same way every time it is used.
  • Consistency: If the same instrument is administered to the same individual repeatedly, the scores should be nearly uniform across trials.
  • Analogy: A speedometer that fails to measure speed accurately every time is deemed useless.

Measurement Error and Reliability

  • Reliable Measurements: A measurement is reliable if it is devoid of measurement error.
  • Quantification: While no instrument is entirely free of error, we can quantify the degree of reliability through a reliability coefficient.
  • Types of Reliability Coefficients: Common coefficients include Cronbach’s alpha, Spearman-Brown, KR-20, KR-21.
  • Degree of Reliability: Reliability is not simply a binary question; it exists on a continuum due to the inherent error present in observations.

Reliability and Error

  • In statistical analyses such as t-tests and ANOVAs, data can be viewed as having two components:
    • Systematic variability (the desirable component)
    • Error variability (the undesirable component)
  • Variability in measurement instruments can lead to inconsistent results, indicating a lack of reliability.

Sources of Error in Measurement

  • There are two primary types of error affecting measurements:
    • Method Error: Issues with the experimenter or testing conditions, including faulty equipment, poor instructions, and distractions.
    • Trial Error: Challenges faced by participants, such as dishonesty, illness, fatigue, or effects stemming from being observed (Hawthorne effect).

Reliability Calculation and Implications

  • The reliability of scores can be expressed through a ratio:
  • A smaller error will yield a reliability score closer to 1.0, with scores above 0.8 considered reliable and those below 0.4 viewed as poor.

Concept of True Score

  • Definition: A subject’s true score reflects their latent ability on a given construct, which is abstract and not directly measurable.
  • Observed vs. True Score: Observed scores are inferences based on true scores, which represent KSA’s.

Reliability Coefficients

  • Calculated using specific forms of correlation coefficients:
    • Reliability coefficients typically yield values greater than .80 (good reliability) and below .40 (poor reliability).
  • This coefficient assumes that analyses are conducted on populations rather than samples, affecting the calculations.

Implications of Reliability Coefficients

  • For reliability coefficients approaching 1.0, all observations reflect true scores accurately, establishing a direct correlation between true and observed scores with no measurement error.
  • A coefficient of 0.0 indicates that all measurements represent random error only.
  • For values between 0.0 and 1.0, observed scores include both true score variance and error variance.

Methods of Estimating Reliability

Overview of Methods

  • Multiple methods exist for determining reliability, all related to correlation:
    • Notable early work by Thorndike (1918) emphasized the connection between reliability and validity.

Test-Retest Reliability

  • Definition: Involves testing the same subjects on two occasions with the same instrument.
  • Calculation: The reliability coefficient is the Pearson correlation between the two score sets.
  • Use Case: Common in personality tests and behavioral assessments. Not suitable for cognitive tests due to possible carry-over and practice effects, which can affect results, depending on testing interval and participant familiarity.

Parallel-Forms Reliability

  • Assumptions: For two test forms (C and D) to be truly parallel, specific assumptions must be accepted.
  • Constructing Parallel Forms: Involves ensuring equivalency in true-score and error variances, which requires a robust understanding of latent constructs.
  • Calculation Method: Correlate results from both forms (C and D) to ascertain reliability.

Effect of Test Length on Reliability

  • General Implication: Adding items tends to increase reliability.
  • Spearman-Brown Formula: Utilized to evaluate increases in reliability due to added test items or length adjustments.
  • Explained using the formula, ext{New Reliability} = rac{k imes ext{Old Reliability}}{k + (m - 1) imes ext{Old Reliability}} .
  • Example Calculation: If an instrument initially has a reliability of $0.70$ and the number of items is doubled from $k$ to $m$, the increase in reliability can be assessed. A move from $k = 10$ to $m = 20$ results in an approximate $18%$ reliability increase.

Internal Consistency

  • Definition: Determined by correlating two halves of an assessment, which may be partitioned in various ways (e.g., first-half/second-half, even-odd).
  • Adaption: A simplified Spearman-Brown version applies to this context using dichotomous scoring for test items.
  • Reliability Estimates:
    • Cronbach’s Alpha: Established by Cronbach (1951); represents the most prevalent reliability measure, functioning for both dichotomous and polytomous data.
    • The formula for Cronbach's alpha is given by: ext{Cronbach’s } eta = rac{k}{k - 1}(1 - rac{ ext{Sum of item variances}}{ ext{Total variance}}) .
    • Coefficients of $0.70$ or higher generally indicate acceptable reliability results.

Interrater Reliability

  • Defined by Cohen’s Kappa coefficient (κ), a chi-square test that quantifies raters' agreement in assessments.
  • Example Case: A psychologist classifies behavioral problems among adolescents; a secondary rater checks this classification.
  • Potential for Chance Agreements: Simply calculating percent agreement can be misleading without accounting for agreements expected to occur by chance.
  • Expected Frequencies Calculation: In assessing multiple classifications, observed and expected frequencies are essential, specifically focusing on diagonal cells within contingency tables.

Summary of Reliability Interpretation Guidelines

  • Cohens κ ranges interpretation by levels of agreement as follows (Landis & Koch, 1977):
    • < 0.1: Poor agreement
    • 0.1 - 0.20: Slight agreement
    • 0.21 - 0.40: Fair agreement
    • 0.41 - 0.60: Moderate agreement
    • 0.61 - 0.80: Substantial agreement
    • 0.81 - 1.0: Almost perfect agreement

References

  • Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322.
  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
  • Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334.
  • Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151–160.
  • Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
  • Luecht, R.M. (2004). Reliability. Presentation to ERM 667: Foundations of Educational Measurement, The University of North Carolina at Greensboro. Greensboro, NC: Author.
  • Spearman, C.C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295.
  • Thorndike, E.L. (1918). The nature, purposes and general methods of measurements of educational products. National Society for the Study of Educational Products: Seventeenth Yearbook, 16–24. Chicago, IL: University of Chicago Press.