Assessing Validity and Reliability of Diagnostic and Screening Tests

Learning Objectives

To define the validity and reliability of screening and diagnostic tests.
To compare measures of validity, including sensitivity and specificity.
To illustrate the use of multiple tests (sequential and simultaneous testing).
To introduce positive and negative predictive value.
To address measures of reliability, including percent agreement and kappa.

Introduction to Epidemiology: Assessing Validity and Reliability of Diagnostic and Screening Tests

The Problem: The quality of diagnostic and screening tests is a constant concern in both clinical practice and public health.
Core Question: How effective is a given test (e.g., physical exam, X-ray, blood assay) at differentiating individuals with a disease from those without it?
Module's Purpose: This module focuses on methods to assess the quality of new screening and diagnostic tests to guide decisions about their appropriate use and interpretation.
Key Challenge: Biologic variation within the human population complicates test assessment.

Validity of Screening Tests

Definition: Validity refers to a test's ability to accurately distinguish between individuals who have a disease and those who do not.
Two Types of Validity:
- Sensitivity: The ability of a test to correctly identify individuals who do have the disease (True Positives).
- Specificity: The ability of a test to correctly identify individuals who do not have the disease (True Negatives).

Tests with Dichotomous Results (Positive or Negative)

Test Results	Have the Disease (Positive)	Do not Have the Disease (Negative)
Positive	True Positive (TP)	False Positive (FP)
Negative	False Negative (FN)	True Negative (TN)

Formulas:
- Sensitivity: $\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}$ (Proportion of truly diseased individuals identified as positive)
- Specificity: $\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}$ (Proportion of truly non-diseased individuals identified as negative)

Example Calculation

Test Results	Have a Disease	Do not Have a Disease
Positive	80	100
Negative	20	800

Sensitivity Calculation: $\frac{80}{80 + 20} = \frac{80}{100} = 80\%$
Specificity Calculation: $\frac{800}{800 + 100} = \frac{800}{900} \approx 89\%$

Why are Sensitivity and Specificity Important?

False Positives (FP):
- Lead to more expensive follow-up tests, placing a burden on the healthcare system.
- Induce significant anxiety and worry for the individual.
- Can have negative impacts on insurance and employment status.
False Negatives (FN):
- The importance of false negatives depends on several factors:
  - Nature and Severity of the Disease: How serious is the condition?
  - Effectiveness of Available Intervention Measures: Are there treatments?
  - Timing of Intervention: Is early intervention significantly more effective in the disease's natural history?

Use of Multiple Tests

1. Sequential (Two-Stage) Testing

Process:
1. An initial, generally less expensive, less invasive, or less uncomfortable test is administered.
2. Only those individuals who screen positive on the first test are recalled for a second, often more expensive, more invasive, or more uncomfortable test.
3. The second test typically has higher sensitivity and specificity than the first.
Objective: This strategy primarily aims to reduce false positives by using a more definitive follow-up test for initial positives.

2. Simultaneous Testing

Process: Two or more tests (e.g., Test A and Test B) are administered at the same time to the population.
Example Scenario: A population of 1,000, with a disease prevalence of 20% (so 200 people have the disease). Two tests are used simultaneously.
- Test A: Sensitivity = 80%, Specificity = 60%
- Test B: Sensitivity = 90%, Specificity = 90%
Net Sensitivity (for two simultaneous tests):
- A person is considered positive if identified by Test A, Test B, or both tests.
- In the example, the net sensitivity is 98%.
Net Specificity (for two simultaneous tests):
- A person is considered negative only if identified as negative by both tests.
- In the example, the net specificity is 54%.

Comparison of Simultaneous and Sequential Testing

Sequential Testing: Generally results in a net loss in sensitivity but a net gain in specificity compared to either test alone.
Simultaneous Testing: Generally results in a net gain in sensitivity but a net loss in specificity compared to either test alone.
Decision-Making: The choice between sequential and simultaneous testing depends on:
- The objectives of the testing (e.g., screening vs. diagnosis).
- Practical considerations in the testing setting (e.g., length of hospital stay, costs, degree of invasiveness of tests).
- Extent of third-party insurance coverage.

Predictive Value of a Test

Concept: Predictive value focuses on the proportion of people who, given a test result, are correctly identified as having or not having the disease.

1. Positive Predictive Value (PPV)

Definition: If a test result is positive, PPV is the probability that the patient actually has the disease.
Formula: $\text{PPV} = \frac{\text{TP}}{\text{TP} + \text{FP}}$

2. Negative Predictive Value (NPV)

Definition: If a test result is negative, NPV is the probability that the patient actually does not have the disease.
Formula: $\text{NPV} = \frac{\text{TN}}{\text{TN} + \text{FN}}$

Example Calculation

Tests	Disease	No Disease
+	80	100
-	20	800

PPV Calculation: $\frac{80}{80 + 100} = \frac{80}{180} \approx 44.4\%$
NPV Calculation: $\frac{800}{800 + 20} = \frac{800}{820} \approx 97.6\%$

Relationship between PPV and Disease Prevalence and Test Specificity

Prevalence and PPV: A higher disease prevalence in the tested population leads to a higher PPV.
- Implication: Screening programs are most productive and cost-effective when directed at high-risk target populations.
- Screening a general population for a rare disease can be wasteful, yielding few new cases relative to the effort.
- Directing screening to a high-risk group increases productivity, motivation to participate, and likelihood of compliance with recommendations.
Specificity and PPV: In populations with low disease prevalence, the specificity of the test has a greater effect on the PPV than its sensitivity.

Reliability (Repeatability) of Tests

Definition: Reliability addresses whether a test result can be consistently replicated if the test is repeated under the same conditions.
Importance: If test results are not reproducible, the value and usefulness of the test are greatly diminished, regardless of its sensitivity and specificity.
Factors Affecting Reliability:
1. Intra-subject Variation: Natural physiological changes within the same individual over time. (e.g., blood pressure fluctuations over a day or seasonally).
2. Intra-observer Variation: Variability in readings or interpretations made by the same observer on the same test results at different times. This often involves subjective factors (e.g., a radiologist reading the same X-ray differently on separate occasions).
3. Inter-observer Variation: Differences in readings or interpretations of the same test results by different observers. Quantifying this agreement/disagreement is crucial for assessing healthcare quality (e.g., two physicians examining the same patient).

Kappa Statistic

Limitation of Percent Agreement: While useful, simple percent agreement between observers doesn't account for agreement that occurs purely by chance. Even completely untrained observers would agree on some observations by random occurrence.
The Need for Kappa: We want to measure the extent to which observed agreement surpasses what would be expected by chance alone. This reflects the true improvement in agreement due to training and refined criteria.
Introduction: The kappa statistic, proposed by Cohen in 1960, addresses this.
What Kappa Quantifies:
- Kappa expresses the degree to which observed agreement exceeds agreement expected by chance alone.
- It presents this excess agreement as a proportion of the maximum possible improvement in agreement beyond chance.
- Formulaically, it can be thought of as: $\frac{(\text{Percent Agreement Observed} - \text{Percent Agreement Expected by Chance})}{(\text{100\%} - \text{Percent Agreement Expected by Chance})}$
Interpretation (Landis and Koch's Suggestions):
- \text{Kappa} > 0.75: Represents excellent agreement beyond chance.
- $0.40 \le \text{Kappa} \le 0.75$ : Represents intermediate to good agreement.
- \text{Kappa} < 0.40: Represents poor agreement.

Relationship between Validity and Reliability

Group vs. Individual: It's important to remember that a test's validity for a group or population may not directly translate to validity for an individual in a clinical setting.
Reliability's Impact: If a test's reliability or repeatability is poor, its validity for a particular individual will also likely be poor. Poor reliability undermines individual validity.
Key Distinction: Maintaining the distinction between group validity and individual validity is crucial when assessing the quality of diagnostic and screening tests.