Chapter 7 – Test Utility and Utility Analysis
Introduction: Everyday vs Psychometric “Utility”
Everyday meaning = “usefulness”; psychometrics = practical value of using a test, battery, training, or intervention to aid decision-making.
Representative questions utility seeks to answer:
How does Test A compare to Test B?
Does adding a test to a battery improve screening?
Will an admissions/personnel test select better applicants than supervisor judgment alone?
Does the test save time, money, or other resources?
Core Definition of Test Utility
A measure of efficiency gains when testing is implemented.
Applies to single instruments and full testing programs.
Utility judgments draw on:
Reliability data
Validity data
Additional information (costs, benefits, logistics, ethics, etc.).
Factors Influencing a Test’s Utility
1. Psychometric Soundness
Reliability → sets ceiling for validity; validity (especially criterion-related) ↗︎ utility but does not guarantee it.
Example: Sweat-patch for cocaine detection had r= .92 agreement with urine tests when untampered, yet low utility due to frequent patch tampering (Chawarski et al., 2007).
2. Costs (Economic & Non-Economic)
Direct : purchase, protocols, scoring software, staff time, facilities, insurance, legal, overhead.
Indirect : Cost of not testing or of using an ineffective test (e.g.
airline stops assessing pilots → lawsuits, loss of confidence).Noneconomic: harm, public safety, morale, ethics (e.g. missing child-abuse fractures due to fewer X-ray views).
3. Benefits (Economic & Non-Economic)
Economic: ↑ productivity, ↓ waste, ↑ profit, ROI.
Noneconomic (often convert to later): better work climate, fewer accidents, lower turnover, social safety from accurate involuntary-hospitalization decisions.
Utility Analysis: Concepts & Purposes
Family of cost-benefit techniques; guides choice among testing, training, interventions.
Typical decisions:
Choose Test A vs Test B vs no test.
Add/subtract tools in a battery.
Compare training programs or intervention elements.
End product = “educated decision” (optimal course of action).
Expectancy Data
Scatterplot → expectancy table showing probability of criterion success per predictor band.
Classic aids: Taylor–Russell tables & Naylor–Shine tables.
Inputs: validity ρ_{xy}; selection ratio; base rate.
Output: percentage of hires predicted successful after adding test.
Limitation: assumes linear predictor-criterion relation & clear pass/fail criterion.
Close-Up: Flecha Esmaralda Road Test (FERT)
Scenario: South-American courier hiring.
Existing policy = license + no criminal record → 50 % of hires rated “qualified.”
New on-road FERT studied (predictive validity r=.40).
Three illustrative cut scores:
18 (low)
Selection ratio =.95 (57/60 hired).
Positive Predictive Value (PPV) =.526.
False-negatives 0 % but utility gain trivial.
80 (high)
Selection ratio =.10 (6/60 hired).
PPV =1.00; overall accuracy only 60 %.
Requires ≈ 600 applicants for 60 hires (costly recruitment).
48 (moderate) – chosen
Selection ratio =.517 (31/60).
Miss rate ↓ from 50 % → 15 %.
PPV =.839.
Misclassifications cut from 30 to 9 drivers.
ROI computed with BCG formula ≈ 12.5 : 1 (see below).
Brogden–Cronbach–Gleser (BCG) Utility Formula
(\text{Utility Gain}) = N\,T\,r{xy}\,SDy\,\bar Z_m - N\,C
Example values (FERT):
N=60 drivers/yr, T=1.5 yr tenure,
r{xy}=0.40, SDy=\$9,000 (≈40 % salary),
\bar Z_m = 1.0, C=\$200/test → total test cost =\$24,000.
\text{Benefit}=60\times1.5\times0.40\times9000\times1=\$324,000.
\text{Utility Gain}=324{,}000-24{,}000=\$300,000.
Each testing dollar returns >\$12.50.
Productivity Variant
(\text{Productivity Gain}) = N\,T\,r{xy}\,SDp\,\bar Z_m - N\,C
SD_p = SD of output units (not dollars).
Decision Theory & Cut Scores
Four potential outcomes: True Positive, False Positive, False Negative, True Negative.
Trade-off managed via selection ratio & cut score.
Guidelines: set stricter cutoffs when false-positives are more costly (e.g. airline pilots).
Cut-Score Taxonomy
Fixed/Absolute (criterion-referenced): e.g.
driver’s road test.Relative/Norm-referenced: top 10 % get A.
Multiple Cut Scores: tiers (A–F).
Multiple Hurdle process: must pass sequential stages to continue (application → test → interview …).
Compensatory Model: weighted predictors; high score in one area offsets low in another (implemented via multiple regression).
Methods for Establishing Cut Scores
Angoff: SMEs judge probability minimally competent candidate answers each item correctly; average probabilities = cut.
Pros: simple; Cons: low inter-rater reliability possible.
Known/Contrasting Groups: test administered to groups already known pass/fail; cut = score at intersection of distributions.
Sensitive to group definition choices.
IRT-Based
Item-Mapping: SMEs review histogram columns of equal item difficulty.
Bookmark: SMEs place bookmark in ordered item booklet at point minimally competent examinee would answer correctly 50 % of time.
Advantages: ties cut to item difficulty, not raw % correct.
Additional historical methods: Predictive Yield (Thorndike), Decision-Theoretic approaches, discriminant-function analysis.
Practical Issues in Utility Studies
Applicant Pool Size & Quality: many models assume limitless applicants & 100 % offer acceptance → real-world overestimation.
Empirical adjustment: reduce projected gains up to 80 % (Murphy 1986).
Job Complexity: higher complexity → wider SD of performance, affects SD_y and utility.
Base Rates: at extreme values a test adds little incremental accuracy.
Real-World Illustration: Police Body Cameras (Ariel et al., 2015)
Randomized controlled trial: 988 shifts, Rialto CA.
“Camera” shifts vs “No-Camera” shifts.
Results: Use-of-force incidents \downarrow by > 50 %; citizen complaints \downarrow dramatically.
Demonstrates high diagnostic/treatment utility of BWC technology despite high initial procurement costs.
Ethical, Philosophical & Practical Implications
Misuse of utility arguments can lead to discriminatory or unsafe practices (e.g.
dropping assessment to save ).Utility not purely monetary: social justice, safety, individual rights, and morale weigh in.
Decision-makers must integrate psychometrics with prudence, vision, common sense.
Key Formulas & Numerical References
Taylor–Russell example: base rate $=.60$, selection ratio $=.20$, validity $=.55$ → projected success $=.88$.
ROI example: \text{ROI}=\frac{\$300,000}{\$24,000}=12.5:1.
Selection ratios illustrated: SR=.95,.517,.10 for cut scores 18, 48, 80 respectively.
Vocabulary Quick-Reference
Utility, Utility Analysis, Utility Gain, ROI.
Costs vs Benefits.
Psychometric Soundness.
Cut Score (fixed, relative, multiple, hurdle).
Angoff, Known-Groups, Bookmark, Item-Mapping.
Brogden–Cronbach–Gleser formula.
Decision-theory: TP, FP, FN, TN; Sensitivity, Specificity.
Compensatory vs Multiple-Hurdle selection.
Connections & Foundational Principles
Reliability → ceiling on validity; validity often correlates with utility but context matters.
Concepts integrate with earlier chapters on criterion-related validity, expectancy data, selection ratio, base rate.
Utility perspective extends psychometrics from “measurement quality” to organizational & societal impact.
Study Prompts
Compute utility gain given new r{xy}, SDy, selection ratio.
Contrast Angoff & Bookmark in settings where item difficulty varies widely.
Debate ethical boundaries: when is a false-negative more acceptable than a false-positive?
Design a multiple-hurdle hiring system for airline pilots incorporating both fixed and compensatory elements.