Everyday meaning = “usefulness”; psychometrics = practical value of using a test, battery, training, or intervention to aid decision-making.
Representative questions utility seeks to answer:
How does Test A compare to Test B?
Does adding a test to a battery improve screening?
Will an admissions/personnel test select better applicants than supervisor judgment alone?
Does the test save time, money, or other resources?
A measure of efficiency gains when testing is implemented.
Applies to single instruments and full testing programs.
Utility judgments draw on:
Reliability data
Validity data
Additional information (costs, benefits, logistics, ethics, etc.).
Reliability → sets ceiling for validity; validity (especially criterion-related) ↗︎ utility but does not guarantee it.
Example: Sweat-patch for cocaine detection had r= .92 agreement with urine tests when untampered, yet low utility due to frequent patch tampering (Chawarski et al., 2007).
Direct : purchase, protocols, scoring software, staff time, facilities, insurance, legal, overhead.
Indirect : Cost of not testing or of using an ineffective test (e.g.
airline stops assessing pilots → lawsuits, loss of confidence).
Noneconomic: harm, public safety, morale, ethics (e.g. missing child-abuse fractures due to fewer X-ray views).
Economic: ↑ productivity, ↓ waste, ↑ profit, ROI.
Noneconomic (often convert to later): better work climate, fewer accidents, lower turnover, social safety from accurate involuntary-hospitalization decisions.
Family of cost-benefit techniques; guides choice among testing, training, interventions.
Typical decisions:
Choose Test A vs Test B vs no test.
Add/subtract tools in a battery.
Compare training programs or intervention elements.
End product = “educated decision” (optimal course of action).
Scatterplot → expectancy table showing probability of criterion success per predictor band.
Classic aids: Taylor–Russell tables & Naylor–Shine tables.
Inputs: validity ρ_{xy}; selection ratio; base rate.
Output: percentage of hires predicted successful after adding test.
Limitation: assumes linear predictor-criterion relation & clear pass/fail criterion.
Scenario: South-American courier hiring.
Existing policy = license + no criminal record → 50 % of hires rated “qualified.”
New on-road FERT studied (predictive validity r=.40).
Three illustrative cut scores:
18 (low)
Selection ratio =.95 (57/60 hired).
Positive Predictive Value (PPV) =.526.
False-negatives 0 % but utility gain trivial.
80 (high)
Selection ratio =.10 (6/60 hired).
PPV =1.00; overall accuracy only 60 %.
Requires ≈ 600 applicants for 60 hires (costly recruitment).
48 (moderate) – chosen
Selection ratio =.517 (31/60).
Miss rate ↓ from 50 % → 15 %.
PPV =.839.
Misclassifications cut from 30 to 9 drivers.
ROI computed with BCG formula ≈ 12.5 : 1 (see below).
(\text{Utility Gain}) = N\,T\,r{xy}\,SDy\,\bar Z_m - N\,C
Example values (FERT):
N=60 drivers/yr, T=1.5 yr tenure,
r{xy}=0.40, SDy=\$9,000 (≈40 % salary),
\bar Z_m = 1.0, C=\$200/test → total test cost =\$24,000.
\text{Benefit}=60\times1.5\times0.40\times9000\times1=\$324,000.
\text{Utility Gain}=324{,}000-24{,}000=\$300,000.
Each testing dollar returns >\$12.50.
(\text{Productivity Gain}) = N\,T\,r{xy}\,SDp\,\bar Z_m - N\,C
SD_p = SD of output units (not dollars).
Four potential outcomes: True Positive, False Positive, False Negative, True Negative.
Trade-off managed via selection ratio & cut score.
Guidelines: set stricter cutoffs when false-positives are more costly (e.g. airline pilots).
Fixed/Absolute (criterion-referenced): e.g.
driver’s road test.
Relative/Norm-referenced: top 10 % get A.
Multiple Cut Scores: tiers (A–F).
Multiple Hurdle process: must pass sequential stages to continue (application → test → interview …).
Compensatory Model: weighted predictors; high score in one area offsets low in another (implemented via multiple regression).
Angoff: SMEs judge probability minimally competent candidate answers each item correctly; average probabilities = cut.
Pros: simple; Cons: low inter-rater reliability possible.
Known/Contrasting Groups: test administered to groups already known pass/fail; cut = score at intersection of distributions.
Sensitive to group definition choices.
IRT-Based
Item-Mapping: SMEs review histogram columns of equal item difficulty.
Bookmark: SMEs place bookmark in ordered item booklet at point minimally competent examinee would answer correctly 50 % of time.
Advantages: ties cut to item difficulty, not raw % correct.
Additional historical methods: Predictive Yield (Thorndike), Decision-Theoretic approaches, discriminant-function analysis.
Applicant Pool Size & Quality: many models assume limitless applicants & 100 % offer acceptance → real-world overestimation.
Empirical adjustment: reduce projected gains up to 80 % (Murphy 1986).
Job Complexity: higher complexity → wider SD of performance, affects SD_y and utility.
Base Rates: at extreme values a test adds little incremental accuracy.
Randomized controlled trial: 988 shifts, Rialto CA.
“Camera” shifts vs “No-Camera” shifts.
Results: Use-of-force incidents \downarrow by > 50 %; citizen complaints \downarrow dramatically.
Demonstrates high diagnostic/treatment utility of BWC technology despite high initial procurement costs.
Misuse of utility arguments can lead to discriminatory or unsafe practices (e.g.
dropping assessment to save ).
Utility not purely monetary: social justice, safety, individual rights, and morale weigh in.
Decision-makers must integrate psychometrics with prudence, vision, common sense.
Taylor–Russell example: base rate $=.60$, selection ratio $=.20$, validity $=.55$ → projected success $=.88$.
ROI example: \text{ROI}=\frac{\$300,000}{\$24,000}=12.5:1.
Selection ratios illustrated: SR=.95,.517,.10 for cut scores 18, 48, 80 respectively.
Utility, Utility Analysis, Utility Gain, ROI.
Costs vs Benefits.
Psychometric Soundness.
Cut Score (fixed, relative, multiple, hurdle).
Angoff, Known-Groups, Bookmark, Item-Mapping.
Brogden–Cronbach–Gleser formula.
Decision-theory: TP, FP, FN, TN; Sensitivity, Specificity.
Compensatory vs Multiple-Hurdle selection.
Reliability → ceiling on validity; validity often correlates with utility but context matters.
Concepts integrate with earlier chapters on criterion-related validity, expectancy data, selection ratio, base rate.
Utility perspective extends psychometrics from “measurement quality” to organizational & societal impact.
Compute utility gain given new r{xy}, SDy, selection ratio.
Contrast Angoff & Bookmark in settings where item difficulty varies widely.
Debate ethical boundaries: when is a false-negative more acceptable than a false-positive?
Design a multiple-hurdle hiring system for airline pilots incorporating both fixed and compensatory elements.