Chapter 7 – Test Utility and Utility Analysis

Introduction: Everyday vs Psychometric “Utility”

  • Everyday meaning = “usefulness”; psychometrics = practical value of using a test, battery, training, or intervention to aid decision-making.

  • Representative questions utility seeks to answer:

    • How does Test A compare to Test B?

    • Does adding a test to a battery improve screening?

    • Will an admissions/personnel test select better applicants than supervisor judgment alone?

    • Does the test save time, money, or other resources?

Core Definition of Test Utility

  • A measure of efficiency gains when testing is implemented.

  • Applies to single instruments and full testing programs.

  • Utility judgments draw on:

    1. Reliability data

    2. Validity data

    3. Additional information (costs, benefits, logistics, ethics, etc.).

Factors Influencing a Test’s Utility

1. Psychometric Soundness

  • Reliability → sets ceiling for validity; validity (especially criterion-related) ↗︎ utility but does not guarantee it.

  • Example: Sweat-patch for cocaine detection had r=.92r= .92 agreement with urine tests when untampered, yet low utility due to frequent patch tampering (Chawarski et al., 2007).

2. Costs (Economic & Non-Economic)

  • Direct :purchase,protocols,scoringsoftware,stafftime,facilities,insurance,legal,overhead.</p></li><li><p>Indirect: purchase, protocols, scoring software, staff time, facilities, insurance, legal, overhead.</p></li><li><p>Indirect: Cost of not testing or of using an ineffective test (e.g.
    airline stops assessing pilots → lawsuits, loss of confidence).

  • Noneconomic: harm, public safety, morale, ethics (e.g. missing child-abuse fractures due to fewer X-ray views).

3. Benefits (Economic & Non-Economic)

  • Economic: ↑ productivity, ↓ waste, ↑ profit, ROI.

  • Noneconomic (often convert to later): better work climate, fewer accidents, lower turnover, social safety from accurate involuntary-hospitalization decisions.

Utility Analysis: Concepts & Purposes

  • Family of cost-benefit techniques; guides choice among testing, training, interventions.

  • Typical decisions:

    • Choose Test A vs Test B vs no test.

    • Add/subtract tools in a battery.

    • Compare training programs or intervention elements.

  • End product = “educated decision” (optimal course of action).

Expectancy Data

  • Scatterplot → expectancy table showing probability of criterion success per predictor band.

  • Classic aids: Taylor–Russell tables & Naylor–Shine tables.

    • Inputs: validity ρ_{xy}; selection ratio; base rate.

    • Output: percentage of hires predicted successful after adding test.

    • Limitation: assumes linear predictor-criterion relation & clear pass/fail criterion.

Close-Up: Flecha Esmaralda Road Test (FERT)

  • Scenario: South-American courier hiring.

  • Existing policy = license + no criminal record → 50 % of hires rated “qualified.”

  • New on-road FERT studied (predictive validity r=.40).</p></li><li><p>Threeillustrativecutscores:</p><ol><li><p><strong>18(low)</strong></p></li></ol><ul><li><p>Selectionratio).</p></li><li><p>Three illustrative cut scores:</p><ol><li><p><strong>18 (low)</strong></p></li></ol><ul><li><p>Selection ratio=.95(57/60hired).</p></li><li><p>PositivePredictiveValue(PPV)(57/60 hired).</p></li><li><p>Positive Predictive Value (PPV)=.526.</p></li><li><p>Falsenegatives0.</p></li><li><p>False-negatives 0 % but utility gain trivial.</p></li></ul><ol><li><p><strong>80 (high)</strong></p></li></ol><ul><li><p>Selection ratio=.10(6/60hired).</p></li><li><p>PPV(6/60 hired).</p></li><li><p>PPV=1.00;overallaccuracyonly60; overall accuracy only 60 %.</p></li><li><p>Requires ≈ 600 applicants for 60 hires (costly recruitment).</p></li></ul><ol><li><p><strong>48 (moderate)</strong> – chosen</p></li></ol><ul><li><p>Selection ratio=.517(31/60).</p></li><li><p>Missratefrom50(31/60).</p></li><li><p>Miss rate ↓ from 50 % → 15 %.</p></li><li><p>PPV=.839.</p></li><li><p>Misclassificationscutfrom30to9drivers.</p></li><li><p>ROIcomputedwithBCGformula<strong>12.5:1</strong>(seebelow).</p></li></ul></li></ul><h5id="e90b014560ac475683dedd26bbd4df72"datatocid="e90b014560ac475683dedd26bbd4df72"collapsed="false"seolevelmigrated="true">BrogdenCronbachGleser(BCG)UtilityFormula</h5><p>.</p></li><li><p>Misclassifications cut from 30 to 9 drivers.</p></li><li><p>ROI computed with BCG formula ≈ <strong>12.5 : 1</strong> (see below).</p></li></ul></li></ul><h5 id="e90b0145-60ac-4756-83de-dd26bbd4df72" data-toc-id="e90b0145-60ac-4756-83de-dd26bbd4df72" collapsed="false" seolevelmigrated="true">Brogden–Cronbach–Gleser (BCG) Utility Formula</h5><p>(\text{Utility Gain}) = N\,T\,r{xy}\,SDy\,\bar Z_m - N\,C</p><ul><li><p>Examplevalues(FERT):</p><ul><li><p></p><ul><li><p>Example values (FERT):</p><ul><li><p>N=60drivers/yr,drivers/yr,T=1.5yrtenure,</p></li><li><p>yr tenure,</p></li><li><p>r{xy}=0.40,,SDy=\$9,000(40(≈40 % salary),</p></li><li><p>\bar Z_m = 1.0,,C=\$200/testtotaltestcost/test → total test cost=\$24,000.</p></li><li><p>.</p></li><li><p>\text{Benefit}=60\times1.5\times0.40\times9000\times1=\$324,000.</p></li><li><p>.</p></li><li><p>\text{Utility Gain}=324{,}000-24{,}000=\$300,000.</p></li><li><p>Eachtestingdollarreturns.</p></li><li><p>Each testing dollar returns>\$12.50.</p></li></ul></li></ul><h5id="487c347bbd3f4c0e90dc8485d47e9c81"datatocid="487c347bbd3f4c0e90dc8485d47e9c81"collapsed="false"seolevelmigrated="true">ProductivityVariant</h5><p>.</p></li></ul></li></ul><h5 id="487c347b-bd3f-4c0e-90dc-8485d47e9c81" data-toc-id="487c347b-bd3f-4c0e-90dc-8485d47e9c81" collapsed="false" seolevelmigrated="true">Productivity Variant</h5><p>(\text{Productivity Gain}) = N\,T\,r{xy}\,SDp\,\bar Z_m - N\,C</p><ul><li><p></p><ul><li><p>SD_p = SD of output units (not dollars).

Decision Theory & Cut Scores

  • Four potential outcomes: True Positive, False Positive, False Negative, True Negative.

  • Trade-off managed via selection ratio & cut score.

  • Guidelines: set stricter cutoffs when false-positives are more costly (e.g. airline pilots).

Cut-Score Taxonomy

  • Fixed/Absolute (criterion-referenced): e.g.
    driver’s road test.

  • Relative/Norm-referenced: top 10 % get A.

  • Multiple Cut Scores: tiers (A–F).

  • Multiple Hurdle process: must pass sequential stages to continue (application → test → interview …).

  • Compensatory Model: weighted predictors; high score in one area offsets low in another (implemented via multiple regression).

Methods for Establishing Cut Scores

  • Angoff: SMEs judge probability minimally competent candidate answers each item correctly; average probabilities = cut.

    • Pros: simple; Cons: low inter-rater reliability possible.

  • Known/Contrasting Groups: test administered to groups already known pass/fail; cut = score at intersection of distributions.

    • Sensitive to group definition choices.

  • IRT-Based

    • Item-Mapping: SMEs review histogram columns of equal item difficulty.

    • Bookmark: SMEs place bookmark in ordered item booklet at point minimally competent examinee would answer correctly 50 % of time.

    • Advantages: ties cut to item difficulty, not raw % correct.

  • Additional historical methods: Predictive Yield (Thorndike), Decision-Theoretic approaches, discriminant-function analysis.

Practical Issues in Utility Studies

  • Applicant Pool Size & Quality: many models assume limitless applicants & 100 % offer acceptance → real-world overestimation.

    • Empirical adjustment: reduce projected gains up to 80 % (Murphy 1986).

  • Job Complexity: higher complexity → wider SD of performance, affects SD_yandutility.</p></li><li><p><strong>BaseRates</strong>:atextremevaluesatestaddslittleincrementalaccuracy.</p></li></ul><h3id="d0f8f13396564e70864a6ef58f949848"datatocid="d0f8f13396564e70864a6ef58f949848"collapsed="false"seolevelmigrated="true">RealWorldIllustration:PoliceBodyCameras(Arieletal.,2015)</h3><ul><li><p>Randomizedcontrolledtrial:988shifts,RialtoCA.</p></li><li><p>CamerashiftsvsNoCamerashifts.</p></li><li><p>Results:Useofforceincidentsand utility.</p></li><li><p><strong>Base Rates</strong>: at extreme values a test adds little incremental accuracy.</p></li></ul><h3 id="d0f8f133-9656-4e70-864a-6ef58f949848" data-toc-id="d0f8f133-9656-4e70-864a-6ef58f949848" collapsed="false" seolevelmigrated="true">Real-World Illustration: Police Body Cameras (Ariel et al., 2015)</h3><ul><li><p>Randomized controlled trial: 988 shifts, Rialto CA.</p></li><li><p>“Camera” shifts vs “No-Camera” shifts.</p></li><li><p>Results: Use-of-force incidents\downarrow by > 50 %; citizen complaints \downarrow dramatically.

  • Demonstrates high diagnostic/treatment utility of BWC technology despite high initial procurement costs.

Ethical, Philosophical & Practical Implications

  • Misuse of utility arguments can lead to discriminatory or unsafe practices (e.g.
    dropping assessment to save ).

  • Utility not purely monetary: social justice, safety, individual rights, and morale weigh in.

  • Decision-makers must integrate psychometrics with prudence, vision, common sense.

Key Formulas & Numerical References

  • Taylor–Russell example: base rate $=.60$, selection ratio $=.20$, validity $=.55$ → projected success $=.88$.

  • ROI example: ROI=$300,000$24,000=12.5:1\text{ROI}=\frac{\$300,000}{\$24,000}=12.5:1.

  • Selection ratios illustrated: SR=.95,.517,.10SR=.95,.517,.10 for cut scores 18, 48, 80 respectively.

Vocabulary Quick-Reference

  • Utility, Utility Analysis, Utility Gain, ROI.

  • Costs vs Benefits.

  • Psychometric Soundness.

  • Cut Score (fixed, relative, multiple, hurdle).

  • Angoff, Known-Groups, Bookmark, Item-Mapping.

  • Brogden–Cronbach–Gleser formula.

  • Decision-theory: TP, FP, FN, TN; Sensitivity, Specificity.

  • Compensatory vs Multiple-Hurdle selection.

Connections & Foundational Principles

  • Reliability → ceiling on validity; validity often correlates with utility but context matters.

  • Concepts integrate with earlier chapters on criterion-related validity, expectancy data, selection ratio, base rate.

  • Utility perspective extends psychometrics from “measurement quality” to organizational & societal impact.

Study Prompts

  • Compute utility gain given new r<em>xyr<em>{xy}, SD</em>ySD</em>y, selection ratio.

  • Contrast Angoff & Bookmark in settings where item difficulty varies widely.

  • Debate ethical boundaries: when is a false-negative more acceptable than a false-positive?

  • Design a multiple-hurdle hiring system for airline pilots incorporating both fixed and compensatory elements.