Accuracy, Bias, and Communication in Forensic Risk Assessment

DefiningandMeasuringRiskAssessmentAccuracyAccurateAssessments:Accuracyoccurswhenapredictionmatchesthesubsequentoutcome.Therearetwoformsofaccuracy:PositiveAccuracy:Predictingthatanindividualwillbereconvicted,andtheydoindeedgoontocommitanewoffenseandreceiveanewconviction.NegativeAccuracy:Predictingthatanindividualwillnotbereconvicted,andtheysurviveforaperiod(e.g.,Defining and Measuring Risk Assessment AccuracyAccurate Assessments: Accuracy occurs when a prediction matches the subsequent outcome. There are two forms of accuracy:Positive Accuracy: Predicting that an individual will be reconvicted, and they do indeed go on to commit a new offense and receive a new conviction.Negative Accuracy: Predicting that an individual will not be reconvicted, and they survive for a period (e.g.,2toto5years)inthecommunitywithoutfurtherrecordedoffending.InaccurateAssessments:Inaccuracyoccurswhenthepredictionandoutcomedonotalign:Predictingnocriminalbehavior/convictionwhentheindividualisactuallyreconvicted.Predictingreconvictionwhentheindividualdoesnotreoffend.QuantifyingAccuracy:Simplystatingameasureisyears) in the community without further recorded offending.Inaccurate Assessments: Inaccuracy occurs when the prediction and outcome do not align:Predicting no criminal behavior/conviction when the individual is actually reconvicted.Predicting reconviction when the individual does not reoffend.Quantifying Accuracy: Simply stating a measure is50\%oror95\%accurateisoftenmeaninglesswithoutspecifyingwhatthatpercentagesignifies.Thereisnosinglepercentageusedtomeasureaccuracy;rather,researchersuseavarietyofmetrics.TheTwobyTwoOutcomeGridinPredictionTheoutcomesofriskpredictionsarecategorizedintoafourcellmatrixbasedontherelationshipbetweenthepredictionandtheactualoutcome:TruePositive:Apersonispredictedtobeconvictedandissubsequentlyconvicted.The"positive"referstotheoccurrenceoftheevent,and"true"referstothecorrectprediction.FalsePositive:Apersonispredictedtobereconvictedbutisnot.Thisrepresentsanoverestimationofrisk.Impact:Theseerrorscanleadtounjustifiedreductionsincivillibertiesandhumanrights(e.g.,harshersentencesorstricterconditions).FalseNegative:Apersonispredictednottobereconvictedbuttheydocommitanoffense.Impact:Thesecasesarehighlyvisibleandaretypicallytheonesthat"hitthefrontpagesofthenewspaper."TrueNegative:Apersonispredictednottoreoffendandindeeddoesnotreoffend.TheComplexityofUncertainty:Inpractice,riskassessmentusuallyprovidesanestimateofuncertaintyratherthanasimplebinary"yes/no."However,atwobytwogridisusedwhenapplyinga"cutoff"score(e.g.,individualsaboveacertainscorearedeemedlikelytooffend).CaseStudy:GenderasaCrudeRiskAssessmentMeasureTheScenario:Usinggenderasthesoleriskassessmentmetric,whereallmenarepredictedas"highrisk"(expectedtoreoffend)andallwomenarepredictedas"lowrisk"(notexpectedtoreoffend).TheData(BasedonRealNumbers):MeninSample:accurate is often meaningless without specifying what that percentage signifies. There is no single percentage used to measure accuracy; rather, researchers use a variety of metrics.The Two-by-Two Outcome Grid in PredictionThe outcomes of risk predictions are categorized into a four-cell matrix based on the relationship between the prediction and the actual outcome:True Positive: A person is predicted to be convicted and is subsequently convicted. The "positive" refers to the occurrence of the event, and "true" refers to the correct prediction.False Positive: A person is predicted to be reconvicted but is not. This represents an overestimation of risk.Impact: These errors can lead to unjustified reductions in civil liberties and human rights (e.g., harsher sentences or stricter conditions).False Negative: A person is predicted not to be reconvicted but they do commit an offense.Impact: These cases are highly visible and are typically the ones that "hit the front pages of the newspaper."True Negative: A person is predicted not to reoffend and indeed does not reoffend.The Complexity of Uncertainty: In practice, risk assessment usually provides an estimate of uncertainty rather than a simple binary "yes/no." However, a two-by-two grid is used when applying a "cutoff" score (e.g., individuals above a certain score are deemed likely to offend).Case Study: Gender as a Crude Risk Assessment MeasureThe Scenario: Using gender as the sole risk assessment metric, where all men are predicted as "high risk" (expected to reoffend) and all women are predicted as "low risk" (not expected to reoffend).The Data (Based on Real Numbers):Men in Sample:454Recidivists:Recidivists:109Nonrecidivists:Non-recidivists:345WomeninSample:Women in Sample:62Recidivists:Recidivists:3Nonrecidivists:Non-recidivists:59SuccessoftheMeasure:RecidivismRates:Therateformen(Success of the Measure:Recidivism Rates: The rate for men (24\%)wasalmostsixtimeshigherthantherateforwomen() was almost six times higher than the rate for women (5\%).IdentifyingRecidivists:Ofthetotal).Identifying Recidivists: Of the total112peoplewhoreoffendedinthesample,thecrudemeasurecorrectlyidentifiedpeople who reoffended in the sample, the crude measure correctly identified109ofthem(of them (97\%).ConclusiononAccuracy:Fromadetectionstandpoint,themeasurelookshighlyjustifiedforprioritizingservicesorjustifyinglongerprisonstaysformen.TheFlipSide(Limitations):OverallAccuracyRate:Whencombiningtruepositivesandtruenegatives,thepredictionwascorrectforonlyaboutonethirdofthetotalsample.FalsePositiveRate:).Conclusion on Accuracy: From a detection standpoint, the measure looks highly justified for prioritizing services or justifying longer prison stays for men.The Flip Side (Limitations):Overall Accuracy Rate: When combining true positives and true negatives, the prediction was correct for only about one-third of the total sample.False Positive Rate:345menwerepredictedtoreoffendbutdidnot.Thisleadstomassiveoverestimationofrisk.Consequences:Thispolicywouldbeextremelyexpensivefinancially(e.g.,keepingmorepeopleinprison)andethicallyproblematicduetounnecessaryreductionsinliberty.StatisticalDiscrimination:AreaUndertheCurve(AUC)Limitationsofthe2x2Grid:Realriskassessmentusesscores(e.g.,men were predicted to reoffend but did not. This leads to massive overestimation of risk.Consequences: This policy would be extremely expensive financially (e.g., keeping more people in prison) and ethically problematic due to unnecessary reductions in liberty.Statistical Discrimination: Area Under the Curve (AUC)Limitations of the 2x2 Grid: Real risk assessment uses scores (e.g.,7outofout of10vs.vs.5outofout of10)ratherthansimplecategories.Weneedtoknowifhigherscoresmeaningfullydiscriminatebetweenrecidivistsandnonrecidivists.AUCDefinition:TheAreaUndertheCurve() rather than simple categories. We need to know if higher scores meaningfully discriminate between recidivists and non-recidivists.AUC Definition: The Area Under the Curve (AUC)isessentiallyastatisticalaverageofthetwobytwogridcalculatedforeverypossiblecutoffscoreonariskmeasure.AUCInterpretation:Thescorerangesfrom) is essentially a statistical average of the two-by-two grid calculated for every possible cutoff score on a risk measure.AUC Interpretation:The score ranges from0toto1..0.5indicatesaccuracyequivalenttoacoinflip(chance).Thevaluerepresentstheprobabilitythatarandomlyselectedrecidivistwillhaveahigherriskscorethanarandomlyselectednonrecidivist.RealityofAccuracyScores:Largemetaanalyses(coveringsexualoffending,violence,andgeneraloffending)showthattypicalriskmeasuresscoreintherealmofindicates accuracy equivalent to a coin flip (chance).The value represents the probability that a randomly selected recidivist will have a higher risk score than a randomly selected non-recidivist.Reality of Accuracy Scores:Large meta-analyses (covering sexual offending, violence, and general offending) show that typical risk measures score in the realm of0.65toto0.75.Ascorearound.A score around0.7isconsideredtohavemoderatetohighaccuracy.WhyAccuracyIsntHigher:Humanbehavioriscomplexandinfluencedbyindividualfactors,societalstructures,lawenforcementpolicies,andthefactthatmostoffendinggoesundetected.Predictingbehaviorisinherentlydifficult,butbeingabovechanceallowsthesetoolstoinformpolicyandresourceallocation.BiasandEthnicityinRiskAssessmentTheRacismQuestion:Amajordebatefocusesonwhetherriskassessmentalgorithmsareracist.MeasuringBias:Itisimportanttodistinguishbetweendifferencesinscoresanddifferencesinpredictivevalidity:HigherScores:Aspecificgroup(e.g.,Indigenouspopulations)mightscorehigheronameasure,butthismayreflecthistoricalfactorsorsystemicdiscriminationinsociety(e.g.,overpolicing)ratherthanabiasinthemeasureitself.PredictiveValidity:Therealquestioniswhethertheconnectionbetweenscoresandoutcomesisthesameacrossgroups.Forinstance,doestheis considered to have moderate to high accuracy.Why Accuracy Isn't Higher: Human behavior is complex and influenced by individual factors, societal structures, law enforcement policies, and the fact that most offending goes undetected. Predicting behavior is inherently difficult, but being above chance allows these tools to inform policy and resource allocation.Bias and Ethnicity in Risk AssessmentThe Racism Question: A major debate focuses on whether risk assessment algorithms are racist.Measuring Bias: It is important to distinguish between differences in scores and differences in predictive validity:Higher Scores: A specific group (e.g., Indigenous populations) might score higher on a measure, but this may reflect historical factors or systemic discrimination in society (e.g., over-policing) rather than a bias in the measure itself.Predictive Validity: The real question is whether the connection between scores and outcomes is the same across groups. For instance, does theAUCscoreremainconsistentforbothIndigenousandnonIndigenouspopulations?ResearchFindings:AmetaanalysiscomparingIndigenousandnonIndigenousgroupsfound:Consistentmoderatepredictivevalidityacrossbothgroupsformostmeasures.StaticMeasures:Historical,unchangeablefactorsshowedmorepotentialforbiasbecausetheyreflecttheenforcementenvironmenttheindividuallivedin.AbandoningToolsvs.Improvement:Expertsargueagainstabandoningthesetoolsunlessreplacedbyasystemwithsuperioraccuracy.Withoutalgorithms,thesystemfallsbackonprofessionalclinicaljudgment,whichresearchprovesisevenmorebiasedandhardertoquantifyortest.BestPracticesandRecommendationsPracticeRecommendations:AvoidSoleReliance:Decisionsshouldneverbebasedsolelyonascorebasedclassification.PrioritizeDynamicMeasures:Usemeasuresthataccountforchangeableaspectsofapersonslifetoreducetheimpactofhistoricalbias.CulturalCompetence:Increasetrainingforassessorstounderstandandaddresshumanbiases.ResearchRecommendations:Evaluateallriskmeasurestodetermineiftheypredictequallywellacrossracial,ethnic,andgendergroups.Studyhowriskinstrumentsareactuallyappliedindecisionmakingcontextstoseeiftheyreduceorincreasedisparities.Investigatehowtheethnicityoftherater(thepersonperformingtheassessment)affectsbias.MethodsofRiskCommunicationCategoricalLabels(Low,Moderate,High):Subjectivity:Theseareinterpreteddifferentlybydifferentpeople.SurveyFindings:Inastudentsurvey,"Low"riskwasestimatedasanaveragedscore remain consistent for both Indigenous and non-Indigenous populations?Research Findings: A meta-analysis comparing Indigenous and non-Indigenous groups found:Consistent moderate predictive validity across both groups for most measures.Static Measures: Historical, unchangeable factors showed more potential for bias because they reflect the enforcement environment the individual lived in.Abandoning Tools vs. Improvement: Experts argue against abandoning these tools unless replaced by a system with superior accuracy. Without algorithms, the system falls back on professional clinical judgment, which research proves is even more biased and harder to quantify or test.Best Practices and RecommendationsPractice Recommendations:Avoid Sole Reliance: Decisions should never be based solely on a score-based classification.Prioritize Dynamic Measures: Use measures that account for changeable aspects of a person's life to reduce the impact of historical bias.Cultural Competence: Increase training for assessors to understand and address human biases.Research Recommendations:Evaluate all risk measures to determine if they predict equally well across racial, ethnic, and gender groups.Study how risk instruments are actually applied in decision-making contexts to see if they reduce or increase disparities.Investigate how the ethnicity of the rater (the person performing the assessment) affects bias.Methods of Risk CommunicationCategorical Labels (Low, Moderate, High):Subjectivity: These are interpreted differently by different people.Survey Findings: In a student survey, "Low" risk was estimated as an averaged\approx 19\%chanceofreoffending,"Moderate"aschance of reoffending, "Moderate" as\approx 50\%,and"High"as, and "High" as\approx 80\%.Inconsistency:Differentmeasuresusedifferentnumbersofcategories(e.g.,somehave.Inconsistency: Different measures use different numbers of categories (e.g., some have3,somehave, some have5),meaninga"high"labelcanmeandifferentthingsacrosstools.AbsoluteRecidivismEstimates:Providinganexactpercentagechanceofreoffending(e.g.,), meaning a "high" label can mean different things across tools.Absolute Recidivism Estimates: Providing an exact percentage chance of reoffending (e.g.,40\%).Pros:Easyforinterpretive/thresholddecisions(e.g.,lawsrequiringan"imminent"riskthreshold).Cons:Canbemisleadingasitreferstoagroupproperty(sharingcharacteristicswithapreviouslyresearchedsample)ratherthanapreciseindividualprediction;notconsistentacrossdifferentjurisdictions(e.g.,NZvs.Canada).RelativeRiskMeasures:Expressingriskasaratio(e.g.,"threetimesaslikelyastheaverageoffender").Pros:Morereliableacrossdifferentgeographicalsamplesandjurisdictions.Cons:Oftenmisinterpretedbythosewhostrugglewithfractions/proportions;meaninglesswithoutknowingthe"baserate"oftheaverageoffender.PercentileRanks:Comparinganindividualtoothers(e.g.,"inthe).Pros: Easy for interpretive/threshold decisions (e.g., laws requiring an "imminent" risk threshold).Cons: Can be misleading as it refers to a group property (sharing characteristics with a previously researched sample) rather than a precise individual prediction; not consistent across different jurisdictions (e.g., NZ vs. Canada).Relative Risk Measures: Expressing risk as a ratio (e.g., "three times as likely as the average offender").Pros: More reliable across different geographical samples and jurisdictions.Cons: Often misinterpreted by those who struggle with fractions/proportions; meaningless without knowing the "base rate" of the average offender.Percentile Ranks: Comparing an individual to others (e.g., "in the50^{th}percentileofrisk").ConsensusonCommunication:Expertssuggestcombiningmeasuresusingacategorylabel,anabsoluterecidivismestimate,andarelativeriskratiotogethertomitigatetheweaknessesofeachindividualmethod.TheCommonRiskLanguageInitiativeGoal:Tostandardizeriskcommunicationacrossdifferentmeasuresandassessorsusingfivelevels(Levelpercentile of risk").Consensus on Communication: Experts suggest combining measures—using a category label, an absolute recidivism estimate, and a relative risk ratio together—to mitigate the weaknesses of each individual method.The Common Risk Language InitiativeGoal: To standardize risk communication across different measures and assessors using five levels (Level1throughLevelthrough Level5)insteadofsubjectivelabels.Level2Example:RecidivismEstimate:Between) instead of subjective labels.Level 2 Example:Recidivism Estimate: Between5\%andand30\%.Action:Monitoringforcompliance;minimalintervention.Prognosis:Generallygood;unlikelytobeacareeroffender.Level5Example:Characteristics:Multiple,chronic,severe,andentrenchedcriminallifestyles.RecidivismEstimate:.Action: Monitoring for compliance; minimal intervention.Prognosis: Generally good; unlikely to be a career offender.Level 5 Example:Characteristics: Multiple, chronic, severe, and entrenched criminal lifestyles.Recidivism Estimate:85\%orhigher(virtuallycertain).Action:Intensivesupervisionandmonitoring;approximatelyor higher (virtually certain).Action: Intensive supervision and monitoring; approximately300hoursofintensivetherapy/intervention.RemainingChallenges:Itisdifficulttocomparedifferenttypesofoffending.Forexample,aLevelhours of intensive therapy/intervention.Remaining Challenges: It is difficult to compare different types of offending. For example, a Level3scoreforgeneraloffendingmayinvolveascore for general offending may involve a40\%recidivismrate,butaLevelrecidivism rate, but a Level3forsexualoffendingwouldnecessarilybelowerbecausesexualoffendingisasmallercomponentofoverallcrime.Potentialforconfusionremainsforparoleboardswhenanindividualhasdifferentlevelsfordifferentoffensetypes(e.g.,Levelfor sexual offending would necessarily be lower because sexual offending is a smaller component of overall crime. Potential for confusion remains for parole boards when an individual has different levels for different offense types (e.g., Level4forviolencebutLevelfor violence but Level1forsexualoffending).for sexual offending).