Test Validity in Psychological Assessment

Core aim: understand test validity—how well a test measures what it claims to measure.
Learning outcomes covered:
1. Define validity and explain its importance.
2. Describe the relationship between validity and reliability.
3. Identify, contrast, and illustrate the four major forms of validity evidence.

Short definition: Degree to which a test actually measures the construct it purports to measure.
Illustration:
• Valid example ➜ Ishihara color-blindness plates correctly distinguish red–green color vision.
• Invalid example ➜ “Big-toe-length test” for color-blindness. Appears unrelated; would never convince the scientific community.

Psychological tests often target latent constructs (anxiety, empathy, intelligence, etc.) that cannot be observed directly.
Acceptance of a test’s validity can depend on whether one accepts the theoretical existence of the construct (e.g., Freudian “anal-retentive personality” vs cognitive-psychology perspective).
Validity is therefore:
• Evidence-based, yet
• Theory-laden and never 100 % proved—always open to revision with new data or new theoretical lenses.

Reliability = consistency / lack of measurement error.
Relationship:
• A test must be reliable to be valid.
• Reliability is necessary but not sufficient.
• Example: consistent toe-length readings → reliable but still not a valid color-blindness test.

Criteria Binet used (late 19th-century Paris):
1. Children judged “bright” by teachers should score higher than those judged “dull.”
2. Older children should outperform younger children (assumes intelligence grows with age).
Only items satisfying both criteria were kept, letting data guide item selection—an early, pragmatic approach to establishing validity.

(You must memorise and be able to contrast these.)

Question: “Does the test look like it measures the construct to the test-taker?”
Examples:
• Traditional spelling test (spell “assessment”) → high face validity.
• “List your favourite TV shows” as a spelling test → low face validity.
• Myers–Briggs ➜ seems personality-related yet scientifically weak.
• Finger-tapping speed to assess concussion ➜ poor face validity but strong empirical validity.
Importance: Enhances credibility and motivation for examinees, yet provides the weakest scientific evidence.

Question: “Does the test cover all facets of the construct?”
Many constructs are multidimensional. Missing dimensions → compromised content validity.
Clinical illustration: Generalised Anxiety Disorder (GAD)—DSM lists at least eight symptom clusters (anxiety, uncontrollable worry, restlessness, fatigue, concentration probs, irritability, muscle tension, sleep disturbance). A good GAD scale samples all clusters.
Unit illustration: PSY2041 performance comprises Theoretical Basics, Application Series, Statistics. An exam assessing only statistics would lack content validity.
Textbook caveat: The text blurs face and content validity—keep them conceptually distinct.

Core question: “Do test scores predict real-world outcomes (criteria) that represent the same construct?”
Technical framing: If $X$ = test score and $Y$ = criterion score, does $X$ significantly estimate $Y$ (regression / correlation)?
Key examples:
• Driving test → predicts future on-road safety.
• Marital-satisfaction scale → predicts divorce rates 12 months later.
• ATAR → predicts university GPA.
• New anxiety inventory → correlates with clinician ratings obtained concurrently.
Concurrent vs. Predictive (timing nuance): Some authors reserve “predictive” for Time-1 ➜ Time-2 designs; “concurrent” when test and criterion coexist. Lecturer treats both under “predictive validity.”
Choosing a criterion:
1. Must itself be reliable.
2. Must be theoretically appropriate in context (divorce illegal in 1950s → weak criterion).
3. Must be uncontaminated (avoid item overlap). Eg, rewording Beck Depression Inventory via thesaurus then correlating with BDI would artificially inflate validity.
Known-Groups Validity (special case): Criterion = group membership. Example: Chronic-pain patients vs healthy controls. A valid pain-severity scale should yield higher scores in the patient group.

Most theoretical and comprehensive. Asks: “Are the test’s assumptions about the latent construct defensible?”
Two evidential pillars (Cronbach & Meehl):
1. Convergent Evidence – Test correlates with related measures (should be high).
  • New depression scale should correlate with measures of low mood, anhedonia, guilt, etc.
2. Discriminant Evidence – Test does not correlate with unrelated constructs (should be low).
  • Color-blindness test shouldn’t merely track overall visual acuity (Snellen chart scores).
Illustration: Implicit-Association Test (IAT) critics argue it may capture stereotype awareness, not attitude endorsement—challenging its construct validity.

All four ask variants of the master question: “Does the test measure what it says it measures?”
They differ in who or what supplies evidence: appearance to examinees (face), content mapping (content), real-world outcomes (predictive), theoretical network of relations (construct).

$\text{Valid} \Rightarrow \text{Reliable}$ (always).
$\text{Reliable} \nRightarrow \text{Valid}$ (toe-length example).
Measurement error destroys both. A highly unreliable test measures nothing, hence cannot be valid.

Using tests without adequate validity can misclassify, stigmatise, or harm individuals (e.g., toe-length color test, flawed driving exams).
High face validity can foster participant engagement, but practitioners must not confuse “looks right” with is right.
Researchers should remain open to new data challenging validity claims and be willing to refine or abandon tests.

Create a mnemonic for F-C-P-C (Face, Content, Predictive, Construct).
Map each validity type to: guiding question, primary evidence source, illustrative example.
Practise distinguishing evidence types in journal articles or case vignettes.
Re-watch segments or reread notes until you can:
• Give a concise definition of each validity form.
• Provide at least one real-world example and explain why it fits.
• Explain how reliability conditions validity.

End of comprehensive notes.