Everyday usage: “valid” = meaningful, well-grounded, executed with proper formalities.
Psychological testing usage: a judgement of how well a test measures what it purports to measure in a particular context.
Focus is on the appropriateness of inferences drawn from scores, not on the test per se.
Validity is not universal; bounded by purpose, population, culture, and time.
Validity can fade as culture or technology changes → periodic re-validation is essential.
Validation = gathering & evaluating evidence.
Test developer supplies evidence in the manual.
Test user may conduct local validation studies when:
Using a new population.
Modifying format (e.g., Braille, translation).
If local study impossible, users must consult independent literature or seek expert consultation before use.
Content Validity – representativeness of items.
Criterion-Related Validity – relationship with external measure(s).
Construct Validity – integration of all evidence within a theoretical framework.
Construct validity functions as “umbrella validity.”
Critics (e.g., Messick) argue for a unitary model incorporating social consequences.
Ecological Validity – generalizability to real-life contexts and moments (ties to Ecological Momentary Assessment).
Face Validity – appearance of relevance to test-taker; PR value rather than technical merit.
High face validity can boost cooperation & acceptance (e.g., Introversion/Extraversion questionnaire).
Definition: degree to which items sample the universe of behaviors the construct covers.
Achieved through a test blueprint:
Specifies topics, item counts, weightings, formats.
Informed by syllabi, textbooks, SMEs, job analyses.
Example
Assertiveness test → items span home, job, and social scenarios.
Cultural relativity: what counts as “content” depends on history & politics.
Gavrilo Princip example illustrates differing correct answers across Bosnian ethnic texts.
Uses an external criterion (standard) to evaluate test scores.
Two forms:
Concurrent Validity – scores & criterion collected simultaneously.
Predictive Validity – criterion obtained in the future.
Relevant – directly linked to construct.
Valid – itself measured accurately.
Uncontaminated – independent of predictor (avoid criterion contamination).
Validity Coefficient r_{xy}: correlation between test (X) and criterion (Y).
Computed typically with Pearson r:
r=\frac{\sum (Xi-\bar X)(Yi-\bar Y)}{\sqrt{\sum (Xi-\bar X)^2}\sqrt{\sum (Yi-\bar Y)^2}}
Affected by range restriction/inflation.
Expectancy Data & Charts – show likelihood of specific outcomes per score band.
Hit rate – proportion correctly classified.
Miss rate – proportion misclassified.
False Positive – predicted "has trait" but doesn’t.
False Negative – predicted "lacks trait" but actually has it.
Base Rate – prevalence of trait in population; influences predictive power.
Added value of a new predictor (
\Delta R^2 via hierarchical regression).
Maximal when:
Strong correlation with criterion.
Low correlation with existing predictors (non-redundant).
Emotional Intelligence research: modest incremental validity beyond g and personality.
BDI with Adolescents: concurrent study vs. established adolescent instrument → adequate validity.
Corporate Selection (Dr. Shoemaker): test with high face validity but poor criterion validity retained only for realistic job preview.
Construct = unobservable trait inferred from theory (e.g., intelligence, anxiety).
Evidence types:
Homogeneity / Unidimensionality
Internal consistency; factor analysis.
Developmental Changes
Scores vary with age/time as theory predicts.
Pretest–Posttest Changes
Scores shift after interventions (therapy, training).
Contrasted (Known) Groups
Expected score differences among distinct groups (e.g., depressed vs. non-depressed).
Convergent Evidence – high correlation with related measures.
Discriminant Evidence – low correlation with unrelated constructs.
Multitrait-Multimethod Matrix (MTMM) – simultaneous appraisal of convergent & discriminant validity.
Factor Analysis
Exploratory (EFA) identifies latent factors.
Confirmatory (CFA) tests hypothesized structure.
Factor Loading = weight linking item to factor.
Reduced from 73→48 items via item–total correlations > .50.
Validated through:
Homogeneity.
Pre-/post-therapy score changes.
Contrasted happily vs. unhappy couples.
Convergent r = 0.79 with Marital Adjustment Test.
29→18 items after EFA.
CFA supported 2-factor model; criterion relations (trait anxiety, punctuality, wildfire preparedness) matched predictions.
Statistical artefact causing systematic error for a group.
Detection: DIF analyses, intercept bias (consistent under/over-prediction), slope bias (weaker correlations).
Broader, value-laden issue: impartial, just, equitable use.
A test can be valid yet used unfairly (e.g., political repression, cold-war USSR).
Leniency / Generosity Error – ratings too high.
Severity Error – systematically low ratings.
Central Tendency Error – avoidance of extremes.
Halo Effect – undifferentiated positive (or negative) impression spills over.
Remedies: rater training, forced rankings, behavioural anchors.
Psychometric techniques (each with ethical & legal debates):
Addition of constant points.
Differential scoring / empirical keying by group.
Elimination of items showing Differential Item Functioning (DIF).
Differential cutoffs.
Separate ranking lists.
Within-group (race) norming (now illegal in U.S. employment).
Banding & sliding bands.
Explicit preference policies.
Pro-adjustment: redress past wrongs, ensure diversity, correct biased items.
Anti-adjustment: undermines individual merit, may harm intended beneficiaries, violates legislation (e.g., U.S. Civil Rights Act 1991 §106).
Validity evidence guides ethical assessment, informs legal defensibility, and affects societal outcomes (employment, admissions, clinical decisions).
Cultural context crucial: item interpretations, historical narratives, political climates (e.g., Bosnian textbooks; Palestinian exam censorship).
Practical takeaway: Continuous monitoring, transparent reporting, and multi-method evidence are mandatory for responsible test use.
Pearson correlation: r=\frac{\sum (X-\bar X)(Y-\bar Y)}{\sqrt{\sum (X-\bar X)^2}\sqrt{\sum (Y-\bar Y)^2}}
Regression model with new predictor: Y=\beta0 + \beta1 X1 + \beta2 X2 + \epsilon; incremental validity assessed via \Delta R^2 when X2 added.
Hit/Miss classification table (simplified):
\begin{array}{c|cc}
& \text{Predicted +} & \text{Predicted -} \
\hline
\text{Actual +} & \text{Hit} & \text{False Negative} \
\text{Actual -} & \text{False Positive} & \text{Hit} \
\end{array}
Understand definitions of all validity types.
Be able to design a validation study, choosing appropriate criteria & statistics.
Recognize cultural and ethical dimensions of test use.
Calculate and interpret r_{xy}, hit/miss rates, and incremental \Delta R^2.
Identify rating errors and propose corrective actions.
Discuss pros/cons of various bias-mitigation techniques.