Note

0.0(0)

Take a practice test

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

View the linked PDF

Class Notes

Validity in Psychological Testing – Review Flashcards

Concept of Validity

Everyday usage: “valid” = meaningful, well-grounded, executed with proper formalities.
Psychological testing usage: a judgement of how well a test measures what it purports to measure in a particular context.
- Focus is on the appropriateness of inferences drawn from scores, not on the test per se.
Validity is not universal; bounded by purpose, population, culture, and time.
Validity can fade as culture or technology changes → periodic re-validation is essential.

Validation Process & Local Studies

Validation = gathering & evaluating evidence.
- Test developer supplies evidence in the manual.
- Test user may conduct local validation studies when:
- Using a new population.
- Modifying format (e.g., Braille, translation).
If local study impossible, users must consult independent literature or seek expert consultation before use.

Classic “Trinitarian” Model (Content, Criterion, Construct)

Content Validity – representativeness of items.
Criterion-Related Validity – relationship with external measure(s).
Construct Validity – integration of all evidence within a theoretical framework.

Construct validity functions as “umbrella validity.”
Critics (e.g., Messick) argue for a unitary model incorporating social consequences.

Other Varieties of Validity

Ecological Validity – generalizability to real-life contexts and moments (ties to Ecological Momentary Assessment).
Face Validity – appearance of relevance to test-taker; PR value rather than technical merit.
- High face validity can boost cooperation & acceptance (e.g., Introversion/Extraversion questionnaire).

Content Validity in Depth

Definition: degree to which items sample the universe of behaviors the construct covers.
Achieved through a test blueprint:
- Specifies topics, item counts, weightings, formats.
- Informed by syllabi, textbooks, SMEs, job analyses.
Example
- Assertiveness test → items span home, job, and social scenarios.
Cultural relativity: what counts as “content” depends on history & politics.
- Gavrilo Princip example illustrates differing correct answers across Bosnian ethnic texts.

Criterion-Related Validity

Basics

Uses an external criterion (standard) to evaluate test scores.
Two forms:
- Concurrent Validity – scores & criterion collected simultaneously.
- Predictive Validity – criterion obtained in the future.

Characteristics of a Good Criterion

Relevant – directly linked to construct.
Valid – itself measured accurately.
Uncontaminated – independent of predictor (avoid criterion contamination).

Statistical Evidence

Validity Coefficient r_{xy}: correlation between test (X) and criterion (Y).
- Computed typically with Pearson r:
  r=\frac{\sum (Xi-\bar X)(Yi-\bar Y)}{\sqrt{\sum (Xi-\bar X)^2}\sqrt{\sum (Yi-\bar Y)^2}}
- Affected by range restriction/inflation.
Expectancy Data & Charts – show likelihood of specific outcomes per score band.

Hit/Miss Terminology

Hit rate – proportion correctly classified.
Miss rate – proportion misclassified.
- False Positive – predicted "has trait" but doesn’t.
- False Negative – predicted "lacks trait" but actually has it.
Base Rate – prevalence of trait in population; influences predictive power.

Incremental Validity

Added value of a new predictor (
\Delta R^2 via hierarchical regression).
Maximal when:
- Strong correlation with criterion.
- Low correlation with existing predictors (non-redundant).
Emotional Intelligence research: modest incremental validity beyond g and personality.

Illustrative Studies

BDI with Adolescents: concurrent study vs. established adolescent instrument → adequate validity.
Corporate Selection (Dr. Shoemaker): test with high face validity but poor criterion validity retained only for realistic job preview.

Construct Validity

Construct = unobservable trait inferred from theory (e.g., intelligence, anxiety).
Evidence types:
1. Homogeneity / Unidimensionality
- Internal consistency; factor analysis.
1. Developmental Changes
- Scores vary with age/time as theory predicts.
1. Pretest–Posttest Changes
- Scores shift after interventions (therapy, training).
1. Contrasted (Known) Groups
- Expected score differences among distinct groups (e.g., depressed vs. non-depressed).
1. Convergent Evidence – high correlation with related measures.
2. Discriminant Evidence – low correlation with unrelated constructs.
3. Multitrait-Multimethod Matrix (MTMM) – simultaneous appraisal of convergent & discriminant validity.
4. Factor Analysis
- Exploratory (EFA) identifies latent factors.
- Confirmatory (CFA) tests hypothesized structure.
  - Factor Loading = weight linking item to factor.

Example: Marital Satisfaction Scale (MSS)

Reduced from 73→48 items via item–total correlations > .50.
Validated through:
- Homogeneity.
- Pre-/post-therapy score changes.
- Contrasted happily vs. unhappy couples.
- Convergent r = 0.79 with Marital Adjustment Test.

Example: Constructive vs. Unconstructive Worry Questionnaire (CUWQ)

29→18 items after EFA.
CFA supported 2-factor model; criterion relations (trait anxiety, punctuality, wildfire preparedness) matched predictions.

Bias, Fairness & Rating Errors

Test Bias

Statistical artefact causing systematic error for a group.
Detection: DIF analyses, intercept bias (consistent under/over-prediction), slope bias (weaker correlations).

Test Fairness

Broader, value-laden issue: impartial, just, equitable use.
A test can be valid yet used unfairly (e.g., political repression, cold-war USSR).

Rating Errors (Criterion Issues)

Leniency / Generosity Error – ratings too high.
Severity Error – systematically low ratings.
Central Tendency Error – avoidance of extremes.
Halo Effect – undifferentiated positive (or negative) impression spills over.

Remedies: rater training, forced rankings, behavioural anchors.

Remedies for Adverse Impact / Score Adjustments

Psychometric techniques (each with ethical & legal debates):

Addition of constant points.
Differential scoring / empirical keying by group.
Elimination of items showing Differential Item Functioning (DIF).
Differential cutoffs.
Separate ranking lists.
Within-group (race) norming (now illegal in U.S. employment).
Banding & sliding bands.
Explicit preference policies.

Policy Debates

Pro-adjustment: redress past wrongs, ensure diversity, correct biased items.
Anti-adjustment: undermines individual merit, may harm intended beneficiaries, violates legislation (e.g., U.S. Civil Rights Act 1991 §106).

Connections & Implications

Validity evidence guides ethical assessment, informs legal defensibility, and affects societal outcomes (employment, admissions, clinical decisions).
Cultural context crucial: item interpretations, historical narratives, political climates (e.g., Bosnian textbooks; Palestinian exam censorship).
Practical takeaway: Continuous monitoring, transparent reporting, and multi-method evidence are mandatory for responsible test use.

Key Formulas & Statistics (LaTeX)

Pearson correlation: r=\frac{\sum (X-\bar X)(Y-\bar Y)}{\sqrt{\sum (X-\bar X)^2}\sqrt{\sum (Y-\bar Y)^2}}
Regression model with new predictor: Y=\beta0 + \beta1 X1 + \beta2 X2 + \epsilon; incremental validity assessed via \Delta R^2 when X2 added.
Hit/Miss classification table (simplified):
\begin{array}{c|cc}
& \text{Predicted +} & \text{Predicted -} \
\hline
\text{Actual +} & \text{Hit} & \text{False Negative} \
\text{Actual -} & \text{False Positive} & \text{Hit} \
\end{array}

Summary Checklist for Exam Prep

Understand definitions of all validity types.
Be able to design a validation study, choosing appropriate criteria & statistics.
Recognize cultural and ethical dimensions of test use.
Calculate and interpret r_{xy}, hit/miss rates, and incremental \Delta R^2.
Identify rating errors and propose corrective actions.
Discuss pros/cons of various bias-mitigation techniques.

Note

0.0(0)

Take a practice test

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

View the linked PDF

Class Notes