Comprehensive Notes on Tests and Measurement

Why Measurement? An Introduction

  • Historical Context of Testing
    • Evidence of testing dates back to 2200 BCE in China, where citizens took civil service exams in writing, arithmetic, horsemanship, and archery.
    • Modern civil service systems in Britain (1830) and America (1889) were modeled after these early competitive systems.
    • Charles Darwin's Origin of Species (1859) catalyzed interest in individual differences, which is the core of most psychological tests.
    • Francis Galton, Darwin's cousin, established the first anthropometric lab to measure physical variables like strength and hand steadiness.
    • James Cattell coined the term "mental test" and founded the Psychological Corporation in the 1920s.
    • Alfred Binet and Theodore Simon created the first intelligence test in 1905 to identify Parisian schoolchildren in need of specialized instruction.
  • Definition of a Test
    • A test is a tool, procedure, device, examination, investigation, assessment, or measure of an outcome (usually behavior).
    • Formats range from 50-question multiple-choice exams to 30-minute qualitative interviews.
  • Purposes of Testing
    • Selection: Identifying individuals for specific roles (e.g., jet pilots).
    • Placement: Assigning individuals to appropriate levels (e.g., college math classes).
    • Diagnosis: Identifying mental disorders or specific strengths/weaknesses.
    • Hypothesis Testing: Validating "if…then" scientific statements.
    • Classification: Helping individuals choose career paths based on aptitude.
  • Modern Economic Scale
    • The Brookings Institution estimates that $1.7 billion is spent annually on assessment for the $K–12$ population in the United States alone.

The Psychology of Psychometrics

Levels of Measurement

  • Nominal Level
    • Variables are categorical or discrete.
    • Names or labels are assigned (e.g., "Republican" vs. "Democrat," "Nurse" vs. "Non-nurse").
    • Categories must be mutually exclusive; a score cannot belong to more than one group.
  • Ordinal Level
    • Variables are ordered along a continuum.
    • Indicates "more than" or "less than" relationships (e.g., ranking childhood fears).
    • Does not provide information on the distance between ranks.
  • Interval Level
    • Values are based on an underlying continuum with equal intervals.
    • Allows for the calculation of differences between scores (e.g., IQ scores where the difference between 100 and 102 is the same as 102 and 104).
    • Does not have an absolute zero.
  • Ratio Level
    • Characterized by all properties of nominal, ordinal, and interval scales, plus an absolute zero (absence of the trait).
    • Examples: Height, weight, number of finger taps.
    • Hard to apply to psychological constructs (e.g., a 0 on a spelling test does not mean zero spelling ability).

Reliability: Consistency of Measurement

  • The Reliability Formula
    • Observed Score = True Score + Error Score.
    • True Score: The theoretical, 100% accurate reflection of underlying knowledge.
    • Error Score: Difference between observed and true scores, consisting of Trait Error (individual factors like fatigue) and Method Error (situational factors like unclear instructions).
    • Conceptual Reliability Formula: \text{Reliability} = \frac{\text{True Score}}{\text{True Score} + \text{Error Score}}.
  • Types of Reliability
    • Test-Retest: Consistency over time; measured by correlating scores from two different time points.
    • Parallel Forms: Similarity between two different versions of the same test.
    • Internal Consistency: Whether items in a single test measure one dimension. Tools include:
      • Split-Half: Correlating odd vs. even items.
      • Cronbach’s Alpha ($\alpha$): Correlating each item with the total score for non-binary items (e.g., Likert scales).
      • Kuder-Richardson (KR20): Internal consistency for binary (Right/Wrong) items.
    • Interrater Reliability: Level of agreement between two or more observers. Formula: \frac{\text{Number of Agreements}}{\text{Number of Possible Agreements}}.
  • The Spearman-Brown Formula
    • Used to correct split-half reliability because shortening a test reduces its reliability. Formula: rt = \frac{2rh}{1 + r_h}.

Validity: Accuracy of Measurement

  • General Definition
    • The extent to which inferences made from a test are appropriate, meaningful, and useful.
    • A test must be reliable before it can be valid, but a reliable test is not necessarily valid.
  • Types of Validity
    • Content Validity: Whether items sample the entire universe of possible items for the domain (essential for achievement tests).
    • Criterion Validity: How well a test correlates with an external criterion.
      • Concurrent: Criterion is measured at the same time.
      • Predictive: Test predicts future performance (e.g., GRE scores predicting grad school GPA).
    • Construct Validity: Most complex; whether a test measures an underlying theoretical construct (e.g., shyness).
      • Multitrait-Multimethod Matrix: Uses multiple traits and methods to establish Convergent (high similarity) and Discriminant (low similarity) validity.

Norms, Percentiles, and Standard Scores

  • Percentiles ($P_r$)
    • Indicates the point below which a certain percentage of scores fall. Formula: P_r = \frac{B}{N} \times 100, where $B$ is the number of lower values and $N$ is total observations.
  • Stanines
    • Divides a distribution into nine equal segments. Mean = 5, SD = 2.
  • Standard Scores
    • z Score: Represents the number of standard deviations a score is from the mean. Formula: z = \frac{X - \bar{X}}{s}.
    • T Score: Transformed score to eliminate negatives and decimals. Formula: T = 50 + 10z.
  • Standard Error of Measurement (SEM)
    • Measure of variability in an individual's score upon repeated testing. Formula: SEM = s\sqrt{1 - r}.

Item Response Theory (IRT)

  • Core Concept
    • Focuses on the characteristics of individual items rather than total scores. Also called "Latent Trait Theory."
  • The Item Characteristic Curve (ICC)
    • A graph with $Theta (\theta)$ on the x-axis representing underlying ability and $P(\theta)$ on the y-axis representing the probability of a correct response.
    • Difficulty ($b$): The point on the x-axis where the probability of success is 0.50.
    • Discrimination ($a$): Represented by the steepness of the curve.
    • Guessing ($c$): The probability of low-ability test takers getting the item right by chance.

The Tao and How of Testing: Item Construction

  • Short-Answer and Completion Items
    • Best for lower-level thinking (memorization, facts).
    • Advantage: Minimizes guessing (no options provided).
  • Essay Items
    • Best for higher-order thinking (synthesis, analysis).
    • Open-ended vs. Closed-ended (restricted) formats.
    • Scoring requires batched grading, model answers, and anonymity to reduce bias.
  • Multiple-Choice Items
    • Anatomy: Stem (premise), Key (correct alternative), and Distracters (plausible incorrect options).
    • Difficulty Index ($D$): Percentage who got the item right. D = \frac{Nh + Nl}{T}.
    • Discrimination Index ($d$): How well the item separates high from low performers. d = \frac{Nh - Nl}{0.5T}.
  • Matching and True-False
    • Matching: Uses premises (Column A) and options (Column B).
    • True-False: Dichotomous format. Correction for guessing: CS = R - W.
  • Portfolios
    • Systematic collections of work showing progress over time. Both formative (ongoing) and summative (final) evaluation.

Areas of Assessment

  • Intelligence Tests
    • Historically based on Spearman’s g factor (general factor) vs. Thurstone’s Primary Mental Abilities.
    • Robert Sternberg’s Triarchic Theory: Componential, Experiential, and Contextual.
    • Howard Gardner’s Multiple Intelligences: Musical, Bodily-Kinesthetic, Logical-Mathematical, Linguistic, Spatial, Interpersonal, Intrapersonal, and Naturalist.
    • Emotional Intelligence (Goleman): Focuses on self-awareness and empathy.
  • Neuropsychological Testing
    • Assessment of cognitive skills relating to brain function.
    • Areas covered: Memory, Language, Visuospatial ability, and Executive Function (e.g., Stroop Test).
  • Personality Testing
    • Objective: Clear stimuli (e.g., MMPI-2, NEO-4).
    • Projective: Ambiguous stimuli (e.g., Rorschach Inkblot, Thematic Apperception Test or TAT).
  • Career Choices
    • John Holland’s Hexagon (RIASEC model): Realistic, Investigative, Artistic, Social, Enterprising, Conventional.

Legal and Ethical Issues

  • Major Legislation
    • NCLB (2002): No Child Left Behind; focused on closing achievement gaps through high-stakes testing.
    • IDEA (1997/PL 94-142): Individuals with Disabilities Education Act; guarantees free appropriate public education in the "Least Restrictive Environment" (LRE).
    • FERPA (1974): Protects the privacy of student education records.
    • Truth in Testing: New York law requiring disclosure of items and scoring processes for admissions tests.
  • Ethics
    • Key principles: No physical or psychological harm, informed consent, confidentiality, anonymity, and appropriate use of incentives.