Comprehensive Notes on Tests and Measurement

Why Measurement? An Introduction

Historical Context of Testing
- Evidence of testing dates back to 2200 BCE in China, where citizens took civil service exams in writing, arithmetic, horsemanship, and archery.
- Modern civil service systems in Britain (1830) and America (1889) were modeled after these early competitive systems.
- Charles Darwin's Origin of Species (1859) catalyzed interest in individual differences, which is the core of most psychological tests.
- Francis Galton, Darwin's cousin, established the first anthropometric lab to measure physical variables like strength and hand steadiness.
- James Cattell coined the term "mental test" and founded the Psychological Corporation in the 1920s.
- Alfred Binet and Theodore Simon created the first intelligence test in 1905 to identify Parisian schoolchildren in need of specialized instruction.
Definition of a Test
- A test is a tool, procedure, device, examination, investigation, assessment, or measure of an outcome (usually behavior).
- Formats range from 50-question multiple-choice exams to 30-minute qualitative interviews.
Purposes of Testing
- Selection: Identifying individuals for specific roles (e.g., jet pilots).
- Placement: Assigning individuals to appropriate levels (e.g., college math classes).
- Diagnosis: Identifying mental disorders or specific strengths/weaknesses.
- Hypothesis Testing: Validating "if…then" scientific statements.
- Classification: Helping individuals choose career paths based on aptitude.
Modern Economic Scale
- The Brookings Institution estimates that $1.7 billion is spent annually on assessment for the $K–12$ population in the United States alone.

The Psychology of Psychometrics

Levels of Measurement

Nominal Level
- Variables are categorical or discrete.
- Names or labels are assigned (e.g., "Republican" vs. "Democrat," "Nurse" vs. "Non-nurse").
- Categories must be mutually exclusive; a score cannot belong to more than one group.
Ordinal Level
- Variables are ordered along a continuum.
- Indicates "more than" or "less than" relationships (e.g., ranking childhood fears).
- Does not provide information on the distance between ranks.
Interval Level
- Values are based on an underlying continuum with equal intervals.
- Allows for the calculation of differences between scores (e.g., IQ scores where the difference between 100 and 102 is the same as 102 and 104).
- Does not have an absolute zero.
Ratio Level
- Characterized by all properties of nominal, ordinal, and interval scales, plus an absolute zero (absence of the trait).
- Examples: Height, weight, number of finger taps.
- Hard to apply to psychological constructs (e.g., a 0 on a spelling test does not mean zero spelling ability).

Reliability: Consistency of Measurement

The Reliability Formula
- Observed Score = True Score + Error Score.
- True Score: The theoretical, 100% accurate reflection of underlying knowledge.
- Error Score: Difference between observed and true scores, consisting of Trait Error (individual factors like fatigue) and Method Error (situational factors like unclear instructions).
- Conceptual Reliability Formula: \text{Reliability} = \frac{\text{True Score}}{\text{True Score} + \text{Error Score}}.
Types of Reliability
- Test-Retest: Consistency over time; measured by correlating scores from two different time points.
- Parallel Forms: Similarity between two different versions of the same test.
- Internal Consistency: Whether items in a single test measure one dimension. Tools include:
  - Split-Half: Correlating odd vs. even items.
  - Cronbach’s Alpha ($\alpha$): Correlating each item with the total score for non-binary items (e.g., Likert scales).
  - Kuder-Richardson (KR20): Internal consistency for binary (Right/Wrong) items.
- Interrater Reliability: Level of agreement between two or more observers. Formula: \frac{\text{Number of Agreements}}{\text{Number of Possible Agreements}}.
The Spearman-Brown Formula
- Used to correct split-half reliability because shortening a test reduces its reliability. Formula: rt = \frac{2rh}{1 + r_h}.

Validity: Accuracy of Measurement

General Definition
- The extent to which inferences made from a test are appropriate, meaningful, and useful.
- A test must be reliable before it can be valid, but a reliable test is not necessarily valid.
Types of Validity
- Content Validity: Whether items sample the entire universe of possible items for the domain (essential for achievement tests).
- Criterion Validity: How well a test correlates with an external criterion.
  - Concurrent: Criterion is measured at the same time.
  - Predictive: Test predicts future performance (e.g., GRE scores predicting grad school GPA).
- Construct Validity: Most complex; whether a test measures an underlying theoretical construct (e.g., shyness).
  - Multitrait-Multimethod Matrix: Uses multiple traits and methods to establish Convergent (high similarity) and Discriminant (low similarity) validity.

Norms, Percentiles, and Standard Scores

Percentiles ($P_r$)
- Indicates the point below which a certain percentage of scores fall. Formula: P_r = \frac{B}{N} \times 100, where $B$ is the number of lower values and $N$ is total observations.
Stanines
- Divides a distribution into nine equal segments. Mean = 5, SD = 2.
Standard Scores
- z Score: Represents the number of standard deviations a score is from the mean. Formula: z = \frac{X - \bar{X}}{s}.
- T Score: Transformed score to eliminate negatives and decimals. Formula: T = 50 + 10z.
Standard Error of Measurement (SEM)
- Measure of variability in an individual's score upon repeated testing. Formula: SEM = s\sqrt{1 - r}.

Item Response Theory (IRT)

Core Concept
- Focuses on the characteristics of individual items rather than total scores. Also called "Latent Trait Theory."
The Item Characteristic Curve (ICC)
- A graph with $Theta (\theta)$ on the x-axis representing underlying ability and $P(\theta)$ on the y-axis representing the probability of a correct response.
- Difficulty ($b$): The point on the x-axis where the probability of success is 0.50.
- Discrimination ($a$): Represented by the steepness of the curve.
- Guessing ($c$): The probability of low-ability test takers getting the item right by chance.

The Tao and How of Testing: Item Construction

Short-Answer and Completion Items
- Best for lower-level thinking (memorization, facts).
- Advantage: Minimizes guessing (no options provided).
Essay Items
- Best for higher-order thinking (synthesis, analysis).
- Open-ended vs. Closed-ended (restricted) formats.
- Scoring requires batched grading, model answers, and anonymity to reduce bias.
Multiple-Choice Items
- Anatomy: Stem (premise), Key (correct alternative), and Distracters (plausible incorrect options).
- Difficulty Index ($D$): Percentage who got the item right. D = \frac{Nh + Nl}{T}.
- Discrimination Index ($d$): How well the item separates high from low performers. d = \frac{Nh - Nl}{0.5T}.
Matching and True-False
- Matching: Uses premises (Column A) and options (Column B).
- True-False: Dichotomous format. Correction for guessing: CS = R - W.
Portfolios
- Systematic collections of work showing progress over time. Both formative (ongoing) and summative (final) evaluation.

Areas of Assessment

Intelligence Tests
- Historically based on Spearman’s g factor (general factor) vs. Thurstone’s Primary Mental Abilities.
- Robert Sternberg’s Triarchic Theory: Componential, Experiential, and Contextual.
- Howard Gardner’s Multiple Intelligences: Musical, Bodily-Kinesthetic, Logical-Mathematical, Linguistic, Spatial, Interpersonal, Intrapersonal, and Naturalist.
- Emotional Intelligence (Goleman): Focuses on self-awareness and empathy.
Neuropsychological Testing
- Assessment of cognitive skills relating to brain function.
- Areas covered: Memory, Language, Visuospatial ability, and Executive Function (e.g., Stroop Test).
Personality Testing
- Objective: Clear stimuli (e.g., MMPI-2, NEO-4).
- Projective: Ambiguous stimuli (e.g., Rorschach Inkblot, Thematic Apperception Test or TAT).
Career Choices
- John Holland’s Hexagon (RIASEC model): Realistic, Investigative, Artistic, Social, Enterprising, Conventional.

Legal and Ethical Issues

Major Legislation
- NCLB (2002): No Child Left Behind; focused on closing achievement gaps through high-stakes testing.
- IDEA (1997/PL 94-142): Individuals with Disabilities Education Act; guarantees free appropriate public education in the "Least Restrictive Environment" (LRE).
- FERPA (1974): Protects the privacy of student education records.
- Truth in Testing: New York law requiring disclosure of items and scoring processes for admissions tests.
Ethics
- Key principles: No physical or psychological harm, informed consent, confidentiality, anonymity, and appropriate use of incentives.