Does not provide information on the distance between ranks.
Interval Level
Values are based on an underlying continuum with equal intervals.
Allows for the calculation of differences between scores (e.g., IQ scores where the difference between 100 and 102 is the same as 102 and 104).
Does not have an absolute zero.
Ratio Level
Characterized by all properties of nominal, ordinal, and interval scales, plus an absolute zero (absence of the trait).
Examples: Height, weight, number of finger taps.
Hard to apply to psychological constructs (e.g., a 0 on a spelling test does not mean zero spelling ability).
Reliability: Consistency of Measurement
The Reliability Formula
Observed Score = True Score + Error Score.
True Score: The theoretical, 100% accurate reflection of underlying knowledge.
Error Score: Difference between observed and true scores, consisting of Trait Error (individual factors like fatigue) and Method Error (situational factors like unclear instructions).
Test-Retest: Consistency over time; measured by correlating scores from two different time points.
Parallel Forms: Similarity between two different versions of the same test.
Internal Consistency: Whether items in a single test measure one dimension. Tools include:
Split-Half: Correlating odd vs. even items.
Cronbach’s Alpha ($\alpha$): Correlating each item with the total score for non-binary items (e.g., Likert scales).
Kuder-Richardson (KR20): Internal consistency for binary (Right/Wrong) items.
Interrater Reliability: Level of agreement between two or more observers. Formula: Number of Possible AgreementsNumber of Agreements.
The Spearman-Brown Formula
Used to correct split-half reliability because shortening a test reduces its reliability. Formula: r<em>t=1+rh2r</em>h.
Validity: Accuracy of Measurement
General Definition
The extent to which inferences made from a test are appropriate, meaningful, and useful.
A test must be reliable before it can be valid, but a reliable test is not necessarily valid.
Types of Validity
Content Validity: Whether items sample the entire universe of possible items for the domain (essential for achievement tests).
Criterion Validity: How well a test correlates with an external criterion.
Concurrent: Criterion is measured at the same time.
Predictive: Test predicts future performance (e.g., GRE scores predicting grad school GPA).
Construct Validity: Most complex; whether a test measures an underlying theoretical construct (e.g., shyness).
Multitrait-Multimethod Matrix: Uses multiple traits and methods to establish Convergent (high similarity) and Discriminant (low similarity) validity.
Norms, Percentiles, and Standard Scores
Percentiles ($P_r$)
Indicates the point below which a certain percentage of scores fall. Formula: Pr=NB×100, where $B$ is the number of lower values and $N$ is total observations.
Stanines
Divides a distribution into nine equal segments. Mean = 5, SD = 2.
Standard Scores
z Score: Represents the number of standard deviations a score is from the mean. Formula: z=sX−Xˉ.
T Score: Transformed score to eliminate negatives and decimals. Formula: T=50+10z.
Standard Error of Measurement (SEM)
Measure of variability in an individual's score upon repeated testing. Formula: SEM=s1−r.
Item Response Theory (IRT)
Core Concept
Focuses on the characteristics of individual items rather than total scores. Also called "Latent Trait Theory."
The Item Characteristic Curve (ICC)
A graph with $Theta (\theta)$ on the x-axis representing underlying ability and $P(\theta)$ on the y-axis representing the probability of a correct response.
Difficulty ($b$): The point on the x-axis where the probability of success is 0.50.
Discrimination ($a$): Represented by the steepness of the curve.
Guessing ($c$): The probability of low-ability test takers getting the item right by chance.
The Tao and How of Testing: Item Construction
Short-Answer and Completion Items
Best for lower-level thinking (memorization, facts).
Advantage: Minimizes guessing (no options provided).
Essay Items
Best for higher-order thinking (synthesis, analysis).
Open-ended vs. Closed-ended (restricted) formats.
Scoring requires batched grading, model answers, and anonymity to reduce bias.
NCLB (2002): No Child Left Behind; focused on closing achievement gaps through high-stakes testing.
IDEA (1997/PL 94-142): Individuals with Disabilities Education Act; guarantees free appropriate public education in the "Least Restrictive Environment" (LRE).
FERPA (1974): Protects the privacy of student education records.
Truth in Testing: New York law requiring disclosure of items and scoring processes for admissions tests.
Ethics
Key principles: No physical or psychological harm, informed consent, confidentiality, anonymity, and appropriate use of incentives.