1/410
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Psychometrics
The science behind psychological and educational measurement.
Test
A standardized procedure producing numerical scores.
Reliability
Degree to which test scores are consistent and dependable.
Validity
Whether a test measures what it claims to measure (most important quality).
Sample vs Population
Sample: observed group; Population: group sample represents.
Descriptive Statistics
Describe a sample (means, SDs, distributions).
Inferential Statistics
Use sample data to make conclusions about a population.
Mean
Average of all scores.
Median
Middle value (50% above/below).
Mode
Most frequent score.
Range
Max − min.
Variance
Average squared deviation from the mean.
Standard Deviation (SD, σ)
Square root of variance; typical distance from mean.
Frequency Distribution
Shows how often each score occurs.
Histogram
Graph of frequency distribution for continuous variable.
Normal Distribution
Symmetric bell curve where mean=median=mode.
Empirical Rule
~68% within ±1 SD, ~95% within ±2 SD, ~99.7% within ±3 SD.
Z-score
z = (x − M)/SD; number of SDs a score is from the mean.
Z-score properties
Mean = 0, SD = 1.
Percentile
Percent of sample below a score (no 100th percentile).
Z to percentile heuristic
z = −2 ≈ 2nd %ile; −1 ≈ 16th; 0 = 50th; 1 ≈ 84th; 2 ≈ 98th.
Standard Score (IQ-style)
Mean = 100, SD = 15.
Subtest Scaled Score
Mean = 10, SD = 3.
T-score
Mean = 50, SD = 10.
Clinically significant T-score
Often T ≥ 65 (1.5 SD) or T ≥ 70 (2 SD) depending on context.
Stanine
9-point scale; 4-6 = average.
Age-equivalent score
Matches median performance at a given age (controversial/misleading).
Grade-equivalent score
Matches median performance at a given grade (controversial).
Developmental scores caveat
Can be misinterpreted; not equal-interval.
Norm-referenced test
Interprets scores vs a norm group.
Criterion-referenced test
Interprets scores against absolute standard/criterion.
Norm group
Standardization sample used for comparisons; should be representative and recent.
Percentile rank interpretation
e.g., 66th %ile means scored above 66% of norm group.
Raw score
Number of items correct (unadjusted).
Reliability (CTT) formula concept
observed score X = true score T + error E.
Reliability coefficient
Correlation (0 to 1) estimating proportion of variance from true scores.
Reliability thresholds
≥.90 excellent (high-stakes); ≥.80 good (screening); ≥.70 acceptable (research).
Alternate-form reliability
Correlation between two different but parallel test forms.
Parallel tests
Two forms designed to give same true score to same person.
Test-retest reliability
Correlation of same test across time (stability).
Interscorer (interrater) reliability
Agreement between different scorers/raters.
Split-half reliability
Correlate two halves of a single test (internal consistency).
Cronbach's alpha (α)
Average of all possible split-half reliabilities; common internal consistency measure.
Omega
Alternative to alpha advocated by some psychometricians.
Short tests & reliability
Shorter tests generally less reliable (sample less content).
Practice effects
Improved scores on retest due to familiarity, lowering test-retest usefulness.
ICC (Intraclass Correlation)
Measures agreement (often for interrater).
Cohen's kappa
Agreement beyond chance for categorical/binary ratings.
Percent agreement
Simple interrater metric (inflated by chance).
SEM (Standard Error of Measurement)
SEM = SD_x * sqrt(1 − reliability).
SEM interpretation
SD of distribution of observed scores around the true score.
SEM empirical rule
~68% within ±1 SEM, 95% within ±2 SEM, 99.7% within ±3 SEM.
Confidence band/interval for observed score
Observed ± (z * SEM) e.g., ±1 SEM = 68% CI.
Regression to the mean
Extreme scores tend to move closer to average on retest.
Reliability depends on sample
Estimates generalize only to similar samples.
Effect of restricted range
Reduces correlations (and reliability/validity coefficients).
Classical Test Theory (CTT)
Observed = True + Error; widely used for reliability.
Item Response Theory (IRT)
Models probability of correct response as function of ability and item properties.
IRT item parameters
Difficulty, discrimination (and guessing for some models).
Computerized Adaptive Testing (CAT)
Uses IRT to tailor items to examinee ability for efficiency.
Convergent validity
Test correlates strongly with other measures of same/related constructs (prefer r ≥ .50).
Discriminant (divergent) validity
Test correlates weakly with unrelated constructs (closer to 0 is better).
Construct validity
Evidence that test measures the intended construct (pattern of correlations, groups, change).
Content validity
Items adequately sample domain; often judged by experts.
Face validity
Superficial judgment: does test 'look' like it measures the construct? (weak evidence).
Criterion-related validity
Correlate scores with external criterion (predictive or concurrent).
Predictive validity
Test predicts future outcome (e.g., SAT → college GPA).
Concurrent validity
Test correlates with criterion measured at same time.
Incremental validity
Degree to which a test adds unique predictive information beyond existing measures.
Internal structure evidence
Factor structure matches the construct's theoretical structure.
Factor analysis
Identifies underlying factors that explain correlations among items/subtests.
Exploratory Factor Analysis (EFA)
Discover factor structure without prior hypothesis.
Confirmatory Factor Analysis (CFA)
Test fit of hypothesized measurement model to data.
PCA vs EFA
Principal Components Analysis (PCA) is similar to EFA; sometimes used as an exploratory method.
Factor
New variable representing shared variance of a set of observed variables.
Factor loading
Correlation/weight of an item with a factor; high loading = item strongly related to factor.
Factor score
Weighted average of observed variables for an individual on a factor.
% Variance explained
Amount of total variance accounted for by each factor.
Naming factors
Based on commonality of variables loading highly on that factor.
High loading example
.80 = strong relationship between item and factor.
Cross-loading
Item loads strongly on multiple factors; may indicate ambiguity or multidimensionality.
EFA first factor
Typically explains the largest share of variance.
CFA fit indices
Check model fit: Chi-square, CFI, TLI, RMSEA, SRMR, AIC, etc.
Chi-square in CFA
Sensitive to sample size; nonsignificant = good fit (rare with large N).
CFI/TLI thresholds
CFI/TLI ≥ .95 = good; ≥ .80 acceptable (context matters).
RMSEA threshold
≤ .08 acceptable; ≤ .05 good.
SRMR threshold
≤ .05 indicates good fit.
Chi-square/df (normed χ²)
≤5 acceptable; ≤2 good.
AIC
Lower values better for model comparison (not absolute).
Sample size & CFA
Larger samples make χ² sensitive; interpret multiple indices.
Use of factor analysis in testing
Provides internal-structure evidence and guides subscale development.
Internal structure & score interpretation
If structure doesn't support indexes, be cautious interpreting index scores.
Composite vs subtest scores
Composites often have higher reliability than subscores.
Base rates
Prevalence of condition in population; affects PPV and NPV.
Sensitivity
Probability test correctly identifies true positives (TP/(TP+FN)).
Specificity
Probability test correctly identifies true negatives (TN/(TN+FP)).
PPV (Positive Predictive Value)
Probability a person has the condition given a positive test (TP/(TP+FP)).
NPV (Negative Predictive Value)
Probability a person does NOT have the condition given a negative test (TN/(TN+FN)).
Sensitivity vs Specificity tradeoff
Lowering cutoff ↑sensitivity but ↓s specificity (and vice versa).
PPV/NPV depend on base rate
Higher prevalence → higher PPV, lower NPV; low prevalence → lower PPV.