Psychometrics and Statistical Concepts in Educational Testing

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/410

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

411 Terms

1
New cards

Psychometrics

The science behind psychological and educational measurement.

2
New cards

Test

A standardized procedure producing numerical scores.

3
New cards

Reliability

Degree to which test scores are consistent and dependable.

4
New cards

Validity

Whether a test measures what it claims to measure (most important quality).

5
New cards

Sample vs Population

Sample: observed group; Population: group sample represents.

6
New cards

Descriptive Statistics

Describe a sample (means, SDs, distributions).

7
New cards

Inferential Statistics

Use sample data to make conclusions about a population.

8
New cards

Mean

Average of all scores.

9
New cards

Median

Middle value (50% above/below).

10
New cards

Mode

Most frequent score.

11
New cards

Range

Max − min.

12
New cards

Variance

Average squared deviation from the mean.

13
New cards

Standard Deviation (SD, σ)

Square root of variance; typical distance from mean.

14
New cards

Frequency Distribution

Shows how often each score occurs.

15
New cards

Histogram

Graph of frequency distribution for continuous variable.

16
New cards

Normal Distribution

Symmetric bell curve where mean=median=mode.

17
New cards

Empirical Rule

~68% within ±1 SD, ~95% within ±2 SD, ~99.7% within ±3 SD.

18
New cards

Z-score

z = (x − M)/SD; number of SDs a score is from the mean.

19
New cards

Z-score properties

Mean = 0, SD = 1.

20
New cards

Percentile

Percent of sample below a score (no 100th percentile).

21
New cards

Z to percentile heuristic

z = −2 ≈ 2nd %ile; −1 ≈ 16th; 0 = 50th; 1 ≈ 84th; 2 ≈ 98th.

22
New cards

Standard Score (IQ-style)

Mean = 100, SD = 15.

23
New cards

Subtest Scaled Score

Mean = 10, SD = 3.

24
New cards

T-score

Mean = 50, SD = 10.

25
New cards

Clinically significant T-score

Often T ≥ 65 (1.5 SD) or T ≥ 70 (2 SD) depending on context.

26
New cards

Stanine

9-point scale; 4-6 = average.

27
New cards

Age-equivalent score

Matches median performance at a given age (controversial/misleading).

28
New cards

Grade-equivalent score

Matches median performance at a given grade (controversial).

29
New cards

Developmental scores caveat

Can be misinterpreted; not equal-interval.

30
New cards

Norm-referenced test

Interprets scores vs a norm group.

31
New cards

Criterion-referenced test

Interprets scores against absolute standard/criterion.

32
New cards

Norm group

Standardization sample used for comparisons; should be representative and recent.

33
New cards

Percentile rank interpretation

e.g., 66th %ile means scored above 66% of norm group.

34
New cards

Raw score

Number of items correct (unadjusted).

35
New cards

Reliability (CTT) formula concept

observed score X = true score T + error E.

36
New cards

Reliability coefficient

Correlation (0 to 1) estimating proportion of variance from true scores.

37
New cards

Reliability thresholds

≥.90 excellent (high-stakes); ≥.80 good (screening); ≥.70 acceptable (research).

38
New cards

Alternate-form reliability

Correlation between two different but parallel test forms.

39
New cards

Parallel tests

Two forms designed to give same true score to same person.

40
New cards

Test-retest reliability

Correlation of same test across time (stability).

41
New cards

Interscorer (interrater) reliability

Agreement between different scorers/raters.

42
New cards

Split-half reliability

Correlate two halves of a single test (internal consistency).

43
New cards

Cronbach's alpha (α)

Average of all possible split-half reliabilities; common internal consistency measure.

44
New cards

Omega

Alternative to alpha advocated by some psychometricians.

45
New cards

Short tests & reliability

Shorter tests generally less reliable (sample less content).

46
New cards

Practice effects

Improved scores on retest due to familiarity, lowering test-retest usefulness.

47
New cards

ICC (Intraclass Correlation)

Measures agreement (often for interrater).

48
New cards

Cohen's kappa

Agreement beyond chance for categorical/binary ratings.

49
New cards

Percent agreement

Simple interrater metric (inflated by chance).

50
New cards

SEM (Standard Error of Measurement)

SEM = SD_x * sqrt(1 − reliability).

51
New cards

SEM interpretation

SD of distribution of observed scores around the true score.

52
New cards

SEM empirical rule

~68% within ±1 SEM, 95% within ±2 SEM, 99.7% within ±3 SEM.

53
New cards

Confidence band/interval for observed score

Observed ± (z * SEM) e.g., ±1 SEM = 68% CI.

54
New cards

Regression to the mean

Extreme scores tend to move closer to average on retest.

55
New cards

Reliability depends on sample

Estimates generalize only to similar samples.

56
New cards

Effect of restricted range

Reduces correlations (and reliability/validity coefficients).

57
New cards

Classical Test Theory (CTT)

Observed = True + Error; widely used for reliability.

58
New cards

Item Response Theory (IRT)

Models probability of correct response as function of ability and item properties.

59
New cards

IRT item parameters

Difficulty, discrimination (and guessing for some models).

60
New cards

Computerized Adaptive Testing (CAT)

Uses IRT to tailor items to examinee ability for efficiency.

61
New cards

Convergent validity

Test correlates strongly with other measures of same/related constructs (prefer r ≥ .50).

62
New cards

Discriminant (divergent) validity

Test correlates weakly with unrelated constructs (closer to 0 is better).

63
New cards

Construct validity

Evidence that test measures the intended construct (pattern of correlations, groups, change).

64
New cards

Content validity

Items adequately sample domain; often judged by experts.

65
New cards

Face validity

Superficial judgment: does test 'look' like it measures the construct? (weak evidence).

66
New cards

Criterion-related validity

Correlate scores with external criterion (predictive or concurrent).

67
New cards

Predictive validity

Test predicts future outcome (e.g., SAT → college GPA).

68
New cards

Concurrent validity

Test correlates with criterion measured at same time.

69
New cards

Incremental validity

Degree to which a test adds unique predictive information beyond existing measures.

70
New cards

Internal structure evidence

Factor structure matches the construct's theoretical structure.

71
New cards

Factor analysis

Identifies underlying factors that explain correlations among items/subtests.

72
New cards

Exploratory Factor Analysis (EFA)

Discover factor structure without prior hypothesis.

73
New cards

Confirmatory Factor Analysis (CFA)

Test fit of hypothesized measurement model to data.

74
New cards

PCA vs EFA

Principal Components Analysis (PCA) is similar to EFA; sometimes used as an exploratory method.

75
New cards

Factor

New variable representing shared variance of a set of observed variables.

76
New cards

Factor loading

Correlation/weight of an item with a factor; high loading = item strongly related to factor.

77
New cards

Factor score

Weighted average of observed variables for an individual on a factor.

78
New cards

% Variance explained

Amount of total variance accounted for by each factor.

79
New cards

Naming factors

Based on commonality of variables loading highly on that factor.

80
New cards

High loading example

.80 = strong relationship between item and factor.

81
New cards

Cross-loading

Item loads strongly on multiple factors; may indicate ambiguity or multidimensionality.

82
New cards

EFA first factor

Typically explains the largest share of variance.

83
New cards

CFA fit indices

Check model fit: Chi-square, CFI, TLI, RMSEA, SRMR, AIC, etc.

84
New cards

Chi-square in CFA

Sensitive to sample size; nonsignificant = good fit (rare with large N).

85
New cards

CFI/TLI thresholds

CFI/TLI ≥ .95 = good; ≥ .80 acceptable (context matters).

86
New cards

RMSEA threshold

≤ .08 acceptable; ≤ .05 good.

87
New cards

SRMR threshold

≤ .05 indicates good fit.

88
New cards

Chi-square/df (normed χ²)

≤5 acceptable; ≤2 good.

89
New cards

AIC

Lower values better for model comparison (not absolute).

90
New cards

Sample size & CFA

Larger samples make χ² sensitive; interpret multiple indices.

91
New cards

Use of factor analysis in testing

Provides internal-structure evidence and guides subscale development.

92
New cards

Internal structure & score interpretation

If structure doesn't support indexes, be cautious interpreting index scores.

93
New cards

Composite vs subtest scores

Composites often have higher reliability than subscores.

94
New cards

Base rates

Prevalence of condition in population; affects PPV and NPV.

95
New cards

Sensitivity

Probability test correctly identifies true positives (TP/(TP+FN)).

96
New cards

Specificity

Probability test correctly identifies true negatives (TN/(TN+FP)).

97
New cards

PPV (Positive Predictive Value)

Probability a person has the condition given a positive test (TP/(TP+FP)).

98
New cards

NPV (Negative Predictive Value)

Probability a person does NOT have the condition given a negative test (TN/(TN+FN)).

99
New cards

Sensitivity vs Specificity tradeoff

Lowering cutoff ↑sensitivity but ↓s specificity (and vice versa).

100
New cards

PPV/NPV depend on base rate

Higher prevalence → higher PPV, lower NPV; low prevalence → lower PPV.