Test Construction
Validity
The extent to which a test measures what it claims to measure.
Reliability
The consistency or stability of test scores across time, forms, or raters.
Construct Validity
The degree to which a test measures the theoretical construct it is intended to assess.
Content Validity
The extent to which a test represents all facets of the construct being measured.
Criterion-Related Validity
The extent to which a test's scores are related to an external criterion. This includes:
Predictive Validity: How well test scores predict future outcomes.
Concurrent Validity: How well test scores correlate with other measures of the same construct administered at the same time.
Test-Retest Reliability
The consistency of scores when the same test is administered to the same individuals at two different points in time.
Inter-Rater Reliability
The degree of agreement among different raters or evaluators.
Internal Consistency
The extent to which items within a test measure the same construct, often assessed using coefficients like Cronbach’s alpha.
Standardization
The process of administering a test under consistent conditions and using established norms to interpret scores.
Norms
Reference scores derived from a standardization sample, used for comparing individual test performance.
Raw Score
The original, untransformed score obtained directly from a test.
Scaled Score
A transformed score that allows comparison across tests or populations, often using standard deviations and means.
Standard Error of Measurement (SEM)
An estimate of the amount of error inherent in an individual’s test score.
Item Analysis
The process of evaluating individual test items for their difficulty, discrimination, and overall contribution to test quality.
Item Difficulty
The proportion of test-takers who answer a test item correctly; used to assess how easy or hard the item is.
Item Discrimination
The degree to which an item differentiates between high and low performers on the overall test.
Ceiling Effect
When a test is too easy, leading many test-takers to achieve near-perfect scores, limiting the ability to distinguish among higher abilities.
Floor Effect
When a test is too difficult, causing many test-takers to score near the lowest possible score, limiting the ability to distinguish among lower abilities.
Face Validity
The extent to which a test appears to measure what it is supposed to measure, based on superficial inspection.
Factor Analysis
A statistical method used to identify underlying dimensions or factors within a set of test items.
Likert Scale
A common scaling method in survey research where respondents rate their level of agreement with statements.
Construct
A theoretical concept or trait that a test aims to measure (e.g., intelligence, anxiety).
Bias
Systematic errors in test scores that unfairly favor one group over another.
Differential Item Functioning (DIF)
When test items have different levels of difficulty for different groups, despite equal ability levels.
Standard Deviation (SD)
A measure of variability that indicates the average difference between test scores and the mean score.
Percentile Rank
A score that indicates the percentage of scores in a norm group that fall below a given score.
Z-Score
A standardized score that represents the number of standard deviations a raw score is from the mean.
T-Score
A standardized score with a mean of 50 and a standard deviation of 10.
Split-Half Reliability
A measure of reliability obtained by correlating scores on two halves of a test.
Adaptive Testing
A testing method where the difficulty of questions is adjusted based on the test-taker’s responses.
Pilot Testing
Preliminary testing of a new test on a small sample to identify issues before full-scale implementation.
Sensitivity
The ability of a test to correctly identify individuals who have a specific condition or trait.
Specificity
The ability of a test to correctly identify individuals who do not have a specific condition or trait.
Item Response Theory (IRT)
A model used to analyze test items based on their difficulty, discrimination, and guessing parameters.
Latent Trait
An unobservable characteristic or quality inferred from test performance.
Cultural Bias
Occurs when a test reflects the values, norms, and knowledge of a particular culture, making it unfairly advantageous for individuals from that culture and disadvantageous for those from others.
Construct Bias
When the construct being measured is defined or operationalized differently across groups, resulting in invalid comparisons.
Content Bias
Arises when test items are more relevant or familiar to certain groups than others, leading to an uneven playing field.
Sampling Bias
Happens when the sample used to standardize the test does not accurately represent the broader population, leading to skewed norms.
Response Bias
A tendency of test-takers to respond in a particular way, regardless of their true feelings or abilities. Examples include:
Social Desirability Bias: Answering in a manner perceived to be socially acceptable.
Acquiescence Bias: Tendency to agree with statements regardless of their content.
Extreme Response Bias: Preferring extreme options on scales (e.g., always selecting "strongly agree" or "strongly disagree").
Central Tendency Bias: Avoiding extreme options and choosing middle responses instead.
Gender Bias
When test items or interpretations favor one gender over another, often stemming from stereotypes or unbalanced representation.
Language Bias
Occurs when the language of the test is more accessible to certain groups, disadvantaging non-native speakers or those with limited vocabulary.
Stereotype Threat
The risk that individuals may underperform on a test due to anxiety about confirming negative stereotypes about their social group.
Anchoring Bias
When raters or evaluators give undue weight to an initial piece of information (e.g., first impressions or early responses) when assessing performance.
Halo Effect
When the evaluator's positive impression of one characteristic influences their evaluation of other characteristics, leading to skewed judgments.
Horn Effect
The opposite of the halo effect, where a negative impression of one characteristic negatively influences evaluation of others.
Confirmation Bias
The tendency to favor information or interpretations that confirm pre-existing beliefs or hypotheses.
Availability Bias
Overweighting recent or easily recalled events or information when interpreting test responses or patterns.
Differential Prediction Bias
Occurs when a test predicts outcomes differently for different groups, even when scores are the same.
Mode of Administration Bias
Variations in test performance caused by differences in how the test is administered (e.g., paper vs. computer, group vs. individual).
Ecological Bias
When the testing environment disproportionately impacts certain groups, such as noise levels, location, or other situational factors.
Leniency Bias
When a rater consistently scores individuals more favorably than warranted.
Severity Bias
When a rater consistently scores individuals more harshly than warranted.
Implicit Bias
Unconscious attitudes or stereotypes that influence judgments and decisions without the evaluator being aware of them.
Intersectionality Bias
When overlapping social categories (e.g., race, gender, socioeconomic status) lead to compounded disadvantages in testing.
Test-Wiseness Bias
Advantages given to individuals who are more familiar with test-taking strategies, regardless of their actual knowledge or ability.
Age Bias
When test content or interpretations are more appropriate for certain age groups, disadvantaging others (e.g., younger or older individuals).
Cognitive Bias
Biases stemming from cognitive limitations, such as overgeneralization or faulty logic in interpreting test results.
Measurement Bias
Errors that arise from the measurement tool itself, leading to systematic overestimation or underestimation of true scores.