Test Construction

Validity
- The extent to which a test measures what it claims to measure.
Reliability
- The consistency or stability of test scores across time, forms, or raters.
Construct Validity
- The degree to which a test measures the theoretical construct it is intended to assess.
Content Validity
- The extent to which a test represents all facets of the construct being measured.
Criterion-Related Validity
- The extent to which a test's scores are related to an external criterion. This includes:
  - Predictive Validity: How well test scores predict future outcomes.
  - Concurrent Validity: How well test scores correlate with other measures of the same construct administered at the same time.
Test-Retest Reliability
- The consistency of scores when the same test is administered to the same individuals at two different points in time.
Inter-Rater Reliability
- The degree of agreement among different raters or evaluators.
Internal Consistency
- The extent to which items within a test measure the same construct, often assessed using coefficients like Cronbach’s alpha.
Standardization
- The process of administering a test under consistent conditions and using established norms to interpret scores.
Norms
- Reference scores derived from a standardization sample, used for comparing individual test performance.
Raw Score
- The original, untransformed score obtained directly from a test.
Scaled Score
- A transformed score that allows comparison across tests or populations, often using standard deviations and means.
Standard Error of Measurement (SEM)
- An estimate of the amount of error inherent in an individual’s test score.
Item Analysis
- The process of evaluating individual test items for their difficulty, discrimination, and overall contribution to test quality.
Item Difficulty
- The proportion of test-takers who answer a test item correctly; used to assess how easy or hard the item is.
Item Discrimination
- The degree to which an item differentiates between high and low performers on the overall test.
Ceiling Effect
- When a test is too easy, leading many test-takers to achieve near-perfect scores, limiting the ability to distinguish among higher abilities.
Floor Effect
- When a test is too difficult, causing many test-takers to score near the lowest possible score, limiting the ability to distinguish among lower abilities.
Face Validity
- The extent to which a test appears to measure what it is supposed to measure, based on superficial inspection.
Factor Analysis
- A statistical method used to identify underlying dimensions or factors within a set of test items.
Likert Scale
- A common scaling method in survey research where respondents rate their level of agreement with statements.
Construct
- A theoretical concept or trait that a test aims to measure (e.g., intelligence, anxiety).
Bias
- Systematic errors in test scores that unfairly favor one group over another.
Differential Item Functioning (DIF)
- When test items have different levels of difficulty for different groups, despite equal ability levels.
Standard Deviation (SD)
- A measure of variability that indicates the average difference between test scores and the mean score.
Percentile Rank
- A score that indicates the percentage of scores in a norm group that fall below a given score.
Z-Score
- A standardized score that represents the number of standard deviations a raw score is from the mean.
T-Score
- A standardized score with a mean of 50 and a standard deviation of 10.
Split-Half Reliability
- A measure of reliability obtained by correlating scores on two halves of a test.
Adaptive Testing
- A testing method where the difficulty of questions is adjusted based on the test-taker’s responses.
Pilot Testing
- Preliminary testing of a new test on a small sample to identify issues before full-scale implementation.
Sensitivity
- The ability of a test to correctly identify individuals who have a specific condition or trait.
Specificity
- The ability of a test to correctly identify individuals who do not have a specific condition or trait.
Item Response Theory (IRT)
- A model used to analyze test items based on their difficulty, discrimination, and guessing parameters.
Latent Trait
- An unobservable characteristic or quality inferred from test performance.

Cultural Bias
- Occurs when a test reflects the values, norms, and knowledge of a particular culture, making it unfairly advantageous for individuals from that culture and disadvantageous for those from others.
Construct Bias
- When the construct being measured is defined or operationalized differently across groups, resulting in invalid comparisons.
Content Bias
- Arises when test items are more relevant or familiar to certain groups than others, leading to an uneven playing field.
Sampling Bias
- Happens when the sample used to standardize the test does not accurately represent the broader population, leading to skewed norms.
Response Bias
- A tendency of test-takers to respond in a particular way, regardless of their true feelings or abilities. Examples include:
  - Social Desirability Bias: Answering in a manner perceived to be socially acceptable.
  - Acquiescence Bias: Tendency to agree with statements regardless of their content.
  - Extreme Response Bias: Preferring extreme options on scales (e.g., always selecting "strongly agree" or "strongly disagree").
  - Central Tendency Bias: Avoiding extreme options and choosing middle responses instead.
Gender Bias
- When test items or interpretations favor one gender over another, often stemming from stereotypes or unbalanced representation.
Language Bias
- Occurs when the language of the test is more accessible to certain groups, disadvantaging non-native speakers or those with limited vocabulary.
Stereotype Threat
- The risk that individuals may underperform on a test due to anxiety about confirming negative stereotypes about their social group.
Anchoring Bias
- When raters or evaluators give undue weight to an initial piece of information (e.g., first impressions or early responses) when assessing performance.
Halo Effect
- When the evaluator's positive impression of one characteristic influences their evaluation of other characteristics, leading to skewed judgments.
Horn Effect
- The opposite of the halo effect, where a negative impression of one characteristic negatively influences evaluation of others.
Confirmation Bias
- The tendency to favor information or interpretations that confirm pre-existing beliefs or hypotheses.
Availability Bias
- Overweighting recent or easily recalled events or information when interpreting test responses or patterns.
Differential Prediction Bias
- Occurs when a test predicts outcomes differently for different groups, even when scores are the same.
Mode of Administration Bias
- Variations in test performance caused by differences in how the test is administered (e.g., paper vs. computer, group vs. individual).
Ecological Bias
- When the testing environment disproportionately impacts certain groups, such as noise levels, location, or other situational factors.
Leniency Bias
- When a rater consistently scores individuals more favorably than warranted.
Severity Bias
- When a rater consistently scores individuals more harshly than warranted.
Implicit Bias
- Unconscious attitudes or stereotypes that influence judgments and decisions without the evaluator being aware of them.
Intersectionality Bias
- When overlapping social categories (e.g., race, gender, socioeconomic status) lead to compounded disadvantages in testing.
Test-Wiseness Bias
- Advantages given to individuals who are more familiar with test-taking strategies, regardless of their actual knowledge or ability.
Age Bias
- When test content or interpretations are more appropriate for certain age groups, disadvantaging others (e.g., younger or older individuals).
Cognitive Bias
- Biases stemming from cognitive limitations, such as overgeneralization or faulty logic in interpreting test results.
Measurement Bias
- Errors that arise from the measurement tool itself, leading to systematic overestimation or underestimation of true scores.