Test Construction

  • Validity

    • The extent to which a test measures what it claims to measure.

  • Reliability

    • The consistency or stability of test scores across time, forms, or raters.

  • Construct Validity

    • The degree to which a test measures the theoretical construct it is intended to assess.

  • Content Validity

    • The extent to which a test represents all facets of the construct being measured.

  • Criterion-Related Validity

    • The extent to which a test's scores are related to an external criterion. This includes:

      • Predictive Validity: How well test scores predict future outcomes.

      • Concurrent Validity: How well test scores correlate with other measures of the same construct administered at the same time.

  • Test-Retest Reliability

    • The consistency of scores when the same test is administered to the same individuals at two different points in time.

  • Inter-Rater Reliability

    • The degree of agreement among different raters or evaluators.

  • Internal Consistency

    • The extent to which items within a test measure the same construct, often assessed using coefficients like Cronbach’s alpha.

  • Standardization

    • The process of administering a test under consistent conditions and using established norms to interpret scores.

  • Norms

    • Reference scores derived from a standardization sample, used for comparing individual test performance.

  • Raw Score

    • The original, untransformed score obtained directly from a test.

  • Scaled Score

    • A transformed score that allows comparison across tests or populations, often using standard deviations and means.

  • Standard Error of Measurement (SEM)

    • An estimate of the amount of error inherent in an individual’s test score.

  • Item Analysis

    • The process of evaluating individual test items for their difficulty, discrimination, and overall contribution to test quality.

  • Item Difficulty

    • The proportion of test-takers who answer a test item correctly; used to assess how easy or hard the item is.

  • Item Discrimination

    • The degree to which an item differentiates between high and low performers on the overall test.

  • Ceiling Effect

    • When a test is too easy, leading many test-takers to achieve near-perfect scores, limiting the ability to distinguish among higher abilities.

  • Floor Effect

    • When a test is too difficult, causing many test-takers to score near the lowest possible score, limiting the ability to distinguish among lower abilities.

  • Face Validity

    • The extent to which a test appears to measure what it is supposed to measure, based on superficial inspection.

  • Factor Analysis

    • A statistical method used to identify underlying dimensions or factors within a set of test items.

  • Likert Scale

    • A common scaling method in survey research where respondents rate their level of agreement with statements.

  • Construct

    • A theoretical concept or trait that a test aims to measure (e.g., intelligence, anxiety).

  • Bias

    • Systematic errors in test scores that unfairly favor one group over another.

  • Differential Item Functioning (DIF)

    • When test items have different levels of difficulty for different groups, despite equal ability levels.

  • Standard Deviation (SD)

    • A measure of variability that indicates the average difference between test scores and the mean score.

  • Percentile Rank

    • A score that indicates the percentage of scores in a norm group that fall below a given score.

  • Z-Score

    • A standardized score that represents the number of standard deviations a raw score is from the mean.

  • T-Score

    • A standardized score with a mean of 50 and a standard deviation of 10.

  • Split-Half Reliability

    • A measure of reliability obtained by correlating scores on two halves of a test.

  • Adaptive Testing

    • A testing method where the difficulty of questions is adjusted based on the test-taker’s responses.

  • Pilot Testing

    • Preliminary testing of a new test on a small sample to identify issues before full-scale implementation.

  • Sensitivity

    • The ability of a test to correctly identify individuals who have a specific condition or trait.

  • Specificity

    • The ability of a test to correctly identify individuals who do not have a specific condition or trait.

  • Item Response Theory (IRT)

    • A model used to analyze test items based on their difficulty, discrimination, and guessing parameters.

  • Latent Trait

    • An unobservable characteristic or quality inferred from test performance.

  • Cultural Bias

    • Occurs when a test reflects the values, norms, and knowledge of a particular culture, making it unfairly advantageous for individuals from that culture and disadvantageous for those from others.

  • Construct Bias

    • When the construct being measured is defined or operationalized differently across groups, resulting in invalid comparisons.

  • Content Bias

    • Arises when test items are more relevant or familiar to certain groups than others, leading to an uneven playing field.

  • Sampling Bias

    • Happens when the sample used to standardize the test does not accurately represent the broader population, leading to skewed norms.

  • Response Bias

    • A tendency of test-takers to respond in a particular way, regardless of their true feelings or abilities. Examples include:

      • Social Desirability Bias: Answering in a manner perceived to be socially acceptable.

      • Acquiescence Bias: Tendency to agree with statements regardless of their content.

      • Extreme Response Bias: Preferring extreme options on scales (e.g., always selecting "strongly agree" or "strongly disagree").

      • Central Tendency Bias: Avoiding extreme options and choosing middle responses instead.

  • Gender Bias

    • When test items or interpretations favor one gender over another, often stemming from stereotypes or unbalanced representation.

  • Language Bias

    • Occurs when the language of the test is more accessible to certain groups, disadvantaging non-native speakers or those with limited vocabulary.

  • Stereotype Threat

    • The risk that individuals may underperform on a test due to anxiety about confirming negative stereotypes about their social group.

  • Anchoring Bias

    • When raters or evaluators give undue weight to an initial piece of information (e.g., first impressions or early responses) when assessing performance.

  • Halo Effect

    • When the evaluator's positive impression of one characteristic influences their evaluation of other characteristics, leading to skewed judgments.

  • Horn Effect

    • The opposite of the halo effect, where a negative impression of one characteristic negatively influences evaluation of others.

  • Confirmation Bias

    • The tendency to favor information or interpretations that confirm pre-existing beliefs or hypotheses.

  • Availability Bias

    • Overweighting recent or easily recalled events or information when interpreting test responses or patterns.

  • Differential Prediction Bias

    • Occurs when a test predicts outcomes differently for different groups, even when scores are the same.

  • Mode of Administration Bias

    • Variations in test performance caused by differences in how the test is administered (e.g., paper vs. computer, group vs. individual).

  • Ecological Bias

    • When the testing environment disproportionately impacts certain groups, such as noise levels, location, or other situational factors.

  • Leniency Bias

    • When a rater consistently scores individuals more favorably than warranted.

  • Severity Bias

    • When a rater consistently scores individuals more harshly than warranted.

  • Implicit Bias

    • Unconscious attitudes or stereotypes that influence judgments and decisions without the evaluator being aware of them.

  • Intersectionality Bias

    • When overlapping social categories (e.g., race, gender, socioeconomic status) lead to compounded disadvantages in testing.

  • Test-Wiseness Bias

    • Advantages given to individuals who are more familiar with test-taking strategies, regardless of their actual knowledge or ability.

  • Age Bias

    • When test content or interpretations are more appropriate for certain age groups, disadvantaging others (e.g., younger or older individuals).

  • Cognitive Bias

    • Biases stemming from cognitive limitations, such as overgeneralization or faulty logic in interpreting test results.

  • Measurement Bias

    • Errors that arise from the measurement tool itself, leading to systematic overestimation or underestimation of true scores.