Notes on Psychological Testing: Norming, Group Norms, Ethics, and Measurement

Norming, interpretation, and measurement error

This portion of the lecture revisits norming and how we interpret scores. The speaker reminds us that no test is perfect, and that observed scores include measurement error. If a student scores an 80 and another scores a 78 on a 0–100 scale (an invented example resembling the SAT range for simplicity), you cannot simply declare the first student is better. Because measurement error can shift scores up or down, direct raw-score comparisons are incomplete. The observed score is conceptually the sum of a student’s true score and some measurement error:

X{obs} = X{true} + \varepsilon

where \varepsilon represents random or systematic error. We don’t know the exact error for any given test-taker, but we can estimate it with the standard error of measurement (SEM). The SEM relates to a test’s reliability and dispersion of scores (and will be covered in more detail next week). A core practical consequence is that interpretation should consider potential error, not just a raw score.

Beyond single-score interpretation, norming allows comparisons within groups rather than across the entire tested population.

Within-group norming and grouping strategies

Norming within groups means comparing a student to norms that are specific to a subgroup (e.g., ethnicity or race) rather than to a single, universal norm. In practice, this involves calculating separate norms for categories such as White, Black, Hispanic, Asian, and Other (often with the acknowledgment that “mixed race” is a real category, though not always cleanly handled in statistics). The idea is to ask: how would a student perform relative to peers within the same subgroup?

In this framework, a student who is White would be compared to White norms, a student who is Black to Black norms, and so on. The intuitive rationale is that different groups might have different average scores due to a variety of factors beyond raw ability, so comparing apples to apples could seem more fair. The speaker notes that this approach is controversial because it implicitly treats race/ethnicity as a causal variable for differences in scores, which raises essentialist and philosophical concerns (a nod to discussions about race as a social rather than strictly biological construct).

This controversy intersects with policy and ethics. In the United States, a 1991 law addressed discrimination in selection/hiring practices, sparking ongoing debates about whether within-group norming constitutes discrimination and whether it appropriately isolates construct validity from group membership. The discussion also links to broader questions about eugenics history—an idea that suggests biological bases for group differences—and the distinction between phenotype (observable traits) and genotype (genetic makeup). The speaker emphasizes that race is not a strict biological determinant, and that statistics often require simplifying categories for practical measurement, which can inadvertently reinforce problematic assumptions.

Ethical, legal, and practical implications of norming and race categories

The lecture covers several intertwined issues:

The essentialist argument: assuming race is the causal variable for score differences.
The historical context of eugenics and its influence on how tests are designed and interpreted.
The legal risk of discrimination when using certain norming practices in employment and other high-stakes decisions (cited: 1991 law).
The ethical obligation to ensure tests measure the intended construct rather than proxies for group membership, and to be mindful of how cultural, linguistic, and SES factors influence performance.
The practical challenges of cultural translation, including language, figures of speech, and familiar knowledge (e.g., cultural exposure to items like what constitutes a “basement” or a “mirror”).

The speaker also discusses how cultural differences can affect test content and interpretation. For example, materials developed in the U.S. may rely on concepts or experiences that are not universal, so direct transfer to other cultures requires cultural translation of both content and context.

Historical and contemporary controversies tied to testing and social policy

Several major themes recur:

Eugenics and claims about group differences rooted in biology or genetics. The lecture emphasizes that race as a biological category is contentious and scientifically questionable; differences observed in scores likely reflect sociocultural and environmental factors as much as any biological component.
The 1990s’ popularization of the “bell curve” argument (Hernstein & Murray, 1994) and the ongoing debate about the interpretation of group differences. The speaker notes that contemporary science supports multiple explanations for disparities, not a single natural hierarchy among groups.
Informed consent and the ethical track record of assessments. The speaker recalls that formal informed consent for testing became widely recognized only in the 1960s–1970s, and that earlier practice often lacked transparent client awareness.
Data privacy and the ethics of testing in the digital age. A 2015 episode at Facebook (now Meta) highlighted that online personality quizzes could collect data to influence political advertising, illustrating that seemingly innocuous tests can have far-reaching consequences for autonomy and democracy.

Standardization, norms, and administration of tests

A key distinction in testing is standardization: standardized tests have uniform instructions, administration procedures, and scoring conventions, plus normative data that allow interpretation relative to a reference group. The lecture emphasizes three interrelated components:

Content standardization: items and scoring rules are fixed across administrations.
Administration standardization: the way the test is given (instructions, timing, environment) is consistent.
Norms: the reference group or groups used to interpret an individual’s score.

Self-report instruments (e.g., personality inventories like the NEO) rely on respondents’ own reports about preferences and behaviors, which are then interpreted using normative data. In contrast, observer-based data (e.g., behavioral coding) involve external ratings that are not self-reported. The speaker notes that standardization also applies to how results are interpreted; without norms, a test cannot be meaningfully standardized.

Items on tests are not always questions. The term item refers to prompts, statements, or tasks used to elicit a response. Distinguishing items from questions matters because some items require performance (doing something) rather than selecting a response to a direct question. In child-testing contexts, many tasks are manipulatives or problem-solving activities that yield observable performance scores.

Achievement vs aptitude: what tests measure

The differentiation between achievement and aptitude is central to understanding what a test claims to measure. The SAT is used as a focal example with two parts: verbal and quantitative, plus a writing component in some versions. The question is whether the test measures what the student has learned (achievement) or the potential to learn and perform in the future (aptitude). The speaker notes that the SAT has historically been used for both purposes, depending on context.

Other tests illustrate this dichotomy. The ASVAB (Armed Services Vocational Aptitude Battery) is discussed as a military example that partitions domains like arithmetic, word knowledge, and paragraph comprehension, highlighting how domains are parceled out and assessed for potential in different areas. The underlying premise is whether a test assesses current mastery or future capacity, and how that informs high-stakes decisions like college admission or job placement.

The speaker notes that most tests used in high-stakes contexts aim to measure achievement (past mastery) or aptitude (potential). In practice, many tests blend both notions to varying degrees depending on content and purpose.

Major test families and constructs in psychology and applied settings

A range of test types and constructs are referenced:

Personality assessments: Big Five (Five-Factor Model) measured by instruments like the NEO; commonly used in organizational settings. Other personality measures include the MBTI (Myers-Briggs Type Indicator), which has faced criticism for its criterion validity in predicting job performance and outcomes. The lecture also mentions the DISC model as another personality framework used in teams.
Clinical and diagnostic measures: MMPI (Minnesota Multiphasic Personality Inventory) and MCMI (Million Clinical Multiaxial Inventory) are noted for clinical application and criterion validity regarding personality disorders.
Projective assessments vs. objective tests: Projective tests use ambiguous stimuli with interpretive scoring, while objective tests have clear scoring rules and fixed response options. These distinctions matter for reliability, validity, and interpretability.

The speaker stresses that the most important issue is not which test is inherently “better,” but how the test is used and what inferences are drawn from it. Criterion validity (the degree to which a test predicts a relevant outcome) and practical utility in decision-making are emphasized as central concerns.

Observational methods, performance contexts, and measurement strategies

Testing can combine multiple approaches to capture different aspects of ability or trait:

Observation and behavioral coding: evidence collected through watching and coding behaviors (e.g., active listening cues such as paraphrasing, nonverbal feedback, and eye contact). Observational data are not self-reported and depend on the observer’s reliability.
Maximal performance vs. typical performance: maximal performance asks for the best possible effort on a given task, whereas typical performance reflects how a person behaves under normal conditions. The choice affects test design and interpretation in organizational settings.
Multi-method batteries: the lecture references combining different approaches (e.g., standardized tests, interviews, observational checklists) to obtain a fuller assessment. This approach raises questions about how to weight and integrate results.

Integrity tests, faking, and high-stakes decisions

Integrity tests (honesty tests) are discussed as predictors of certain work behaviors, such as reliability and follow-through. The degree to which these tests are vulnerable to faking is debated; some educators believe there are methods to mitigate faking (e.g., test design, validity scales, or using multiple indicators). The practical implication is that selection decisions in high-stakes contexts (e.g., hiring) require careful consideration of test properties, faking risk, and how results will inform decisions.

Ethics, social impact, and ongoing debates in testing

One central thread is the ethical use of tests. The lecture emphasizes that:

Tests should measure the intended construct and provide information useful for decision-making, not carte blanche discrimination.
Informed consent and transparency about how tests are used and how data will be used remains crucial.
The use of tests to support discriminatory outcomes—whether explicit or implicit—raises ethical concerns and policy implications.

Examples from recent history illustrate how data from testing and profiling can be misused. The Facebook/Meta example shows how psychologically-informed data can influence political outcomes, underscoring the need for ethical safeguards in data collection and use.

Language, culture, and translation challenges in testing

Cultural and linguistic differences can impede fair assessment. Language fluency affects performance on verbal sections or items that rely on vocabulary and comprehension. Cultural knowledge and familiarity with certain constructs influence responses. The instructor highlights the importance of cultural translation when adapting tests for different populations, including avoiding overreliance on figures of speech that may not translate across languages.

An illustrative example concerns basements and other culturally specific knowledge. If a test assumes domestic architecture or common experiences from one culture, it may disadvantage others who have not had exposure to those contexts. This underscores the need for careful cross-cultural test development and interpretation.

The role of SES, ZIP codes, and education in testing disparities

The discussion links test disparities to social determinants such as socioeconomic status (SES) and neighborhood effects (e.g., ZIP code correlates with educational attainment). The speaker notes that disparities in attainment across geographies can reflect structural inequalities and differential access to resources, rather than fixed differences in ability. This reinforces the importance of considering context when interpreting scores.

Practical test administration notes and terminology

Several terminology points are clarified for students:

A test’s “norms” are the reference data against which an individual’s score is interpreted.
A “standardized test” requires consistent administration, scoring, and interpretation, plus norms.
Tests can be self-report (e.g., NEO) or observer-rated (e.g., behavioral checklists).
The term “item” is used for components of a test (prompts, statements, tasks) rather than always “questions.”
The MBTI and the Five-Factor Model are both discussed as personality frameworks, but they differ in empirical support, predictive validity, and how results are used in practice.
Criterion validity is a central concern when evaluating whether a test predicts relevant outcomes; a lack of clear criterion validity raises questions about a test’s usefulness in guiding decisions.

Closing practical notes on study and classroom context

The speaker mentions practical classroom logistics (e.g., upcoming PsychCon conference, chapter focus, and workbook items) to help students connect theoretical content to applied settings. A recurring message is that understanding psychological assessment requires combining concepts (norms, validity, bias, ethics, and application) and recognizing that tests are tools whose value depends on how they are used, interpreted, and integrated with context.