chapter 4
Chapter 3: A Statistics Refresher
Objective: Overview the elements of a good test
Basic assumptions about assessment will be elaborated later on in this chapter and subsequent chapters
Chapter 4: Of Tests and Testing
Importance of tests in addressing important questions in various situations
Examples of situations where tests are used:
Emergency room patient with a diagnosis
Court case with a defendant's competence to stand trial
Hiring decisions in a large corporation
College admissions decisions
Custody dispute in a bitter divorce
Assessment professionals need confidence in the tests they use
Overview of the elements of a good test
Assumptions About Psychological Testing and Assessment
Assumption 1: Psychological traits and states exist
Traits defined as distinguishable, relatively enduring ways individuals vary from each other
States also distinguish individuals but are relatively less enduring
Psychological traits cover a wide range of characteristics
Psychological traits exist as constructs, developed to describe or explain behavior
Traits cannot be seen, heard, or touched, but their existence can be inferred from overt behavior
Controversy exists regarding how psychological traits exist
Psychological traits are relatively enduring but not expected to be manifested in behavior 100% of the time
Manifestation of traits depends on the strength of the trait and the nature of the situation
Trait manifestation is situation-dependent
Behavior context plays a role in selecting appropriate trait terms
Trait and state terms are relative and involve comparisons with the average person or specific groups
Measuring sensation seeking
Sensation seeking defined as the need for varied, novel, and complex sensations and experiences
Sensation-Seeking Scale (SSS) used to identify people high or low on this trait
Context and behavior context influence the interpretation of trait terms
Paper-and-pencil measures and performance-based measures have comparative advantages
Note: The transcript continues on page 5, but the given information covers the main ideas and supporting details.
Page 6:
Influence of reference group on conclusions or judgments
Example of a psychologist administering a test of shyness to a male exotic dancer
Interpretation of test data differs based on reference group (other males in his age group or other male exotic dancers in his age group)
Psychological traits and states can be quantified and measured
Most psychological traits and states vary by degree and can be quantified
Different ways of defining and looking at the same phenomenon
Importance of clear operational definition of the construct under study
Different ways of defining "aggressive behavior" in different contexts
Test developers consider item content to gauge the strength of a trait
Universe of behaviors indicative of the targeted trait
Examples of items for measuring intelligence and social judgment
Page 7:
Comparative value of test items influenced by technical considerations, definition of construct, and societal value
Developing appropriate test items, scoring methods, and interpretation of results
Cumulative scoring and its familiarity from elementary school spelling tests
Test-related behavior predicts non-test-related behavior
Objective of tests is to provide indication of other aspects of examinee's behavior
Use of test patterns in decision making regarding mental disorders
Tests mimic actual behaviors but only provide a sample of behavior under non-test conditions
Predicting future behavior or postdicting past behavior in forensic matters
All tests have limits and imperfections
Competent test users understand how tests were developed and appropriate circumstances for use
Page 8: Assumptions about Testing and Assessment
Assumption 1: Tests and assessments are used to make important decisions
Test results can impact decisions about education, employment, and treatment
The stakes of these decisions make it crucial to have accurate and reliable tests
Assumption 2: Tests and assessments are based on psychological measurement
Tests aim to measure traits, abilities, or characteristics of individuals
Measurement involves assigning numbers to represent the construct being measured
Assumption 3: Tests and assessments have strengths and limitations
Tests can provide valuable information, but they also have limitations
Test users should be aware of these limitations and interpret results accordingly
Assumption 4: Competent test users understand the tests they use
Test users should know how to administer the test and interpret the results
Codes of ethics emphasize the importance of test users being aware of test limitations
Assumption 5: Various sources of error are part of the assessment process
Error refers to factors other than what the test measures that influence performance
Error variance is the component of a test score attributable to sources other than the trait measured
Sources of error variance include assessors, assessees, and measuring instruments
Page 9: Assumptions about Testing and Assessment (cont.)
Assumption 6: Unfair and biased assessment procedures can be identified and reformed
Test developers strive to develop fair instruments following guidelines in the test manual
Procedures to identify and correct test bias have been developed
Fairness-related problems can arise when tests are used with people for whom they were not intended
Assumption 7: Testing and assessment offer powerful benefits to society
Tests and assessments play a crucial role in making important decisions
Despite limitations and challenges, testing and assessment provide valuable information
Testing and assessment are tools that can be used properly or improperly
Like any tool, tests can be used in a fair and appropriate manner or misused
The importance of understanding and applying these assumptions in the field of testing and assessment
Page 10:
The importance of tests and assessment procedures in society
Without tests, people could present themselves as professionals regardless of their qualifications
Nepotism may be used in hiring instead of merit
Teachers and school administrators may place children in special classes arbitrarily
The need for instruments to diagnose educational difficulties and neuropsychological impairments
The military's need to screen recruits based on key variables
The criteria for a good test
Clear instructions for administration, scoring, and interpretation
Economy in time and money
Measures what it purports to measure
Page 11:
The concept of reliability in tests
Reliability involves the consistency and precision of the measuring tool
Example of three scales weighing a 1-pound gold bar
Scale A consistently measures 1 pound, making it reliable
Scale B consistently measures 1.3 pounds, making it consistently inaccurate but still reliable
Scale C registers different weights every time, making it unreliable
The concept of validity in tests
A test is considered valid if it measures what it purports to measure
Example of a scale that consistently measures the weight of a 1-pound gold bar as 1 pound
Controversy exists regarding the definition of intelligence, affecting the validity of intelligence tests
Questions regarding a test's validity may focus on the items, individual items, and interpretation of test scores
Other considerations for a good test
Ease of administration, scoring, and interpretation by trained examiners
Utility and actionable results that benefit testtakers or society
The importance of norms or normative data for comparing test results
Everyday Psychometrics: Putting Tests to the Test
Why Use This Particular Instrument or Method?
Choice of measuring instruments
Objective of using a test and how well it meets that objective
Targeted testtakers and appropriateness for them
Definition of what the test measures
Types of data generated and other necessary data
Existence of alternate forms of the test
Are There Any Published Guidelines for the Use of This Test?
Awareness of published guidelines from professional associations
Example of guidelines for child custody evaluations
Three types of assessments relevant to custody decisions
Supplementing the selected instrument with other tools of assessment
Use of interviews, behavioral observation, and case history analysis
Is This Instrument Reliable?
Importance of reliability in measurement
Assessing reliability through reading the test's manual and published research
Measuring emotional states as an example of reliability challenges
Page 13:
Precision and Retest Reliability
Emotional states can change quickly, so low retest reliability does not necessarily mean inaccurate scores.
To estimate the reliability of emotional states, measure it twice over short intervals or with a short series of test items.
Statistical procedures can estimate the reliability of the measurement from the consistency of item responses.
Instrument Validity
Validity refers to the extent to which a test measures what it purports to measure.
Research to determine instrument validity starts with a careful reading of the test's manual and published research on the test.
Questions related to the validity of a test can be complex and colored more in shades of gray than black or white.
Interrater reliability is the degree to which different respondents give similar evaluations of a behavior or trait.
Discrepant ratings between parents and teachers in assessing childhood behavior problems may reflect reality.
Multiple sources of data are needed to base an opinion, both for ethical mandates and meeting a burden of proof in court.
Instrument Cost-Effectiveness
Group tests can be more cost-effective than individually administered tests.
Group tests were developed by the armed services during World Wars I and II to quickly screen recruits for intelligence.
Inferences and Generalizability
In evaluating a test, consider the inferences that can be reasonably made from administering the test.
Generalizability of findings is important, and factors like test norms and cultural considerations can affect it.
Culture must be taken into account in the development, administration, scoring, and interpretation of any test.
Page 14:
Norm-Referenced Testing and Assessment
Norm-referenced testing compares an individual testtaker's score to scores of a group of testtakers.
Norms are the test performance data of a particular group of testtakers used as a reference for evaluating individual test scores.
Normative sample is the group of people whose performance on a test is analyzed for reference.
Norming is the process of deriving norms, and it can be modified to describe specific types of norm derivation.
Race norming, norming on the basis of race or ethnic background, was once practiced but is now outlawed.
Norming a test with a nationally representative sample can be expensive, so some test manuals provide user norms or program norms based on descriptive statistics.
Sampling to Develop Norms
Standardization or test standardization is the process of administering a test to a representative sample of testtakers for establishing norms.
Sampling is necessary to understand how norms are derived through formal sampling methods.
Page 15: How "Standard" Is Standard in Measurement?
Introduction
The foot, a unit of distance measurement in the United States, originated from the length of a British king's foot used as a standard.
Different localities used different "feet" for measurement in the past.
There is still confusion in the field of psychological testing and assessment regarding the meaning of terms like standard and standardization.
Definition of "Standard"
The word standard can be a noun or an adjective with multiple definitions.
As a noun, standard refers to something that others are compared to or evaluated against.
Example: A test with exceptional psychometric properties is considered "the standard against which all similar tests are judged."
Example: The Standards for Educational and Psychological Testing is a well-known manual that sets forth ideals of professional behavior.
As an adjective, standard refers to what is usual, generally accepted, or commonly employed.
Example: The standard way of conducting a measurement procedure is contrasted with newer or experimental procedures.
Example: Researchers studying alcoholism have adopted the concept of a standard drink to better understand and quantify alcohol consumption patterns.
The Verb "to Standardize"
The verb "to standardize" refers to making or transforming something into a basis of comparison or judgment.
Example: Researchers standardize an alcoholic beverage that contains 15 milliliters of alcohol as a "standard drink."
Page 15: Ben's Cold Cut Preference Test (CCPT)
Ben owns a small "deli boutique" that sells 10 varieties of private-label cold cuts.
Ben created his own "standardized test" called the Cold Cut Preference Test (CCPT).
The CCPT consists of two questions: "What would you like today?" and "How much of that would you like?"
Ben trains his wife on test administration and scoring of the CCPT.
The question is raised whether the CCPT qualifies as a "standardized test."
Page 15: Figure 1
Figure 1 shows Ben's Cold Cut Preference Test (CCPT).
Chapter 4: Of Tests and Testing
This section continues the discussion on the topic of tests and testing.
Page 16: How "Standard" Is Standard in Measurement?
Standardizing tests involves developing replicable procedures for administering, scoring, and interpreting the test.
Test items must be clearly specified, along with rules for administering and scoring them.
Traditionally, standardized tests also have norms and come with manuals that provide all the necessary information for responsible test use.
Standardized tests should be administered in a standardized manner, with little deviation from examiner to examiner.
Test manuals should contain detailed scoring guidelines and examples of correct, incorrect, or partially correct responses.
The test manual should also provide guidelines for interpreting the test results, including appropriate and inappropriate generalizations.
Page 17:
The term "standard score" is sometimes reserved for z scores, while other types of standardized scores are referred to as "standardized scores."
In psychological testing and assessment, there are various types of standard errors, such as the standard error of measurement, standard error of estimate, standard error of the mean, and standard error of the difference.
Critical thinking is encouraged when encountering the word "standard" in any context, as it may not always be as standard as expected.
Sampling
Test developers target a defined population for which the test is designed.
It is usually impossible, impractical, or expensive to administer the test to everyone in the population, so a sample is used.
The sample should be representative of the whole population.
Subgroups within the population may need to be proportionately represented in the sample.
Page 18:
Different types of sampling procedures:
Stratified sampling:
Includes people representing different subgroups of the population
Helps prevent sampling bias
Aids in the interpretation of findings
Can be stratified-random sampling if it is random
Purposive sampling:
Selects a sample believed to be representative of the population
Used by manufacturers to test products in specific markets
Can lead to non-representative samples or biased results
Incidental sampling or convenience sampling:
Uses the most convenient sample available
Often due to budgetary limitations or other constraints
Generalization of findings must be made with caution
Page 19:
Exclusionary criteria for normative samples:
Persons tested on any intelligence measure in the six months prior to the testing
Persons not fluent in English or who are primarily nonverbal
Persons with uncorrected visual impairment or hearing loss
Persons with upper-extremity disability that affects motor performance
Persons currently admitted to a hospital or mental or psychiatric facility
Persons currently taking medication that might depress test performance
Persons previously diagnosed with any physical condition or illness that might depress test performance
Importance of standard set of instructions and conditions for giving the test:
Makes test scores of the normative sample more comparable with future testtakers
Ensures consistent conditions for accurate interpretation of test scores
Descriptive statistics and analysis of test data:
Summarizes data using measures of central tendency and variability
Importance of representative normative sample:
Basis for comparison becomes questionable if the normative group is different from future testtakers
Encouragement for test developers to provide information to support recommended interpretations of the results, including the nature of the content, norms or comparison groups, and other technical evidence
Variability in descriptions of normative samples in test manuals:
Test authors may present tests in a favorable light, overlooking shortcomings in the standardization procedure
Generalizability of norms to specific groups or individuals may be questionable
Page 20:
Different ways norms can be categorized
Test manuals provide guidelines for establishing local norms
Normative sample and standardization sample are sometimes used interchangeably
New norms can be developed based on a new normative sample
New normative sample may include underrepresented groups
Types of norms
Age norms
Grade norms
National norms
National anchor norms
Local norms
Norms from a fixed reference group
Subgroup norms
Percentile norms
Age norms
Also known as age-equivalent scores
Indicate the average performance of test takers at different ages
Examples: height in inches, performance on psychological tests as a function of advancing age
Age norm tables for physical characteristics are widely accepted
Age norm tables for psychological characteristics, like intelligence, are controversial
Mental age concept used to identify the "mental age" of a test taker
Mental age concept criticized for being too broad and generalized
Grade norms
Designed to indicate the average test performance of test takers in a given school grade
Developed by administering the test to representative samples of children over consecutive grade levels
Mean or median score for each grade level is calculated
Fractions in the mean or median are expressed as decimals
Grade norms have intuitive appeal and widespread application, especially for elementary school children
Grade norms do not provide information about the similarity of abilities between different grade levels
Page 22
Grade norms and age norms
Grade norms are used to compare a student's performance with that of fellow students in the same grade.
Grade norms are only applicable to students who have completed a certain number of years and months of schooling.
Grade norms are not designed for children who are not yet in school or adults who have returned to school.
Grade norms and age norms are both types of developmental norms.
National norms
National norms are derived from a normative sample that is nationally representative.
National norms are obtained by testing large numbers of people from different variables of interest such as age, gender, race/ethnicity, socioeconomic status, and geographical location.
Norms may be obtained for students in every grade if the test is designed for use in schools.
Factors related to the representativeness of the school, such as funding, religious orientation, and library resources, may also be considered when selecting the normative sample.
Different tests claiming to have nationally representative samples may differ in important respects, so it is important to check the manual for details.
National anchor norms
National anchor norms provide stability to test scores by anchoring them to other test scores.
Equivalency tables or national anchor norms are established by computing percentile norms for each test and calculating the equivalency of scores based on corresponding percentile scores.
National anchor norms must be obtained on the same sample, with each member taking both tests.
Page 23
Subgroup norms
A normative sample can be segmented by criteria used in selecting subjects, resulting in more narrowly defined subgroup norms.
Subgroup norms provide normative information for specific subgroups, such as age, educational level, socioeconomic level, geographic region, community type, and handedness.
Local norms
Local norms are developed by test users themselves and provide normative information for the local population's performance on a test.
Local norms may be more useful than national norms for specific purposes, such as personnel selection or counseling students.
Local norms may be developed for abbreviated forms of existing tests or when substituting subtests within a larger test.
Fixed Reference Group Scoring Systems
Fixed reference group scoring systems use the distribution of scores obtained from a fixed reference group as the basis for calculating test scores for future administrations of the test.
The SAT is an example of a test that uses a fixed reference group scoring system.
The distribution of scores from a specific year is used as a standard for converting raw scores on future administrations of the test.
Page 24:
SAT scores are scaled to make them comparable across different versions of the test
Scaled scores are calibrated based on the difficulty of the test
SAT scores are typically interpreted by local decision-making bodies with respect to local norms
College admissions officers rely on their own norms to make selection decisions
Admissions decisions are not based solely on SAT scores, but on various criteria
Page 24-25:
Test scores can be evaluated using norm-referenced or criterion-referenced approaches
Norm-referenced evaluation compares an individual's performance to others who took the test
Criterion-referenced evaluation assesses whether a test-taker meets a specific standard or criterion
Criterion-referenced tests are often used to gauge achievement or mastery
Criterion-referenced approach is used in computer-assisted education programs
Mastery tests assess whether a test-taker has mastered a specific set of materials
Cut scores in mastery testing determine the passing threshold
Critics argue that criterion-referenced approach may overlook information about performance relative to others and may not be applicable at the upper end of the knowledge/skill continuum
Norm-referenced interpretations are better suited for identifying brilliance and superior abilities
Page 25:
Norm-referenced and criterion-referenced approaches are not mutually exclusive and can be used in different contexts
Page 26:
Testing is ultimately normative, even in pass-fail scores
Pass-fail scores acknowledge a continuum of abilities
Norm-referenced assessments may not always have a representative norm
Criterion level for special populations may be "far from the norm"
Chicago Bulls used personality testing and behavioral interviewing for player selection
Aimed to evaluate competencies necessary for success
Used well-validated personality assessment tools from the business world
Data collected allowed for the validation of a regression formula
Information collected on athletes used to assist coaching staff
Page 27:
Culture is an important factor in test administration, scoring, and interpretation
Responsible test users should consider the appropriateness of test norms for the targeted population
Historical context should be taken into consideration in evaluation
Introduction of the term culturally informed assessment and guidelines for accomplishing it
Guidelines can be seen as themes that may be repeated in different ways
Psychometric concept of reliability