chapter 4 Chapter 3: A Statistics Refresher

Objective: Overview the elements of a good test
Basic assumptions about assessment will be elaborated later on in this chapter and subsequent chapters

Chapter 4: Of Tests and Testing

Importance of tests in addressing important questions in various situations
Examples of situations where tests are used:
- Emergency room patient with a diagnosis
- Court case with a defendant's competence to stand trial
- Hiring decisions in a large corporation
- College admissions decisions
- Custody dispute in a bitter divorce
Assessment professionals need confidence in the tests they use
Overview of the elements of a good test

Assumptions About Psychological Testing and Assessment

Assumption 1: Psychological traits and states exist
Traits defined as distinguishable, relatively enduring ways individuals vary from each other
States also distinguish individuals but are relatively less enduring
Psychological traits cover a wide range of characteristics
Psychological traits exist as constructs, developed to describe or explain behavior
Traits cannot be seen, heard, or touched, but their existence can be inferred from overt behavior
Controversy exists regarding how psychological traits exist
Psychological traits are relatively enduring but not expected to be manifested in behavior 100% of the time
Manifestation of traits depends on the strength of the trait and the nature of the situation
Trait manifestation is situation-dependent
Behavior context plays a role in selecting appropriate trait terms
Trait and state terms are relative and involve comparisons with the average person or specific groups

Measuring sensation seeking

Sensation seeking defined as the need for varied, novel, and complex sensations and experiences
Sensation-Seeking Scale (SSS) used to identify people high or low on this trait
Context and behavior context influence the interpretation of trait terms
Paper-and-pencil measures and performance-based measures have comparative advantages

Note: The transcript continues on page 5, but the given information covers the main ideas and supporting details.

Page 6:

Influence of reference group on conclusions or judgments
- Example of a psychologist administering a test of shyness to a male exotic dancer
- Interpretation of test data differs based on reference group (other males in his age group or other male exotic dancers in his age group)
Psychological traits and states can be quantified and measured
- Most psychological traits and states vary by degree and can be quantified
- Different ways of defining and looking at the same phenomenon
- Importance of clear operational definition of the construct under study
- Different ways of defining "aggressive behavior" in different contexts
Test developers consider item content to gauge the strength of a trait
- Universe of behaviors indicative of the targeted trait
- Examples of items for measuring intelligence and social judgment

Page 7:

Comparative value of test items influenced by technical considerations, definition of construct, and societal value
Developing appropriate test items, scoring methods, and interpretation of results
Cumulative scoring and its familiarity from elementary school spelling tests
Test-related behavior predicts non-test-related behavior
- Objective of tests is to provide indication of other aspects of examinee's behavior
- Use of test patterns in decision making regarding mental disorders
- Tests mimic actual behaviors but only provide a sample of behavior under non-test conditions
- Predicting future behavior or postdicting past behavior in forensic matters
All tests have limits and imperfections
- Competent test users understand how tests were developed and appropriate circumstances for use

Page 8: Assumptions about Testing and Assessment

Assumption 1: Tests and assessments are used to make important decisions
- Test results can impact decisions about education, employment, and treatment
- The stakes of these decisions make it crucial to have accurate and reliable tests
Assumption 2: Tests and assessments are based on psychological measurement
- Tests aim to measure traits, abilities, or characteristics of individuals
- Measurement involves assigning numbers to represent the construct being measured
Assumption 3: Tests and assessments have strengths and limitations
- Tests can provide valuable information, but they also have limitations
- Test users should be aware of these limitations and interpret results accordingly
Assumption 4: Competent test users understand the tests they use
- Test users should know how to administer the test and interpret the results
- Codes of ethics emphasize the importance of test users being aware of test limitations
Assumption 5: Various sources of error are part of the assessment process
- Error refers to factors other than what the test measures that influence performance
- Error variance is the component of a test score attributable to sources other than the trait measured
- Sources of error variance include assessors, assessees, and measuring instruments

Page 9: Assumptions about Testing and Assessment (cont.)

Assumption 6: Unfair and biased assessment procedures can be identified and reformed
- Test developers strive to develop fair instruments following guidelines in the test manual
- Procedures to identify and correct test bias have been developed
- Fairness-related problems can arise when tests are used with people for whom they were not intended
Assumption 7: Testing and assessment offer powerful benefits to society
- Tests and assessments play a crucial role in making important decisions
- Despite limitations and challenges, testing and assessment provide valuable information
Testing and assessment are tools that can be used properly or improperly
- Like any tool, tests can be used in a fair and appropriate manner or misused
The importance of understanding and applying these assumptions in the field of testing and assessment

Page 10:

The importance of tests and assessment procedures in society
- Without tests, people could present themselves as professionals regardless of their qualifications
- Nepotism may be used in hiring instead of merit
- Teachers and school administrators may place children in special classes arbitrarily
- The need for instruments to diagnose educational difficulties and neuropsychological impairments
- The military's need to screen recruits based on key variables
The criteria for a good test
- Clear instructions for administration, scoring, and interpretation
- Economy in time and money
- Measures what it purports to measure

Page 11:

The concept of reliability in tests
- Reliability involves the consistency and precision of the measuring tool
- Example of three scales weighing a 1-pound gold bar
  - Scale A consistently measures 1 pound, making it reliable
  - Scale B consistently measures 1.3 pounds, making it consistently inaccurate but still reliable
  - Scale C registers different weights every time, making it unreliable
The concept of validity in tests
- A test is considered valid if it measures what it purports to measure
- Example of a scale that consistently measures the weight of a 1-pound gold bar as 1 pound
- Controversy exists regarding the definition of intelligence, affecting the validity of intelligence tests
- Questions regarding a test's validity may focus on the items, individual items, and interpretation of test scores
Other considerations for a good test
- Ease of administration, scoring, and interpretation by trained examiners
- Utility and actionable results that benefit testtakers or society
- The importance of norms or normative data for comparing test results

Everyday Psychometrics: Putting Tests to the Test

Why Use This Particular Instrument or Method?

Choice of measuring instruments
Objective of using a test and how well it meets that objective
Targeted testtakers and appropriateness for them
Definition of what the test measures
Types of data generated and other necessary data
Existence of alternate forms of the test

Are There Any Published Guidelines for the Use of This Test?

Awareness of published guidelines from professional associations
Example of guidelines for child custody evaluations
Three types of assessments relevant to custody decisions
Supplementing the selected instrument with other tools of assessment
Use of interviews, behavioral observation, and case history analysis

Is This Instrument Reliable?

Importance of reliability in measurement
Assessing reliability through reading the test's manual and published research
Measuring emotional states as an example of reliability challenges

Page 13:

Precision and Retest Reliability

Emotional states can change quickly, so low retest reliability does not necessarily mean inaccurate scores.
To estimate the reliability of emotional states, measure it twice over short intervals or with a short series of test items.
Statistical procedures can estimate the reliability of the measurement from the consistency of item responses.

Instrument Validity

Validity refers to the extent to which a test measures what it purports to measure.
Research to determine instrument validity starts with a careful reading of the test's manual and published research on the test.
Questions related to the validity of a test can be complex and colored more in shades of gray than black or white.
Interrater reliability is the degree to which different respondents give similar evaluations of a behavior or trait.
Discrepant ratings between parents and teachers in assessing childhood behavior problems may reflect reality.
Multiple sources of data are needed to base an opinion, both for ethical mandates and meeting a burden of proof in court.

Instrument Cost-Effectiveness

Group tests can be more cost-effective than individually administered tests.
Group tests were developed by the armed services during World Wars I and II to quickly screen recruits for intelligence.

Inferences and Generalizability

In evaluating a test, consider the inferences that can be reasonably made from administering the test.
Generalizability of findings is important, and factors like test norms and cultural considerations can affect it.
Culture must be taken into account in the development, administration, scoring, and interpretation of any test.

Page 14:

Norm-Referenced Testing and Assessment

Norm-referenced testing compares an individual testtaker's score to scores of a group of testtakers.
Norms are the test performance data of a particular group of testtakers used as a reference for evaluating individual test scores.
Normative sample is the group of people whose performance on a test is analyzed for reference.
Norming is the process of deriving norms, and it can be modified to describe specific types of norm derivation.
Race norming, norming on the basis of race or ethnic background, was once practiced but is now outlawed.
Norming a test with a nationally representative sample can be expensive, so some test manuals provide user norms or program norms based on descriptive statistics.

Sampling to Develop Norms

Standardization or test standardization is the process of administering a test to a representative sample of testtakers for establishing norms.
Sampling is necessary to understand how norms are derived through formal sampling methods.

Page 15: How "Standard" Is Standard in Measurement?

Introduction

The foot, a unit of distance measurement in the United States, originated from the length of a British king's foot used as a standard.
Different localities used different "feet" for measurement in the past.
There is still confusion in the field of psychological testing and assessment regarding the meaning of terms like standard and standardization.

Definition of "Standard"

The word standard can be a noun or an adjective with multiple definitions.
As a noun, standard refers to something that others are compared to or evaluated against.
- Example: A test with exceptional psychometric properties is considered "the standard against which all similar tests are judged."
- Example: The Standards for Educational and Psychological Testing is a well-known manual that sets forth ideals of professional behavior.
As an adjective, standard refers to what is usual, generally accepted, or commonly employed.
- Example: The standard way of conducting a measurement procedure is contrasted with newer or experimental procedures.
- Example: Researchers studying alcoholism have adopted the concept of a standard drink to better understand and quantify alcohol consumption patterns.

The Verb "to Standardize"

The verb "to standardize" refers to making or transforming something into a basis of comparison or judgment.
Example: Researchers standardize an alcoholic beverage that contains 15 milliliters of alcohol as a "standard drink."

Page 15: Ben's Cold Cut Preference Test (CCPT)

Ben owns a small "deli boutique" that sells 10 varieties of private-label cold cuts.
Ben created his own "standardized test" called the Cold Cut Preference Test (CCPT).
The CCPT consists of two questions: "What would you like today?" and "How much of that would you like?"
Ben trains his wife on test administration and scoring of the CCPT.
The question is raised whether the CCPT qualifies as a "standardized test."

Page 15: Figure 1

Figure 1 shows Ben's Cold Cut Preference Test (CCPT).

Chapter 4: Of Tests and Testing

This section continues the discussion on the topic of tests and testing.

Page 16: How "Standard" Is Standard in Measurement?

Standardizing tests involves developing replicable procedures for administering, scoring, and interpreting the test.
- Test items must be clearly specified, along with rules for administering and scoring them.
Traditionally, standardized tests also have norms and come with manuals that provide all the necessary information for responsible test use.
Standardized tests should be administered in a standardized manner, with little deviation from examiner to examiner.
Test manuals should contain detailed scoring guidelines and examples of correct, incorrect, or partially correct responses.
The test manual should also provide guidelines for interpreting the test results, including appropriate and inappropriate generalizations.

Page 17:

The term "standard score" is sometimes reserved for z scores, while other types of standardized scores are referred to as "standardized scores."
In psychological testing and assessment, there are various types of standard errors, such as the standard error of measurement, standard error of estimate, standard error of the mean, and standard error of the difference.
Critical thinking is encouraged when encountering the word "standard" in any context, as it may not always be as standard as expected.

Sampling

Test developers target a defined population for which the test is designed.
It is usually impossible, impractical, or expensive to administer the test to everyone in the population, so a sample is used.
The sample should be representative of the whole population.
Subgroups within the population may need to be proportionately represented in the sample.

Page 18:

Different types of sampling procedures:
- Stratified sampling:
  - Includes people representing different subgroups of the population
  - Helps prevent sampling bias
  - Aids in the interpretation of findings
  - Can be stratified-random sampling if it is random
- Purposive sampling:
  - Selects a sample believed to be representative of the population
  - Used by manufacturers to test products in specific markets
  - Can lead to non-representative samples or biased results
- Incidental sampling or convenience sampling:
  - Uses the most convenient sample available
  - Often due to budgetary limitations or other constraints
  - Generalization of findings must be made with caution

Page 19:

Exclusionary criteria for normative samples:
- Persons tested on any intelligence measure in the six months prior to the testing
- Persons not fluent in English or who are primarily nonverbal
- Persons with uncorrected visual impairment or hearing loss
- Persons with upper-extremity disability that affects motor performance
- Persons currently admitted to a hospital or mental or psychiatric facility
- Persons currently taking medication that might depress test performance
- Persons previously diagnosed with any physical condition or illness that might depress test performance
Importance of standard set of instructions and conditions for giving the test:
- Makes test scores of the normative sample more comparable with future testtakers
- Ensures consistent conditions for accurate interpretation of test scores
Descriptive statistics and analysis of test data:
- Summarizes data using measures of central tendency and variability
Importance of representative normative sample:
- Basis for comparison becomes questionable if the normative group is different from future testtakers
Encouragement for test developers to provide information to support recommended interpretations of the results, including the nature of the content, norms or comparison groups, and other technical evidence
Variability in descriptions of normative samples in test manuals:
- Test authors may present tests in a favorable light, overlooking shortcomings in the standardization procedure
- Generalizability of norms to specific groups or individuals may be questionable

Page 20:

Different ways norms can be categorized
- Test manuals provide guidelines for establishing local norms
- Normative sample and standardization sample are sometimes used interchangeably
- New norms can be developed based on a new normative sample
- New normative sample may include underrepresented groups
Types of norms
- Age norms
- Grade norms
- National norms
- National anchor norms
- Local norms
- Norms from a fixed reference group
- Subgroup norms
- Percentile norms

Age norms

Also known as age-equivalent scores
Indicate the average performance of test takers at different ages
Examples: height in inches, performance on psychological tests as a function of advancing age
Age norm tables for physical characteristics are widely accepted
Age norm tables for psychological characteristics, like intelligence, are controversial
Mental age concept used to identify the "mental age" of a test taker
Mental age concept criticized for being too broad and generalized

Grade norms

Designed to indicate the average test performance of test takers in a given school grade
Developed by administering the test to representative samples of children over consecutive grade levels
Mean or median score for each grade level is calculated
Fractions in the mean or median are expressed as decimals
Grade norms have intuitive appeal and widespread application, especially for elementary school children
Grade norms do not provide information about the similarity of abilities between different grade levels

Page 22

Grade norms and age norms

Grade norms are used to compare a student's performance with that of fellow students in the same grade.
Grade norms are only applicable to students who have completed a certain number of years and months of schooling.
Grade norms are not designed for children who are not yet in school or adults who have returned to school.
Grade norms and age norms are both types of developmental norms.

National norms

National norms are derived from a normative sample that is nationally representative.
National norms are obtained by testing large numbers of people from different variables of interest such as age, gender, race/ethnicity, socioeconomic status, and geographical location.
Norms may be obtained for students in every grade if the test is designed for use in schools.
Factors related to the representativeness of the school, such as funding, religious orientation, and library resources, may also be considered when selecting the normative sample.
Different tests claiming to have nationally representative samples may differ in important respects, so it is important to check the manual for details.

National anchor norms

National anchor norms provide stability to test scores by anchoring them to other test scores.
Equivalency tables or national anchor norms are established by computing percentile norms for each test and calculating the equivalency of scores based on corresponding percentile scores.
National anchor norms must be obtained on the same sample, with each member taking both tests.

Page 23

Subgroup norms

A normative sample can be segmented by criteria used in selecting subjects, resulting in more narrowly defined subgroup norms.
Subgroup norms provide normative information for specific subgroups, such as age, educational level, socioeconomic level, geographic region, community type, and handedness.

Local norms

Local norms are developed by test users themselves and provide normative information for the local population's performance on a test.
Local norms may be more useful than national norms for specific purposes, such as personnel selection or counseling students.
Local norms may be developed for abbreviated forms of existing tests or when substituting subtests within a larger test.

Fixed Reference Group Scoring Systems

Fixed reference group scoring systems use the distribution of scores obtained from a fixed reference group as the basis for calculating test scores for future administrations of the test.
The SAT is an example of a test that uses a fixed reference group scoring system.
The distribution of scores from a specific year is used as a standard for converting raw scores on future administrations of the test.

Page 24:

SAT scores are scaled to make them comparable across different versions of the test
- Scaled scores are calibrated based on the difficulty of the test
SAT scores are typically interpreted by local decision-making bodies with respect to local norms
- College admissions officers rely on their own norms to make selection decisions
Admissions decisions are not based solely on SAT scores, but on various criteria

Page 24-25:

Test scores can be evaluated using norm-referenced or criterion-referenced approaches
Norm-referenced evaluation compares an individual's performance to others who took the test
Criterion-referenced evaluation assesses whether a test-taker meets a specific standard or criterion
Criterion-referenced tests are often used to gauge achievement or mastery
Criterion-referenced approach is used in computer-assisted education programs
Mastery tests assess whether a test-taker has mastered a specific set of materials
Cut scores in mastery testing determine the passing threshold
Critics argue that criterion-referenced approach may overlook information about performance relative to others and may not be applicable at the upper end of the knowledge/skill continuum
Norm-referenced interpretations are better suited for identifying brilliance and superior abilities

Page 25:

Norm-referenced and criterion-referenced approaches are not mutually exclusive and can be used in different contexts

Page 26:

Testing is ultimately normative, even in pass-fail scores
- Pass-fail scores acknowledge a continuum of abilities
Norm-referenced assessments may not always have a representative norm
Criterion level for special populations may be "far from the norm"
Chicago Bulls used personality testing and behavioral interviewing for player selection
- Aimed to evaluate competencies necessary for success
- Used well-validated personality assessment tools from the business world
- Data collected allowed for the validation of a regression formula
- Information collected on athletes used to assist coaching staff

Page 27:

Culture is an important factor in test administration, scoring, and interpretation
Responsible test users should consider the appropriateness of test norms for the targeted population
Historical context should be taken into consideration in evaluation
Introduction of the term culturally informed assessment and guidelines for accomplishing it
Guidelines can be seen as themes that may be repeated in different ways
Psychometric concept of reliability