The Test Development Process

Test Conceptualization

Defines the construct(s) being measured.
Explores the similarity or difference to other tests measuring the same construct.
Defines the objective of the test.
Explores the similarity or difference of the goals to other tests measuring the same construct.
Determines the need for the test by reviewing existing tests and identifying unique contributions.
Determines who will use the test (e.g., clinicians, educators).
Determines who will take the test, including age range and special considerations.
Considers what content the test will cover, including culture-specific aspects.
Determines how the test will be administered (individually, in groups, manually, computer-based).
Determines the ideal format for the test (true-false, multiple-choice, etc.).
Considers developing multiple forms of the test via cost-benefit analysis.
Determines special training required for test users, including background, qualifications, and restrictions.
Considers the types of responses required of test takers, accounting for abilities and disabilities.
Considers who benefits from the test and what social benefits can be derived.
Evaluates the potential for harm and includes safeguards in the testing procedure.
Determines how meaning will be attributed to scores, comparing to a norm or criterion group (norm-referenced or criterion-referenced).

Test Construction

Based on practical and scientific rules applied before, during, and after each item construction.
Theoretical aspects:
- Concepts on which the test is constructed.
- Goals and objectives of the test.
- General components, parts, and sections of the test.
- Theoretical and technical foundations for writing test items.

Scaling

Process of setting rules for assigning numbers in measurement.
Designing and calibrating a measuring device.
Assigning numbers to different amounts of the trait, attribute, or characteristic being measured.
Scales as instruments used to measure.

Types of Scales

Age-based scale: performance as a function of age.
Grade-based scale: performance as a function of grade.
Stanine scale: raw scores transformed into scores ranging from 1 to 9.

Scale Categorization

Unidimensional vs. multidimensional.
Comparative vs. categorical.

Comparative vs. Categorical Scales

Comparative: Test takers answers are compared to each other.
Categorical: sorting items into categories (e.g., never justified, sometimes justified, always justified).

Scaling Methods

Rating scale: grouping of words, statements, or symbols to indicate the strength of a trait, attitude, or emotion.
Summative scale: final test scores obtained by summing the ratings across all items.
Likert scale: used extensively in psychology to scale attitudes, presenting five (sometimes seven) alternative responses on an agree–disagree or approve–disapprove continuum.

Morally Debatable Behaviors Scale-Revised (MDBS-R; Katz et al., 1994)

Assesses beliefs, strength of convictions, and individual differences in moral tolerance.
Contains 30 items with descriptions of moral issues or behaviors.
Test takers express their opinion on a 10-point scale from “never justified” to “always justified.”

Writing Items

Determine the range of content the items should cover.
Select the types of item formats to be employed.
Determine the number of items to be written in total and for each content area.

Item Pool

The reservoir or well from which items will be drawn for the final version of the test.
Initial draft contains approximately twice the number of items that the final version of the test will contain.

Item Format

Variables such as the form, plan, structure, arrangement, and layout of individual test items.
Selected-response format: requires test takers to select a response from a set of alternatives.
Constructed-response format: requires test takers to supply or create the correct answer.

Item Example

Stem: "A psychological test, an interview, and a case study are:"
Correct alternative: a. psychological assessment tools
Distractors: b. standardized behavioral samples, c. reliable assessment instruments, d. theory-linked measures

Scoring Items

Cumulative model: higher score indicates a higher level of the measured ability, trait, or characteristic.
Class or category scoring: placement in a particular class or category based on pattern of responses.
Ipsative scoring: comparing a test taker’s score on one scale within a test to another scale within the same test.

Ipsative Scoring

Allows only intra-individual conclusions about the test taker.
Example: “John’s need for achievement is higher than his need for affiliation.”
Example Test: Edwards Personal Preference Schedule (EPPS) - designed to measure the relative strength of different psychological needs
Forced-choice item: respondents would indicate which is “more true” of themselves:
- “I feel depressed when I fail at something.”
- “I feel nervous when giving a talk before a group.”

Face and Content Validity Stage

Items are referred to a 3-member committee: a specialist in the content area, a specialist in measurement, and an experienced educator.
Each item is either accepted as is, accepted after modifications, or rejected as invalid.

Committee Judgement

Based on a set questionnaire, the committee judges each item from various dimensions including the nature of the item, item difficulty, item bias, and item quality.
Justifications have to be provided if an item is deemed invalid.

Review

Computerized items are reviewed by four testing experts to verify:
- Strict implementation of referring committee comments.
- Accuracy of numbers and graphs.
- No grammatical, spelling errors or misprints.
- Full compliance to items with the test pre-set standards and regulations.
- Answer key is provided for all test items.
- Screening out the items that seem too complicated or might take too much time to answer.

Test Try Out

Item Trial (pilot testing ):
- A set of test questions is first administered to a small group of people deemed to be representative of the population for which the final test is intended.

Item Analysis

Items are statistically analyzed so that the valid ones are selected for the test, while the invalid ones are rejected.
It helps in explaining why a test shows a certain level of reliability and validity.

Item Analysis Tools

Index of the item’s difficulty.
Index of the item’s reliability.
Index of the item’s validity.
Index of item discrimination.

Item Difficulty Index

Indicates how easy or difficult an item is.
Appropriate for maximal performance tests such as achievement and aptitude tests.
Requires that test items can be scored as correct and incorrect.
Obtained by calculating the proportion of test takers who answered the item correctly.
The value of an item-difficulty index can range from 0 (if no one got the item right) to 1 (if everyone got the item right).

Formula

$p = \frac{\text{number of persons answering item correctly}}{N}$
- p = item difficulty for a particular item
- N = total number of people taking the test

Example

If 50 of the 100 examinees in a Math Proficiency Test answered item 1 correctly, then the item-difficulty index for this item would be equal to:
- item difficulty index ( $p_1$ ) = $\frac{50}{100}$ = 0.5

Evaluation of Item Difficulty

Above 0.90: Very easy item
0.20: Very difficult item

Alternative Formula

$p = \frac{R}{T}$
- p = item difficulty index
- R = the number of correct responses to the test item
- T = the total number of responses comprises both correct and incorrect responses

Optimal Item Difficulty

The optimal average item difficulty is usually the midpoint between 1.00 and the chance success proportion (probability of answering correctly by random guessing).
The midpoint representing the optimal item difficulty is obtained by summing the chance success proportion and 1.00 and then dividing the sum by 2.

Examples of Optimal Item Difficulty

True-False item: the optimal item difficulty is halfway between 0.50 and 1.00, or 0.75.
Five-option multiple-choice item: the probability of guessing correctly is 0.20; the optimal item difficulty is 0.60.

Item-Difficulty vs. Item-Endorsement Index

Achievement testing: item-difficulty index (percent of people passing the item).
Personality testing: item-endorsement index (percent of people who agreed with the item).

Floor and Ceiling Effects

Floor effect: diminished utility for distinguishing test takers at the low end of the measured attribute.
Ceiling effect: diminished utility for distinguishing test takers at the high end of the measured attribute.

Reliability

Refers to the stability or consistency of the measurement.
Includes the notion that each individual measurement has an element of error.
Error variance: the component of test score attributable to sources other than the trait/ability measured.
Standard Error of Measurement (SEM): amount of error inherent in measurement.
Reliable test: test takers will fall in the same positions relative to each other.

Classical Test Score Theory

Test scores reflect two factors:
- True characteristics: stable characteristics of the individual.
- Random measurement error: chance features of the individual.

Formula

$X = T + E$
- X = a person’s test score (raw score)
- T = a person’s stable characteristic/knowledge (true score)
- E = chance events (error score)
In a reliable test, the value of E should be close to 0 and the value of T should be close to the actual test score X.

Measuring Reliability

Test-Retest reliability:
- Administering a test at 2 different times.
- Measures relatively enduring characteristics.
- Coefficient of stability: interval between testing is greater than 6 months.
Parallel/Alternate forms reliability:
- Compares 2 equivalent forms of a test that measure the same attribute.
- Two forms of tests use different items, but the rules used to select items of a particular difficulty level are the same.
- Coefficient of equivalences.
Internal Consistency/Inter Item Consistency:
- Refers to the degree of correlation among all the items on a scale.
- Calculated using a single administration of a single test form.
- Useful in homogeneity of the test
- Methods used: Split-half reliability, KR-20 (Kuder-Richardson Formula) & Coefficient Alpha.

Split-Half Reliability

A test is given and divided into halves that are scored separately.
The results of one half of the test are then compared with the results of others.
Use the odd-even system, whereby one subscore is obtained for the odd-number items in the test and another for the even-numbered items
Measures internal consistency using spearman- brown formula

KR-20 (Kuder-Richardson Formula)

For dichotomous items (Right (1) or wrong (0) format)

Coefficient Alpha

For non-dichotomous items or there is no right or wrong answer
Mean of all possible split-half
Used when 2 halves of the test have unequal variances
Ranges from 0 (absolutely no similarity) to 1 (perfectly identical).

Cronbach's Alpha Rule of Thumb

- $\alpha \geq 0.9$ : Excellent
-0.8 \leq \alpha < 0.9: Good
-0.7 \leq \alpha < 0.8: Acceptable
-0.6 \leq \alpha < 0.7: Questionable
-0.5 \leq \alpha < 0.6: Poor
-\alpha < 0.5: Unacceptable

Inter-rater Reliability/Inter-scorer Reliability

The degree of agreement/consistency between 2 or more scorers/judges/raters with regard to a particular measure.
There is a judge to rate the examine answers
For creativity or projective test
- Methods: Kappa statistics (assess interrater agreement), Cohen’s Kappa (2 raters) & Fleiss’ Kappa (3 or more raters)

Item Reliability Index

Provides an indication of the internal consistency of a test.
The higher this index, the greater the test’s internal consistency.
Factor analysis and inter-item consistency
A statistical tool useful in determining whether items on a test appear to be measuring the same thing(s) is factor analysis.

Summary of Reliability Types

Type of Reliability	Number of Testing Sessions	Number of Test Forms	Sources of Error Variance	Statistical Procedures
Test-retest	2	1	Administration	Pearson r or Spearman rho
Alternate-forms	1 or 2	2	Test construction or administration	Pearson r or Spearman rho
Internal consistency	1	1	Test construction	Pearson r between equivalent test halves with Spearman Brown correction or Kuder-Richardson for dichotomous items, or coefficient alpha for multipoint items
Inter-scorer	1	1	Scoring and interpretation	Pearson r or Spearman rho

Validity

Judgement or estimate of how well a test measures what it purports to measure in a particular context.
Judgement based on evidence about the appropriateness of inferences drawn from test scores.
Validation
- Process of gathering and evaluating evidence about validity
Local Validation
- Validation process if test users plan to alter format, instruction, language or content of the test

Types of Validity

Face Validity
- When the items look like they measure what they are supposed to measure
Content Validity
- Adequacy of representation of the conceptual domain in the test that is designed to cover Qualitative process in
- which test items are compared to the detailed description of the test domain
- Important whenever a test is sued to make inferences about the broader domain of knowledge/skills represented by sample items
Criterion Validity
- How well it corresponds to a particular criterion
- Predictive validity:
  - An index of the degree to which a test score predicts some criterion measure
- Concurrent validity
  - an index of the degree to which a test score is related to some criterion measure obtained at the same time
Construct validity
- A test measures what it is intended to measure.
- Divergent/discriminant validity - measure correlating it to inconsistent construct
- Convergent validity - measure correlating the same construct

Item Validity Index

A statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.
The higher the item-validity index, the greater the test’s criterion-related validity.

Item Discrimination Index

Measures of item discrimination indicate how adequately an item separates or discriminates between high scorers and low scorers on an entire test.
A multiple-choice item on an achievement test is a good item if most of the high scorers answer correctly and most of the low scorers answer incorrectly.

Item-discrimination Index (d)

A measure of item discrimination, symbolized by a lowercase italic “d” ( $d$ ).
This estimate of item discrimination, in essence, compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores.

Formula

$d = \frac{U - L}{n}$
Where:
- U = Number of examinees in the upper group
- L = Number of examinees in the lower group
- n = total # of Examinees
The item-discrimination index is a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly.
The higher the value of d, the greater the number of high scorers answering the item correctly.
A negative d- value on a particular item indicates that low-scoring examinees are more likely to answer the item correctly than high-scoring examinees.

Discrimination Index Grade

> 0.39: Excellent, Preserve
0.30-0.39: Good, Possibilities for enhancement
0.20-0.29: Average, Need to verify/review
0.00-0.20: Poor, Reject or review in depth
< -0.01: Worst, Remove

Analysis of Item Alternatives

Charting the number of testtakers in the U and L groups, the test developer can get an idea of the effectiveness of a distractor by means of a simple eyeball test

Other Considerations in Item Analysis

Guessing
Item fairness
- A biased test item is an item that favors one particular group of examinees in relation to another when differences in group ability are controlled.
Speed tests
- Item analyses of tests taken under speed conditions yield misleading or uninterpretable results. The closer an item is to the end of the test, the more difficult it may appear to be. This is because testtakers simply may not get to items near the end of the test before time runs out.

Qualitative Item Analysis

A general term for various nonstatistical procedures designed to explore how individual test items work.
Qualitative methods are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.

Think Aloud Test Administration

An innovative approach to cognitive assessment entails having respondents verbalize thoughts as they occur.

Expert Panels

Expert panels may also provide qualitative analyses of test items.

Sensitivity Review

A study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes, or situations.

Errors in Behavioral Assessment

Rating errors: judgement resulting from the intentional/unintentional misuse of rating scale.
- Halo effect: generalized only the positive aspect
- Leniency error: always positive or high rating response
- Severity error: always negative or low rating response
- Central tendency: always neutral or middle rating response

Flynn Effect

James R. Flynn noticed that intelligence seems to rise on average, year by year starting in the year which the test was normed.

Revision Test Production

The test is finally produced.
The test is constructed in its final form.
Items are randomly selected from those deposited in the question bank in a manner that adequately represents all test dimensions.
Trial questions are included, and various versions of the test are prepared to avoid cheating.
The test is printed in booklets that include test instructions and test items.
Note that in test development, one revision might not be enough. The process goes back to test try out, then analysis and revision again. This process is repeated several times until the test has reached its “near perfection.”

The Normal Curve

Review of Statistics

Key Features

Centered
Fixed score distribution
Unimodal
Symmetrical
Mean = Median = Mode

Quartile

Quartile refers to a specific point whereas quarter refers to an interval. Ex. An individual score may fall at the third quartile or in the third quarter.
A distribution of test scores can be divided into four parts such that 25% of the test scores occur in each quarter. The dividing points between the four quarters in the distribution are the quartiles.

Not all Data is Normal

Modality
Symmetry
Peakedness

Kurtosis

The steepness of a distribution in its center is called kurtosis.
- Mesokurtic
- Leptokurtic
- Platykurtic

Skewness

Refers to asymmetry
- Skewed to the Left: Mean < Median < Mode, Long Left Tail
- Normal Distribution: Mean = Median = Mode, No Skew
- Skewed to the Right: Mean > Median > Mode, Long Right Tail

Modality

Unimodal: one-peak
Bimodal: two-peaks
Multimodal: multiple peaks

Standard Scores

A standard score is a raw score that has been converted from one scale to another scale, where the latter scale has some arbitrarily set mean and standard deviation.
Raw scores may be converted to standard scores because standard scores are more easily interpretable than raw scores.
The position of a testtaker’s performance relative to other testtakers is readily apparent with a standard score.

Z Scores

z score results from the conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean of the distribution.
Mean = 0
SD = 1

T Scores

One advantage in using T scores is that none of the scores is negative.
Mean = 50
SD = 10

Stanines

Stanines take on whole values from 1 to 9, which represent a range of performance that is half of a standard deviation in width
Mean = 5
SD = 2

Other Standard Scores

IQ Deviation
- Mean = 100
- SD = 15
Sten
- Mean = 5.5
- SD = 2

GRE/SAT (A) Graduate Record Examination (GRE) & Scholastic Aptitude Test (SAT)

Mean = 500
SD = 100