Chapter 8 – Test Development

0.0(0)

Studied by 0 people

View linked note

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/29

Earn XP

Description and Tags

These flashcards review key concepts, procedures, and statistics involved in Chapter 8’s topic—how psychological and educational tests are developed, analyzed, and revised.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

30 Terms

New cards

What are the five major stages of test development?

(1) Test conceptualization, (2) Test construction, (3) Test try-out, (4) Item analysis, (5) Test revision.

New cards

During test construction, which four key activities usually occur?

Writing/revising items, formatting items, setting scoring rules, and designing the overall test.

New cards

What is the purpose of a test try-out?

To administer a preliminary test to a representative sample under simulated final conditions in order to gather performance data for item analysis.

New cards

Define item analysis.

A set of statistical (and sometimes qualitative) procedures used to evaluate individual test items for difficulty, reliability, validity, and discrimination.

New cards

What is test revision (in the development cycle of a new test)?

Actions taken to modify content or format, based on try-out results and item analysis, to improve the test’s effectiveness.

New cards

Explain the difference between norm-referenced and criterion-referenced tests.

Norm-referenced tests compare an examinee’s score to others’, while criterion-referenced tests evaluate whether mastery of specific criteria has been achieved.

New cards

What is pilot work (pilot study)?

Preliminary research with prototype items or procedures to discover the best ways to measure the targeted construct before full test construction.

New cards

Define scaling in test development.

The process of establishing rules for assigning numbers (or other indicators) to different amounts of the trait being measured.

New cards

What characterizes a Likert scale?

A summative rating scale with typically 5 (or 7) ordered response options ranging from strong disagreement to strong agreement.

New cards

What is the method of paired comparisons?

A scaling technique where respondents choose one preferred stimulus from each presented pair, allowing ordinal measurement.

New cards

In multiple-choice items, what are distractors (foils)?

Incorrect alternatives designed to appear plausible and to distract examinees who do not know the correct answer.

New cards

Give two advantages of binary-choice (true-false) items.

Easy to write and score; can sample a broad content area quickly.

New cards

What is a completion (short-answer) item?

A constructed-response item that requires the examinee to supply a word, phrase, or brief answer to complete a statement.

New cards

Describe computerized adaptive testing (CAT).

A computer-administered testing process where the selection of each item depends on the examinee’s previous responses.

New cards

What are floor and ceiling effects?

Floor effect: test can’t discriminate among low-ability examinees; ceiling effect: test can’t discriminate among high-ability examinees.

New cards

Item branching refers to what capability in computer testing?

Programming that tailors which item appears next based on the examinee’s response pattern.

New cards

Explain cumulative scoring.

A scoring model where higher total scores indicate more of the trait or ability measured, with each keyed response adding credit.

New cards

What is ipsative scoring?

Scoring that compares an individual’s scores on different scales within the same test, rather than comparing to other people.

New cards

Define the item-difficulty index (p).

The proportion of examinees who answered an item correctly (or endorsed it); values range from 0 (hard) to 1 (easy).

New cards

What value of average item-difficulty is optimal for maximum discrimination on a normed test with four-choice items?

About .60 (midpoint between chance level .25 and 1.00).

New cards

What does the item-discrimination index (d) indicate?

How well an item differentiates between high and low scorers on the total test; calculated from the difference in proportions correct in upper vs. lower groups.

New cards

Why is a negative d-value problematic?

It means low scorers are more likely than high scorers to answer the item correctly, indicating a flawed item.

New cards

What information does an item-characteristic curve (ICC) in IRT show?

The relationship between examinee ability (theta) and probability of a correct response, revealing item difficulty and discrimination.

New cards

What is differential item functioning (DIF)?

When examinees from different groups with the same ability have different probabilities of endorsing a test item, suggesting potential bias.

New cards

What is a sensitivity review?

Expert examination of items for fairness, bias, stereotypes, or offensive content affecting specific groups.

New cards

How does cross-validation differ from co-validation?

Cross-validation re-tests validity on a new sample; co-validation (co-norming) validates or norms multiple tests simultaneously on the same sample.

New cards

What is validity shrinkage?

The expected decrease in item or test validity coefficients when tested on a new sample during cross-validation.

New cards

Describe an anchor protocol in quality assurance.

A protocol scored by experts that serves as a standard to detect scoring drift among different raters.

New cards

In classroom testing, what informal method can a professor use to address content validity?

Creating a test blueprint that proportionally samples each lecture/topic and reading assignment.

New cards

Briefly state three common reasons publishers revise an existing standardized test.

(1) Stimulus materials or language become dated. (2) Norms no longer represent current populations. (3) New theory or data allow improved reliability/validity.