1/29
These flashcards review key concepts, procedures, and statistics involved in Chapter 8’s topic—how psychological and educational tests are developed, analyzed, and revised.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What are the five major stages of test development?
(1) Test conceptualization, (2) Test construction, (3) Test try-out, (4) Item analysis, (5) Test revision.
During test construction, which four key activities usually occur?
Writing/revising items, formatting items, setting scoring rules, and designing the overall test.
What is the purpose of a test try-out?
To administer a preliminary test to a representative sample under simulated final conditions in order to gather performance data for item analysis.
Define item analysis.
A set of statistical (and sometimes qualitative) procedures used to evaluate individual test items for difficulty, reliability, validity, and discrimination.
What is test revision (in the development cycle of a new test)?
Actions taken to modify content or format, based on try-out results and item analysis, to improve the test’s effectiveness.
Explain the difference between norm-referenced and criterion-referenced tests.
Norm-referenced tests compare an examinee’s score to others’, while criterion-referenced tests evaluate whether mastery of specific criteria has been achieved.
What is pilot work (pilot study)?
Preliminary research with prototype items or procedures to discover the best ways to measure the targeted construct before full test construction.
Define scaling in test development.
The process of establishing rules for assigning numbers (or other indicators) to different amounts of the trait being measured.
What characterizes a Likert scale?
A summative rating scale with typically 5 (or 7) ordered response options ranging from strong disagreement to strong agreement.
What is the method of paired comparisons?
A scaling technique where respondents choose one preferred stimulus from each presented pair, allowing ordinal measurement.
In multiple-choice items, what are distractors (foils)?
Incorrect alternatives designed to appear plausible and to distract examinees who do not know the correct answer.
Give two advantages of binary-choice (true-false) items.
Easy to write and score; can sample a broad content area quickly.
What is a completion (short-answer) item?
A constructed-response item that requires the examinee to supply a word, phrase, or brief answer to complete a statement.
Describe computerized adaptive testing (CAT).
A computer-administered testing process where the selection of each item depends on the examinee’s previous responses.
What are floor and ceiling effects?
Floor effect: test can’t discriminate among low-ability examinees; ceiling effect: test can’t discriminate among high-ability examinees.
Item branching refers to what capability in computer testing?
Programming that tailors which item appears next based on the examinee’s response pattern.
Explain cumulative scoring.
A scoring model where higher total scores indicate more of the trait or ability measured, with each keyed response adding credit.
What is ipsative scoring?
Scoring that compares an individual’s scores on different scales within the same test, rather than comparing to other people.
Define the item-difficulty index (p).
The proportion of examinees who answered an item correctly (or endorsed it); values range from 0 (hard) to 1 (easy).
What value of average item-difficulty is optimal for maximum discrimination on a normed test with four-choice items?
About .60 (midpoint between chance level .25 and 1.00).
What does the item-discrimination index (d) indicate?
How well an item differentiates between high and low scorers on the total test; calculated from the difference in proportions correct in upper vs. lower groups.
Why is a negative d-value problematic?
It means low scorers are more likely than high scorers to answer the item correctly, indicating a flawed item.
What information does an item-characteristic curve (ICC) in IRT show?
The relationship between examinee ability (theta) and probability of a correct response, revealing item difficulty and discrimination.
What is differential item functioning (DIF)?
When examinees from different groups with the same ability have different probabilities of endorsing a test item, suggesting potential bias.
What is a sensitivity review?
Expert examination of items for fairness, bias, stereotypes, or offensive content affecting specific groups.
How does cross-validation differ from co-validation?
Cross-validation re-tests validity on a new sample; co-validation (co-norming) validates or norms multiple tests simultaneously on the same sample.
What is validity shrinkage?
The expected decrease in item or test validity coefficients when tested on a new sample during cross-validation.
Describe an anchor protocol in quality assurance.
A protocol scored by experts that serves as a standard to detect scoring drift among different raters.
In classroom testing, what informal method can a professor use to address content validity?
Creating a test blueprint that proportionally samples each lecture/topic and reading assignment.
Briefly state three common reasons publishers revise an existing standardized test.
(1) Stimulus materials or language become dated. (2) Norms no longer represent current populations. (3) New theory or data allow improved reliability/validity.