TJ

Test Development Flashcards

Test Development

Test Development Process

  • Initial Progress:
    • Review of relevant book chapters: Cohen (8), Kaplan (6).
    • Initial steps completed.
  • Test Development Process Cycle:
    • Conceptualization → Construction → Tryout → Item Analysis → Revision.
    • Revision may loop back to the tryout phase for iterative improvements.

Test Construction

  • Stimulus: An emerging social phenomenon or pattern of behavior can prompt the development of a new test.
  • Preliminary Questions:
    • What is the test designed to measure?
      • Focuses on defining the construct and differentiating it from similar tests.
    • What is the objective of the test?
      • Defines the test's goal and its uniqueness compared to other tests.
    • Is there a need for this test?
      • Justifies the test by highlighting its advantages over existing ones.
    • Who will use this test?
      • Specifies the intended users (e.g., clinicians, educators) and settings.
    • Who will take this test?
      • Considers factors like age range, reading level, and cultural aspects affecting test-taker responses.
    • What content will the test cover?
      • Defines the scope of content to be included in the test.
    • How will the test be administered?
      • Specifies whether it will be individual or group administration, and pen-and-paper or computerized format.
    • What is the ideal format of the test?
      • Determines the format, such as true-false, essay, or multiple-choice.
    • Should more than one form of the test be developed?
      • Considering parallel or alternate forms for repeated testing.
    • What special training will be required of test users for administering and interpreting the test?
      • Highlights necessary qualifications for test administrators.
    • What types of responses will be required of test-takers?
      • Addresses adaptations and accommodations for individuals with disabilities.
    • Who benefits from an administration of this test?
      • Identifies the beneficiaries of the test results.
    • Is there any potential harm as the result of an administration of this test?
      • Considers ethical implications and potential negative impacts.
    • How will meaning be attributed to scores on this test?
      • Decides between norm-referenced or criterion-referenced interpretation.

Norm-Referenced vs. Criterion-Referenced

  • Norm-Referenced:
    • A good item differentiates between high and low scorers.
    • High scorers answer correctly, while low scorers answer incorrectly.
  • Criterion-Referenced:
    • Items assess whether test-takers meet specific criteria.
    • Commonly used in licensing and mastery testing contexts.
    • Items should discriminate between those who have mastered the material and those who have not.
    • Items that effectively discriminate between mastery groups are considered 'good items.'

Pilot Work

  • Preliminary research involving a test prototype.
  • Test items are pilot-studied to assess their suitability for the final instrument.
  • Aims to optimize the measurement of the targeted construct.
  • Involves literature reviews, experimentation, and iterative item creation, revision, and deletion.
  • Essential for tests intended for publication and widespread use.

Test Construction - Scaling

  • Scaling:
    • Assigning numbers according to rules in measurement.
    • Designing and calibrating a measuring device.
    • Assigning scale values to different amounts of the measured trait or attribute.
  • L.L. Thurstone:
    • Introduced absolute scaling to measure item difficulty across test-taker ability levels.

Types of Scales

  • Scales used to measure traits, states, or abilities.
    • Age-based scale: Performance as a function of age.
    • Grade-based scale: Performance as a function of grade.
    • Stanine scale: Transforms raw scores into a scale from 1 to 9.
  • Dimensionality:
    • Unidimensional: Measures a single underlying dimension.
    • Multidimensional: Measures multiple dimensions.
  • Nature of judgment:
    • Comparative: Compares stimuli against each other.
    • Categorical: Places stimuli into categories that differ quantitatively.

Scaling Methods

  • Rating Scale:

    • Words, statements, or symbols indicate the strength of a trait or attitude.
    • Records judgments about oneself, others, experiences, or objects.
    • Yields ordinal-level data.
  • Summative Scale:

    • The final score is the sum of ratings across all items.
  • Likert Scale:

    • Presents 5-7 response options on an agree-disagree continuum.
    • Generally reliable.
    • Assigning weights of 1 through 5 typically works best.
  • Method of Paired Comparisons:

    • Test-takers choose between pairs of stimuli based on a rule.
    • Higher score for selecting the option deemed more justifiable by judges.
  • Guttman Scale:

    • Items range sequentially from weaker to stronger expressions.
    • Endorsement of a stronger statement implies agreement with milder statements.
    • Developed through scalogram analysis.
    • Arranges items so that endorsement of one item implies endorsement of less extreme positions.
  • Method of Equal-Appearing Intervals (Thurstone Scaling):

    • Obtains interval-level data.
    • Steps:
      • Collect a large number of statements reflecting positive and negative attitudes.
      • Judges evaluate each statement on a scale (e.g., 1-9) indicating the strength of the variable.
      • Judges focus on the statements, not their own views.
      • Calculate the mean and standard deviation of judges’ ratings for each statement.
      • Select items based on their contribution to comprehensive measurement and confidence that items are sorted into equal intervals.
      • A low standard deviation indicates a good item.
      • Administration: average the values of selected items to produce a score.
    • Direct estimation scaling method: no need to transform responses.

Writing Items

  • Item Pool:
    • The collection of items considered for the final test.
    • Comprehensive sampling ensures content validity.
    • The initial pool should contain approximately twice the number of items needed for the final version.
    • New or rewritten items must be tried out to maintain content validity.
  • Multiple Forms:
    • Multiply the number of items for one form by the number of forms planned to determine the total item pool size.
    • Items can be generated from personal experience, academic knowledge, or with the help of experts.

Item Format

  • Form, plan, arrangement, and layout of individual test items.
  • Selected-Response Format:
    • Test takers select a response from alternatives.
      • Multiple-choice format:
        • Stem: Question or statement.
        • Correct alternative/option.
        • Distractors/foils: Incorrect options.
      • Matching item:
        • Premises (left column) and responses (right column).
        • Match the best response to each premise.
      • Binary-choice item:
        • Statement requiring a true/false or yes/no response.
        • Should contain a single, clear idea and be free from debate.
        • The correct response should be definitive.
        • Cannot contain distractor alternatives.
  • Constructed-Response Format:
    • Test takers create the answer.
      • Completion item:
        • Fill-in-the-blank.
      • Short-answer item:
        • Identification-based.
        • Answer is a word, term, sentence, or paragraph.
      • Essay item:
        • Requires a composition demonstrating recall, understanding, analysis, and/or interpretation.
        • Useful for assessing in-depth knowledge.

Writing Items for Computer Administration

  • Item Bank:

    • A large, accessible collection of test questions.
      • Advantages:
        • Accessibility to a large number of items.
        • Items classified by subject area, item statistics, etc.
        • Items can be added, withdrawn, and modified.
  • Computerized Adaptive Testing (CAT):

    • Interactive, computer-administered testing.
    • Item presentation adapts to the test-taker's performance.
      • Features:
        • Begins with sample/practice items.
        • Test continuation may depend on satisfactory performance on practice items.
        • The test is tailored to each test-taker.
        • Items with a high probability of being answered in a particular way are avoided.
        • Reduces floor and ceiling effects.
  • Item Branching:

    • The computer tailors content and item order based on responses to previous items.

Scoring Items

  • Cumulative Model:
    • Higher score indicates a higher level of the measured characteristic.
  • Class Scoring/Category Scoring:
    • Responses earn credit toward placement in a category.
    • Used in diagnostic systems where specific patterns of responses are required for diagnosis.
  • Ipsative Scoring:
    • Compares a test-taker’s scores on different scales within the same test.
    • Example: "John's openness to experience is higher than his extraversion."

Test Tryout

  • Administer the test to a sample similar to the target population.
  • Informal rule of thumb: at least 5-10 subjects per item.
    • More subjects are preferable.
  • Conditions should mimic standardized test administration.
  • Differences in responses should be due to the items themselves, not extraneous factors.
  • A good test item is reliable, valid, and discriminates between test-takers.
    • Answered correctly by high scorers on the test as a whole.
    • Items answered correctly by low scorers may not be good items.

Item Analysis

Item-Difficulty Index

  • Proportion of test-takers who answered the item correctly.

  • p_1 denotes item difficulty for item 1.

  • A larger index indicates an easier item.

  • Calculated as number of examinees who correctly answered item 1 / total number of examinees.

    • Example: p_1 = 50/100 = 0.5
  • Index of difficulty of the average test item: sum of item difficulty indices for all items divided by the total number of items.

  • Optimal Average Item Difficulty:

    • Approximately 0.5 for maximum discrimination.
    • Individual items should range from 0.3 to 0.8.
    • Midpoint between 1.00 and the chance success proportion.
  • Adjusting for Chance Success:

    • Binary-choice items:
      • Chance success proportion = 1/2 or 0.5.
      • Optimal item difficulty = (0.5 + 1.00) / 2 = 0.75.
    • Five-option multiple-choice items:
      • Chance success proportion = 1/5 or 0.2.
      • Optimal item difficulty = (0.2 + 1.00) / 2 = 0.60.
    • Seven-option multiple-choice items:
      • Chance success proportion = 1/7 or approximately 0.14.
      • Optimal item difficulty = (0.14 + 1.00) / 2 = 0.57.

Item-Reliability Index

  • Indicates internal consistency.
  • A higher index indicates greater internal consistency.
  • Equal to the product of the item-score standard deviation (s) and the correlation (r) between the item score and the total test score.
  • Factor analysis identifies items that do not "load on" the intended factor, which can be revised or eliminated.

Item-Validity Index

  • Indicates the degree to which a test measures what it intends to measure.
  • A higher index indicates greater criterion-related validity.
  • Requires two statistics:
    • Item-score standard deviation (s).
    • Correlation between the item score and the criterion score (r_{1c}).
  • Calculated as item-score standard deviation multiplied by the correlation between item score and criterion score.
  • Important for maximizing the criterion-related validity of the test.

Item-Discrimination Index

  • Denoted by d.

  • Indicates how well an item separates high scorers from low scorers.

  • Compares performance on an item with performance in the upper and lower regions of a distribution of continuous test scores.

  • In normal distribution, the upper and lower 27% of scores are used.

  • In platykurtic distribution, the upper and lower 33% of scores are used.

  • The higher the value of d, the better the discrimination.

  • A negative d value indicates that low-scoring examinees are more likely to answer the item correctly.

  • Calculated as d = (U - L) / n, where:

    • U = average of scores in the upper limit (high scorers).
    • L = average of scores in the lower limit (low scorers).
    • n = number of members in each group.
  • Interpretation:

    • More U-group members answer correctly than L-group members: the item is reasonable or good.
    • All U-group members answer correctly and all L-group members answer incorrectly: the item is excellent (d = +1.00).
    • Equal numbers of U- and L-group members answer correctly: the item is not discriminating (d = 0).
    • All L-group members answer correctly and all U-group members answer incorrectly: the item is bad (d = -1.00).

Analysis of Item Alternatives

  • Charting the number of test-takers in the U and L groups who chose each alternative, can evaluate distractor effectiveness.
  • Example scenarios:
    • Good item: More U group members answered correctly, and each distractor attracted some test-takers.
    • Perfect item: ALL U group members answered correctly, while some L group members were attracted by distractors.
    • Effective item: There is still more correct U group members, although distractor e was too effective because most L group members answered it.
    • Poor item: More L group members answered it correctly and more U group members were attracted by distractors.

Qualitative Interpretation of Item Difficulty

  • 0.00 - 0.20: Very Difficult, Unacceptable.
  • 0.21 - 0.40: Difficult, Unacceptable.
  • 0.41 - 0.60: Moderate, Highly acceptable.
  • 0.61 - 0.80: Easy, Highly acceptable/Acceptable
  • 0.81 - 1.00: Very Easy, Acceptable.

Qualitative Interpretation of Item Discrimination

  • Below 0.19: Very Difficult Item, Revise.
  • 0.10 - 0.19: Difficult Item, Unacceptable.
  • 0.20 - 0.29: Fair Item, Acceptable.
  • 0.30 - 0.39: Good Item, Highly acceptable/Acceptable
  • 0.40 and above: Very Good Item, Highly Acceptable.

Item-Characteristic Curves

  • Graphic representation of item difficulty and discrimination.

  • Horizontal axis: ability.

  • Vertical axis: probability of correct response (PCR).

  • Item discrimination: slope (steeper slope = greater discrimination).

  • Item difficulty: skewness (easy item = negative skew, difficult item = positive skew).

  • Examples:

    • Bad item: People of low ability get it correct, and people of high ability get it wrong.
    • Good item: People of low ability get it wrong, while people of high ability get it correct.
    • Desirable curve: A linear increase in scores as ability increases.
    • Excellent item: High probability that all test-takers at or above moderate ability will respond correctly.

Other Considerations in Item Analysis

Guessing

  • Correction for guessing criteria:
    • Recognize that guessing is not always random (test-taker knowledge may allow them to eliminate options).
    • Account for omitted items.
    • Acknowledge that some test-takers may be luckier than others.
  • Test developer actions regarding guessing:
    • Include explicit instructions regarding guessing in the test manual.
    • Provide specific instructions for scoring and interpreting omitted items.

Item Fairness

  • The degree to which an item is biased.
  • A biased item favors one group over another when group abilities are controlled.
  • Item-characteristic curves can identify biased items. If groups do not differ in total test score but exhibit significantly different item-characteristic curves then the items is biased.
  • Same proportion of persons from each group should pass any given item of the test, provided that the persons all earned the same total score on the test. Biased items must be revised or eliminated.

Speed Tests

  • Speed tests yield misleading/uninterpretable results.

  • Items near the end of the test may appear more difficult because test-takers may not reach them before time runs out.

  • May show high item discrimination/positive item-total correlations in late-appearing items due to the select group of examinees reaching those items.

  • Recommended approach:

    • Administer the test with generous time limits for item analysis.
    • Establish norms using the speed conditions intended for actual use.

Qualitative Item Analysis

  • Techniques relying on verbal rather than mathematical procedures.
  • Non-statistical procedures explore how items work.
  • Compares individual test items to each other and to the test as a whole.

"Think Aloud" Test Administration

  • Examinees think aloud while responding to items.
  • Provides insights into how individuals perceive, interpret, and respond to items.

Expert Panels

  • Sensitivity review examines items for fairness and offensive content.
  • Forms of content bias: status, stereotype, familiarity, offensive language.
  • Cultural experts advise on achieving desired measurement with specific populations.

Test Revision Approaches

  • Characterize each item by strengths and weaknesses.

  • Items with many weaknesses are prime candidates for deletion or revision.

  • Very difficult items may lack reliability and validity.

    • Balance various strengths and weaknesses.
  • Test developers may purposefully include some more difficult items on a test that has good items but is somewhat easy.

  • Revision priorities based on test purpose:

    • Educational placement/employment: item bias.
    • Skills/abilities testing: item discrimination.
  • A large item pool facilitates the elimination of poor items.

  • Poor items can be eliminated in favor of those that were shown in the test tryout to be good items.

  • After balancing concerns, the revised test is administered to a second sample under standardized conditions.

  • If item analysis indicates the test is not in finished form, the steps of revision, tryout, and item analysis are repeated until the test is satisfactory and standardization can occur

Test Revision in the Life Cycle of an Existing Test

  • An existing test should be revised when:
    • Stimulus materials look dated.
    • Verbal content contains dated vocabulary not readily understood by current test-takers.
    • Words/expressions are inappropriate or offensive.
    • Test norms are inadequate due to population changes or age-shifts.
    • Reliability, validity, or item effectiveness can be improved.
    • The underlying theory has improved significantly.
      Note: test revision process in this sense is similar to the steps in making a brand new test —especially when a new or improved theory is involved any change in performance between the old and revised test cannot automatically be viewed as a change in examinee performance

Cross-validation and Co-validation

  • Cross-validation:
    • Revalidation on a new sample.
    • Item validities are expected to decrease due to chance (validity shrinkage).
  • Co-validation:
    • Validation of two or more tests using the same sample.
    • Can also be called co-norming when used with norm creation/revision.
    • More economical and reduces sampling error.