Note

0.0(0)

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

View the linked PDF

Class Notes

Chapter 8 – Test Development

Stages of Test Development

• Five sequential phases that guide the systematic creation and refinement of psychological tests:

– Test Conceptualization: The initial stage where the need for a new test is identified and defined.

– Test Construction: Involves writing and formatting items, and designing the test structure.

– Test Try-out: Administering the preliminary test to a representative sample to gather initial data.

– Item Analysis: Statistically evaluating individual test items based on data from the try-out phase.

– Test Revision: Modifying, deleting, or adding items based on item analysis results to improve the test.

• This process is inherently iterative, meaning it is cyclical rather than linear, often cycling back through previous stages (e.g., from Item Analysis to Test Construction, then back to Test Try-out) multiple times before the test is finalized and standardized. This iterative nature (see Figure 8-1 in many psychometric texts) ensures continuous improvement.

Test Conceptualization

• This phase originates from a perceived measurement need, often phrased as, "There ought to be a test for…" reflecting an identified gap in existing assessment tools.

• Stimuli for new tests can be diverse, including:

– Psychometric weaknesses of existing tools: Identifying poor reliability, validity, or outdated norms in current tests.

– Emerging social phenomena: Needing to measure new constructs or experiences not previously addressed (e.g., digital literacy, specific social anxieties).

– New occupations/roles: Assessment tools required for evaluating skills or aptitudes for newly developed professions.

– Medical analogies: Just as new diseases require diagnostic tests, new psychological conditions or constructs may require new assessment instruments.

• Example hot domains requiring new measurement tools: high-definition electronics aptitude, environmental engineering expertise, wireless communication proficiency, and understanding LGBTQIA2S+ experiences (especially underrepresented identities like asexuality), among others.

• During this crucial preliminary stage, every test developer must address core questions comprehensively:

– What construct?: Precisely defining the hypothetical construct to be measured (e.g., intelligence, anxiety, job satisfaction) and delineating its theoretical boundaries.

– Definitions?: Providing clear, operational definitions of the construct and its facets.

– Distinctiveness?: How does this construct differ from other related constructs? Is there discriminant validity from the start?

– Objective & anticipated real-world correlates: What is the ultimate purpose of the test, and what real-world behaviors or outcomes should its scores predict (e.g., academic success, clinical diagnosis, job performance)?

– Market need & advantages over existing tools: Is there a demand for this test? How will it be superior, more efficient, or more appropriate than tools already available?

– Intended users & purposes: Who will administer and interpret the test? What specific decisions will be made based on its results (e.g., screening, diagnosis, selection, research)?

– Target examinees: Who is the test designed for (e.g., age range, reading level, cultural background, clinical vs. non-clinical populations)? This influences language, content, and instructions.

– Content coverage & cultural specificity: What specific content domains or behaviors should the test cover to comprehensively represent the construct? Are there cultural nuances or biases to consider and address?

– Administration mode: Will the test be individual or group administered? Paper-and-pencil, computer-based, or adaptive?

– Item format: What type of items will be used (e.g., multiple-choice, true/false, essay, Likert scales)?

– Number of forms: Will there be parallel forms, short forms, or only one version?

– Scoring model: How will responses be translated into scores (e.g., cumulative, ipsative, class/category)?

– User qualifications: What training or credentials will be required for test administrators and interpreters?

– Accommodations: What provisions need to be made for individuals with disabilities or specific needs?

– Benefits vs. harms: What are the potential positive impacts of using the test versus any potential negative consequences (e.g., misdiagnosis, labeling)?

– Norm-referenced vs Criterion-referenced orientation: Will score interpretation be based on comparison to a normative group or against a predefined standard of mastery?

Close-Up: Asexuality Identification Scale (AIS)

• The Asexuality Identification Scale (AIS) is a 12-item, sex and gender-neutral self-report measure developed by Yule et al. (2015) specifically designed to identify asexuality.

• Its development followed a rigorous, multi-stage process:

Initial Item Generation: Researchers began with 8 open-ended qualitative questions administered to individuals to gather rich data on experiences related to asexuality. The responses were then subjected to thematic analysis, leading to the generation of 111 initial multiple-choice items.
First Field Test & Refinement: These 111 items were administered to a large sample of n=917 participants, comprising 165 individuals who self-identified as asexual and 752 who identified as sexual. Thorough factor analysis and item analysis (evaluating item difficulty, discrimination, etc.) were conducted to identify and retain only the most psychometrically sound items, resulting in a reduced set of 37 items.
Second Field Test & Finalization: The refined 37 items were then administered to an even larger sample of n=1242 participants (specifically, 316 asexual-identifying and 926 sexual-identifying individuals). Further rigorous psychometric analysis, including item response theory (IRT) methods, led to the final selection of just 12 items.
• Validity evidence for the AIS is robust:
– Known-groups validity: An optimal cutoff score of 40/60 on the AIS demonstrated excellent differentiation between groups, accurately identifying 93\% of asexual individuals while effectively excluding 95\% of sexual individuals.
– Incremental validity: The AIS showed a moderate positive correlation (r) with the Klein Sexual Orientation Grid, indicating it contributes unique variance beyond existing measures of sexual orientation.
– Convergent validity: It had a weak positive correlation with the Solitary Desire Index (SDI Solitary Desire) and a moderate negative correlation with Dyadic Desire, consistent with theoretical expectations for asexuality.
– Discriminant validity: The AIS showed non-significant correlations (ns) with measures of childhood trauma (CTQ), interpersonal problems (IIP-SC), and personality traits (BFI), confirming that it measures a distinct construct separate from these other psychological variables.
• Implication: The AIS serves as a valuable research and clinical tool. It aids in the representative recruitment and screening of asexual individuals for studies, reducing reliance on potentially biased self-identification and improving the accuracy of research samples.

Norm- vs Criterion-Referenced Tests

• Norm-referenced tests: The meaning of an individual's score is derived from their relative position within a predefined distribution of scores from a norm group (e.g., "Johnny scored in the 80th percentile, meaning only 20% of his peers scored higher"). For these tests, a "good item" is one that is answered correctly by high-scoring individuals on the overall test and incorrectly by low-scoring individuals, effectively differentiating between test-takers based on their ability or trait level.

• Criterion-referenced tests: The meaning of an individual's score is solely derived from their mastery of predefined knowledge, skills, or criteria, independent of how other examinees perform (e.g., "Mary passed, meaning she has mastered 85% of the course objectives"). Item quality is judged by its ability to accurately separate individuals who have achieved mastery of a specific criterion from those who have not, regardless of their relative rank within a group.

• Examples: Licensing exams (e.g., medical boards, bar exams) and tests used in mastery learning educational models are classic examples of criterion-referenced assessments because they evaluate whether an individual meets a minimum standard rather than how they compare to others.

Pilot Work ("Pilot Studies")

• Pilot work refers to preliminary research activities undertaken to thoroughly explore a construct and to craft initial prototype items for a test. This phase is critical for laying a solid foundation for test development.

• Activities involved in pilot work can include:

– Extensive literature reviews: To understand the construct, its theoretical underpinnings, existing measures, and relevant research findings.

– Open-ended interviews: With subject matter experts (SMEs) or target populations to gather qualitative insights, terminology, and perspectives related to the construct.

– Physiological monitoring: In certain contexts (e.g., stress, emotion), preliminary physiological data might inform item content.

– Cognitive interviews: Having potential examinees think aloud as they respond to draft items to identify ambiguities, misunderstandings, or cognitive processes involved.

• Pilot work is considered essential for the development of published, large-scale, and high-stakes tests, as it helps identify potential issues early on. While optional for informal assessments like classroom quizzes, even then, informal pilot testing (e.g., trying questions on a few students) can significantly improve item quality.

Scaling

• Scaling refers to the set of rules and procedures by which numbers are assigned to psychological attributes or observations. It defines how quantitative values represent different levels of a construct.

• L.L. Thurstone was a pioneer in the field, initially developing psychophysical scaling methods (relating physical stimuli to psychological sensations) and then extending these principles to psychological scaling (measuring psychological attributes directly). He introduced the concept of absolute scaling, which aims to assign scale values to items or stimuli that are independent of the particular sample of judges used.

• Different types of scales used in psychometrics include:

– Age-based scales: Scores are interpreted based on the average performance of individuals at specific age levels (e.g., mental age).

– Grade-based scales: Interpret scores relative to performance expected at particular academic grade levels. They provide a general benchmark but can be misleading as grade levels are broad.

– Stanine scales: A method of normalizing scores into a nine-point scale, with a mean of 5 and a standard deviation of 2, allowing for quick comparisons within a group.

– Unidimensional vs. multidimensional scales: Unidimensional scales measure a single underlying construct (e.g., anxiety), while multidimensional scales measure multiple related constructs or facets (e.g., different aspects of intelligence).

– Comparative vs. categorical scales: Comparative scales involve judging items relative to each other (e.g., which of these two is more important), whereas categorical scales involve assigning items to distinct categories based on predefined criteria.

Selected Scaling Methods

• Rating Scales: These are the most common scaling methods. Respondents indicate their agreement, frequency, or intensity of a feeling or behavior on a continuum.

– Likert scale: Typically uses a 5-point or 7-point format (e.g., 1=Strongly\ Disagree to 5=Strongly\ Agree). Scores for individual items are often summed to create a total score, assuming that each point on the scale represents an equal interval of the underlying construct.

– Summative scores: Scores are typically summed across multiple items to yield a total score that is interpreted as representing the amount of the trait possessed.

• Paired Comparisons: In this method, respondents are presented with all possible pairs of stimuli (e.g., items, objects, statements) and asked to choose which one they prefer or which better exemplifies a particular attribute for each pair. Scores are derived based on how consistently options are chosen over others, reflecting their perceived agreement with judges or overall preference hierarchy.

• Sorting Tasks: These tasks require respondents to organize items into groups or put them in order.

– Comparative sorting (Rank Order): Respondents arrange a set of stimuli in a specific order based on a given criterion (e.g., ranking a list of values from most to least important).

– Categorical sorting (Pile-sorting): Respondents sort items into qualitative bins or categories that they define themselves or that are predefined (e.g., sorting traits into "like me" vs. "not like me" piles).

• Guttman Scalogram: This method constructs a hierarchical scale where items are ordered in terms of difficulty or extremity. Endorsement of a stronger, more extreme item implies endorsement of all weaker, less extreme items. For example, if a person agrees with the statement "I am willing to die for my country," they are also assumed to agree with "I am willing to serve my country." It aims for cumulative items that reflect an underlying continuum.

• Equal-Appearing Intervals (Thurstone) procedure: This multi-step method aims to create attitude scales where the intervals between scale points appear to be equal.

Statement Generation: A large pool of statements (e.g., 100-200) expressing various degrees of pro- and anti-attitude towards a specific topic is generated.
Expert Judges Rating: A panel of expert judges (typically 50-100) independently rates each statement on an 11-point continuum (from 1=extremely\ unfavorable to 11=extremely\ favorable), based on how strongly the statement expresses the attitude, assuming equal intervals between points.
Statistical Analysis: For each statement, the mean (\bar x) and standard deviation (s) of the judges' ratings are computed. The mean indicates the scale value of the item, and the standard deviation indicates the level of agreement among judges.
Item Selection: Items are selected that have low standard deviations (indicating high agreement among judges) and that cover a broad range of mean scale values (\bar x spread), ensuring that the selected items represent distinct points along the attitude continuum.
Respondent Endorsement & Scoring: During test administration, respondents simply endorse the statements with which they agree. Their attitude score is then calculated as the mean scale value of all the items they endorsed.

Writing Items & Item Pool

• It is a common practice to draft approximately twice the number of items intended for the final test. This generous initial item count helps ensure that sufficient items remain after the elimination of problematic items during item analysis, thereby maintaining adequate domain representation and content coverage.

• Sources for item generation are varied and often combined to ensure comprehensive and relevant item content:

– Literature reviews: Existing theories, research, and previous scales provide a foundation for item construction.

– Subject matter experts (SMEs): Experts in the field (e.g., psychologists, educators, engineers) contribute items based on their specialized knowledge.

– Interviews with relevant populations: Conversations with clinicians, teachers, workers, or individuals from the target population can reveal real-world experiences, common misconceptions, or critical behaviors to be assessed.

– Critical incidents technique: Identifying specific behaviors or events that are critical to the construct being measured.

• Item Formats offer different ways for examinees to respond:

– Selected-response formats: Require examinees to choose from given options, facilitating objective scoring. Examples include:

– Multiple-choice questions (MCQ): Consist of a stem (the question or incomplete statement), a keyed correct answer, and several plausible distractors (incorrect options designed to be attractive to those who lack true knowledge).

– Matching items: Present two columns of stimuli, and examinees pair items from one column with items from the other.

– True–false (binary choice): Simplest format, requiring a decision between two alternatives.

– Constructed-response formats: Require examinees to generate their own answer, often allowing for deeper assessment of knowledge but requiring more subjective scoring. Examples include:

– Completion items (fill-in-the-blank): Examinees provide a word or phrase to complete a sentence.

– Short-answer questions: Require a brief, concise response.

– Essay questions: Demand extended, organized responses, demonstrating critical thinking, analysis, and writing skills.

• General Guidelines for Item Writing focus on clarity, precision, and fairness:

– Clarity and simplicity: Items should be clear, unambiguous, and written at an appropriate reading level for the target audience.

– Single idea: Each item should ideally assess only one specific concept or idea to avoid confusion and ensure accurate measurement.

– Parallel grammar and structure: For selected-response items, all options should be grammatically consistent with the stem and similar in length and complexity, reducing extraneous cues.

– Plausible distractors: Distractors in MCQs should be incorrect but realistic and attractive to those who do not know the correct answer. Avoid obviously incorrect or absurd options.

– Avoid clues: Ensure that items do not contain inadvertent clues to the correct answer (e.g., grammatical agreement with the stem, specific determiners like "always" or "never").

– Avoid negatives/double negatives: Phrasing items negatively can increase cognitive load and confusion.

Computerized Innovations

• Item Banks: These are large, organized repositories of test items that have been extensively calibrated and tagged with psychometric properties (e.g., difficulty, discrimination, content area). They allow for efficient test construction and the generation of multiple, equivalent test forms.

• Computer Adaptive Testing (CAT): An advanced form of computerized testing where an algorithm dynamically selects items for each examinee based on their responses to previous items. The system continuously estimates the examinee's ability level and presents items that are maximally informative at that specific ability level.

– Benefits of CAT: Significant reduction in test length (e.g., often 50\% fewer items are needed compared to traditional linear tests) while maintaining or even increasing precision (50\% reduction in measurement error). This makes tests more efficient and less burdensome.

– Key Concepts in CAT:

– Item branching: The process by which the computer selects the next item based on the examinee's previous answer (e.g., if correct, present a harder item; if incorrect, present an easier item).

– Floor effect: Occurs when a test or item set is not sensitive enough to discriminate among individuals at the very low end of the ability or trait continuum, potentially leading to inaccurate low scores.

– Ceiling effect: The opposite of a floor effect, occurring when a test fails to discriminate among individuals at the very high end of the ability or trait continuum, leading to inaccurately high scores that don't fully capture exceptional performance.

Scoring Models

• Cumulative Scoring Model: This is the most common model, where higher total scores are interpreted as indicative of a greater amount of the measured trait or ability. The sum of correct answers or scale ratings directly reflects the extent to which an individual possesses the construct.

• Class/Category Scoring Model: This model classifies examinees into distinct diagnostic groups or categories based on their test responses. Instead of a continuous score, the outcome is a categorical assignment (e.g., clinical diagnosis, eligibility for a program, primary learning style). This is often used in diagnostic assessments.

• Ipsative Scoring Model: This model involves intra-individual comparisons, meaning it describes an individual's strengths and weaknesses relative to their own profile rather than comparing them to others. For example, in the Edwards Personal Preference Schedule (EPPS), respondents make forced choices between statements, and scores indicate the relative strength of different needs within that individual. Ipsative scores are therefore unsuitable for inter-individual comparisons because they do not reflect absolute levels of traits but rather their relative hierarchy within one person.

Test Try-out

• This phase involves administering the newly constructed or revised test to a preliminary sample of examinees. The sample should be similar in its characteristics to the target population for whom the final test is intended (e.g., age, educational background, relevant demographics, clinical status).

• A general rule-of-thumb for sample size in test try-out is to include 5 to 10 subjects per item on the test. For a 50-item test, this would suggest a sample of 250 to 500 participants.

• To ensure that the data collected are genuinely representative and useful for improving the test, testing conditions during the try-out should meticulously mimic the predicted conditions for future standardized administration. This involves adhering to standardized instructions, time limits, environmental settings, and proctoring procedures. The goal is to control extraneous variance (factors unrelated to the construct being measured but that might influence scores), thereby maximizing the representativeness and generalizability of the try-out data.

Item Analysis (Quantitative)

• Item analysis involves a suite of statistical procedures used to evaluate the quality and effectiveness of individual test items based on data collected during the test try-out phase. It helps identify items that are too easy, too hard, poor discriminators, or misleading.

– Difficulty Index (pi= rac{\text{# correct}}{N}): This index, denoted by pi, represents the proportion of examinees who answered an item correctly. It ranges from 0 (no one answered correctly, the item is extremely difficult) to 1 (everyone answered correctly, the item is extremely easy). For power tests (where speed is not a factor and all examinees are expected to attempt all items), the optimal average difficulty for items is typically around 0.5 to maximize variance and discrimination among test-takers. For k-option multiple-choice questions, the optimal difficulty can be calculated as rac{1+\text{chance}}{2}. For example, for a 5-option MCQ (chance of guessing correct is 0.20), the optimal difficulty would be rac{1+0.20}{2}=0.60. If an item is too easy or too hard, it doesn't effectively differentiate between test-takers.

– Item-Score Standard Deviation (si=\sqrt{pi(1-pi)}): This measures the variability of scores on a single item. Items with higher standard deviations contribute more to the overall test variance and thus have greater potential to discriminate among examinees. An item with very low si (e.g., nearly everyone gets it right or wrong) provides little information.

– Item-Reliability Index (si r{iT}): This index is the product of the item's standard deviation (si) and its point-biserial correlation (r{iT}) with the total test score. The point-biserial correlation indicates how well an item discriminates between high and low scorers on the overall test. A higher item-reliability index suggests that the item is a consistent measure of the same construct as the rest of the test, contributing to the internal consistency reliability of the test.

– Item-Validity Index (si r{iC}): Similar to the item-reliability index, but here r*{iC} represents the correlation of the item score with an external criterion measure. A higher item-validity index indicates that the item contributes meaningfully to the overall predictive validity of the test concerning an external outcome.

– Discrimination Index (d=\frac{U-L}{n_{\text{subgroup}}}): This is a simple yet powerful index that compares the performance of the upper and lower groups of examinees on a particular item. Typically, it calculates the difference between the number of examinees in the upper 27\% of overall test scorers (U) who answered the item correctly and the number of examinees in the lower 27\% of overall test scorers (L) who answered the item correctly, divided by the number of examinees in each subgroup (n_{\text{subgroup}}). The range of d is from -1 to 1. A high positive d indicates good discrimination (high scorers get it right, low scorers get it wrong). A negative d value indicates a serious flaw (low scorers are more likely to get the item correct than high scorers), suggesting the item is misleading, confusing, or keyed incorrectly.

– Item Characteristic Curves (ICCs): Used in Item Response Theory (IRT), ICCs graphically represent the relationship between an examinee's underlying ability or trait level (\theta, typically on the x-axis) and the probability (P(\text{correct}), on the y-axis) of answering a specific item correctly. The shape of the ICC provides crucial information about the item:

– A steep slope indicates high discrimination, meaning the item effectively differentiates between individuals with slightly different ability levels.

– A flat or irregular curve suggests a poor item that does not effectively discriminate or that is not well-aligned with the underlying construct (see example patterns for Items A–D in typical psychometric texts, where a flat curve means low discrimination and an irregular curve suggests item problems).

Additional Analytical Issues

• Guessing: This is a perennial problem in selected-response tests, as random guessing can inflate scores. Complications arise because it's difficult to ascertain if an answer was a lucky guess, an informed elimination of options, or truly known. Traditional correction for guessing formulas (e.g., score = R - \frac{W}{k-1} where R=rights, W=wrongs, k=number of options) attempt to adjust scores but must consider factors like whether examinees omitted items or made educated guesses. Many modern psychometricians argue against correction for guessing due to its limitations.

• Item Fairness / Differential Item Functioning (DIF): This refers to whether an item functions differently for different groups of examinees, even when those groups have the same underlying ability level. For example, an item might be disproportionately harder for one gender or racial group, even if their true ability on the construct is equivalent to others. DIF is typically flagged by comparing the shapes of ICCs across groups—if the curves for different groups are not approximately equal, DIF may be present. DIF analysis is a statistical method for detecting such biases, and items flagged for DIF typically undergo sensitivity reviews by expert panels to determine if cultural context, language, or content might be contributing to unfairness.

• Speed Tests: In tests where time limits are very strict and many examinees do not complete all items, item statistics can become distorted. Items appearing late in the test battery might incorrectly seem harder simply because fewer examinees reached them. To avoid this, item statistics for speeded tests are often analyzed separately for items completed by most examinees, or the test is administered with generous time limits to ensure all items are attempted by all, effectively becoming a power test for analysis purposes.

• Qualitative Item Analysis: Beyond statistical methods, qualitative approaches provide rich insights into how examinees interpret and respond to items:

– Questionnaires and surveys: Administered to examinees post-test to gather feedback on item clarity, difficulty, and relevance.

– Focus groups: Group discussions with examinees or experts to explore perceptions of test items and instructions.

– "Think-aloud" protocols: Examinees verbalize their thought processes as they respond to items, revealing cognitive strategies, misunderstandings, or reasons for choosing specific options.

– Expert/Sensitivity panels: Panels of content experts and cultural sensitivity reviewers review items for accuracy, bias, and appropriateness for diverse populations.

– Everyday Psychometrics example: The use of the HANAA (Holistic Assessment of Narratives and Abilities Scale for Australian Aboriginal Clients) illustrates a culturally congruent approach. It incorporates yarning, a traditional Aboriginal conversational style, into the assessment process, respecting cultural communication norms and fostering trust, thereby improving the validity of assessments for this population.

Test Revision (New Test Stage)

• This is an iterative stage where the test is refined based on the comprehensive evidence gathered during item analysis (both statistical and qualitative). The primary goal is to improve the psychometric properties of the test and ensure it meets its intended purpose.

• Actions taken during revision include:

– Deletion of problematic items: Removing items that are poor discriminators, too easy/difficult, or exhibit bias.

– Rewriting items: Revising ambiguous, confusing, or poorly worded items, or those with ineffective distractors.

– Adding new items: If content domains are found to be underrepresented after item deletion, new items may be developed and added to maintain the test's content blueprint.

• The revision process is part of a continuous iterative cycle: Test Try-out (of the revised version) → further Item Analysis → further Revision. This cycle continues until the test developers deem the psychometric properties satisfactory (e.g., meeting target reliability and validity coefficients, desired item distribution). Only then is the test ready for administration to a Standardization Sample to establish robust norms.

Revision of Existing Tests

• Tests that have been in use for some time periodically require revision themselves to maintain their relevance and psychometric soundness. Key triggers for revising existing tests include:

– Outdated stimuli/language: Language evolves, and stimuli (e.g., pictures, scenarios) can become culturally irrelevant or offensive.

– Cultural shifts: Societal changes may render certain test content or norms inappropriate or biased.

– Norm obsolescence: Normative data gathered years or decades ago may no longer accurately reflect the abilities or characteristics of current populations due to factors like societal changes, educational improvements, or Flynn effect (generational increase in IQ scores).

– Theory change: Advances in psychological theory may necessitate changes in how a construct is conceptualized and measured.

– Reliability/validity gains: New research or statistical methods may offer opportunities to improve the test's psychometric properties.

• The steps for revising an existing test largely parallel those for developing a new test (conceptualization, construction, try-out, item analysis, revision). However, a crucial additional consideration is linkage studies.

– Linkage studies: These are research efforts designed to compare scores on the old and new forms of a test (or between different forms of a test) to ensure continuity and comparability of scores over time. This is critical for longitudinal research and clinical tracking.

• Cross-validation: After a test has been revised or its validity established on one sample, it is crucial to apply the test to a new, independent sample. When applying a test (or its scoring key) to a new sample, one should typically expect some validity shrinkage. This means the test's predictive validity (or other forms of validity) will likely be slightly lower in a fresh sample compared to the original sample on which it was developed or refined. This is a common phenomenon in statistics caused by capitalizing on chance factors in the original sample.

• Co-validation/Co-norming: This process involves standardizing and norming multiple tests (e.g., an intelligence test and a memory test) on the same representative sample of individuals. This approach is highly advantageous because it significantly reduces sampling error when comparing scores across different tests administered to the same person (e.g., comparing a WAIS-IV IQ score with a WMS-IV memory index score). Such procedures ensure that score differences are more reliably attributable to actual differences in the constructs measured rather than to differences in the normative samples.

• Quality Assurance in Test Administration and Scoring: To maintain the integrity of test results, particularly for standardized assessments, several quality assurance measures are essential:

– Examiner training: Ensuring that all test administrators are thoroughly trained in standardized test administration procedures.

– Double-scoring with resolvers: For subjective items (like essays) or complex scoring, two independent scorers evaluate responses, and discrepancies are resolved by a third expert.

– Anchor protocols: Providing examples of scored responses at different levels of quality to guide consistent scoring.

– Monitoring scoring drift: Periodically checking to ensure that scorers maintain consistent standards over time and addressing any unintentional shifts.

– Data-entry checks: Implementing procedures (e.g., double data entry) to minimize errors when raw scores are entered into databases.

Classical vs Item Response Theory

• Classical Test Theory (CTT): The older, more traditional framework for test development. Key characteristics include:

– Simpler math: Based on the observed score = true score + error formula (X=T+E), the statistical models are relatively straightforward and easier to compute manually or with basic software.

– Small N suffices: CTT methods can be applied with relatively smaller sample sizes for item analysis and test construction compared to IRT.

– Limitations: Item statistics (e.g., difficulty, discrimination) are sample-dependent, meaning they can vary significantly across different groups of examinees. Also, the precision of measurement (reliability) is typically constant across all ability levels; longer tests are often required to achieve acceptable reliability.

• Item Response Theory (IRT): A more modern and sophisticated framework that focuses on the relationship between an examinee's underlying ability/trait and their probability of responding to an item in a certain way. Key characteristics include:

– Sample-free item parameters: A major advantage is that item parameters (difficulty, discrimination) are estimated to be independent of the specific sample of examinees used, allowing for standardized item banks.

– Supports CAT & item banking: Its principles are foundational for developing computer adaptive tests (CAT) and building robust item banks, as item characteristics are precisely calibrated.

– Test information functions: IRT provides test information curves, which show the precision of measurement (reliability) at different points along the ability continuum, allowing test developers to optimize test length and item selection for specific ability ranges.

– Requirements: IRT models generally require large sample sizes (typically N \approx 200+ examinees) for stable parameter estimation and involve more complex mathematical and statistical models.

– Applications: IRT is used to refine tests by understanding how individual items perform at different ability levels, to detect Differential Item Functioning (DIF) more rigorously, and to build calibrated item banks for flexible test construction.

Building an Item Bank via IRT (Flow)

• The process of developing a robust item bank using IRT involves several systematic steps:

Collect Items: Gather a large pool of items, including existing validated items (if available) and newly developed items.
Content & Cognitive Reviews: Subject all items to rigorous reviews by subject matter experts for content accuracy and relevance, and by cognitive psychologists or through cognitive interviews to ensure clarity and appropriate cognitive processing by examinees. Items are refined based on this feedback.
Preliminary Bank: The refined items are organized into a preliminary item bank.
Field Test on Large N: The items are then administered to a large, diverse sample of examinees that is representative of the target population. A large sample (N > 200) is crucial for stable IRT parameter estimation.
Calibrate (IRT): Apply IRT models to the field test data to estimate item parameters (e.g., difficulty, discrimination) for each item. This process places all items on a common proficiency scale.
Final Bank: Based on the calibration results and further qualitative review, the best-performing items are included in the final item bank, complete with their estimated IRT parameters. Poor items are discarded or revised.
• Outputs and Benefits of an IRT Item Bank:
– Fixed-length parallel forms: Ability to generate multiple equivalent test forms automatically from the bank, ensuring comparability across administrations.
– CAT engines: The calibrated item bank serves as the foundation for developing sophisticated computer adaptive tests.
– Public pools: In some contexts, item banks can contribute to public pools of calibrated items used for various assessment purposes.

Instructor-Made Classroom Tests

• Many psychometric complaints regarding classroom tests (e.g., "that question was unfair," "the test didn't cover what we learned") can be directly translated into issues of reliability and validity. An unfair question often reflects poor item discrimination or ambiguity, impacting reliability. A test not covering learned material points to a lack of content validity.

• While formal psychometric analyses are rarely feasible for individual instructors, professors can still address key psychometric principles, primarily focusing on validity and minimizing measurement error:

– Content Validity via Blueprinting: Instructors consciously ensure content validity by creating a test blueprint that outlines the specific chapters, lecture topics, and cognitive skills (e.g., recall, application, analysis) that will be covered on the test, ensuring alignment with learning objectives.

– Mitigate Error: Strategies to reduce measurement error include:

– Clear instructions: Providing unambiguous directions for test completion.

– Uniform administration: Ensuring consistent testing conditions (e.g., time limits, environment) for all students.

– Double-scoring essays/subjective items: Having two graders score essays independently greatly enhances scoring reliability and reduces bias.

– INFORMAL Methods for Psychometric Evaluation: Professors can approximate formal psychometric evaluation through practical, informal methods:

– Item review: Carefully reading and rereading items for clarity, ambiguity, and potential flaws.

– Discussions with students: Gathering feedback spontaneously or via structured discussions after a test to identify confusing items.

– Grade analysis: Observing the distribution of grades, identifying whether particular items were missed by many high-achieving students, or consistently answered by low-achieving students, which could indicate a problematic item.

Key Formulas & Definitions

• Item difficulty (proportion correct): p=\frac{k}{N} where k is the number of examinees who answered the item correctly, and N is the total number of examinees who attempted the item.

• Item Standard Deviation: s=\sqrt{p(1-p)} where p is the item difficulty index.

• Optimal item difficulty for k-option MCQs: p{opt}=\frac{1+P{chance}}{2} where P*{chance} is the probability of guessing the correct answer randomly (\frac{1}{k}, where k is the number of options).

• Discrimination index: d=\frac{#U-#L}{n} where #U is the number of examinees in the upper scoring group who answered correctly, #L is the number in the lower scoring group who answered correctly, and n is the number of examinees in each subgroup (typically the top and bottom 27\%).

• Item-Reliability Index: s r{iT} where s is the item standard deviation and r{iT} is the point-biserial correlation between the item score and the total test score.

• Item-Validity Index: s r{iC} where s is the item standard deviation and r{iC} is the correlation between the item score and an external criterion score.