1/28
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What are precursors to item response theory (IRT)?
In classical test theory: Use of the persons-by-items response matrix to calculate person scores and calculate item statistics (difficulty, discrimination).
In Thurstone scaling: estimating scale values for items, using the scale values to score people, placing people and items on the same scale.
In Guttman scaling: placing items and persons on the same scale, the “step function” (examining the relationship between total score and proportion correct).
In Likert scaling: Estimating scale values for the response options of polytomous items.
What is item response theory?
A family of mathematical models for developing measurement instruments of all types (ability, achievement, personality, attitudes and preferences). Includes models for dichotomously scored items, polytomously scored items, unidimensional constructs, multidimensional constructs, response times, pair comparison scaling. Previously called latent trait theory, item characteristic curve (ICC) theory.
While IRT models differ in specifics, what do they all have in common?
All have a similar general form (they are probabilistic, persons and items are on the same scale, probability of a response is primarily a function of the relationship between a person and an item).
All have the following features: item statistics (called “parameters”) are estimated from a person's items matrix of responses, person parameters are estimated using a person’s responses to each item, taking into account the item parameters of each item. Also results in an individualized standard error of measurement. Measuring instruments are constructed and evaluated based on their information structure.
What are item response functions (IRFs) and what are the two primary purposes for which they are used?
IRFs are estimated from a person’s items response matrix. Using complex mathematical algorithms. Current popular algorithm is called marginal maximum likelihood. Quality of the estimates depends on how the algorithm is implemented. Output is a, b, and c parameters for each item.
Are used for two primary purposes: Constructing measuring instruments by transforming the IRFs nd combining the resulting functions; Scoring people by combining IRFs with person responses. For each person, estimate theta and SEM.
Constructing measuring instruments: item information
Estimate IRFs from response data. Transform the IRFs to item information functions.
Constructing measuring instruments: test information
Sum the item information conditional on theta for all items in the test. The result is the test information function. Information reflects precision of measurement. Tells you where on the scale the test measures well and where it measures poorly. Design your test to put the precision where you need it.
Constructing measuring instruments: conditional standard error of measurement
From the test information function: At each value of theta, take the square root of the value of test information. Compute its reciprocal. This is the model-predicted standard error of measurement. It reflects imprecision or error of measurement on the theta scale.
What are the steps for constructing measuring instruments using IRT?
Write items that are assumed to measure the trait.
Administer those items to a “tryout” (or calibration) group of appropriate examinees.
Estimate IRT item parameters, using a model appropriate to the item format and appropriate software. Use a 2-parameter model for “free response” or “constructed response” items. Use a 3-parameter model for multiple-choice items. Use a polytomous model ror Likert-type (rating scale) items and for “partial credit” items.
Delete non-fitting items.
Compute item information functions for each item.
Specify a target test information function that is based on the purpose for which the measurements are to be used. Measure a wide range of individual differences with equal precision. Measure individuals within a restricted range. Classify people around a cut score without regard to individual differences. Classify people around a cut score but measure individual differences above (or below) that score.
Select items based on their information functions, adding each item's information values to the sum of information for previously selected items.
Compare the resulting test information function to the target test information function.
The test is complete when the observed and target information functions are close to each other throughout the range of interest.
If additional items are necessary, write new items in the appropriate range of the trait (to the extent possible), and go back to Step 2.
If another (parallel) form of the test is needed, select another set of items for which the test information function is similar to the target test information function.
Calculate the test SEM function to determine whether the test will provide the level of precision that is required (low SEM in the appropriate theta ranges).
How is measuring individuals different using classical test theory vs item response theory?
With classical test methods, people are measured by counting the number of items answered correctly. In a 6-item test measuring 20 people, how many distinctions can be made among those 20 people? 7. In a 6-item test measuring 1,000 people, how many distinctions can be made among those 1,000 people? 7. How many distinctions can be made with a 20-item test measuring 1,000 people? 21. Each item is considered to be equivalent and all items receive the same weight (1 = correct, 0 = incorrect). A difficult and an easy item are considered equivalent. A “good” item (high discrimination) receives the same weight as a “poor” item (low discrimination).
With IRT methods, number correct is not used. Items are “weighted” by difficulty, discrimination, guessing, and whether or not it was answered correctly. The entire pattern of item responses is taken into account for each examinee. How many distinctions among people can be made with a 6-item test using IRT? 26 = 2 x 2 x 2 x 2 x 2 x 2 = 64 (vs 7 for number-correct scoring). How many distinctions among people can be made with a 20-item test using IRT? 2^20 = 1,048,576 (vs 21 for number-correct scoring).
How do we measure people with IRT?
Basic method is called “maximum likelihood estimation” or MLE. When an item is answered correctly, the IRF for that item is used. When an item is answered incorrectly, 1 minus the IRF is used. Multiply the appropriate IRFs for each item answered. Result is a likelihood function, which reflects how likely each value of is given a particular response pattern to each item answered.
How does maximum likelihood estimation work?
The likelihood function is computed for each person independently of other persons. Uses the person’s complete pattern of responses in conjunction with the estimated parameters of each item. Provides two numerical values for each person. An estimate of the person’s trait level, by locating the maximum of the theta function, and an estimate of the imprecision of the theta estimate, from the curvature of the function at is maximum. More curvature, less imprecision, or more precise; less curvature, more imprecision or less precise.
This value is the observed standard error of measurement (SEM). It differs from the model-predicted SEM, which comes from the test information function because it takes into account the person’s response pattern.
Each response pattern will result in a different likelihood function and therefore a different theta-hat and a different SEM.
What are the steps for maximum likelihood estimation?
Administer a set of test items to a group of individuals and estimate IRT parameters for each item (a, b, c).
Administer the items (or a subset) to the person whose trait level you want to measure
Score the items as correct or incorrect
Based on that person's pattern of correct and incorrect responses, determine the appropriate item response functions (IRFs) for each item. If the item was answered correctly, use the probability of a correct response for the IRF. If the item was answered incorrectly, use the probability of an incorrect response as the IRF (i.e., 1 minus probability correct).
Multiply the appropriate IRFs across all items at all values of theta. The result is the likelihood function
Determine the value of the likelihood that is the largest, i.e., the maximum of the likelihood function associated with this largest value is a value of theta. This value of theta is the person's theta estimate. It is the level of theta that “most likely gave rise to that pattern of correct/incorrect answers to that set of items with the specified item parameters,” given the IRT model used. It is, therefore, the maximum likelihood estimate of theta.
Evaluate the curvature of the likelihood function at its maximum. That curvature is related to the “information” in the response pattern. Take the reciprocal of the square root of that value. The result is the observed standard error of the estimate. It can be interpreted as the standard error of measurement for that person's answers to that set of items.
Describe Bayesian theta estimation
The likelihood function is modified by a prior distribution to create a posterior distribution. For maximum likelihood estimation, the prior distribution is rectangular, and therefore has no differential effect.
What are three Bayesian methods for estimating theta?
Maximum A Posteriori = MAP. Determines the estimate by finding the maximum of the posterior distribution (the product of the likelihood function and the prior distribution). Also called “Bayesian Modal” since the maximum of the distribution is also the mode of the distribution.
Expected A Posteriori = EAP. Determines the estimate by calculating the mean of the posterior distribution. In statistics the mean is the “expectation”
Weighted Maximum Likelihood. Weights the likelihood function by the test information function. This is an “empirical” prior distribution.
MAP and EAP use theoretical priors. Their SEM are the standard deviation of the posterior distribution. They do not necessarily result in the same theta estimates.
Describe how IRT works with polytomous rating scale items.
The Graded Response Model is based on a series of ordered two-parameter logistic models estimated at the boundaries between successive dichotomizations called boundary response functions (BRFs). The slopes of these functions (discriminations) are estimated to be the same across boundaries but different across items. Each option of a Likert-type rating scale item has its own response function called a “category response function” or an “option response function.” They are obtained by successive subtraction of the BRFs. Each item has a discrimination.
Describe how theta estimation works for polytomous models
For dichotomous items, multiply the appropriate IRFs for each item answered. For both dichotomous and polytomous items, the theta estimate is obtained from the maximum of the function and the SEM from the curvature at the maximum.
Describe item fit for polytomous items
Fit is evaluated for each option of an item, then aggregated across the options for the item.
What are some advantages of IRT over classical models?
Is a coherent set of mathematical models that can be used with all types of measuring instruments. It shares a common set of characteristics: persons and items are on the same scale. Items characterized by sets of item parameters. Model fit can be evaluated for items.
Items and tests can be described in terms of information. Item information can be calculated for each item. Test information is the sum of item information. Test information can be transformed into a test SEM function. Tests can be built to match a target information function. Target is based on purpose of the measurements. Information reflects precision. Allows precision to be where it is needed on the theta scale.
Persons are scored using maximum likelihood. Uses all the information in the person’s response pattern combined with the item parameters for each item. Result is a theta estimate and a standard error of the estimate, and the likelihood associated with the theta estimate. The theta estimates are independent of the number correct/keyed. They allow for many more distinctions among individuals, allowing for better description of individual differences. Observed standard error of measurement is individualized. Can differ across theta levels and can differ within theta levels. Reflects “person fit” to the model used. Does NOT depend on “reliability” or other persons.
What is an adaptive test?
A test that dynamically adjusts to the trait level of each examinee as the test is being administered.
The First Adaptive Test was Alfred Binet’s IQ Test, developed in France around 1905. Individually administered by a trained psychologist. Later published as the Stanford-Binet IQ Test. Still in use in schools and clinics today. The standard against which IQ tests are compared. Incorporates all the elements of an adaptive test.
What have we learned from Binet’s Test? Is based on a calibrated item bank. Can use a different starting point for each examinee. Uses an adaptive item selection procedure. Uses a scoring method that allows a common score to be obtained from different subsets of items. Can vary in length across examinees by using a variable termination rule. But it is inefficient because it requires that all items be administered at each level; it requires a scoring procedure based on age norming of items; it is expensive because it is individually administered.
What is a CAT?
A Computerized Adaptive Test is a test administered by computer that dynamically adjusts to the trait level of each examinee as the test is being administered.
What are advantages of CAT?
Efficiency: CATs are more efficient than conventional tests. They generally reduce test length by 50% or more.
Control of measurement precision: A properly designed CAT can measure all examinees with the same degree of precision. We refer to this as “equiprecise measurement.”
What are fully adaptive CATs?
Based on item response theory (IRT). Select items one at a time. Use all the information in the examinee’s responses and in each test question/item. Score examinee after each item response. Use unstructured item banks.
What are components of a CAT?
A pre-calibrated item bank
A starting rule for selecting the first item. Can start everyone with the same theta estimate (e.g., theta = 0.0). Everyone gets the same first item. Assign a random theta estimate within an interval, e.g., between = +1.0 and -1.0. Use prior information available for a given examinee: Subjective evaluations, e.g., below average, above average; theta estimates from tests previously administered in the same or a prior test session; theta estimates from same test administered at a previous time.
A procedure for scoring item responses and estimating trait level. For a single item or a non-mixed response pattern: assign an arbitrary estimate for a correct response, e.g., -3.0 for incorrect or +3.0 for correct, e.g., -1.0 for incorrect or +1.0 for correct, or use a theta estimation method that will estimate for a single response (Bayesian or weighted maximum likelihood). For mixed response patterns, use maximum likelihood or weighted maximum likelihood. Result after every item is a theta estimate and an estimate of the standard error of measurement.
A method of selecting the next item. Use the item information functions for each item in the bank and select the item that has maximum information. at the current theta estimate. Referred to as Maximum Information Item Selection.
A rule for ending the test. Fixed standard error of the theta estimate (SEM). Useful with a well-developed item bank. SEM of .25 results in a 95% error of measurement band around a theta estimate of about .50. Standard errors are reduced most quickly by highly discriminating items. Minimum change in theta estimate. NOT a fixed number of items!
Summary observations about how CAT works
Differences in theta estimates are largest at the beginning of a CAT and generally decrease as test length increases.
Observed SEM is generally largest at the beginning and decreases as test length increases.
Item responses alternate between correct and incorrect toward the end of a CAT. Proportion correct converges on .50 + ½ of guessing probability.
Item Bank Usage: Each examinee receives a subset of items adapted to their theta level as it is continuously estimated.
What are some psychological or “public relations” issues associated with CAT?
CAT equalizes the “psychological environment” of the test across ability levels. High ability students will get about 50% correct. Low ability students will get about 50% correct.
Students cannot change their answer to an item once they have submitted it. They should be told in advance of starting the test. Because CAT is “dynamic” it can recover from an occasional error in answering an item. Literature shows little or no gain from answer changing.
Different examinees will get different numbers of items. Results from examinee response patterns, e.g., lack of person fit, and also from item bank characteristics. Examinees need to be educated in advance.
Summary of CAT
CAT is efficient: Typical minimum reduction in test length is 50%, reductions of up to 95% have been observed.
CAT is effective: With a properly designed item bank can measurement almost all examinees to the same level of precision, assuming that the examinees are responding according to the IRT model used, so that the observed SEM equals the model-predicted SEM.
A classification CAT can classify most examinees with a known and controllable degree of error. Except for those very close to the cut score, but their degree of classification error will be known.
CAT can be implemented with all IRT models: dichotomously scored items, polytomous scored items (personality, attitudes), unidimensional and multidimensional IRT models.
What is classification testing?
Alternative procedures have been proposed in place of confidence interval overlap CAT. Primarily “sequential” methods (SCTs or sequential classification tests). Erroneously referred to in much of the literature as CATs. Differ from CATs in that item bank has a different structure, items are pre-ordered by information at the cutoff score, all examinees begin with an item at the cut score, items are administered in a fixed sequence based on their information at the cut score until a classification can be made.Test is ended when an examinee can be classified into one of the criterion categories.
But theta is still estimated by MLE or WMLE (or EAP or MAP)
SPRT vs GLR for CAT
Termination rule: based on whether a classification can be made after each item using likelihoods from the examinee’s likelihood function.
Both methods use an arbitrary “indifference region” around the cut score. A small band on the theta scale around the cut score. If theta estimate is in that region, no classification will be made. Upper bound is above the cut score and lower bound is below the cut score.
Sequential probability ratio test (SPRT): Two likelihoods are obtained from the examinee’s likelihood function: the likelihood of the response pattern at the upper bound of the indifference region, and the likelihood of the response pattern at the lower bound. Then, the ratio of these two likelihoods is computed. Two additional values are specified: A, to determine if an examinee should be classified as below the θ cut score, e.g., 1/9; B, to determine if an examinee should be classified as above the θ cut score, e.g., 9. If LR is between A and B, the test is continued. If LR less than A, classify as below cutoff and end test. If LR greater than B, classify as above cutoff and end test.
Generalized likelihood ratio (GLR)
As for the SPRT, an indifference region and A and B values are specified. If the theta estimate is within the indifference region, administer another item. Otherwise, three likelihoods are obtained from the examinee’s likelihood function: the likelihood of the response pattern at the upper bound of the indifference region, the likelihood of the response pattern at the lower bound, and the likelihood of the response pattern at the current θ estimate. Then, one of two likelihood ratios (LRs) is computed. If the theta estimate is at or above the upper bound, then the generalized likelihood ratio (GLR) is computed as GLR = likelihood at theta / likelihood at the lower bound. If the theta estimate is at or below the lower bound, then the generalized likelihood ratio (GLR) is computed as GLR = likelihood at the upper bound / likelihood at theta. If GLR is less than A, classify as below cutoff and end the test. If GLR is greater than B, classify as above cutoff and end the test.
Improving predictive validity
In classical test theory, correlational validity is limited by score reliability (a group statistic). Higher reliability = higher (potential) validity. Lower reliability = (potential) lower validity. Is the correction for “attenuation”
But reliability is not an IRT concept. It is replaced by score (im)precision conditional on a theta estimate = the observed SEM.
Can observed SEMs be used to enhance prediction?