Measurement final exam - instrument construction

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/36

There's no tags or description

Looks like no tags are added yet.

Last updated 1:06 AM on 12/15/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

37 Terms

New cards

What are the unique characteristics, measurement strategies, and special problems for ability/achievement type variables?

Unique Characteristics: Operationalized by questions/items for which there is a "correct" answer

Measurement Strategies: Test construction methods designed to create "homogeneous" scales: classical test theory methods, item response theory (IRT)

Special Problems: Speed versus power (time limits), changes in the dimensionality of the test with increases in trait level, guessing

New cards

What are the unique characteristics, measurement strategies, and special problems for personality type variables?

Unique Characteristics: Operationalized by questions in which the individual describes her/himself in terms of some "typical" performance

Measurement Strategies: Empirical/criterion-oriented test construction methods (Designed to predict or classify people, not measure, rarely used for new instruments but used for classic instruments, e.g., MMPI), homogenous test construction methods used in ability measurement (proportion “correct” (item “difficulty”) becomes proportion “keyed”)

Special Problems: response sets (yea-saying), faking, lying, social desirability

New cards

What are the unique characteristics, measurement strategies, and special problems for preference/perceptual type variables (e.g,. interests, attitudes)?

Unique Characteristics: Individual describes her/his evaluation of some stimulus and reports that evaluation (e.g., attitude toward religion, job, political figures; values; vocational interests and preferences)

Measurement Strategies: Empirical or criterion-oriented test construction methods, primarily for interests, homogenous scaling methods, e.g. Thurstone scaling, Likert scaling, Guttman scaling, Pair comparison methods, Multidimensional scaling

Special Problems: Same as personality measurement: response sets (“yea-saying”), faking, social desirability. Normative (inter-individual) versus ipsative (intra-individual)

New cards

What are three examples of how empirical keying or criterion-oriented scale construction was used in early test development?

Minnesota Multiphasic Personality Inventory (MMPI), Strong Vocational Interest Blanks, Career Assessment Inventory

New cards

What are the steps for empirical keying of criterion-oriented scale construction?

Begin with a heterogeneous set of items that are relevant for the purpose of the instrument (e.g., items for the MMPI, vocational interest inventories)
Identify a “criterion group” (a group of people who have a specified characteristic that you want to use to classify other people: e.g., diagnostic classification, occupational group)
Identify a “reference group” (not the criterion group)
For each criterion group (one at a time), do an “empirical keying item analysis”: tabulate response frequencies for each item separately for criterion and reference group; examine relationship between item response and group membership. Repeat this process for every item and a given criterion group.
Apply the scoring key to all members of the criterion group and all members of the reference group. Result is a score for each person. Score represents how similar the person’s response is to the responses of the criterion group.
Validate the scores on that scale by comparing the scores of the criterion group to the scores of the reference group.
Repeat the entire process for each new criterion group.

New cards

What are some issues with empirical keying or criterion-oriented scale construction?

This is not measurement in the usual sense of it. It is only empirical prediction to make a dichotomous classification. Scores should not be interpreted as measuring a construct. E.g. the MMPI Depression scale does not measure “depression”; it reflects the similarity of the response pattern of an individual to the responses given by a group diagnosed as depressed. Scoring key capitalizes on characteristics of the samples of people in the two groups. Apply the scoring key to new criterion and reference groups; examine overlap and misclassification rates in new groups.

New cards

What are the steps for constructing measuring instruments for ability/achievement variables using classical test theory?

Write items that are assumed to measure the trait. Carefully define the trait, i.e., articulate its theory. Content validity: Is it unidimensional or multidimensional? Heterogeneous or homogeneous? Select type(s) of item: free response, constructed response, open-ended? Multiple-choice – how many options? True-false – not a good idea. Technology enhanced items (if computer administered).
Administer the items to an appropriate group of examinees. Should be representative of the target group of examinees. Need at least N = 50; 100 or more is better.
Score the items and examinees and do an item analysis. Determine for each item and for each item alternative ("distractor“) in multiple-choice tests: item difficulty (proportion correct), item discrimination (point-biserial correlation). Calculate the internal consistency reliability (e.g., alpha or Hoyt’s) for the full set of items. Eliminate items for which a non-correct alternative performs better than the correct alternative. Correct/keyed alternative should correlate positive with total score. Incorrect alternative should correlate negative with total score. Calculate internal consistency reliability. Eliminate items with the lowest discriminations and recompute reliability. Eliminate items with extreme difficulties, keeping items with difficulties nearest to .50, and recompute reliability.
If reliability is "satisfactory" (.90 or above?) at any previous step, define the final test as comprised of the surviving items.
If reliability is not satisfactory, try adding in some items with difficulties that deviate a little more from .50, and recompute reliability. If this does not increase reliability, write more items, and repeat the above steps beginning at Step 2 (pretest new items along with surviving items).

New cards

When constructing measuring instruments for ability/achievement variables using classical test theory, what should final tests have?

High reliability, items with relatively high discriminations (point-biserial r), items around .50 difficulty (proportion correct)

New cards

For test and item analysis with classical test theory for dichotomous items, how is difficulty defined?

Item Difficulty (“p-value”) of a test item is defined by the proportion of individuals answering the item correctly, p. Should really be referred to as an item "easiness" index because high values indicate an easy item: p = .90 indicates that 90% of the individuals answered the item correctly. Low values indicate a difficult item: p = .20 indicates that only 20% of the individuals were able to answer the item correctly. The proportion of individuals answering an item incorrectly is defined as Q = 1.0 - P. Item difficulty is sample dependent. For a given item, a high ability group will result in a high difficulty (item appears to be easy); a low ability sample will result in a low difficulty (item appears to be difficult)

New cards

For test and item analysis with classical test theory for dichotomous items, how is item variance defined?

The variance of a test item is defined as pq. Other things being equal, items with an item difficulty near p = .50 will help maximize the reliability of the test, because p = .50 is the difficulty value that maximizes item variance and therefore maximizes total score variance (internal consistency reliability is a function of total score variance).

New cards

For test and item analysis with classical test theory for dichotomous items, how is item discrimination defined?

Item discrimination is the capability of a test item to differentiate between examinees of varying levels of the trait/characteristic being measured.

New cards

For test and item analysis with classical test theory for dichotomous items, what is a “classical” index of discrimination, and what is a better index of discrimination?

“Classical” index of discrimination: Identify the highest and lowest scoring examinees (grouped by 'ability') – usually the upper and lower 27%; compute the proportion-correct (p-values) in each of these two ability groups. The index of discrimination is defined as the difference between the high and low groups.

A better index of discrimination: The point-biserial correlation between each person’s score on each item and that person’s total score.

New cards

What are some issues in classical test and item analysis?

Point-biserial correlation with total score is proportional to average item correlation. By selecting items with the highest item discrimination values (point-biserial r) you can maximize reliability.

For each alternative/option/distractor, not just the correct (keyed) response, compute proportion endorsing each option, point-biserial correlation with total score: should be negative.

Spuriousness, because each item contributes to the total score. The item discrimination index (pont-biserial) is inflated, especially for short tests, a statistical "correction for spuriousness" should be used for test lengths of 25 or fewer items.

Person scores: typically number correct, assumes all items are of equal importance, are item dependent: different sets of items can result in different scores for a person.

Item statistics: are group dependent, both difficulty and discrimination, are dependent on sample size.

Reliability is item dependent and is sample dependent.

Standard error of measurement is based on reliability and within group standard deviation, is both item and group dependent, is a constant for all values of total score.

New cards

In classical item analysis, what characterizes good items?

Correct answer correlates positively and high with total score, all incorrect answers are selected by some examinees, all incorrect answers correlate negatively with total score, for classically built tests, item difficulty is around 0.50.

New cards

What are the steps for constructing an Thurstone Equal-Appearing Intervals Scale?

Define a continuum along which the attitude to be measured can vary e.g., 1 = unfavorable, 11 = favorable
Write a series of items designed to reflect different levels of the attitude on that continuum, e.g., Measuring attitudes toward multiple-choice tests. Item 1: Multiple-choice tests are fair measures of an individual's knowledge. Item 2: Multiple-choice tests are unfair measures of an individual's knowledge. Item 3: Multiple-choice tests should be used in all graduate-level courses. Item 4: Multiple-choice tests should not be used in any graduate-level courses. Item 5: Multiple-choice tests are about as good as any other.
Select a group of judges who are asked to rate each item as to its location on the continuum defined in #1.
Tabulate the ratings made by the judges
Convert the frequencies in #4 to cumulative proportions
Determine the scale value of each item: Find the median of the cumulative proportion distribution. The rating associated with it is the item's scale value (the median is indicated above by a solid circle). Find the inter-quartile range of the cumulative proportion distribution, i.e., the values of the 25th and 75th percentiles (indicated by the dotted circles).
Select items to comprise the measuring instrument: Select a set of items that represent all levels of the attitude continuum. If two (or more) items appear at the same point on the continuum, select the item with the lowest inter-quartile range.
Construct the measuring instrument using the selected items.
Ask your respondents to “Agree” or “Disagree” with each item.
Score people by determining the median scale value of the items that they marked "agree."
Place individuals on the attitude continuum according to their score. Check for the "relevance" of each item by determining the interquartile range of the total scores for people who endorsed each item. Eliminate those items that have higher interquartile ranges.

New cards

What are some issues with equal-appearing intervals scales?

One of the earliest scaling methods; no longer in use

The use of judges to obtain scale values is problematic. Judges can provide biased ratings.

The assumption of equal intervals can be questioned.

Is susceptible to all the problems of self report in personality measurement.

But has characteristics that foreshadow modern measurement methods: Items have “scale values” that order them onto a continuum with arbitrary equal intervals. People are scored using the scale values of the items that they endorsed. People and items are on the same continuum.

New cards

What are some characteristics of Guttman scaling?

Originally called “scalogram analysis”. Is simultaneous scaling of persons and items. Can be used with dichotomous or polytomous items. In its “pure” form it is deterministic – does not allow for error. In its typical application, coefficient of reproducibility reflects how well real data approximate the ideal scale. Usable only with polytomous items. Originally called the Method of Summated Ratings. Requires ordered response categories/options. Is the most used scaling method. Is also the most mis-used scaling method. Simultaneously scales people and items.

New cards

What are the steps to creating a Guttman scale?

Write items to span the range of the attitude
Administer to people whose attitudes are to be measured asking them to

Agree or Disagree with each item, Or reply Yes or No.

Score the items 1 = Yes/Agree, 0 = No/Disagree
Record item responses in a data matrix of people by items
Calculate person total score and item total score
Order items by item score
Order persons by total score

New cards

What is a perfect Guttman scale?

The total person score allows perfect reproducibility of the response pattern for each person. A given total score can be achieved in only one way. The total item score allows perfect reproducibility of the response pattern for each item. A given item score can be achieved in only one way. A diagonal line can be drawn that separates all the 1s from all the 0s, with no errors.

New cards

What is a more typical result of a Guttman scale?

Zeroes to the left of the line represent deviations from perfect reproducibility and are, therefore, errors. Determine fit of data to perfect scale by coefficient of reproducibility: reproducibility = 1 - (number of errors / number of entries to the left of line).

New cards

How is a Guttman scale similar to and different from a Thurstone’s equal appearing intervals scale?

Different: Does not use independent judges, simultaneously scales people and items

Similar: Both use dichotomous response to measure people. Both place people on the scale relative to items. Both provide a means to examine performance of items. Both are precursors to modern measurement theory and methods.

New cards

What are the steps to constructing a Likert scale?

Define a continuum for measuring attitude, e.g., strongly disagree, disagree, undecided, agree, strongly agree. Wide variety of options for the rating scale: Bipolar versus unipolar, label end points only, different numbers of points/options on the scale.
Write a series of items to reflect different levels of the attitude.
Administer the items to a group of people whose attitudes you want to measure. Assume that there is a range of individual differences in the group. Each person selects one response option for each item.
Tabulate the response frequencies for each item and compute the cumulative proportion endorsed for each item
For each item determine the cumulative proportion below each category plus ½ the difference between two contiguous categories. Convert these values to z values based on a cumulative normal distribution. These represent scale values for each rating category/option of each item.
Transform scale values if desired: To eliminate negative numbers, subtract the lowest from the others, e.g., Z- transformed scale values; eliminate decimal points by rounding to nearest integer.
Result is a set of weights for each category response for each item
Score each person by summing the weights for the responses given to each item, e.g., Resulting scores can be used to order persons. Average score (i.e., total score divided by number of items) gives average adjusted or integer scale value of item categories selected by the individual.

New cards

Describe inappropriate Likert rating scale development

Adjusted scale values and integer scale values were a convenience in pre-computer days. Likert did a small study in the 1930s or 1940s that demonstrated on a dataset that assigning arbitrary integer weights to the response categories (e.g., 1, 2 ,3, 4, 5) did not affect the internal consistency reliability of some Likert scales. From then on, almost everyone assigned arbitrary integer weights to the response categories! This is completely incorrect! It completely ignores the data that go into calculating the scale values for the response options, e,g,. A set of response options integer weighted 1, 2, 3, 4, 5 when the empirical data would result in options weighted 0, 1, 1, 2, 3 or 0, 2, 2, 2, 2.

New cards

What are the steps to item analysis for developing a Likert scale?

Assign arbitrary sequential integer weights to each item (e.g, 1, 2, 3, 4, 5)
Score each person using those weights, i.e., compute a mean summed score (average rank choice) for each person
For each option of each item, compute mean of the mean scores for each person who selected that response = mean score for each item response (MSIR)
Replace original arbitrary weights with MSIR for each option of each item (retaining the number of people on which each MSIR is based)
Rescore each person using the MSIR weights, and average each person’s score
Using the MSIR weight matrix, compute an eta coefficient (correlation ratio) for each item. Remove items with low etas. Re-compute internal consistency reliability after removing each item—reliability should increase when items are removed. Stop removing items when reliability stabilizes.

New cards

What are some advantages of the method of constructing Likert scales?

Eliminates the need to impute missing responses. Person scores and item scoring weights are on the same scale. Identifies items that are not correlated well with total score. Provides empirical item scoring weights that reflect actual relationships with total score.

New cards

How are Likert scales different from and similar to Thurstone’s Equal Appearing Intervals scales?

Different: does not use independent judges, scales people and items on the same scale, operates with graded polytomous items, not dichotomous items

Similar: results in scale values, but for item response options, like Guttman scales, aspects of it are precursors to modern measurement theory, is a two-step procedure (Step 1: “Scale values” are estimated for each response option, Step 2: Persons are scored using the scale values)

New cards

How are Likert scales different from Guttman scales?

Likert scores are not a precursor to item response theory

New cards

How does the method of pair comparisons differ from Thurstone, Guttman, and Likert scaling?

They all scale people on a single variable. Their scores are all normative. Individuals are scaled on the dimension relative to other individuals. Pair comparisons scales a number of stimuli on a single preference (or other) dimension. When used to scale a single person, scores are “ipsative” not “normative.” Ipsative = intraindividual. But with modifications, can be made normative. Can also be used to measure a group. Results are then within-group ipsative (i.e., non-normative). But with appropriate modifications can be made group normative. Has a built-in “validity” type of index.

New cards

What is pair comparison scaling?

Requires a set of stimuli that can be scaled on a single dimension (example: Scaling flavors of ice cream). Practical problem: Want to know how much of a new flavor to make and distribute to stores.

New cards

What is the data collection process for pair comparisons scaling?

Define a dimension on which to scale your stimuli: in this case, preference for flavors of ice cream.
Pair each stimulus with each other stimulus. The number of pairs is [k(k-1)]/2, where k is the number of stimuli. Randomize order within pairs and between pairs.
Ask respondents to choose one within each pair “Which flavor do you prefer?” (e.g., chunky monkey or cherry garcia)
Give your pair comparisons questionnaire to a sample of your target group of people.
Using each person's choices for each pair of stimuli, aggregate across people and obtain the Frequency Matrix, or the number of times the column stimulus was chosen over the row stimulus.
Convert the Frequency Matrix to a Proportion Matrix, by dividing each entry in the Frequency Matrix by N, the number of respondents (94 in this example). Put .50 in the diagonal entries. Sum the columns.
Convert the Proportion Matrix to a z Matrix, using the z value equivalent of each proportion from the cumulative normal distribution. Compute the scale value of each stimulus as the average of its z values. If desired, eliminate negative values by subtracting the lowest scale value.

New cards

What are some properties of pair comparisons scaling for group-scaled data?

The scale values are anchored within groups. A specific numerical value within a group is not comparable to the same value in another group. The zero point within a group is not comparable to the zero point in another group. The ordering of scale values between groups can be compared. The differences between scale values within groups can also be compared between groups. Cannot, from a comparison of scale value rankings, between groups conclude anything about the absolute level of preferences.

New cards

How can pair comparisons scaling be applied to a single person?

In this case the measurements become ipsative. Ipsative = intra-individual. Contrasts with normative which means inter-individual. Each person answers exactly the same number of pairs of choices. So the only difference among people is the number of times they select a given stimulus over the other stimuli. E.g., For seven stimuli, each person makes 21 choices. All that differs among people is the distribution of their choices among the seven stimuli, i.e., which they preferred most and which they preferred least. And the frequencies of choice for those stimuli in between. Interpretation of results between people are limited to the interpretations appropriate for group scaled data.

New cards

Describe pair comparisons scaling with a “psychological zero” point.

Adding a perceived or “psychological” zero point can convert ipsative scores to normative scores. Can be applied to a single person or aggregated across a group. By adding the “zero point” as an additional stimulus in addition to the comparative judgments, e.g., Chunky monkey or pistachio pistachio, add a series of “absolute judgment” questions, e.g., “Indicate whether you like or do not like each:” chunky money: yes or no. Add another row and column to the response matrix. For each “no” response record a “1” in that column and a zero in the row for the zero point. A “1” indicates that the zero point was chosen over the stimulus. Scale the expanded response matrix as before.But now, adjust the scale values by subtracting the zero point value.

New cards

How should the zero point be interpreted for pair comparisons?

The Zero Point can be interpreted as a “threshold.” It is the point at which positive preference turns to negative preference. Stimuli above zero (positive scale values) are preferred. Stimuli below zero (negative scale values) are not preferred. Stimuli with a 0.0 scale value are “ambivalent”—neither preferred or not preferred. Scale values of other stimuli represent distances from the threshold. By adding the zero point, scales values can then be compared across people. They are “anchored” by each individual’s threshold. Absolute judgment data can be accumulated at the group level and a group threshold can be scaled, making normative comparisons between groups possible.

New cards

What is an intra-individual “validity” index for pair comparisons?

Any complete pair comparisons design allows the calculation of “circular triads”. A triad is a set of choices among three stimuli. Triads can be consistent, non-circular, and logical, or inconsistent, circular, and illogical. Circular triads reflect response inconsistency, illogical sets of responses, invalid sets of responses. They can result from Inattention, either intentional or unintentional, lying, faking, confusion, or inability to make choices because the stimuli are not perceived as different. The number of circular triads can be computed from a person’s number of choices for each stimulus. Circular triad score has a known distribution under random responding.

New cards

Describe multidimensional scaling as it applies to pair comparisons methods.

Not all stimuli can be scaled on a single dimension. For a person (or a group) when presented with subsets of stimulus pairs, some subsets are “too close” together to differentiate on that dimension. This might be evidenced by circular triads among the subset. So it’s necessary to see if they can be differentiate on another dimension

Example: Scaling flavors of ice cream: might need a chocolatey, a nutty and a fruity dimension. Multidimensional scaling is a set of procedures designed to scale preferences on multiple dimensions.

New cards

Compare and contrast Equal-appearing intervals (Thurstone) scaling, Guttman scaling, Likert scaling, and Pair comparisons scaling at a high level.

Equal-Appearing Intervals (Thurstone); First scaling method. Scales one variable with multiple items. Judges scale stimuli and provide “scale values.” Persons are scaled relative to the item scale values

Guttman scaling: Scales one variable with multiple items: Scales items and persons at the same time.

Likert scaling: Scales one variable with multiple items. Scales items and persons sequentially. Uses multicategory ordered item responses.

Pair comparisons scaling: Scales multiple stimuli on a single dimension. Can scale with or without a zero point (threshold).