Measurement test 2

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
GameKnowt Play
New
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/63

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

64 Terms

1
New cards

what are the goals that makes a good scale?

  1. A clear purpose

  2. Reliability (consistency, less random error)

  3. Validity (accuracy in measuring the intended target)

2
New cards

what strategies make a good scale?

a. clarity of items

b. specific to a single idea (for ex. you dont want to see and, either, or in an item usually)

c. gets at all the facets/domains

d. get a variety of responses

e. avoid jargon, negatives

f. have differently worded items

g. appropriate reading level

h. try to avoid bias

3
New cards

steps for developing a summated rating scale

define construct  ←—————————

|                                                              |

design scale  ←—-                                 |

|                             |                                 |

Pilot test ————-                                  |

|                                                               |

administration and item analysis ———-

|

validate and norm

4
New cards
5
New cards

how to define something invisible or abstract

  1. What is the context in which this construct exists? 

  2. Inductive approach: Give as clear of a definition as possible before you create items. 

  3. Read well. 

  4. How narrow or broad of a concept do you want to measure?

6
New cards

questions to get a good definition

  1. What is the purpose of the scale?

  2. What is the construct?

  3. What definitions do you want to adopt from the literature? 

  4. Any good measures already? 

  5. Are there any gaps in the literature?

7
New cards

bandalos 15 steps?

  1. Determine item format. 

  2. Develop a blueprint for your test objectives. 

  3. Create initial candidate item pool. 

  4. Review items (with multiple people).

  5. Large pilot test of items - initial analysis. 

  6. Larger analysis of items. 

  7. Revise items.

  8. Calculate reliability of scale. 

  9. Conduct a test with a 2nd sample. 

  10. Repeat  Steps 3-9 if needed. 

  11. Test the evidence of validity. 

  12. Prepare guidelines for administration.

8
New cards

how do we get there?

1) Teamwork makes the dreamwork!

  • Collaborate, take perspective

  • Reward disagreements.

  • Expert involvement

2) read

  • Adopt a good definition.

  • Clarify the strengths and gaps in the literature. 

3) Make your blueprint: 

  • Definition of what you want to measure. 

  • Objectives of your measure.

  • Choose appropriate response choices

    • Agreement, frequency, evaluation

    • Spacing

    • Bipolar/unipolar?

    • Number of choices (Forced choice? If Likert scale, 5-9 choices?

  • Types of items

9
New cards

what to read?

  1. Peer-Reviewed Measure Development Articles

  2. Peer-Reviewed Measure Review Articles

  3. Peer-Reviewed Literature Reviews

  4. Monographs by leading authors in a field

10
New cards

where to read?

  1. Google Scholar, Google

  2. Library

    1. PsycINFO

    2. ProQuest

    3. PubMed

    4. PsycArticles

    5. InterLibrary Loan

    6. IPIP 

    7. *Mental Measurements Yearbook

    8. *PsycTESTS

    9. *Tests in Print

  3. Email/ResearchGate - requests

11
New cards

measuring something cognitive

  • Recall/demonstrate knowledge of something

  • Show understanding or comprehension of something

  • Show application of knowledge

  • Require analytical skills

  • Require skills of synthesis of information into something coherent/directed (or, comparisons and making inferences) 

Require skills at evaluation/making judgments

12
New cards

measuring something affective

  • Receiving - willing to pay attention 

  • Responding - endorsement of an opinion 

  • Commitment - Voicing a strong opinion or intention to change something 

  • Organization - Demonstrate a change in opinion or attitude

Characterization - Demonstrating commitment to a change via habits and developments of traits/a worldview

13
New cards

speed in measuring achievement and aptitude:

quickly respond to the task within a time limit

14
New cards

power in measuring achievement and aptitude:

respond to items that increase in difficulty

15
New cards

the next steps with your blue print

  • Generate good items

  • Design your responses

    • Choose format

    • Choose rater of items (Self or other or both)

    • Clear instructions

    • Defined population

  • Develop a strategy for how you could evaluate or test your measure

    • Item analysis (e.g., Internal Consistency, Factor Analysis, Item-Response Theory)

    • Criterion-Related Validity 

    • Generalizability to certain populations (Measurement Invariance)

16
New cards

tips for developing good likert items

  • Consistent item stems

  • Items with similar lengths

  • Task of item is clear 

  • Good grammar

  • Items adequately get at the domain of the topic you want to cover. 

  • Items clearly express only one idea. 

  • Use positively worded items, but have reverse items too. 

    • Reverse items are not relying on negative words like “no” or “not” 

  • Who is your language accessible to? 

    • Jargon? Colloquialisms? 

    • Think of the reading level of your participants. 

  • Keep items present or future-oriented.

  • Avoid facts.

  • Choose items people would have a dispersion of scores on across the range of scores.

  • Keep items short.

  • Careful with adverbs like “usually, only, just”

17
New cards

cognitive items

  • True/False

  • Matching

  • Multiple Choice

  • Checklists

  • Short answer

  • Completing a sentence

  • Performance Task

18
New cards

non-cognitive items

  • Thurstone Scaling

    • Items with interval scales are ranked by multiple judges from least favorable to most favorable (11 point scale), and final measure has items representative across the scale and are weighted according to rankings. Take a long time to do.  

  • Guttman Scaling

    • Deterministic method: Items level up in a hierarchy from less extreme or likely to most extreme or likely. Hard to get right. 

  • Likert Scaling

    • Summative method (add up scores on items). Very common, low cost.

19
New cards

what do we mean by cognitive?

For a short-hand, with cognitive, think of conscious intellectual activities that require effort, knowledge. Examples being like memory, problem-solving, reasoning. IQ tests here.

20
New cards

what do we mean for non-cognitive

For non-cognitive, think of other mental aspects like attitude, affect, motivation, traits, skills, or behaviors. Personality tests here. 

21
New cards

what comes after creating candidate items?

Design: 

  • Clear Instructions

    • Common frame of reference for participants to respond

  • Choose an appropriate rating scale

    • Agreement? Frequency? Evaluation? - Spector, p. 19

  • How will participants take your measure?

Review

  • Share, get feedback, revise

Pilot 

  • Get feedback from a sample of people on items

Data Collection:

  • IRB approval of a research proposal 

  • Collect data in accordance with IRB

  • Initial item analysis

More Data Analysis:

  • Mean, Range, Standard Deviations…

  • Reliability analyses

  • Validity analyses

22
New cards

issues to consider

What is getting measured: Person, Situation, or Person x Situation?

  • Level of effort (satisfice vs. optimize; Krosnick, 1991)

  • Level of enjoyment of the survey, motivation to take the survey

  • Participant’s understanding/interpretation of item

  • How easy it is for the participant to generate a response to the item?

  • How appropriate response choices are across all items?

  • To what degree can participants edit responses?

  • How good is the scale in terms of its conversational utility?

    • Amount of information

    • Quality of information

    • Relevance of information

    • Clarity of Information

23
New cards

examples of response distortion

  • Socially Desirable Responding

  • Acquiescence

  • Malingering

  • Extreme Response (or consistently neutral response)

24
New cards

why go through this process of scale development?

  1. Advances scientific understanding in our field. (scientific advancement) 

  2. Helps inform to whom and how measures should be applied to inform interventions, accommodations, or decisions. (Inform practitioners)

  3. Learning to do measurement analysis and development will open doors professionally. (open professional doors)

25
New cards

Item remainder coefficient -

correlation of item with total score of the other items subtracting the specific item being looked at. <.35 may be a good rule of thumb that the item should be removed, but do this for multiple iterations because it can change each time you take outan item. 

26
New cards

non cognitive steps after initial item pool- 1st step

  1. Collect a sample of at least 200 people 

    1. Reliability Analysis

      1. Check data/scoring of items

      2. Look at item-remainder coefficient

      3. Check Internal consistency (e.g., alpha)

        1. Overall score: >.70?

        2. Check internal consistency if item deleted

          1. Improve or get worse?

    2. Exploratory Factor Analysis 

      1. How many factors? 

        1. Unidimensional or multidimensional?

      2. How are they related? 

        1. Orthogonal (not correlated) or oblique (correlated)?

      3. How many items? 

        1. Convergent validity

        2. Caution - need enough items that cover various aspects of the construct and don’t just say the same thing. 

    3. Descriptive Statistics (Mean, SD, Skew, Kurt)

27
New cards

non cognitive steps after initial item pool- 2nd step

Collect a new sample of 200-500 people

a. Descriptives: Measures of central tendency, range, variance/standard deviation; skewness and kurtosis (normality)

b. Reliability Analysis

c. Construct Validity: Confirmatory Factor Analysis (Replicate measure’s factor structure)

d. Construct Validity: Convergent and Divergent Correlations (MTMM)

28
New cards

non cognitive steps after initial item pool- 3rd step

collect a new sample of 200-300 people 

a. Reliability Analysis

b. Descriptives

c. Measurement Invariance (CFA for a new sample)

d. Predictive Validity Correlation: What should your scale predict? Why does it matter?

29
New cards

non cognitive steps after initial item pool- 4th step

 Collect a new sample with 2 data points

  1. Test-retest reliability

  2. Predictive Validity - does your measure predict maintenance or change in something over time? 

30
New cards

developing test norms

  • Collect multiple representative large samples

  • Share whether groups differ on means/SDs

    • Should there be separate norms of these values for different groups? 

      • Are they all representative of a single population or multiple populations?

    • Examples: 

      • Age Cohort/Grade

      • Gender or Sex

      • Race or Ethnicity

      • Nationality

31
New cards

tips for evaluating a measure

Use multiple samples and tests to give your scale the best chance to fail at each step

  1. Sample 1: Initial Test. Descriptives, Internal Consistency, and EFA. maybe correlation with another measure of construct.

  2. Sample 2: Replicate. Reliability, Descriptives, and CFA; initial correlations

  3. Sample 3: Validity. Reliability, CFA, Look at convergent and divergent correlations. Give the scale a good chance of failing 

  4. A capstone study: Evaluate predictive validity or generalizability to other samples. Why was it important to develop this measure? For whom is this measure appropriate?

32
New cards

considerations for cognitive item analysis

  • P: Item Difficulty

    • Proportion of correct responses/total number of responses

    • Range ~ .3 to 7

  • D: Item Discrimination (Option 1)

    • Difference between proportion of upper and lower level groups of test scores on whether they got an item correct (internal criterion).

      • Evaluate a test’s performance.

    • Difference between proportion of upper and lower level groups on an item when groups are formed from something external to the test (external criterion).

      • Predict behavior, make diagnostic or employment decisions

    • Range from -1 to +1. The closer to 0, the worse it is at distinguishing between levels.

    • Degree to which correct and incorrect answers on an item distinguish between knowledge/abilities/skills of test-takers

33
New cards

considerations for cogntive item analysis pt 2

  • Item Discrimination (Option 2)

    • Item-Total Test Score Correlation 

      • Biserial/Point-Biserial (correlations with dichotomous variables)

  • Distractor Analysis

    • Multiple choice items

      • Do distractors attract more of the lower level than upper level group?

      • Decisions on analysis will vary between norm or criterion-referenced tests (norm - more discrimination with good distractors; criterion - less)

  • Corrections for Guessing

    • R (# correct) = W (# incorrect)/[C (# of choices) -1]

      • Goal: reduce error, measure something more accurately

      • Evidence is not consistent to support this.

34
New cards

Non-Cognitive Item Analysis summary

  1. Multiple tests with multiple samples to get your scale right…

    1. Item descriptives, variance

    2. Reliability 

    3. Validation 

    4. Generalizability 

  2. A good rule of thumb is to try to make your measure fail.

  3. Balance statistics with theory/pragmatic considerations in analysis.

35
New cards

cognitive item analysis summary

  1. Item Difficulty

  2. Item Discrimination

  3. Performance of Distractors in Performance Tests

36
New cards

item difficulty 

Proportion of correct responses/total number of responses

- Range ~ .3 to 7 

formula: P= # of correct responses / # of total responses

37
New cards

item discrimination

Degree to which an item distinguishes high-scoring from low-scoring test-takers on the overall measure. Ranges from –1 to +1; higher positive = better.

38
New cards

Distractor analysis

For multiple-choice items, checking how well incorrect options attract lower-ability respondents and are avoided by higher-ability ones.

39
New cards

correction for guessing

Formula adjustment to reduce inflated scores from random guessing (e.g., R – W / [C – 1]); rarely used in counseling tests but conceptually reduces error.

40
New cards

item-total correlations

Correlation between each item and the total test score (excluding that item). Indicates how well an item fits the construct.

41
New cards

point biserial/biserial correlations

Relationship between a continuous variable (e.g., total score) and a dichotomous item (right / wrong). Point-biserial = observed dichotomy; biserial = assumed underlying continuity.

42
New cards

range

Difference between the highest and lowest score in a distribution.

43
New cards

skewness

Degree of asymmetry in score distribution. Positive = tail on right (few high scores); negative = tail on left (few low scores).

44
New cards

kurtosis

“Peakedness” of distribution.

  • Leptokurtic = tall/narrow (scores cluster near mean)

  • Platykurtic = flat/wide (scores spread out)

45
New cards

standard deviation

Average distance of scores from the mean; indicates variability.

46
New cards

variance

SD squared; measure of total spread of scores.

47
New cards

Guttman scale

Hierarchical—endorsing a strong item implies endorsement of easier ones.

48
New cards

thurstone scale

Experts rank items from least → most favorable; weighted by judge agreement.

49
New cards

Likert scale

Summated-rating scale using multiple agree–disagree items averaged or summed for a total score.

50
New cards

item-remainder coefficient

Correlation between an item and the total score minus that item; identifies how much each item contributes to internal consistency.

51
New cards

internal consistency

Degree to which all items on a scale measure the same underlying construct at one time (e.g., Cronbach’s α, McDonald’s Ω) form of reliability

52
New cards

convergent validity

Evidence that a test correlates strongly with other measures of the same or related construct.

53
New cards

divergent validity

Evidence that a test shows low correlations with measures of different or unrelated constructs.

54
New cards

concurrent validity

Form of criterion validity where test scores relate to an external criterion measured at the same time (e.g., current GPA aptitude test).

55
New cards

criterion-related validity

Extent to which test scores predict or correspond with an external performance criterion.
Includes predictive (future) and concurrent (present).

56
New cards

factor analysis

Statistical method identifying clusters of items (factors) that represent underlying dimensions of a construct.

  • EFA = explore; CFA = confirm.

57
New cards

cronbach/coefficient alpha 

Index of internal consistency; estimates average inter-item correlation adjusted for item count. ≥ .70 = acceptable for research; ≥ .90 = clinical use.

58
New cards

compiling norms

Creating reference data (means, SDs, percentiles, T-scores) from large representative samples so individual scores can be interpreted relative to others.

59
New cards

Key 15 steps of measurement design

1- state the purpose of the scale

2- identify and define the domain of the construct to be measured

3- determine whether a measure already exists 

4- determine item format

5- develop a test blue print or test objectives

6- create initial item pool

7- conduct the initial item review (and revisions)

8- conduct a large scale field test of items

9- analyze items

10- revise items

11- calculate reliability

12- conduct a second field test of items

13- repeat steps 8-11 if needed

14- conduct validation studies

15- prepare guidelines for administration 

60
New cards

key characteristics of scales

  • Unidimensional – Items measure one main construct.

  • Reliable – Produces consistent results (low random error).

  • Valid – Accurately measures what it claims to.

  • Clear and specific – Avoids vague, double-barreled, or biased wording.

  • Appropriate reading level – Understandable to the target population.

  • Balanced – Includes both positively and negatively worded items.

61
New cards

analyses involved in measurement design

Analysis Type

Purpose

Descriptive Statistics

Understand means, SDs, skewness, kurtosis of item scores.

Item Analysis

Identify poor items based on difficulty, discrimination, or item-total correlation.

Reliability Analysis

Estimate internal consistency (Cronbach’s alpha or McDonald’s Omega).

Exploratory Factor Analysis (EFA)

Discover how many dimensions underlie the data.

Confirmatory Factor Analysis (CFA)

Confirm the factor structure in a new sample.

Criterion/Construct Validity Correlations

Check that scores relate appropriately to other variables.

62
New cards

key tips for creating good items

  • Keep items simple, short, and clear.

  • Avoid “and/or” statements (each item should measure one thing).

  • Keep the reading level accessible.

  • Use positively worded items but include some reverse items.

  • Make items present- or future-oriented.

  • Avoid double negatives or jargon.

  • Ensure response options are consistent and evenly spaced (for Likert scales).

● Consistent item stems

● Items with similar lengths

● Task of item is clear

● Good grammar

● Items adequately get at the domain of the topic you want to cover.

● Items clearly express only one idea.

● Use positively worded items, but have reverse items too.

○ Reverse items are not relying on negative words like “no” or “not”

● Who is your language accessible to?

○ Jargon? Colloquialisms?

○ Think of the reading level of your participants.

● Keep items present or future-oriented.

● Avoid facts.

● Choose items people would have a dispersion of scores on across the range of scores.

● Keep items short.

● Careful with adverbs like “usually, only, just”

63
New cards

potential problems for reliability and validity

Social desirability bias

People answer in ways that make them look good.

Acquiescence bias

Participants agree with everything.

Random responding

Low reliability; participants not paying attention.

Restricted range

Reduces correlations and reliability estimates.

Cultural bias

Items may not be relevant or fair across groups.

Ambiguous wording

Increases measurement error.

What is getting measured: Person, Situation, or Person x

Situation?

- Level of effort (satisfice vs. optimize; Krosnick, 1991)

- Level of enjoyment of the survey, motivation to take the

survey

- Participantʼs understanding/interpretation of item

- How easy it is for the participant to generate a response to

the item?

- How appropriate response choices are across all items?

- To what degree can participants edit responses?

- How good is the scale in terms of its conversational utility?

- Amount of information

- Quality of information

- Relevance of information

- Clarity of Information

Examples of Response Distortion:

● Socially Desirable Responding

● Acquiescence

● Malingering

● Extreme Response (or consistently

neutral response)

64
New cards

research design in terms of sampling, data analysis

  • Multiple samples help verify reliability and validity.

  • Sample 1: Pilot → Item and reliability analysis.

  • Sample 2: Replication → Factor analysis and consistency check.

  • Sample 3: Validity → Convergent/divergent and predictive validity.

  • Sample 4: Test–retest reliability or measurement invariance (across groups).