Measurement test 2

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/63

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

64 Terms

New cards

what are the goals that makes a good scale?

A clear purpose
Reliability (consistency, less random error)
Validity (accuracy in measuring the intended target)

New cards

what strategies make a good scale?

a. clarity of items

b. specific to a single idea (for ex. you dont want to see and, either, or in an item usually)

c. gets at all the facets/domains

d. get a variety of responses

e. avoid jargon, negatives

f. have differently worded items

g. appropriate reading level

h. try to avoid bias

New cards

steps for developing a summated rating scale

define construct ←—————————

| |

design scale ←—- |

| | |

Pilot test ————- |

| |

administration and item analysis ———-

validate and norm

New cards

how to define something invisible or abstract

What is the context in which this construct exists?
Inductive approach: Give as clear of a definition as possible before you create items.
Read well.
How narrow or broad of a concept do you want to measure?

New cards

questions to get a good definition

What is the purpose of the scale?
What is the construct?
What definitions do you want to adopt from the literature?
Any good measures already?
Are there any gaps in the literature?

New cards

bandalos 15 steps?

Determine item format.
Develop a blueprint for your test objectives.
Create initial candidate item pool.
Review items (with multiple people).
Large pilot test of items - initial analysis.
Larger analysis of items.
Revise items.
Calculate reliability of scale.
Conduct a test with a 2nd sample.
Repeat Steps 3-9 if needed.
Test the evidence of validity.
Prepare guidelines for administration.

New cards

how do we get there?

1) Teamwork makes the dreamwork!

Collaborate, take perspective
Reward disagreements.
Expert involvement

2) read

Adopt a good definition.
Clarify the strengths and gaps in the literature.

3) Make your blueprint:

Definition of what you want to measure.
Objectives of your measure.
Choose appropriate response choices
- Agreement, frequency, evaluation
- Spacing
- Bipolar/unipolar?
- Number of choices (Forced choice? If Likert scale, 5-9 choices?
Types of items

New cards

what to read?

Peer-Reviewed Measure Development Articles
Peer-Reviewed Measure Review Articles
Peer-Reviewed Literature Reviews
Monographs by leading authors in a field

New cards

where to read?

Google Scholar, Google
Library
1. PsycINFO
2. ProQuest
3. PubMed
4. PsycArticles
5. InterLibrary Loan
6. IPIP
7. *Mental Measurements Yearbook
8. *PsycTESTS
9. *Tests in Print
Email/ResearchGate - requests

New cards

measuring something cognitive

Recall/demonstrate knowledge of something
Show understanding or comprehension of something
Show application of knowledge
Require analytical skills
Require skills of synthesis of information into something coherent/directed (or, comparisons and making inferences)

Require skills at evaluation/making judgments

New cards

measuring something affective

Receiving - willing to pay attention
Responding - endorsement of an opinion
Commitment - Voicing a strong opinion or intention to change something
Organization - Demonstrate a change in opinion or attitude

Characterization - Demonstrating commitment to a change via habits and developments of traits/a worldview

New cards

speed in measuring achievement and aptitude:

quickly respond to the task within a time limit

New cards

power in measuring achievement and aptitude:

respond to items that increase in difficulty

New cards

the next steps with your blue print

Generate good items
Design your responses
- Choose format
- Choose rater of items (Self or other or both)
- Clear instructions
- Defined population
Develop a strategy for how you could evaluate or test your measure
- Item analysis (e.g., Internal Consistency, Factor Analysis, Item-Response Theory)
- Criterion-Related Validity
- Generalizability to certain populations (Measurement Invariance)

New cards

tips for developing good likert items

Consistent item stems
Items with similar lengths
Task of item is clear
Good grammar
Items adequately get at the domain of the topic you want to cover.
Items clearly express only one idea.
Use positively worded items, but have reverse items too.
- Reverse items are not relying on negative words like “no” or “not”
Who is your language accessible to?
- Jargon? Colloquialisms?
- Think of the reading level of your participants.
Keep items present or future-oriented.
Avoid facts.
Choose items people would have a dispersion of scores on across the range of scores.
Keep items short.
Careful with adverbs like “usually, only, just”

New cards

cognitive items

True/False
Matching
Multiple Choice
Checklists
Short answer
Completing a sentence
Performance Task

New cards

non-cognitive items

Thurstone Scaling
- Items with interval scales are ranked by multiple judges from least favorable to most favorable (11 point scale), and final measure has items representative across the scale and are weighted according to rankings. Take a long time to do.
Guttman Scaling
- Deterministic method: Items level up in a hierarchy from less extreme or likely to most extreme or likely. Hard to get right.
Likert Scaling
- Summative method (add up scores on items). Very common, low cost.

New cards

what do we mean by cognitive?

For a short-hand, with cognitive, think of conscious intellectual activities that require effort, knowledge. Examples being like memory, problem-solving, reasoning. IQ tests here.

New cards

what do we mean for non-cognitive

For non-cognitive, think of other mental aspects like attitude, affect, motivation, traits, skills, or behaviors. Personality tests here.

New cards

what comes after creating candidate items?

Design:

Clear Instructions
- Common frame of reference for participants to respond
Choose an appropriate rating scale
- Agreement? Frequency? Evaluation? - Spector, p. 19
How will participants take your measure?

Review:

Share, get feedback, revise

Pilot

Get feedback from a sample of people on items

Data Collection:

IRB approval of a research proposal
Collect data in accordance with IRB
Initial item analysis

More Data Analysis:

Mean, Range, Standard Deviations…
Reliability analyses
Validity analyses

New cards

issues to consider

What is getting measured: Person, Situation, or Person x Situation?

Level of effort (satisfice vs. optimize; Krosnick, 1991)
Level of enjoyment of the survey, motivation to take the survey
Participant’s understanding/interpretation of item
How easy it is for the participant to generate a response to the item?
How appropriate response choices are across all items?
To what degree can participants edit responses?
How good is the scale in terms of its conversational utility?
- Amount of information
- Quality of information
- Relevance of information
- Clarity of Information

New cards

examples of response distortion

Socially Desirable Responding
Acquiescence
Malingering
Extreme Response (or consistently neutral response)

New cards

why go through this process of scale development?

Advances scientific understanding in our field. (scientific advancement)
Helps inform to whom and how measures should be applied to inform interventions, accommodations, or decisions. (Inform practitioners)
Learning to do measurement analysis and development will open doors professionally. (open professional doors)

New cards

Item remainder coefficient -

correlation of item with total score of the other items subtracting the specific item being looked at. <.35 may be a good rule of thumb that the item should be removed, but do this for multiple iterations because it can change each time you take outan item.

New cards

non cognitive steps after initial item pool- 1st step

Collect a sample of at least 200 people
1. Reliability Analysis
  1. Check data/scoring of items
  2. Look at item-remainder coefficient
  3. Check Internal consistency (e.g., alpha)
    1. Overall score: >.70?
    2. Check internal consistency if item deleted
      1. Improve or get worse?
2. Exploratory Factor Analysis
  1. How many factors?
    1. Unidimensional or multidimensional?
  2. How are they related?
    1. Orthogonal (not correlated) or oblique (correlated)?
  3. How many items?
    1. Convergent validity
    2. Caution - need enough items that cover various aspects of the construct and don’t just say the same thing.
3. Descriptive Statistics (Mean, SD, Skew, Kurt)

New cards

non cognitive steps after initial item pool- 2nd step

Collect a new sample of 200-500 people

a. Descriptives: Measures of central tendency, range, variance/standard deviation; skewness and kurtosis (normality)

b. Reliability Analysis

c. Construct Validity: Confirmatory Factor Analysis (Replicate measure’s factor structure)

d. Construct Validity: Convergent and Divergent Correlations (MTMM)

New cards

non cognitive steps after initial item pool- 3rd step

collect a new sample of 200-300 people

a. Reliability Analysis

b. Descriptives

c. Measurement Invariance (CFA for a new sample)

d. Predictive Validity Correlation: What should your scale predict? Why does it matter?

New cards

non cognitive steps after initial item pool- 4th step

Collect a new sample with 2 data points

Test-retest reliability
Predictive Validity - does your measure predict maintenance or change in something over time?

New cards

developing test norms

Collect multiple representative large samples
Share whether groups differ on means/SDs
- Should there be separate norms of these values for different groups?
  - Are they all representative of a single population or multiple populations?
- Examples:
  - Age Cohort/Grade
  - Gender or Sex
  - Race or Ethnicity
  - Nationality

New cards

tips for evaluating a measure

Use multiple samples and tests to give your scale the best chance to fail at each step

Sample 1: Initial Test. Descriptives, Internal Consistency, and EFA. maybe correlation with another measure of construct.
Sample 2: Replicate. Reliability, Descriptives, and CFA; initial correlations
Sample 3: Validity. Reliability, CFA, Look at convergent and divergent correlations. Give the scale a good chance of failing
A capstone study: Evaluate predictive validity or generalizability to other samples. Why was it important to develop this measure? For whom is this measure appropriate?

New cards

considerations for cognitive item analysis

P: Item Difficulty
- Proportion of correct responses/total number of responses
- Range ~ .3 to 7
D: Item Discrimination (Option 1)
- Difference between proportion of upper and lower level groups of test scores on whether they got an item correct (internal criterion).
  - Evaluate a test’s performance.
- Difference between proportion of upper and lower level groups on an item when groups are formed from something external to the test (external criterion).
  - Predict behavior, make diagnostic or employment decisions
- Range from -1 to +1. The closer to 0, the worse it is at distinguishing between levels.
- Degree to which correct and incorrect answers on an item distinguish between knowledge/abilities/skills of test-takers

New cards

considerations for cogntive item analysis pt 2

Item Discrimination (Option 2)
- Item-Total Test Score Correlation
  - Biserial/Point-Biserial (correlations with dichotomous variables)
Distractor Analysis
- Multiple choice items
  - Do distractors attract more of the lower level than upper level group?
  - Decisions on analysis will vary between norm or criterion-referenced tests (norm - more discrimination with good distractors; criterion - less)
Corrections for Guessing
- R (# correct) = W (# incorrect)/[C (# of choices) -1]
  - Goal: reduce error, measure something more accurately
  - Evidence is not consistent to support this.

New cards

Non-Cognitive Item Analysis summary

Multiple tests with multiple samples to get your scale right…
1. Item descriptives, variance
2. Reliability
3. Validation
4. Generalizability
A good rule of thumb is to try to make your measure fail.
Balance statistics with theory/pragmatic considerations in analysis.

New cards

cognitive item analysis summary

Item Difficulty
Item Discrimination
Performance of Distractors in Performance Tests

New cards

item difficulty

Proportion of correct responses/total number of responses

- Range ~ .3 to 7

formula: P= # of correct responses / # of total responses

New cards

item discrimination

Degree to which an item distinguishes high-scoring from low-scoring test-takers on the overall measure. Ranges from –1 to +1; higher positive = better.

New cards

Distractor analysis

For multiple-choice items, checking how well incorrect options attract lower-ability respondents and are avoided by higher-ability ones.

New cards

correction for guessing

Formula adjustment to reduce inflated scores from random guessing (e.g., R – W / [C – 1]); rarely used in counseling tests but conceptually reduces error.

New cards

item-total correlations

Correlation between each item and the total test score (excluding that item). Indicates how well an item fits the construct.

New cards

point biserial/biserial correlations

Relationship between a continuous variable (e.g., total score) and a dichotomous item (right / wrong). Point-biserial = observed dichotomy; biserial = assumed underlying continuity.

New cards

range

Difference between the highest and lowest score in a distribution.

New cards

skewness

Degree of asymmetry in score distribution. Positive = tail on right (few high scores); negative = tail on left (few low scores).

New cards

kurtosis

“Peakedness” of distribution.

Leptokurtic = tall/narrow (scores cluster near mean)
Platykurtic = flat/wide (scores spread out)

New cards

standard deviation

Average distance of scores from the mean; indicates variability.

New cards

variance

SD squared; measure of total spread of scores.

New cards

Guttman scale

Hierarchical—endorsing a strong item implies endorsement of easier ones.

New cards

thurstone scale

Experts rank items from least → most favorable; weighted by judge agreement.

New cards

Likert scale

Summated-rating scale using multiple agree–disagree items averaged or summed for a total score.

New cards

item-remainder coefficient

Correlation between an item and the total score minus that item; identifies how much each item contributes to internal consistency.

New cards

internal consistency

Degree to which all items on a scale measure the same underlying construct at one time (e.g., Cronbach’s α, McDonald’s Ω) form of reliability

New cards

convergent validity

Evidence that a test correlates strongly with other measures of the same or related construct.

New cards

divergent validity

Evidence that a test shows low correlations with measures of different or unrelated constructs.

New cards

concurrent validity

Form of criterion validity where test scores relate to an external criterion measured at the same time (e.g., current GPA ↔ aptitude test).

New cards

criterion-related validity

Extent to which test scores predict or correspond with an external performance criterion.
Includes predictive (future) and concurrent (present).

New cards

factor analysis

Statistical method identifying clusters of items (factors) that represent underlying dimensions of a construct.

EFA = explore; CFA = confirm.

New cards

cronbach/coefficient alpha

Index of internal consistency; estimates average inter-item correlation adjusted for item count. ≥ .70 = acceptable for research; ≥ .90 = clinical use.

New cards

compiling norms

Creating reference data (means, SDs, percentiles, T-scores) from large representative samples so individual scores can be interpreted relative to others.

New cards

Key 15 steps of measurement design

1- state the purpose of the scale

2- identify and define the domain of the construct to be measured

3- determine whether a measure already exists

4- determine item format

5- develop a test blue print or test objectives

6- create initial item pool

7- conduct the initial item review (and revisions)

8- conduct a large scale field test of items

9- analyze items

10- revise items

11- calculate reliability

12- conduct a second field test of items

13- repeat steps 8-11 if needed

14- conduct validation studies

15- prepare guidelines for administration

New cards

key characteristics of scales

Unidimensional – Items measure one main construct.
Reliable – Produces consistent results (low random error).
Valid – Accurately measures what it claims to.
Clear and specific – Avoids vague, double-barreled, or biased wording.
Appropriate reading level – Understandable to the target population.
Balanced – Includes both positively and negatively worded items.

New cards

analyses involved in measurement design

Analysis Type	Purpose
Descriptive Statistics	Understand means, SDs, skewness, kurtosis of item scores.
Item Analysis	Identify poor items based on difficulty, discrimination, or item-total correlation.
Reliability Analysis	Estimate internal consistency (Cronbach’s alpha or McDonald’s Omega).
Exploratory Factor Analysis (EFA)	Discover how many dimensions underlie the data.
Confirmatory Factor Analysis (CFA)	Confirm the factor structure in a new sample.
Criterion/Construct Validity Correlations	Check that scores relate appropriately to other variables.

New cards

key tips for creating good items

Keep items simple, short, and clear.
Avoid “and/or” statements (each item should measure one thing).
Keep the reading level accessible.
Use positively worded items but include some reverse items.
Make items present- or future-oriented.
Avoid double negatives or jargon.
Ensure response options are consistent and evenly spaced (for Likert scales).

● Consistent item stems

● Items with similar lengths

● Task of item is clear

● Good grammar

● Items adequately get at the domain of the topic you want to cover.

● Items clearly express only one idea.

● Use positively worded items, but have reverse items too.

○ Reverse items are not relying on negative words like “no” or “not”

● Who is your language accessible to?

○ Jargon? Colloquialisms?

○ Think of the reading level of your participants.

● Keep items present or future-oriented.

● Avoid facts.

● Choose items people would have a dispersion of scores on across the range of scores.

● Keep items short.

● Careful with adverbs like “usually, only, just”

New cards

potential problems for reliability and validity

Social desirability bias

People answer in ways that make them look good.

Acquiescence bias

Participants agree with everything.

Random responding

Low reliability; participants not paying attention.

Restricted range

Reduces correlations and reliability estimates.

Cultural bias

Items may not be relevant or fair across groups.

Ambiguous wording

Increases measurement error.

What is getting measured: Person, Situation, or Person x

Situation?

- Level of eﬀort (satisfice vs. optimize; Krosnick, 1991)

- Level of enjoyment of the survey, motivation to take the

survey

- Participantʼs understanding/interpretation of item

- How easy it is for the participant to generate a response to

the item?

- How appropriate response choices are across all items?

- To what degree can participants edit responses?

- How good is the scale in terms of its conversational utility?

- Amount of information

- Quality of information

- Relevance of information

- Clarity of Information

Examples of Response Distortion:

● Socially Desirable Responding

● Acquiescence

● Malingering

● Extreme Response (or consistently

neutral response)

New cards

research design in terms of sampling, data analysis

Multiple samples help verify reliability and validity.
Sample 1: Pilot → Item and reliability analysis.
Sample 2: Replication → Factor analysis and consistency check.
Sample 3: Validity → Convergent/divergent and predictive validity.
Sample 4: Test–retest reliability or measurement invariance (across groups).