1/63
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
what are the goals that makes a good scale?
A clear purpose
Reliability (consistency, less random error)
Validity (accuracy in measuring the intended target)
what strategies make a good scale?
a. clarity of items
b. specific to a single idea (for ex. you dont want to see and, either, or in an item usually)
c. gets at all the facets/domains
d. get a variety of responses
e. avoid jargon, negatives
f. have differently worded items
g. appropriate reading level
h. try to avoid bias
steps for developing a summated rating scale
define construct ←—————————
| |
design scale ←—- |
| | |
Pilot test ————- |
| |
administration and item analysis ———-
|
validate and norm
how to define something invisible or abstract
What is the context in which this construct exists?
Inductive approach: Give as clear of a definition as possible before you create items.
Read well.
How narrow or broad of a concept do you want to measure?
questions to get a good definition
What is the purpose of the scale?
What is the construct?
What definitions do you want to adopt from the literature?
Any good measures already?
Are there any gaps in the literature?
bandalos 15 steps?
Determine item format.
Develop a blueprint for your test objectives.
Create initial candidate item pool.
Review items (with multiple people).
Large pilot test of items - initial analysis.
Larger analysis of items.
Revise items.
Calculate reliability of scale.
Conduct a test with a 2nd sample.
Repeat Steps 3-9 if needed.
Test the evidence of validity.
Prepare guidelines for administration.
how do we get there?
1) Teamwork makes the dreamwork!
Collaborate, take perspective
Reward disagreements.
Expert involvement
2) read
Adopt a good definition.
Clarify the strengths and gaps in the literature.
3) Make your blueprint:
Definition of what you want to measure.
Objectives of your measure.
Choose appropriate response choices
Agreement, frequency, evaluation
Spacing
Bipolar/unipolar?
Number of choices (Forced choice? If Likert scale, 5-9 choices?
Types of items
what to read?
Peer-Reviewed Measure Development Articles
Peer-Reviewed Measure Review Articles
Peer-Reviewed Literature Reviews
Monographs by leading authors in a field
where to read?
Google Scholar, Google
Library
PsycINFO
ProQuest
PubMed
PsycArticles
InterLibrary Loan
IPIP
*Mental Measurements Yearbook
*PsycTESTS
*Tests in Print
Email/ResearchGate - requests
measuring something cognitive
Recall/demonstrate knowledge of something
Show understanding or comprehension of something
Show application of knowledge
Require analytical skills
Require skills of synthesis of information into something coherent/directed (or, comparisons and making inferences)
Require skills at evaluation/making judgments
measuring something affective
Receiving - willing to pay attention
Responding - endorsement of an opinion
Commitment - Voicing a strong opinion or intention to change something
Organization - Demonstrate a change in opinion or attitude
Characterization - Demonstrating commitment to a change via habits and developments of traits/a worldview
speed in measuring achievement and aptitude:
quickly respond to the task within a time limit
power in measuring achievement and aptitude:
respond to items that increase in difficulty
the next steps with your blue print
Generate good items
Design your responses
Choose format
Choose rater of items (Self or other or both)
Clear instructions
Defined population
Develop a strategy for how you could evaluate or test your measure
Item analysis (e.g., Internal Consistency, Factor Analysis, Item-Response Theory)
Criterion-Related Validity
Generalizability to certain populations (Measurement Invariance)
tips for developing good likert items
Consistent item stems
Items with similar lengths
Task of item is clear
Good grammar
Items adequately get at the domain of the topic you want to cover.
Items clearly express only one idea.
Use positively worded items, but have reverse items too.
Reverse items are not relying on negative words like “no” or “not”
Who is your language accessible to?
Jargon? Colloquialisms?
Think of the reading level of your participants.
Keep items present or future-oriented.
Avoid facts.
Choose items people would have a dispersion of scores on across the range of scores.
Keep items short.
Careful with adverbs like “usually, only, just”
cognitive items
True/False
Matching
Multiple Choice
Checklists
Short answer
Completing a sentence
Performance Task
non-cognitive items
Thurstone Scaling
Items with interval scales are ranked by multiple judges from least favorable to most favorable (11 point scale), and final measure has items representative across the scale and are weighted according to rankings. Take a long time to do.
Guttman Scaling
Deterministic method: Items level up in a hierarchy from less extreme or likely to most extreme or likely. Hard to get right.
Likert Scaling
Summative method (add up scores on items). Very common, low cost.
what do we mean by cognitive?
For a short-hand, with cognitive, think of conscious intellectual activities that require effort, knowledge. Examples being like memory, problem-solving, reasoning. IQ tests here.
what do we mean for non-cognitive
For non-cognitive, think of other mental aspects like attitude, affect, motivation, traits, skills, or behaviors. Personality tests here.
what comes after creating candidate items?
Design:
Clear Instructions
Common frame of reference for participants to respond
Choose an appropriate rating scale
Agreement? Frequency? Evaluation? - Spector, p. 19
How will participants take your measure?
Review:
Share, get feedback, revise
Pilot
Get feedback from a sample of people on items
Data Collection:
IRB approval of a research proposal
Collect data in accordance with IRB
Initial item analysis
More Data Analysis:
Mean, Range, Standard Deviations…
Reliability analyses
Validity analyses
issues to consider
What is getting measured: Person, Situation, or Person x Situation?
Level of effort (satisfice vs. optimize; Krosnick, 1991)
Level of enjoyment of the survey, motivation to take the survey
Participant’s understanding/interpretation of item
How easy it is for the participant to generate a response to the item?
How appropriate response choices are across all items?
To what degree can participants edit responses?
How good is the scale in terms of its conversational utility?
Amount of information
Quality of information
Relevance of information
Clarity of Information
examples of response distortion
Socially Desirable Responding
Acquiescence
Malingering
Extreme Response (or consistently neutral response)
why go through this process of scale development?
Advances scientific understanding in our field. (scientific advancement)
Helps inform to whom and how measures should be applied to inform interventions, accommodations, or decisions. (Inform practitioners)
Learning to do measurement analysis and development will open doors professionally. (open professional doors)
Item remainder coefficient -
correlation of item with total score of the other items subtracting the specific item being looked at. <.35 may be a good rule of thumb that the item should be removed, but do this for multiple iterations because it can change each time you take outan item.
non cognitive steps after initial item pool- 1st step
Collect a sample of at least 200 people
Reliability Analysis
Check data/scoring of items
Look at item-remainder coefficient
Check Internal consistency (e.g., alpha)
Overall score: >.70?
Check internal consistency if item deleted
Improve or get worse?
Exploratory Factor Analysis
How many factors?
Unidimensional or multidimensional?
How are they related?
Orthogonal (not correlated) or oblique (correlated)?
How many items?
Convergent validity
Caution - need enough items that cover various aspects of the construct and don’t just say the same thing.
Descriptive Statistics (Mean, SD, Skew, Kurt)
non cognitive steps after initial item pool- 2nd step
Collect a new sample of 200-500 people
a. Descriptives: Measures of central tendency, range, variance/standard deviation; skewness and kurtosis (normality)
b. Reliability Analysis
c. Construct Validity: Confirmatory Factor Analysis (Replicate measure’s factor structure)
d. Construct Validity: Convergent and Divergent Correlations (MTMM)
non cognitive steps after initial item pool- 3rd step
collect a new sample of 200-300 people
a. Reliability Analysis
b. Descriptives
c. Measurement Invariance (CFA for a new sample)
d. Predictive Validity Correlation: What should your scale predict? Why does it matter?
non cognitive steps after initial item pool- 4th step
Collect a new sample with 2 data points
Test-retest reliability
Predictive Validity - does your measure predict maintenance or change in something over time?
developing test norms
Collect multiple representative large samples
Share whether groups differ on means/SDs
Should there be separate norms of these values for different groups?
Are they all representative of a single population or multiple populations?
Examples:
Age Cohort/Grade
Gender or Sex
Race or Ethnicity
Nationality
tips for evaluating a measure
Use multiple samples and tests to give your scale the best chance to fail at each step
Sample 1: Initial Test. Descriptives, Internal Consistency, and EFA. maybe correlation with another measure of construct.
Sample 2: Replicate. Reliability, Descriptives, and CFA; initial correlations
Sample 3: Validity. Reliability, CFA, Look at convergent and divergent correlations. Give the scale a good chance of failing
A capstone study: Evaluate predictive validity or generalizability to other samples. Why was it important to develop this measure? For whom is this measure appropriate?
considerations for cognitive item analysis
P: Item Difficulty
Proportion of correct responses/total number of responses
Range ~ .3 to 7
D: Item Discrimination (Option 1)
Difference between proportion of upper and lower level groups of test scores on whether they got an item correct (internal criterion).
Evaluate a test’s performance.
Difference between proportion of upper and lower level groups on an item when groups are formed from something external to the test (external criterion).
Predict behavior, make diagnostic or employment decisions
Range from -1 to +1. The closer to 0, the worse it is at distinguishing between levels.
Degree to which correct and incorrect answers on an item distinguish between knowledge/abilities/skills of test-takers
considerations for cogntive item analysis pt 2
Item Discrimination (Option 2)
Item-Total Test Score Correlation
Biserial/Point-Biserial (correlations with dichotomous variables)
Distractor Analysis
Multiple choice items
Do distractors attract more of the lower level than upper level group?
Decisions on analysis will vary between norm or criterion-referenced tests (norm - more discrimination with good distractors; criterion - less)
Corrections for Guessing
R (# correct) = W (# incorrect)/[C (# of choices) -1]
Goal: reduce error, measure something more accurately
Evidence is not consistent to support this.
Non-Cognitive Item Analysis summary
Multiple tests with multiple samples to get your scale right…
Item descriptives, variance
Reliability
Validation
Generalizability
A good rule of thumb is to try to make your measure fail.
Balance statistics with theory/pragmatic considerations in analysis.
cognitive item analysis summary
Item Difficulty
Item Discrimination
Performance of Distractors in Performance Tests
item difficulty
Proportion of correct responses/total number of responses
- Range ~ .3 to 7
formula: P= # of correct responses / # of total responses
item discrimination
Degree to which an item distinguishes high-scoring from low-scoring test-takers on the overall measure. Ranges from –1 to +1; higher positive = better.
Distractor analysis
For multiple-choice items, checking how well incorrect options attract lower-ability respondents and are avoided by higher-ability ones.
correction for guessing
Formula adjustment to reduce inflated scores from random guessing (e.g., R – W / [C – 1]); rarely used in counseling tests but conceptually reduces error.
item-total correlations
Correlation between each item and the total test score (excluding that item). Indicates how well an item fits the construct.
point biserial/biserial correlations
Relationship between a continuous variable (e.g., total score) and a dichotomous item (right / wrong). Point-biserial = observed dichotomy; biserial = assumed underlying continuity.
range
Difference between the highest and lowest score in a distribution.
skewness
Degree of asymmetry in score distribution. Positive = tail on right (few high scores); negative = tail on left (few low scores).
kurtosis
“Peakedness” of distribution.
Leptokurtic = tall/narrow (scores cluster near mean)
Platykurtic = flat/wide (scores spread out)
standard deviation
Average distance of scores from the mean; indicates variability.
variance
SD squared; measure of total spread of scores.
Guttman scale
Hierarchical—endorsing a strong item implies endorsement of easier ones.
thurstone scale
Experts rank items from least → most favorable; weighted by judge agreement.
Likert scale
Summated-rating scale using multiple agree–disagree items averaged or summed for a total score.
item-remainder coefficient
Correlation between an item and the total score minus that item; identifies how much each item contributes to internal consistency.
internal consistency
Degree to which all items on a scale measure the same underlying construct at one time (e.g., Cronbach’s α, McDonald’s Ω) form of reliability
convergent validity
Evidence that a test correlates strongly with other measures of the same or related construct.
divergent validity
Evidence that a test shows low correlations with measures of different or unrelated constructs.
concurrent validity
Form of criterion validity where test scores relate to an external criterion measured at the same time (e.g., current GPA ↔ aptitude test).
criterion-related validity
Extent to which test scores predict or correspond with an external performance criterion.
Includes predictive (future) and concurrent (present).
factor analysis
Statistical method identifying clusters of items (factors) that represent underlying dimensions of a construct.
EFA = explore; CFA = confirm.
cronbach/coefficient alpha
Index of internal consistency; estimates average inter-item correlation adjusted for item count. ≥ .70 = acceptable for research; ≥ .90 = clinical use.
compiling norms
Creating reference data (means, SDs, percentiles, T-scores) from large representative samples so individual scores can be interpreted relative to others.
Key 15 steps of measurement design
1- state the purpose of the scale
2- identify and define the domain of the construct to be measured
3- determine whether a measure already exists
4- determine item format
5- develop a test blue print or test objectives
6- create initial item pool
7- conduct the initial item review (and revisions)
8- conduct a large scale field test of items
9- analyze items
10- revise items
11- calculate reliability
12- conduct a second field test of items
13- repeat steps 8-11 if needed
14- conduct validation studies
15- prepare guidelines for administration
key characteristics of scales
Unidimensional – Items measure one main construct.
Reliable – Produces consistent results (low random error).
Valid – Accurately measures what it claims to.
Clear and specific – Avoids vague, double-barreled, or biased wording.
Appropriate reading level – Understandable to the target population.
Balanced – Includes both positively and negatively worded items.
analyses involved in measurement design
Analysis Type | Purpose |
---|---|
Descriptive Statistics | Understand means, SDs, skewness, kurtosis of item scores. |
Item Analysis | Identify poor items based on difficulty, discrimination, or item-total correlation. |
Reliability Analysis | Estimate internal consistency (Cronbach’s alpha or McDonald’s Omega). |
Exploratory Factor Analysis (EFA) | Discover how many dimensions underlie the data. |
Confirmatory Factor Analysis (CFA) | Confirm the factor structure in a new sample. |
Criterion/Construct Validity Correlations | Check that scores relate appropriately to other variables. |
key tips for creating good items
Keep items simple, short, and clear.
Avoid “and/or” statements (each item should measure one thing).
Keep the reading level accessible.
Use positively worded items but include some reverse items.
Make items present- or future-oriented.
Avoid double negatives or jargon.
Ensure response options are consistent and evenly spaced (for Likert scales).
● Consistent item stems
● Items with similar lengths
● Task of item is clear
● Good grammar
● Items adequately get at the domain of the topic you want to cover.
● Items clearly express only one idea.
● Use positively worded items, but have reverse items too.
○ Reverse items are not relying on negative words like “no” or “not”
● Who is your language accessible to?
○ Jargon? Colloquialisms?
○ Think of the reading level of your participants.
● Keep items present or future-oriented.
● Avoid facts.
● Choose items people would have a dispersion of scores on across the range of scores.
● Keep items short.
● Careful with adverbs like “usually, only, just”
potential problems for reliability and validity
Social desirability bias | People answer in ways that make them look good. |
Acquiescence bias | Participants agree with everything. |
Random responding | Low reliability; participants not paying attention. |
Restricted range | Reduces correlations and reliability estimates. |
Cultural bias | Items may not be relevant or fair across groups. |
Ambiguous wording | Increases measurement error. |
What is getting measured: Person, Situation, or Person x
Situation?
- Level of effort (satisfice vs. optimize; Krosnick, 1991)
- Level of enjoyment of the survey, motivation to take the
survey
- Participantʼs understanding/interpretation of item
- How easy it is for the participant to generate a response to
the item?
- How appropriate response choices are across all items?
- To what degree can participants edit responses?
- How good is the scale in terms of its conversational utility?
- Amount of information
- Quality of information
- Relevance of information
- Clarity of Information
Examples of Response Distortion:
● Socially Desirable Responding
● Acquiescence
● Malingering
● Extreme Response (or consistently
neutral response)
research design in terms of sampling, data analysis
Multiple samples help verify reliability and validity.
Sample 1: Pilot → Item and reliability analysis.
Sample 2: Replication → Factor analysis and consistency check.
Sample 3: Validity → Convergent/divergent and predictive validity.
Sample 4: Test–retest reliability or measurement invariance (across groups).