Psychometrics Midterm 2

0.0(0)

Studied by 2 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/37

There's no tags or description

Looks like no tags are added yet.

Last updated 6:04 PM on 11/22/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

38 Terms

New cards

Assumptions of CTT

-Assumes true score is constant

-Assumes all error is random and not systematic (errors are uncorrelated and that with repeated testing, the mean of error would be 0).

New cards

Test Models (Parallel to Congeneric)

-Parallel - same true score, variance, difficulty, factor loadings, item means, most strict

-Tau-equivalent - same true score but different variance and the measurement precision is different

-Essentially Tau-equivalent - True score is off by a systematic amount, different variance but has same discriminating power but different difficulty

Congeneric - least restrictive, allows for different factor loadings

New cards

Equal true score variance

When the expected variance (SD) around the true score is equal across the models (parallel tests). The true score variance is genuine, stable differences found in the construct.

New cards

Reliability formula through variance

True score variance (genuine variance in construct)/ Observed score variance (sum of the true score variance and error variance)

New cards

Reliability as proportion

Proportion of observed variance that is due to true variance instead of error variance. Signal over noise.

New cards

Reliability as percentage

True variance accounts for __% of the observed variance, leaving __% of the observed score variance due to error variance.

New cards

Cronbach’s alpha measures

average inter-item correlation (covariance) adjusted for test length. The average of how the items correlate with each other.

New cards

Cronbach’s alpha interpretation of values and when to use

Used for essentially tau-equivalent and higher (same T, same factor loadings (strength), but different means (difficulty) and different error variance). Need higher reliability for individual decisions than for research.

-Under .70 - concerning for research, unacceptable for individual decisions

-.70-.80 - acceptable for research

-.80-.90 - Good

-Over .90 - Excellent, required for high-stakes individual decisions (diagnoses, etc.))

New cards

Improving reliability strategies (4 strategies, plus what to prioritize)

Spearman-Brown - when a test is split in half, this helps correct for underestimated reliability. However, this assumes parallel halves (really rare), is inefficient, and old school. Has been replaced with Cronbach’s alpha

Remove poor items - if they have low item-total correlation, it isn’t measuring the same construct as the other

Replace weak items with better items - get items that avoid bias or misinterpretation and load more strongly with higher total and inter-item correlations

Write items that better capture the construct - allows for the observed score to get closer to the true score

(In general, increasing reliability through prioritizing stronger average inter-item correlation is more important, even if it shortens the test)

New cards

“Corrected” item total correlation

How well an item correlates with the total of all other items (controlling for itself). Helps detect items that don’t belong with the others. A higher item-total correlation means the item correlated will with the rest of the items.

New cards

Interrater Reliability Statistics (3 types). ICC or alpha?

Intraclass Correlation Coefficient (ICC) - continuous ratings from multiple raters. Most flexible and uses ANOVA methods to compare between-group variance to total variance.

Cohen’s Kappa - categorical ratings from two raters, the best form, but it is easily affected by prevalence and bias. Proportion of potential non-chance agreement (so it controls for chance agreement). You can have a low kappa while having a high percentage agreement.

Percentage agreement - simple, but doesn’t account for chance because it is the strict percentage of agreement/identical results.

ICC over alpha because alpha looks at internal consistency and reliability. So, you can have a high alpha while the scores are vastly apart (but still correlated).

New cards

A correlaton coefficient (such as item-total correlation). The closer to |1|, the stronger the relationship.

-Under .20 is poor

-.20-.30 is small

-.30-.40 is good

-Over .40 is very good

New cards

EFA vs CFA

CFA is where you have a structure and you run your items into that structure whereas with EFA, is discovers the underlying structure/dimensions in the items

New cards

Factor loading

How well the item is correlated with this underlying factor that was found among the items.

New cards

3 approaches to deciding number of factors

Kaiser criterion - extract anything with an eigenvalue (sum of the squared factor loadings across items. Squared factor loading means __% of the variance in this item is explained by this factor. Having an eigenvalue greater than 1 means it explains over 1 complete item worth of variance) greater than 1, simple, but may over-extract

Scree plot - look for the elbow where eigenvalues level off (anything to the left/above)

Parallel analysis - most sophisticated, compare the data to random data and maintain factors that have eigenvalues greater than the eigenvalues of random data that is just due to chance (signal over noise)

New cards

Rotation Methods

Orthogonal rotation - Varimax, forces factors to be uncorrelated

Oblique rotation - Promax, allows factors to correlate. Promax uses pattern matrices (looks at the unique relationship between each item and factor while controlling for other items) and/or structure matrices (doesn’t control so it shows simple correlation).

Use oblique for factors expected to relate with each other (as most psychological constructs do)

New cards

Factor loadings (ideal, problems)

-|.40| is the cutoff for substantial.

-Simple structure is ideal where an item loads strongly on only one factor (instead of multiple factors)

-Cross-loadings - when an item loads greater than |.40| on multiple factors, usually problematic

New cards

KMO (what it is, what it is used for, the scoring)

The size of observed correlations compared to the size of partial correlations.

Tells us whether the data is fit for an EFA

Scoring

-Marvelous - over .90

-Meritorious - .80-.89

-Middling - .70-.79

-Mediocre - .60-.69

-Miserable - under .60, not suitable for factor analysis as the data/variables aren’t very strongly correlated

New cards

Cognitive item difficulty

p-value - the proportion answering correctly (so a higher difficulty value (p) means an easier item)

Optimal difficulty depends on the purpose, but .50 maximizes discrimination (so, you typically look for a range between .30-.70

Very easy (greater than .80) or very hard (under .20) provide little discrimination information or information about individual differences

New cards

Cognitive item discrimination (3 points)

A discrimination index closer to 1 means the high scorers got it right while low scorers got it wrong. (-1 would be the reverse)

Point-biserial correlation - measures the relationship between one dichotomous (binary) variable and one continuous variable.

In contrast, biserial correlation looks at the relationship between an artificially dichotomous variable and a continuous variable.

New cards

Distractor warning signs

It is chosen by more high-scorers than low-scorers (may be ambiguous or scored wrong)

It is chosen by almost no one (implausible, it isn’t doing it’s job (distracting))

New cards

What statistics are used for noncognitive item analysis

Mean - shows central tendency, alerts for ceiling/floor effects

Standard deviation - shows variability to see if there is an adequate spread to be able to discriminate between individuals (in a 5-point Likert scale, generally want something over 1?)

Corrected item-total correlation (r_it)

New cards

Corrected item-total correlation

How well an item correlates with the rest of the items (while controlling for itself, since not controlling for it would inflate the alpha since it would perfectly correlate with itself)

Generally want r>.30

Low values (such as under .20) suggests that the item is measuring a different construct than the rest of test

New cards

Cognitive item analysis vs noncognitive item analysis

While cognitive can look at difficulty (since there is a right answer), noncognitive analysis focuses more on variability and consistency across items

New cards

Old view of validity

There are 3 different types of validity (content, criterion, construct). Issue with them essentially all being content validity and researchers trying to checkmark each “type” of validity instead of assessing overall validity.

New cards

New view of validity (4 points)

Unified view. Validity is one broad concept about whether the inferences drawn from a score are justified

The same test can have valid interpretations for one purpose, but invalid for another (why we know look at the validity of the inferences rather than the validity of the test).

Validity is never fully “established” - it is an ongoing process of accumulating evidence

5 approaches to validity (Content, response processes, internal structure, relations to other variables, consequences)

New cards

Content approach to validity (question it addresses, how to assess)

Do items represent the construct domain adequately

Assess through expert judgement on item relevance and representativeness. Assess test blueprint/specifications.

New cards

Response processes approach to validity (question it addresses, how to assess)

Do respondents engage with items as intended?

Use think-aloud protocols, cognitive interviews, analysis of response patterns

New cards

Internal structure approach to validity (question it addresses, how to assess)

Does the pattern of item relationships match theory

Use factor analysis (items fall under the hypothesized factors), item intercorrelations (did the items correlate as expected), test dimensionality (did is show uni vs. multi dimensionality as expected), item difficulty and discrimination index

New cards

Relations to other varialbes approach to validity (question it addresses, how to assess)

Do the scores relate to other measures as expected

Convergent (correlated with theoretically related constructs) and discriminant (doesn’t correlate with theoretically unrelated constructs) evidence (can be assessed with the MTMM, which looks at these and examines trait variance from method variance). Criterion evidence (does it predict relevant outcomes)

New cards

Consequences approach to validity (question it addresses)

Does score interpretation lead to appropriate decisions?

Are there adverse impacts on particular groups?

New cards

Reliability vs Validity and their relationship

Reliability - consistency, would you get the same score again?

Validity - accuracy/meaning, does the score mean what you interpret it as?

You can be reliably invalid/wrong (it measures consistently, but it isn’t measuring what you intended it to)

New cards

Kane’s argument-based approach to building a logical argument for validity

Identify each inference (an untested assumption of how a piece of evidence is linked to a broader interpretation about a test taker) in the interpretive chain
Gather evidence to support each inference
Identify potential threats to each inference
Address threats through design or acknowledge the limitations

Validating score interpretation - get evidence that the way a score is being interpreted is accurate and justified

New cards

Deliberate practice features (FIERCCE)

Guided by expert Feedback

Focused on Improving specific aspects of performance

Not inherently Enjoyable (it is work)

Requires full Concentration and effort

Involves stepping outside of the Comfort zone

It is not simply Experience or repetition

New cards

10,000 hr rule

Ericsson. This was the average found and it isn’t necessarily true since it depends on the type of practice. You can practice something that long but if the practice isn’t deliberate, than improvement may not occur (may remain in a good-enough plateau in the comfort zone of skill level)

New cards

Kind vs Wicked Learning Environments

Kind - clear rules, immediate and accurate feedback, reliable patterns (you know what to expect and what is expected of you). Better for deliberate practice.

Wicked - unclear rule, delayed or misleading feedback, unpredictable patterns

New cards

Gladwell vs Henderson vs Pink vs Goodhart

Tortoise and the Hare argument. Timed test measures knowledge AND speed.

Different students succeed under different or no time constraints (so time pressure changes what is being measured)

French class phenomenon. Good scores on measures intended to measure understanding/fluency do not necessarily mean true understanding/fluency of the concept

When a measure becomes high-stakes, it no longer measures what it was intended to measure

New cards

Prevalence effect

When Kappa values are reduced (and thus misleading) when a particular outcome has high prevalence (even when observed agreement is high). This happens because kappa looks at chance-corrected, so if the prevalence of an outcome is high, that means that expected chance agreement is also high (and thus, Kappa is reduced)