1/37
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Assumptions of CTT
-Assumes true score is constant
-Assumes all error is random and not systematic (errors are uncorrelated and that with repeated testing, the mean of error would be 0).
Test Models (Parallel to Congeneric)
-Parallel - same true score, variance, difficulty, factor loadings, item means, most strict
-Tau-equivalent - same true score but different variance and the measurement precision is different
-Essentially Tau-equivalent - True score is off by a systematic amount, different variance but has same discriminating power but different difficulty
Congeneric - least restrictive, allows for different factor loadings
Equal true score variance
When the expected variance (SD) around the true score is equal across the models (parallel tests). The true score variance is genuine, stable differences found in the construct.
Reliability formula through variance
True score variance (genuine variance in construct)/ Observed score variance (sum of the true score variance and error variance)
Reliability as proportion
Proportion of observed variance that is due to true variance instead of error variance. Signal over noise.
Reliability as percentage
True variance accounts for __% of the observed variance, leaving __% of the observed score variance due to error variance.
Cronbach’s alpha measures
average inter-item correlation (covariance) adjusted for test length. The average of how the items correlate with each other.
Cronbach’s alpha interpretation of values and when to use
Used for essentially tau-equivalent and higher (same T, same factor loadings (strength), but different means (difficulty) and different error variance). Need higher reliability for individual decisions than for research.
-Under .70 - concerning for research, unacceptable for individual decisions
-.70-.80 - acceptable for research
-.80-.90 - Good
-Over .90 - Excellent, required for high-stakes individual decisions (diagnoses, etc.))
Improving reliability strategies (4 strategies, plus what to prioritize)
Spearman-Brown - when a test is split in half, this helps correct for underestimated reliability. However, this assumes parallel halves (really rare), is inefficient, and old school. Has been replaced with Cronbach’s alpha
Remove poor items - if they have low item-total correlation, it isn’t measuring the same construct as the other
Replace weak items with better items - get items that avoid bias or misinterpretation and load more strongly with higher total and inter-item correlations
Write items that better capture the construct - allows for the observed score to get closer to the true score
(In general, increasing reliability through prioritizing stronger average inter-item correlation is more important, even if it shortens the test)
“Corrected” item total correlation
How well an item correlates with the total of all other items (controlling for itself). Helps detect items that don’t belong with the others. A higher item-total correlation means the item correlated will with the rest of the items.
Interrater Reliability Statistics (3 types). ICC or alpha?
Intraclass Correlation Coefficient (ICC) - continuous ratings from multiple raters. Most flexible and uses ANOVA methods to compare between-group variance to total variance.
Cohen’s Kappa - categorical ratings from two raters, the best form, but it is easily affected by prevalence and bias. Proportion of potential non-chance agreement (so it controls for chance agreement). You can have a low kappa while having a high percentage agreement.
Percentage agreement - simple, but doesn’t account for chance because it is the strict percentage of agreement/identical results.
ICC over alpha because alpha looks at internal consistency and reliability. So, you can have a high alpha while the scores are vastly apart (but still correlated).
r
A correlaton coefficient (such as item-total correlation). The closer to |1|, the stronger the relationship.
-Under .20 is poor
-.20-.30 is small
-.30-.40 is good
-Over .40 is very good
EFA vs CFA
CFA is where you have a structure and you run your items into that structure whereas with EFA, is discovers the underlying structure/dimensions in the items
Factor loading
How well the item is correlated with this underlying factor that was found among the items.
3 approaches to deciding number of factors
Kaiser criterion - extract anything with an eigenvalue (sum of the squared factor loadings across items. Squared factor loading means __% of the variance in this item is explained by this factor. Having an eigenvalue greater than 1 means it explains over 1 complete item worth of variance) greater than 1, simple, but may over-extract
Scree plot - look for the elbow where eigenvalues level off (anything to the left/above)
Parallel analysis - most sophisticated, compare the data to random data and maintain factors that have eigenvalues greater than the eigenvalues of random data that is just due to chance (signal over noise)
Rotation Methods
Orthogonal rotation - Varimax, forces factors to be uncorrelated
Oblique rotation - Promax, allows factors to correlate. Promax uses pattern matrices (looks at the unique relationship between each item and factor while controlling for other items) and/or structure matrices (doesn’t control so it shows simple correlation).
Use oblique for factors expected to relate with each other (as most psychological constructs do)
Factor loadings (ideal, problems)
-|.40| is the cutoff for substantial.
-Simple structure is ideal where an item loads strongly on only one factor (instead of multiple factors)
-Cross-loadings - when an item loads greater than |.40| on multiple factors, usually problematic
KMO (what it is, what it is used for, the scoring)
The size of observed correlations compared to the size of partial correlations.
Tells us whether the data is fit for an EFA
Scoring
-Marvelous - over .90
-Meritorious - .80-.89
-Middling - .70-.79
-Mediocre - .60-.69
-Miserable - under .60, not suitable for factor analysis as the data/variables aren’t very strongly correlated
Cognitive item difficulty
p-value - the proportion answering correctly (so a higher difficulty value (p) means an easier item)
Optimal difficulty depends on the purpose, but .50 maximizes discrimination (so, you typically look for a range between .30-.70
Very easy (greater than .80) or very hard (under .20) provide little discrimination information or information about individual differences
Cognitive item discrimination (3 points)
A discrimination index closer to 1 means the high scorers got it right while low scorers got it wrong. (-1 would be the reverse)
Point-biserial correlation - measures the relationship between one dichotomous (binary) variable and one continuous variable.
In contrast, biserial correlation looks at the relationship between an artificially dichotomous variable and a continuous variable.
Distractor warning signs
It is chosen by more high-scorers than low-scorers (may be ambiguous or scored wrong)
It is chosen by almost no one (implausible, it isn’t doing it’s job (distracting))
What statistics are used for noncognitive item analysis
Mean - shows central tendency, alerts for ceiling/floor effects
Standard deviation - shows variability to see if there is an adequate spread to be able to discriminate between individuals (in a 5-point Likert scale, generally want something over 1?)
Corrected item-total correlation (r_it)
Corrected item-total correlation
How well an item correlates with the rest of the items (while controlling for itself, since not controlling for it would inflate the alpha since it would perfectly correlate with itself)
Generally want r>.30
Low values (such as under .20) suggests that the item is measuring a different construct than the rest of test
Cognitive item analysis vs noncognitive item analysis
While cognitive can look at difficulty (since there is a right answer), noncognitive analysis focuses more on variability and consistency across items
Old view of validity
There are 3 different types of validity (content, criterion, construct). Issue with them essentially all being content validity and researchers trying to checkmark each “type” of validity instead of assessing overall validity.
New view of validity (4 points)
Unified view. Validity is one broad concept about whether the inferences drawn from a score are justified
The same test can have valid interpretations for one purpose, but invalid for another (why we know look at the validity of the inferences rather than the validity of the test).
Validity is never fully “established” - it is an ongoing process of accumulating evidence
5 approaches to validity (Content, response processes, internal structure, relations to other variables, consequences)
Content approach to validity (question it addresses, how to assess)
Do items represent the construct domain adequately
Assess through expert judgement on item relevance and representativeness. Assess test blueprint/specifications.
Response processes approach to validity (question it addresses, how to assess)
Do respondents engage with items as intended?
Use think-aloud protocols, cognitive interviews, analysis of response patterns
Internal structure approach to validity (question it addresses, how to assess)
Does the pattern of item relationships match theory
Use factor analysis (items fall under the hypothesized factors), item intercorrelations (did the items correlate as expected), test dimensionality (did is show uni vs. multi dimensionality as expected), item difficulty and discrimination index
Relations to other varialbes approach to validity (question it addresses, how to assess)
Do the scores relate to other measures as expected
Convergent (correlated with theoretically related constructs) and discriminant (doesn’t correlate with theoretically unrelated constructs) evidence (can be assessed with the MTMM, which looks at these and examines trait variance from method variance). Criterion evidence (does it predict relevant outcomes)
Consequences approach to validity (question it addresses)
Does score interpretation lead to appropriate decisions?
Are there adverse impacts on particular groups?
Reliability vs Validity and their relationship
Reliability - consistency, would you get the same score again?
Validity - accuracy/meaning, does the score mean what you interpret it as?
You can be reliably invalid/wrong (it measures consistently, but it isn’t measuring what you intended it to)
Kane’s argument-based approach to building a logical argument for validity
Identify each inference (an untested assumption of how a piece of evidence is linked to a broader interpretation about a test taker) in the interpretive chain
Gather evidence to support each inference
Identify potential threats to each inference
Address threats through design or acknowledge the limitations
Validating score interpretation - get evidence that the way a score is being interpreted is accurate and justified
Deliberate practice features (FIERCCE)
Guided by expert Feedback
Focused on Improving specific aspects of performance
Not inherently Enjoyable (it is work)
Requires full Concentration and effort
Involves stepping outside of the Comfort zone
It is not simply Experience or repetition
10,000 hr rule
Ericsson. This was the average found and it isn’t necessarily true since it depends on the type of practice. You can practice something that long but if the practice isn’t deliberate, than improvement may not occur (may remain in a good-enough plateau in the comfort zone of skill level)
Kind vs Wicked Learning Environments
Kind - clear rules, immediate and accurate feedback, reliable patterns (you know what to expect and what is expected of you). Better for deliberate practice.
Wicked - unclear rule, delayed or misleading feedback, unpredictable patterns
Gladwell vs Henderson vs Pink vs Goodhart
Tortoise and the Hare argument. Timed test measures knowledge AND speed.
Different students succeed under different or no time constraints (so time pressure changes what is being measured)
French class phenomenon. Good scores on measures intended to measure understanding/fluency do not necessarily mean true understanding/fluency of the concept
When a measure becomes high-stakes, it no longer measures what it was intended to measure
Prevalence effect
When Kappa values are reduced (and thus misleading) when a particular outcome has high prevalence (even when observed agreement is high). This happens because kappa looks at chance-corrected, so if the prevalence of an outcome is high, that means that expected chance agreement is also high (and thus, Kappa is reduced)