1/81
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Reliability
Refers to precision in measurement
Is determined by consistency of scored obtained by same persons on equivalent/ parallel tests
Error
Is inevitable; it is the difference between the observed and true scores due to test limitations
Psychological Traits
Are abstract; measured using imperfect tools that may over or under estimate
Psychologists must
Evaluate how much error exists in their measurement tools
Using unreliable tools
Risks flawed understandings of behaviour
Spearman (1904)
Combined sampling error & correlation to develop reliability theory
Classical Test Theory
true score (T); observed score (X) = true score (T) + random error (E)
Random Error
Causes variability in repeated test scores, producing a normal distribution around the true score
Greater dispersion
Less reliability
Narrow dispersion
more accurate representation of ‘true’ ability
Domain Sampling Method
A technique used in test construction where multiple items are drawn from a larger domain to better estimate a person's true ability.
As test length (number of items) increases
sampling error decreases and reliability increases
Repeated random sampling
yields normally distributed estimates of the true score
Item Response Theory
Modern alternative to Classical Test Theory which improves measurement precision. Adapts item difficulty based on individual response (adaptive testing), focusing testing around the individuals actual ability levels for greater accuracy. Leads to shorter tests with higher reliability than classical methods.
Reliability Coefficient
Ratio of variance of true scores to variance of observed scores. Tells us what proportion of the test variance is non-error.
Reliability coefficient of .75
75% of variance in test scores is due to true differences in ability & 25% of the variance in test scores is due to error.
Time Sampling Error
The error that occurs when test scores are influenced by the particular moment in time when the testing is conducted, potentially affecting the measurement of a person’s true ability.
Test-Retest Reliability
Extent to which scores can be generalised and remain unchanged over time when measuring stable constructs (e.g., personality).
Test-Retest Reliability Coefficient
Correlation between scores obtained on identical tests administered on seperate occasions.
Test-retest correlations _______________ as inter-test interval lengthens.
tend to decrease
Inter-test Interval for Test-Retest Reliability
Should not be too long, or trait being measured is likely to undergo real change
Longer Inter-test Intervals
Introduce other influencing factors
Test-Retest Reliability is only applicable to ____________
stable traits (e.g., intelligence) not changing states (e.g., mood)
Test-Retest Reliability primarily addresses error due to__________
temporary changes in test taker, for example, illness, tiredeness, emotional problems, effects of medication, etc.
Test-Retest Reliability can also be influenced by
Error due to test administration, scoring/ interpretation
Test-retest Reliability does not
account for error due to variation in tent content; since the same test is used
A test-retest reliability limitation
a nuisance to obtain test-retest data
In test-retest reliability performance on the first test may ____________
influence performance on the second test, for example practice may produce different degrees of improvement in retest score.
Alternate Forms Reliability
Two equivalent forms with different items but same selection rules are used to calculate reliability.
Alternate Forms Reliability ensures
test scores aren’t dependent on a specific set of items from the domain. (item sampling error)
Alternate Forms Reliability Coefficient
Correlations between scores obtained on two equivalent test forms
Immediate Succession Alternate Forms Reliability
Primarily addresses unreliability due to content sampling
Inter-test Interval Alternate Forms Reliability
Addresses unreliability due to content sampling & variations due to temporary changes in test-taker.
Alternate forms are __________
Not used frequently because most tests don’t have alternate forms
Inter-Scorer Reliability
Degree of agreement or consistency between two or more scorers or raters
Inter-Scorer Reliability provides information about ___________
unsystematic error arising from variation in scoring & interpretation BUT not any other source of error
Inter-Scorer Reliability is important when _________
judgement enters the scoring process for example in projective personality tests
Internal Consistency
Extent to which items measure the same underlying construct.
Internal Consistency determined by
examining relationship among items on 1 test at a single point in time. If they measure the same construct they should be correlated with each other.
Split-Half Method
Measure on internal consistency involving correlating one half of a test with the other half (random split or odd-even).
Split-Half Method correlation
Under estimates reliability as each half is shorter and thus less reliable
Spearman-Brown Formula
Takes the split half correlation as input and converts it to an estimate of the equivalent level of reliability for the full-length test, for a better reliability estimate.
Kinder Richardson Formula
Method used to assess the internal consistency of a measure based on dichotomous data (right or wrong).
Kinder Richardson (K-R 20) Formula
gives a coefficient for any test which is equal to the average of all possible split-half coefficients
high item covariance increases reliability.
Equivalent to the avg of all split-half reliabilities - robust.
Cronbach’s Alpha
Method used to assess internal consistency of a measure — generalises KR20 to apply to non-dichotomous items (e.g., Likert scales)
For alpha to be meaningful
Tests should be built to assess a single domain/trait
Various measure of internal consistency assess unreliability due to ______
content sampling
Tests can be developed to have high internal consistency by
having items with highly similar content → sampling may be so constructed as to be trivial
.90s
high reliability (any higher may indicate items are too similar!)
.80s
moderate to high reliability
.70s
low to moderate reliability (must be at least this value!)
.60s
unacceptably low reliability
___________ have higher reliabilities (.90s)
Cognitive ability tests
______________ have second highest reliabilities (.80s)
Self-report tests of personality
Research requires ___________ alpha level
.70 - .80 are acceptable
Clinical decision-making settings require _______ alpha level
equal to or greater than .90 (must)
Reliability can be improved by _________ number of items
increasing
Reliability can be improved by _______ items reducing reliability
discarding
Reliability can be improved by providing estimate correlation without __________
measurement error
A test may yield scores than can be reliably used in some situations ______________
but not in others
Homogenous items
Measure 1 factor and is appropriate for internal consistency
Heterogenous items
Measure a range of factors and are therefore not appropriate for internal consistency
Dynamic traits
Including mood fluctuate and thus test-retest reliability is not appropriate
Static traits
Including intelligence remain stable over time and are thus appropriate for test-retest reliability
Range Restriction
Reliability decreases when variance of true scores decreases
Range Inflation
Reliability increases when variance of true scores increases
Criterion-referenced test
Evaluates whether a specific criterion (e.g., pass/fail) has been achieved.
Reduces true score variability, thereby lowering reliability—even if individual performance is stable.
Reliability becomes less critical when the test is used for prediction.
Speed test
time limited; focus is on speed rather than difficulty
Items are interdependent → Internal consistency inappropriate.
Test-retest reliability is appropriate.
Power test
Focuses on difficulty (untimed).
Treated like regular tests for reliability (internal consistency, test-retest, etc.).
Standard Error of Measurement (SEM)
Estimates how repeated measures of a person on the same instrument tend to be distributed around their “true” score
Purpose of Standard Error of Measurement (SEM)
Evaluates the precision of an individual’s observed score as an estimate of their true score (vs. reliability coefficients, which assess overall test quality).
how much an observed score might deviate from the true score due to unsystematic error.
Reliability
Quality of a ruler
SEM
How precise one measurement is with that ruler
Small SEM
High precision → Observed scores are close to the true score (high confidence).
Large SEM
Low precision → More error, less confidence in observed scores.
±1 SEM
~68% of scores
±1.96 SEM
~95% of scores
±2.58 SEM
~99% of scores
True Score (T)
The theoretical, unchanging value of a trait (e.g., IQ, extroversion) for an individual. Never directly known but inferred through repeated testing (mean of observed scores).
If someone takes a test infinitely, their average score = true score.
Error (E)
Unsystematic factors (e.g., mood, guessing, distractions). Causes observed scores to vary around the true score in repeated testing. It is random (mean = 0) and uncorrelated with true scores.
Observed Score Variability
Each test is a "sample" of possible items/trials; scores naturally fluctuate. Repeated testing produces a range of observed scores around T (like a sampling distribution).
Practical Implications of CTT
Goal to minimize error (E) to make observed scores (X) closer to true scores (T).
Use reliability coefficients (e.g., Cronbach’s α) and SEM to quantify error
Confidence intervals (e.g., "IQ = 110 ± 5") reflect CTT’s error model.