Reliability
Precision =
Estimated reliability
Consistency of test scores across a sample
Consistency of test scores across time
Trueness
Estimated validity '
Identify relations with other constructs(theoretical or empirical)
Model fit (theoretical models and measurement models
Measurement error = all factors associated with process of measuring some variable, other than the variable being measured
Random error = a source of error in measuring caused by fluctuations and inconsistency of other variables in the measurement process
Systematic error = a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be a true value of the variable measured
Reliability
Measure of consistency
With the same measures and circumstances the results should be similar
Result in reliable and replicable studies
Psychological measures = assess psychological differences, hinge on ability to do so accurately
Classical test theory
Measurement theory that defines the conceptual basis for reliability
Holds that differences in test scores reflect actual differences in true levels of attitudes instead of measurement error
Came from physical sciences as two results, true score and error)
Assumption = true score and error values are uncorrelated, observed score = true scores and measurement error, error occurs as if it was random , error s will cancel out
With measurement comes unreliability
Reliability is a test property that derives from observed scores(values obtained from the measurement of some characteristic in a person), true scores (real amounts of that characteristic) and measurement error
Error = true score impossible to detect (mean score from an infinite number of administrations of test, hypothetical construct), true scores obscured by errors, error occurs as if random
Observed score = true score + error
Error occurs as if random, so effects are independent of true values, no correlation between true and error scores
Random error
Observed score can be higher or lower than true score on multiple tests so expected to cancel out in long run, mean average of 0
Sources of RE = candidate related, procedural, environmental, test-related
Reliability = testing consistency
Methods for assessing reliability =
Multiple administration = test retest reliability, alternate forms of reliability
Single administration = split half reliability, Cronbach's alpha
Multiple judges = interrater reliability
Test retest reliability =
Same group tested twice and results are correlated using Pearson's r (measure of correlation)
Each individual will get different results from another
If scale is reliable they and the group should get similar results on time one and time 2
Assesses consistency over time
Problems = practice effects, participant attrition, maturation effects and participant memory
Alternate forms reliability
Two independent versions of measure used
People do both consecutively given to the same group and scores compared with Pearson's r
Alternate forms reliability gives a measure of consistency
Alternate forms not vulnerable to practice effects
But can we ever know if two alternate tests are truly parallel
Inter scorer reliability
Degree of agreement between two or more scorers regarding a measure
Used in observational studies and measures the degree of consensus between observers
Affected by degree of objectivity in the measurement system (shouldn't matter who gives the measure )
Low inter rater reliability can be due to a defective scale or insufficient training
Internal consistency
Measures how related items on test are to another
Related to the fact they measure a similar attribute, so internally consistent
Split half reliability
Scores on half of survey correlate with other half
Can compare first with second half or odd with even ones
Measures internal consistency
Addresses fatigue and memory effects and time factor
Reliability depends on number of items
Cronbach's a
Involves correlating every possible half of items with other half and finding average
Value of 0.7 or higher acceptable
Reliability issues
Research measures don’t have to be as reliable as those used in clinical practices( large sample size evens out individually unreliable scores, individuals lives can be very affected by results of clinical measures)
Sometimes better for a scale to have low internal reliability (example Cleckleys's key characteristics of antisocial personality disorder psychopath has unrelated features, removing items to make it reliable would make it less effective at identifying people with high psychoapathy scores, removing consistency versus unidimensionality )
Reliability estimate = nature of test determines reliability metric like
Test items homogenous or heterogenous in nature
Characteristic, ability, or trait being measured is presumed to be dynamic or static
Range of scores is/is not restricted
Test is a speed or power test
Test is/is not criterion referenced
Reliability issues
Reliability is a matter of degree
More reliance on reliabilities based on large samples
Sample size for calculating reliability should not be below 100
Expect to see test-retest reliability for ability and reasoning measures
Ability aptitude IQ should have coefficients above 0.8
Personality and other usually a=have coefficients between 0.6 and 0.8 but 0.7 recommended as minimum
Always look for reliability coefficient and size of sample to calculate it
Improve reliability
Clear conceptualization
Standardization
Increase number of items
Use more precise measurement
Use multiple indicators
Pilot testing and replication
Standard error of measurement (SEM)
Measure of precision of an observed test score, estimate of amount of error inherent in an observed score or measurement
High reliability = low standard error can be used to estimate the extent to which an observed score deviates from a true score
Confidence interval = range of scores that likely contain a true score