Reliability Flashcards
Reliability
Concept of Reliability
Reliability coefficient is an index of reliability.
It indicates the ratio between the true score variance and the total variance.
Estimates in the range of 0.70 to 0.80 are considered good for basic research purposes.
In clinical settings, a reliability of .90+ may be required.
Reliability greater than 0.95 should be attempted to find.
Classical Test Theory
A test score reflects the testtaker's true score plus error.
The true score for an individual remains constant with repeated applications of the same test.
Error is a component of the observed test score unrelated to the testtaker's ability.
Errors of measurement are random.
Formula: , where:
= observed score
= true score
= error
Variance and Error
Variance describes test score variability (standard deviation squared).
Variance can be broken into components:
True variance: variance from true differences.
Error variance: variance from irrelevant, random sources.
Basic Sampling Theory
Distribution of random errors is bell-shaped.
The center represents the true score.
Dispersion around the mean displays the distribution of sampling errors.
The true score can be estimated by finding the mean of repeated observations.
Reliability as Proportion of True Variance
Reliability is the proportion of total variance attributed to true variance.
A greater proportion of total variance attributed to true variance indicates a more reliable test.
True differences are stable and yield consistent scores.
Error variance can affect test score consistency and reliability.
Measurement Error
Measurement error includes all factors associated with measuring a variable, other than the variable being measured.
Random error is caused by unpredictable fluctuations and inconsistencies.
It fluctuates from one testing situation to another with no discernible pattern.
Examples include unanticipated events or physical events happening within the testtaker.
Systematic error is constant or proportionate to the true value of the variable.
Systematic error does not affect score consistency.
Example: a weighing scale overweighing by 7 pounds.
Even with this error, the relative standings of those who use the scale remain unaffected.
Sources of Error Variance
Test Construction
Item sampling (content sampling) refers to variation among test items.
The extent to which a testtaker’s score is affected by the content sampled.
The test developer must maximize true variance and minimize error variance.
Test administration
Reactions to untoward influences are sources of error variance.
Test environment: room temperature, lighting, ventilation, and noise.
The instrument used to enter responses and the writing surface.
Events of the day.
Testtaker variables
Emotional problems, physical discomfort, lack of sleep, effects of drugs or medication.
Formal learning experiences, life experiences, therapy, illness, changes in mood or mental state.
Examiner-related variables: Examiner's physical appearance and demeanor, presence or absence of an examiner.
Test Scoring and Interpretation
Scorers and scoring systems are potential sources of error variance.
Technical glitches in computer scoring.
Subjectivity in scoring.
Methodological Error
Examples: Interviewers not being trained properly, ambiguous wording, biased items.
Reliability Estimates
Test-Retest Reliability Estimates
Obtained by correlating pairs of scores from the same people on two different administrations of the same test.
Appropriate for tests that measure stable traits.
As the time interval increases, the correlation decreases.
Coefficient of stability is used when the interval is greater than six months.
Evaluation must consider possible intervening factors.
Carryover effect: the first testing session influences scores from the second session.
Test-retest correlation usually overestimates the true reliability when there are carryover effects.
Random carryover effects occur when changes are unpredictable or affect some testtakers.
Practice effects: skills improve with practice.
Testtakers score better because they have sharpened their abilities by having taken the test the first time.
Parallel-Forms and Alternate Forms Reliability Estimates
Coefficient of equivalence: degree of relationship between various forms of a test.
Parallel forms: the means and the variances of observed test scores are equal.
The forms use different items, but the rules used to select items of a particular difficulty level are the same
Parallel forms reliability estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test, when, for each form of the test, the means and variances of observed test scores are equal
Alternate forms: different versions of a test that have been constructed so as to be parallel although they do not meet the requirements for the legitimate designation “parallel”, they are typically designed to be equivalent with respect to variables such as content and level of difficulty
Is not required to have the same means and variances of observed test scores
Alternate forms reliability estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error
Two administrations with the same group are required.
Test scores may be affected by motivation, fatigue, or intervening events.
Item sampling is a source of error variance.
Testtakers may do better or worse due to the items selected.
Split-Half Reliability Estimates
Obtained by correlating two pairs of scores from equivalent halves of a single test administered once.
Steps:
Divide the test into equivalent halves.
Calculate a Pearson r between scores on the two halves of the test.
Adjust the half-test reliability using the Spearman-Brown Formula.
Simply dividing the test in the middle is not recommended.
Acceptable ways to split a test:
Randomly assign items to one or the other half of the test.
Odd-even reliability: assign odd-numbered items to one half and the even-numbered ones to the other.
Divide the test by content so that each half contains items equivalent with respect to content and difficulty
Methods of Estimating Internal Consistency
Inter-Item Consistency
Degree of correlation among all the items on a scale.
Calculated from a single administration of a single form of a test.
Useful in assessing homogeneity.
Homogeneity: Items measure a single trait.
Heterogeneity: Items measure more than one trait.
More homogeneous a test, the more inter-item consistency it can be expected to have.
Methods:
Spearman-Brown formula estimates the internal consistency reliability from a correlation of two halves of a test.
used for estimating the reliability of a test that has been shortened or lengthened.
usually, but not always, reliability increases as test length increases
additional test items are equivalent with respect to the content and the range of difficulty of the original items
the formula can also be used to estimate the effect of the shortening on the test’s reliability
Kuder-Richardson Formula 20 (KR-20)
Statistic of choice for determining the inter-item consistency of dichotomous items.
Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar.
If items are heterogeneous, KR-20 will yield lower reliability estimates than split- half
Formula:
= number of test items
= variance of total test scores
= proportion of testtakers who pass the item
= proportion of people who fail the item
= sum of the pq products over all items
KR-21
Used when there is reason to assume that all the test items have approximately the same difficulty or that the average difficulty level is 50%.
It is used to obtain an approximation of KR-20
Cronbach’s Coefficient Alpha
Mean of all possible test-retest, split-half coefficients.
Appropriate for use on tests containing nondichotomous items.
Can be used when two halves of a test have unequal variances.
Preferred statistic for obtaining an estimate of internal consistency reliability.
Provides the lowest estimate of reliability that one can expect when variances of two halves of the test are equal, the Spearman-Brown coefficient and coefficient alpha give the same results
Requires only one administration of the test
Calculates to help answer questions about how similar sets of data are
ranges from 0 to 1
0 = absolutely no similarity
1 = perfectly identical
.70 to .80 are “good enough”
a value of alpha above .90 may be “too high” and indicate redundancy in items
Average Proportional Distance (APD)
A measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores.
Steps:
Calculate the absolute difference between scores for all of the items
Average the difference between scores
Obtain the APD by dividing the average difference between scores by the number of response options on the test, minus one
General rule of thumb for interpretation:
0. 2 or lower is indicative of excellent internal consistency
a value of 0.25 to 0.20 is in the acceptable range
025 is suggestive of problems with the internal consistency of the test
An advantage of APD from Cronbach’s alpha is that APD is not connected to the number of items on a measure, while Cronbach’s alpha will be higher when a measure has more than 25 items
Measures of Inter-Scorer Reliability
Degree of agreement or consistency between two or more scorers (or raters) with regard to a particular measure.
Reference to levels of inter-scorer reliability may be published in the test’s manual or elsewhere
If coefficient is high, the prospective test user knows that test scores can be derived in a systematic, consistent way by various scorers with sufficient training
It is often used when coding nonverbal behavior example: depressed mood
the researcher starts by composing a checklist of behaviors that constitute depressed mood
each subject will be given a depressed mood score by a rater having at least one other individual observe and rate the same behaviors
could safeguard the ratings as a product of personal bias
if there is a consensus in ratings, the researchers can be more confident regarding the accuracy of the ratings and their conformity with the established rating system
Coefficient of inter-scorer reliability
Calculating the coefficient of correlation between scorers’ ratings
Kappa statistic
Best method for assessing level of agreement among several observers
Cohen measure of agreement between two judges who each rate a set of objects using nominal scales
Fleiss extended the method to consider the agreement between any number of observers
Kappa
Indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement
ranges between 1 (perfect agreement) and —1 (less agreement than can be expected on the basis of chance alone)
> 0.75 = excellent agreement
0. 40 — 0.75 = fair to good (satisfactory) agreement
< 0.40 = poor agreement
Using and Interpreting a Coefficient of Reliability
Purpose of Reliability Coefficient
test designed for various use test-retest
test designed for single administration only internal consistency
transient error → source of error attributable to variations in the testtaker’s feelings, moods, or mental state over time
Nature of the Test
Homogeneity vs Heterogeneity of test items
homogeneous items are functionally uniform throughout
tests designed to measure one factor, such as one ability or trait, are expected to be homogeneous in items
expected to have high internal consistency
heterogeneous low internal consistency relative to a more appropriate estimate of test-retest reliability
Dynamic vs Static characteristics
dynamic trait, state, or ability is presumed to be ever-changing
best estimate of reliability is internal consistency
test-retest would be of little help
static trait, state, or ability is relatively unchanging
e.g. intelligence
obtained measurement would not be expected to vary significantly as a function of time
either test-retest or alternate-forms would be appropriate
Restriction or Inflation of range
variance of either variable in a correlational analysis is restricted by sampling procedure used correlation coefficient is lower
if variance is inflated by sampling procedure used correlation coefficient is higher
Speed tests vs Power tests
Power Test when a time limit is long enough to allow testtakers to attempt all items and if some of the items are so difficult that no testtaker is able to obtain a perfect score
Speed Test
generally contains items of uniform level of difficulty (typically low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly
the time limit on a speed test is established so that few if any of the testtakers will be able to complete the entire test
reliability estimate should be based on performance from two independent testing periods using one of the following:
test-retest
alternate-forms
split-half from two separately timed tests the obtained reliability coefficient is for a half test and should be adjusted using the Spearman-Brown formula
if speed test is administered once and some measure of internal consistency is calculated (KR-20 or split-half), the result will be a spuriously high reliability coefficient because the difficulty of items are uniformly low
Summary
test-retest correlation between total scores on two administrations of the same test
alternate-forms correlation between two total scores of two forms
split-half correlation between scores on two halves of the test adjusted using Spearman-Brown formula to obtain estimate of the whole test
Sources of Measurement Error and Methods of Reliability Assessment
Time sampling
same test given at different points in time may produce different scores, even if given to the same test takers
assessed using test-retest method
Item sampling
same construct or attribute may be assessed using a wide pool of items
assessed using alternate-forms or parallel-forms
Internal consistency
intercorrelations among items within the same test if the test is designed to measure a single construct and all items are equally good candidates to measure that attribute, then there should be a high correspondence among items
assessed using split-half, KR-20, or coefficient alpha
Observer differences
when different observers record the same behavior even though they have the same instructions, different judges observing the same event may record different numbers
assessed using kappa statistic
True Score Model of Measurement and Alternatives
Classical Test Theory (CTT)
everyone has a “true score” that is the actual measure of trait or ability if not complemented by an “error” in the “observed score”
a person’s true score on one test can vary from the same person’s true score on another intelligence test
Domain Sampling Theory
seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score
a test’s reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample
conceptualizes reliability as the ratio of the variance of the observed score on the shorter test and the variance of the long-run true score
the measurement considered is the error introduced by using a sample of items rather than the entire domain
as the sample gets larger, it represents the domain more and more accurately
domain universe of items that could conceivably measure a behavior a hypothetical construct that shares certain characteristics (and is measured by) the sample of items that make up the test
items in the domain have the same means and variances of those in the test that samples from the domain
internal consistency is the most compatible with this theory
Generalizability Theory
a “universe score” replaces that of a “true score”
based on the idea that a person’s test scores vary from testing to testing because of variables in the testing situation
instead of conceiving all variability in a person’s scores as error, test developers are encouraged to describe the details of the particular test situation (universe) leading to a specific test score
universe described in terms of its facets: number of items in the test amount of training the test scorers have had purpose of the test administration
according to the generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained
universe score is analogous to a true score in the true score model
the person will ordinarily have a different universe score for each universe one person’s universe score covering tests on a specific day will not agree perfectly with their score for the whole month for any measure, there are many “true scores” each corresponding to a different universe
Note: When we use a single observation as if it represented the universe, we are generalizing in other words, true score is a specific score for a specific time or setting or “universe”.
It becomes generalized when that “true score” is considered a “universe score” thereby making it a representative of a person’s universe if the observed scores from a procedure agree closely with the universe score, we can say that the observation is accurate, or reliable, or generalizable
Application of theory: tests should be developed with the aid of a generalizability study followed by a decision study
generalizability study examined how generalizable scores from a particular test are if the test is administered in different situations
the influence of particular facets on the test score is represented by coefficients of generalizability which are similar to reliability coefficients in the true score model
decision study developers examine the usefulness of test scores in helping the test user make decisions designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use
Item Response Theory (IRT)
synonymous with latent-trait theory
provide a way to model the probability that a person with X ability will be able to perform at a level of Y
it refers to a family of theories and methods used to distinguish specific approaches
Difficulty attribute of not being easily accomplished, solved, or comprehended
Discrimination degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever is being measured
Dichotomous test items test items that can be answered with only one of two alternative responses
Polytomous test items test items with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct
Rasch Model each item on the test is assumed to have an equivalent relationship with the construct being measured by the test
The Standard Error of Measurement (SEM)
provides a measure of the precision of an observed test score.
provides an estimate of the amount of error inherent in an observed score or measurement.
the higher the reliability of a test, the lower the SEM.
allows us to estimate the degree to which a test provides inaccurate readings because we assume that the distribution of random errors will be the same for all people, CTT uses the standard deviation of errors as the basic measure of error
larger SEM = less accuracy with which an attribute is measured
formula:
= standard error of measurement
= standard deviation of test scores by the group of testtakers
= reliability coefficient of the test
establishes a confidence interval a range or band of test scores that is likely to contain the true score
Standard Error of Difference
statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant.
if the probability is more than 5% that the difference occurred by chance, then it is presumed that there was no difference
the standard error of the difference between two scores will be larger than the standard error of measurement for either score alone because the former is affected by measurement error in both scores
recall:
in a difference score, error is expected to be larger than either observed score or true score because E absorbs error from both of the scores used to create the difference score
T might be expected to be smaller than E because whatever is common to both measures is canceled out when the difference score is created
the reliability of a difference score is expected to be lower than the reliability of either score on which it is based
if two tests measure exactly the same trait, then the score representing the difference between them is expected to have a reliability of 0
it is most convenient to find difference scores by first creating Z scores for each measure
What to Do about Low Reliability
Increase Number of Items
the larger the sample, the more likely that the test will represent the true characteristic
in domain sampling model, the reliability of a test increases as the number of items increases
using the Spearman-Brown prophecy formula can help estimate how many items will have to be added in order to bring a test to an acceptable level of reliability to find the number if items required, we must multiply the number of items on the current test by N
Factor and Item Analysis
tests are most reliable if they are unidimensional one factor should account for considerably more of the variance than any other factor
items that do not load on this factor might be best omitted
discriminability analysis examines the correlation between each item and the total score for the test when correlation is low, the item is probably measuring something different from the other items on the test
might also mean that the item is so easy or so hard that people do not differ in response to it should be excluded as it drags down the estimate of reliability
Correction for Attenuation
if a test is unreliable, information obtained with it is of little or no value. Thus, we say that potential correlations are attenuated, or diminished, by measurement error
measurement theory allows one to estimate what the correlation between two measures would have been if they had not been measured with error
these methods “correct” for the attenuation in the correlations caused by the measurement error
one needs to know only the reliabilities of two tests and the correlation between them when correcting for attenuation caused by one unreliable test