Reliability Flashcards

Reliability

Concept of Reliability

Reliability coefficient is an index of reliability.
It indicates the ratio between the true score variance and the total variance.
Estimates in the range of 0.70 to 0.80 are considered good for basic research purposes.
In clinical settings, a reliability of .90+ may be required.
Reliability greater than 0.95 should be attempted to find.

Classical Test Theory

A test score reflects the testtaker's true score plus error.
The true score for an individual remains constant with repeated applications of the same test.
Error is a component of the observed test score unrelated to the testtaker's ability.
Errors of measurement are random.
Formula: $X = T + E$ , where:
- $X$ = observed score
- $T$ = true score
- $E$ = error

Variance and Error

Variance describes test score variability (standard deviation squared).
Variance can be broken into components:
- True variance: variance from true differences.
- Error variance: variance from irrelevant, random sources.

Basic Sampling Theory

Distribution of random errors is bell-shaped.
The center represents the true score.
Dispersion around the mean displays the distribution of sampling errors.
The true score can be estimated by finding the mean of repeated observations.

Reliability as Proportion of True Variance

Reliability is the proportion of total variance attributed to true variance.
A greater proportion of total variance attributed to true variance indicates a more reliable test.
True differences are stable and yield consistent scores.
Error variance can affect test score consistency and reliability.

Measurement Error

Measurement error includes all factors associated with measuring a variable, other than the variable being measured.
Random error is caused by unpredictable fluctuations and inconsistencies.
It fluctuates from one testing situation to another with no discernible pattern.
Examples include unanticipated events or physical events happening within the testtaker.
Systematic error is constant or proportionate to the true value of the variable.
Systematic error does not affect score consistency.
Example: a weighing scale overweighing by 7 pounds.
Even with this error, the relative standings of those who use the scale remain unaffected.

Sources of Error Variance

Test Construction
- Item sampling (content sampling) refers to variation among test items.
- The extent to which a testtaker’s score is affected by the content sampled.
- The test developer must maximize true variance and minimize error variance.
Test administration
- Reactions to untoward influences are sources of error variance.
- Test environment: room temperature, lighting, ventilation, and noise.
- The instrument used to enter responses and the writing surface.
- Events of the day.
Testtaker variables
- Emotional problems, physical discomfort, lack of sleep, effects of drugs or medication.
- Formal learning experiences, life experiences, therapy, illness, changes in mood or mental state.
- Examiner-related variables: Examiner's physical appearance and demeanor, presence or absence of an examiner.
Test Scoring and Interpretation
- Scorers and scoring systems are potential sources of error variance.
- Technical glitches in computer scoring.
- Subjectivity in scoring.
Methodological Error
- Examples: Interviewers not being trained properly, ambiguous wording, biased items.

Reliability Estimates

Test-Retest Reliability Estimates
- Obtained by correlating pairs of scores from the same people on two different administrations of the same test.
- Appropriate for tests that measure stable traits.
- As the time interval increases, the correlation decreases.
- Coefficient of stability is used when the interval is greater than six months.
- Evaluation must consider possible intervening factors.
- Carryover effect: the first testing session influences scores from the second session.
- Test-retest correlation usually overestimates the true reliability when there are carryover effects.
- Random carryover effects occur when changes are unpredictable or affect some testtakers.
- Practice effects: skills improve with practice.
- Testtakers score better because they have sharpened their abilities by having taken the test the first time.

Parallel-Forms and Alternate Forms Reliability Estimates

Coefficient of equivalence: degree of relationship between various forms of a test.
Parallel forms: the means and the variances of observed test scores are equal.
- The forms use different items, but the rules used to select items of a particular difficulty level are the same
- Parallel forms reliability estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test, when, for each form of the test, the means and variances of observed test scores are equal
Alternate forms: different versions of a test that have been constructed so as to be parallel although they do not meet the requirements for the legitimate designation “parallel”, they are typically designed to be equivalent with respect to variables such as content and level of difficulty
- Is not required to have the same means and variances of observed test scores
- Alternate forms reliability estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error
Two administrations with the same group are required.
Test scores may be affected by motivation, fatigue, or intervening events.
Item sampling is a source of error variance.
Testtakers may do better or worse due to the items selected.

Split-Half Reliability Estimates

Obtained by correlating two pairs of scores from equivalent halves of a single test administered once.
Steps:
- Divide the test into equivalent halves.
- Calculate a Pearson r between scores on the two halves of the test.
- Adjust the half-test reliability using the Spearman-Brown Formula.
- Simply dividing the test in the middle is not recommended.
Acceptable ways to split a test:
- Randomly assign items to one or the other half of the test.
- Odd-even reliability: assign odd-numbered items to one half and the even-numbered ones to the other.
- Divide the test by content so that each half contains items equivalent with respect to content and difficulty

Methods of Estimating Internal Consistency

Inter-Item Consistency
- Degree of correlation among all the items on a scale.
- Calculated from a single administration of a single form of a test.
- Useful in assessing homogeneity.
- Homogeneity: Items measure a single trait.
- Heterogeneity: Items measure more than one trait.
- More homogeneous a test, the more inter-item consistency it can be expected to have.
Methods:
- Spearman-Brown formula estimates the internal consistency reliability from a correlation of two halves of a test.
- used for estimating the reliability of a test that has been shortened or lengthened.
- usually, but not always, reliability increases as test length increases
- additional test items are equivalent with respect to the content and the range of difficulty of the original items
- the formula can also be used to estimate the effect of the shortening on the test’s reliability

Kuder-Richardson Formula 20 (KR-20)

Statistic of choice for determining the inter-item consistency of dichotomous items.
Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar.
If items are heterogeneous, KR-20 will yield lower reliability estimates than split- half
Formula: $KR-20 = \frac{N}{N-1} * \frac{σ^2 - Σpq}{σ^2}$
- $N$ = number of test items
- $σ^2$ = variance of total test scores
- $p$ = proportion of testtakers who pass the item
- $q$ = proportion of people who fail the item
- $Σpq$ = sum of the pq products over all items

KR-21

Used when there is reason to assume that all the test items have approximately the same difficulty or that the average difficulty level is 50%.
It is used to obtain an approximation of KR-20

Cronbach’s Coefficient Alpha

Mean of all possible test-retest, split-half coefficients.
Appropriate for use on tests containing nondichotomous items.
Can be used when two halves of a test have unequal variances.
Preferred statistic for obtaining an estimate of internal consistency reliability.
Provides the lowest estimate of reliability that one can expect when variances of two halves of the test are equal, the Spearman-Brown coefficient and coefficient alpha give the same results
Requires only one administration of the test
Calculates to help answer questions about how similar sets of data are
ranges from 0 to 1
- 0 = absolutely no similarity
- 1 = perfectly identical
- .70 to .80 are “good enough”
- a value of alpha above .90 may be “too high” and indicate redundancy in items

Average Proportional Distance (APD)

A measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores.
Steps:
- Calculate the absolute difference between scores for all of the items
- Average the difference between scores
- Obtain the APD by dividing the average difference between scores by the number of response options on the test, minus one
General rule of thumb for interpretation:
- 0. 2 or lower is indicative of excellent internal consistency
- a value of 0.25 to 0.20 is in the acceptable range
- 025 is suggestive of problems with the internal consistency of the test
An advantage of APD from Cronbach’s alpha is that APD is not connected to the number of items on a measure, while Cronbach’s alpha will be higher when a measure has more than 25 items

Measures of Inter-Scorer Reliability

Degree of agreement or consistency between two or more scorers (or raters) with regard to a particular measure.
Reference to levels of inter-scorer reliability may be published in the test’s manual or elsewhere
If coefficient is high, the prospective test user knows that test scores can be derived in a systematic, consistent way by various scorers with sufficient training
It is often used when coding nonverbal behavior example: depressed mood
- the researcher starts by composing a checklist of behaviors that constitute depressed mood
- each subject will be given a depressed mood score by a rater having at least one other individual observe and rate the same behaviors
- could safeguard the ratings as a product of personal bias
- if there is a consensus in ratings, the researchers can be more confident regarding the accuracy of the ratings and their conformity with the established rating system
Coefficient of inter-scorer reliability
Calculating the coefficient of correlation between scorers’ ratings
Kappa statistic
- Best method for assessing level of agreement among several observers
- Cohen measure of agreement between two judges who each rate a set of objects using nominal scales
- Fleiss extended the method to consider the agreement between any number of observers
Kappa
- Indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement
- ranges between 1 (perfect agreement) and —1 (less agreement than can be expected on the basis of chance alone)
  - > 0.75 = excellent agreement
  - 0. 40 — 0.75 = fair to good (satisfactory) agreement
  - < 0.40 = poor agreement

Using and Interpreting a Coefficient of Reliability

Purpose of Reliability Coefficient
- test designed for various use test-retest
- test designed for single administration only internal consistency
- transient error → source of error attributable to variations in the testtaker’s feelings, moods, or mental state over time

Nature of the Test

Homogeneity vs Heterogeneity of test items
- homogeneous items are functionally uniform throughout
  - tests designed to measure one factor, such as one ability or trait, are expected to be homogeneous in items
  - expected to have high internal consistency
- heterogeneous low internal consistency relative to a more appropriate estimate of test-retest reliability
Dynamic vs Static characteristics
- dynamic trait, state, or ability is presumed to be ever-changing
  - best estimate of reliability is internal consistency
  - test-retest would be of little help
- static trait, state, or ability is relatively unchanging
  - e.g. intelligence
  - obtained measurement would not be expected to vary significantly as a function of time
  - either test-retest or alternate-forms would be appropriate

Restriction or Inflation of range

variance of either variable in a correlational analysis is restricted by sampling procedure used correlation coefficient is lower
if variance is inflated by sampling procedure used correlation coefficient is higher

Speed tests vs Power tests

Power Test when a time limit is long enough to allow testtakers to attempt all items and if some of the items are so difficult that no testtaker is able to obtain a perfect score
Speed Test
- generally contains items of uniform level of difficulty (typically low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly
- the time limit on a speed test is established so that few if any of the testtakers will be able to complete the entire test
- reliability estimate should be based on performance from two independent testing periods using one of the following:
  - test-retest
  - alternate-forms
  - split-half from two separately timed tests the obtained reliability coefficient is for a half test and should be adjusted using the Spearman-Brown formula
if speed test is administered once and some measure of internal consistency is calculated (KR-20 or split-half), the result will be a spuriously high reliability coefficient because the difficulty of items are uniformly low

Summary

test-retest correlation between total scores on two administrations of the same test
alternate-forms correlation between two total scores of two forms
split-half correlation between scores on two halves of the test adjusted using Spearman-Brown formula to obtain estimate of the whole test

Sources of Measurement Error and Methods of Reliability Assessment

Time sampling
- same test given at different points in time may produce different scores, even if given to the same test takers
- assessed using test-retest method
Item sampling
- same construct or attribute may be assessed using a wide pool of items
- assessed using alternate-forms or parallel-forms
Internal consistency
- intercorrelations among items within the same test if the test is designed to measure a single construct and all items are equally good candidates to measure that attribute, then there should be a high correspondence among items
- assessed using split-half, KR-20, or coefficient alpha
Observer differences
- when different observers record the same behavior even though they have the same instructions, different judges observing the same event may record different numbers
- assessed using kappa statistic

True Score Model of Measurement and Alternatives

Classical Test Theory (CTT)
- everyone has a “true score” that is the actual measure of trait or ability if not complemented by an “error” in the “observed score”
- a person’s true score on one test can vary from the same person’s true score on another intelligence test
Domain Sampling Theory
- seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score
- a test’s reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample
- conceptualizes reliability as the ratio of the variance of the observed score on the shorter test and the variance of the long-run true score
- the measurement considered is the error introduced by using a sample of items rather than the entire domain
as the sample gets larger, it represents the domain more and more accurately
- domain universe of items that could conceivably measure a behavior a hypothetical construct that shares certain characteristics (and is measured by) the sample of items that make up the test
- items in the domain have the same means and variances of those in the test that samples from the domain
- internal consistency is the most compatible with this theory

Generalizability Theory

a “universe score” replaces that of a “true score”
based on the idea that a person’s test scores vary from testing to testing because of variables in the testing situation
instead of conceiving all variability in a person’s scores as error, test developers are encouraged to describe the details of the particular test situation (universe) leading to a specific test score
universe described in terms of its facets: number of items in the test amount of training the test scorers have had purpose of the test administration
according to the generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained
universe score is analogous to a true score in the true score model
the person will ordinarily have a different universe score for each universe one person’s universe score covering tests on a specific day will not agree perfectly with their score for the whole month for any measure, there are many “true scores” each corresponding to a different universe
Note: When we use a single observation as if it represented the universe, we are generalizing in other words, true score is a specific score for a specific time or setting or “universe”.
It becomes generalized when that “true score” is considered a “universe score” thereby making it a representative of a person’s universe if the observed scores from a procedure agree closely with the universe score, we can say that the observation is accurate, or reliable, or generalizable
Application of theory: tests should be developed with the aid of a generalizability study followed by a decision study
- generalizability study examined how generalizable scores from a particular test are if the test is administered in different situations
- the influence of particular facets on the test score is represented by coefficients of generalizability which are similar to reliability coefficients in the true score model
- decision study developers examine the usefulness of test scores in helping the test user make decisions designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use

Item Response Theory (IRT)

synonymous with latent-trait theory
provide a way to model the probability that a person with X ability will be able to perform at a level of Y
it refers to a family of theories and methods used to distinguish specific approaches
Difficulty attribute of not being easily accomplished, solved, or comprehended
Discrimination degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever is being measured
Dichotomous test items test items that can be answered with only one of two alternative responses
Polytomous test items test items with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct
Rasch Model each item on the test is assumed to have an equivalent relationship with the construct being measured by the test

The Standard Error of Measurement (SEM)

provides a measure of the precision of an observed test score.
provides an estimate of the amount of error inherent in an observed score or measurement.
the higher the reliability of a test, the lower the SEM.
allows us to estimate the degree to which a test provides inaccurate readings because we assume that the distribution of random errors will be the same for all people, CTT uses the standard deviation of errors as the basic measure of error
larger SEM = less accuracy with which an attribute is measured
formula:
$σ<em>{meas} = σ\sqrt{1 - r</em>{xx}}$
- $σ_{meas}$ = standard error of measurement
- $σ$ = standard deviation of test scores by the group of testtakers
- $r_{xx}$ = reliability coefficient of the test
establishes a confidence interval a range or band of test scores that is likely to contain the true score

Standard Error of Difference

statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant.
if the probability is more than 5% that the difference occurred by chance, then it is presumed that there was no difference
the standard error of the difference between two scores will be larger than the standard error of measurement for either score alone because the former is affected by measurement error in both scores
recall: $X = T + E$
in a difference score, error is expected to be larger than either observed score or true score because E absorbs error from both of the scores used to create the difference score
T might be expected to be smaller than E because whatever is common to both measures is canceled out when the difference score is created
the reliability of a difference score is expected to be lower than the reliability of either score on which it is based
if two tests measure exactly the same trait, then the score representing the difference between them is expected to have a reliability of 0
it is most convenient to find difference scores by first creating Z scores for each measure

What to Do about Low Reliability

Increase Number of Items
- the larger the sample, the more likely that the test will represent the true characteristic
- in domain sampling model, the reliability of a test increases as the number of items increases
- using the Spearman-Brown prophecy formula can help estimate how many items will have to be added in order to bring a test to an acceptable level of reliability to find the number if items required, we must multiply the number of items on the current test by N
Factor and Item Analysis
- tests are most reliable if they are unidimensional one factor should account for considerably more of the variance than any other factor
- items that do not load on this factor might be best omitted
- discriminability analysis examines the correlation between each item and the total score for the test when correlation is low, the item is probably measuring something different from the other items on the test
- might also mean that the item is so easy or so hard that people do not differ in response to it should be excluded as it drags down the estimate of reliability
Correction for Attenuation
- if a test is unreliable, information obtained with it is of little or no value. Thus, we say that potential correlations are attenuated, or diminished, by measurement error
measurement theory allows one to estimate what the correlation between two measures would have been if they had not been measured with error
these methods “correct” for the attenuation in the correlations caused by the measurement error
one needs to know only the reliabilities of two tests and the correlation between them when correcting for attenuation caused by one unreliable test