CHAPTER 5: RELIABILITY

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/67

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

68 Terms

New cards

Reliability

The ‘CONSISTENCY’ in measurement

• The proportion of the total variance attributed to true variance.

• The greater the proportion of the total variance attributed to true variance, the more reliable the test.

New cards

Reliability Coefficient

An index of reliability describing the consistency of scores across contexts (e.g., different times, items, or raters).

New cards

THE CONCEPT OF RELIABILITY

Classical Test Theory

New cards

Classical Test Theory

• A score on an ability test is presumed to reflect not only the test taker’s true score on the ability being measured but also error.

Error

Measurement Error

Variance (σ2)

New cards

Error

• The component of the observed test score that does not have to do with the testtaker’s ability.

• X: observed score

• T: true score

• E: Error

New cards

Measurement Error

• All of the factors associated with the process of measuring some variable, other than the variable being measured.

o eg., English-language test on the subject of 12th-grade algebra being administered, in English, to a sample of 12-grade students, newly arrived to the United States from China.

• 1 Random Error

• 2 Systematic Error

<p>• All of the factors associated with the process of measuring some variable, other than the variable being measured.</p><p>o eg., English-language test on the subject of 12th-grade algebra being administered, in English, to a sample of 12-grade students, newly arrived to the United States from China.</p><p>• 1 Random Error</p><p>• 2 Systematic Error</p>

New cards

• 1 Random Error

o A source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process.

▪ Eg., lightning strike or sudden surge in the testtaker's blood pressure.

New cards

• 2 Systematic Error

o A source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.

o Does not affect score consistency.

New cards

Variance (σ2)

• The standard deviation squared which describes the sources of test score variability.

o True Variance: Variance from true differences.

o Error Variance: Variance from irrelevant, random sources.

New cards

1. SOURCES OF ERROR VARIANCE

Test Construction

Test Administration

Test Scoring and Interpretation

New cards

Test Construction

Item Sampling (Content Sampling)

The variation among items within a test as well as to variation among items between tests.

• Eg., test content differs in the way the it is worded or the included items.

New cards

Test Administration

Test Environment

Testtaker Variables

Examiner-Related Variables

New cards

Test Environment

• Eg. room temperature, level of lighting, and amount of ventilation and noise, the events of the day.

New cards

Testtaker Variables

• Eg. emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication, formal learning experiences, casual life experiences, therapy, illness, and changes in mood or mental state.

New cards

Examiner-Related Variables

• Examiner's physical appearance and demeanor, professionalism, and nonverbal behavior.

New cards

Test Scoring and Interpretation

Scorers and Scoring Systems

• Computer Scoring: Technical glitch might contaminate the data.

• Bias and subjectivity of scorers.

New cards

RELIABILITY ESTIMATES

1. test-retest reliability estimates

2. Parallel forms

3. Split-half reliability estimates

4. Other methods of Estimating Internal Consistency

5. Inter-scorer reliability

New cards

1. test-retest reliability estimates

• An estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.

o Useful if you are measuring something that is relatively stable over time (eg. personality traits).

• Coefficient of Stability

New cards

• Coefficient of Stability

o The estimate of test-retest reliability.

▪ The longer the time that passes, the greater the likelihood that the reliability coefficient will be lower.

New cards

2. Parallel forms

• Administering different forms of tests with the same group.

• Coefficient of Equivalence

o The degree of the relationship between various forms of a test.

New cards

Parallel Forms

• The means and the variances of observed test scores are equal in each form of the test.

New cards

• Parallel Forms Reliability

o An estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.

New cards

Alternate Forms

• Simply different versions of a test that have been constructed so as to be parallel.

New cards

• Alternate Forms Reliability

o An estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error.

New cards

3. Split-half reliability estimates

• Correlating two pairs of scores obtained from equivalent halves of a single test administered once.

Three-Step Process

Splitting the Test

Spearman–Brown formula

New cards

Three-Step Process

• Step 1. Divide the test into equivalent halves.

• Step 2. Calculate a Pearson r between scores on the two halves of the test.

• Step 3. Adjust the half-test reliability using the Spearman–Brown formula.

New cards

Splitting the Test

• 1. Randomly assign items to one or the other half of the test.

• 2 Odd-Even Reliability

o Assign odd-numbered items to one half of the test and even-

numbered items to the other half.

• 3 Divide the test by content so that each half contains items equivalent with respect to content and difficulty

• Don’t divide the items in the middle.

o Raise or lower the reliability coefficient.

New cards

Spearman–Brown formula

Estimate internal consistency reliability from a correlation of two halves of a test.

o rSB: The reliability adjusted by the Spearman–Brown formula.

o rxy: The Pearson r in the original-length test.

o n: The number of items in the revised version divided by the number of items in the original version.

New cards

• Use the Spearman- Brown formula to estimate the reliability of a whole test.

o rHH: the Pearson r of scores in the two half tests

o A whole test is two times longer than half a test, n becomes 2.

• !! Usually, but not always, reliability increases as test length increases.

New cards

• May be used to:

o Estimate the effect of the shortening on the test's reliability.

o Determine the number of items needed to attain a desired level of reliability.

New cards

4. Other methods of Estimating Internal Consistency

Inter-Item Consistency

Kuder-Richardson Formula (KR-20)

Coefficient Alpha

Average Proportional Distance (APD)

New cards

Inter-Item Consistency

• The degree of correlation among all the items on a scale.

Homogeneity

• Greek words homos, meaning “same,” and genos, meaning “kind”. The degree to which a test measures a single factor.

Heterogeneity

• The degree to which a test measures different factors.

• Composed of items that measure more than one trait.

New cards

Kuder-Richardson Formula (KR-20)

• G. Frederic Kuder and M. W. Richardson

o Named because it was the 20th formula developed in a series.

• Determining the inter-item consistency of dichotomous items.

o Items that can be scored right or wrong (Eg. multiple-choice items).

o k: The number of test items.

o σ2: The variance of total test scores.

o p: The proportion of test takers who pass the item.

o q: The proportion of people who fail the item.

o Σ pq: The sum of the pq products over all items.

<p>o Named because it was the 20th formula developed in a series.</p><p>• Determining the inter-item consistency of dichotomous items.</p><p>o Items that can be scored right or wrong (Eg. multiple-choice items).</p><p>o k: The number of test items.</p><p>o σ2: The variance of total test scores.</p><p>o p: The proportion of test takers who pass the item.</p><p>o q: The proportion of people who fail the item.</p><p>o Σ pq: The sum of the pq products over all items.</p>

New cards

Coefficient Alpha

• Developed by Cronbach (1951)

• The mean of all possible split-half correlations, corrected by the Spearman–Brown formula.

o Use on tests containing non-dichotomous items.

o ra: Coefficient alpha

o k: The number of items, the variance of one item.

o Σ: The sum of variances of each item.

o σ2: The variance of the total test scores.

New cards

Coefficient Alpha

• Ranges in value from 0 to 1

o Help answer questions about how similar sets of data are.

o 0: absolutely no similarity to 1: perfectly identical

o a value of alpha above .90 may be “too high” and indicate redundancy in the items.

New cards

Average Proportional Distance (APD)

o A measure that focuses on the degree of difference that exists between item scores.

▪ .2 or lower: Excellent internal consistency

▪ .25 to .2: Acceptable internal consistency

New cards

5. Inter-scorer reliability

• The degree of agreement or consistency between two or more scorers with regard to a particular measure.

o Often used when coding nonverbal behavior.

▪ Eg. checklist of behavior about depressed mood.

New cards

• Coefficient of Inter-Scorer Reliability

o Way of determining the degree of consistency among scorers in the scoring of a test.

New cards

USING AND INTERPRETING A COEFFICIENT OF RELIABILITY

1. THE PURPOSE OF THE RELIABILITY COEFFICIENT

2. THE NATURE OF THE TEST

3. THE TRUE SCORE MODEL OF MEASUREMENT AND ALTERNATIVES TO IT

New cards

1. THE PURPOSE OF THE RELIABILITY COEFFICIENT

New cards

2. THE NATURE OF THE TEST

-Homogeneity Versus Heterogeneity of Test Items

-Dynamic versus Static Characteristics

-Restriction or Inflation of Range

-Speed Tests versus Power Tests

-Criterion-Referenced Tests

New cards

-Homogeneity Versus Heterogeneity of Test Items

Homogenous Items

• Tests designed to measure one factor.

• High degree of internal consistency.

Heterogenous Items

An estimate of internal consistency might be low.

New cards

-Dynamic versus Static Characteristics

Dynamic Characteristics

• A trait, state, or ability presumed to be ever-changing

as a function of situational and cognitive experiences.

o Not test-retest but measure of internal consistency is appropriate.

New cards

-Dynamic versus Static Characteristics

Static Characteristics

A trait, state, or ability presumed to be relatively

unchanging (Eg., intelligence).

o Test-retest or the alternate-forms method.

New cards

-Restriction or Inflation of Range

• Also called as ‘Restriction of Range’ or ‘Restriction of Variance’.

• If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower.

• If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher.

New cards

-Speed Tests versus Power Tests

Speed Test

• Contains items of uniform level of difficulty so that,

when given generous time limits, all testtakers should be able to complete all the test items correctly.

o test-retest reliability, (2) alternate-forms reliability, or (3) split-half reliability from two separately timed half tests.

• Not be calculated from a single administration of the

test with a single time limit.

New cards

-Speed Tests versus Power Tests

Power Test

Some items are so difficult, and the time limit is long

enough to allow testtakers to attempt all items.

New cards

-Criterion-Referenced Tests

• Designed to provide an indication of where a testtaker stands with respect to some variable or criterion.

o Traditional procedures of estimating reliability are usually not appropriate for use with criterion-referenced tests.

New cards

3. THE TRUE SCORE MODEL OF MEASUREMENT AND ALTERNATIVES TO IT

Classical Test Theory (CTT)

Domain Sampling Theory and Generalizability Theory

Item Response Theory (IRT)

New cards

Classical Test Theory (CTT)

• also known as ‘True Score (or Classical) Model of Measurement’

• True Score

o A value that according to classical test theory genuinely reflects an individual’s ability level as measured by a

particular test.

New cards

Classical Test Theory (CTT)

• Advantages

o Notion that everyone has a “true score” on a test.

o Its assumptions allow for its application in most situations.

o Its compatibility and ease of use with widely used statistical techniques.

• Disadvantages

o CTT assumptions are characterized as “weak”—this precisely because its assumptions are so readily met.

▪ IRT assumptions being characterized in terms such as “strong,” “hard,” “rigorous,” and “robust.”

o All items are presumed to be contributing equally to the score total.

o The length of tests that are developed using a CTT model.

o Favor the development of longer rather than shorter tests.

New cards

Domain Sampling Theory and Generalizability Theory

1. Domain Sampling Theory

2. Generalizability Theory

New cards

1. Domain Sampling Theory

• Seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score.

o Domain of Behavior: the universe of items that could conceivably measure that behavior, can be thought of as a hypothetical construct.

o !! Measures of internal consistency are perhaps the most compatible.

New cards

2. Generalizability Theory

• Developed by Lee J. Cronbach (1970) and his colleagues

• A person’s test scores vary from testing to testing because of variables in the testing situation.

New cards

o Universe:

Describe the details of the particular test situation leading to a specific test score.

New cards

o Facets

The number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration.

New cards

o Universe Score:

The exact same test score obtained given the exact same condition of all the facets in the universe.

New cards

• Tests be developed with the aid of a generalizability study followed by a decision study.

o Generalizability Study: Examines how generalizable scores from a particular test are if the test is administered in different situations.

o Decision Study: Developers examine the usefulness of test scores in helping the test user make decisions.

• Asserts that test’s reliability is very much a function of the circumstances under which the test is developed, administered, and interpreted.

New cards

Item Response Theory (IRT)

Also called as ‘Latent-Trait Theory’

o The difficulty level of an item and the item’s level of discrimination.

o Difficulty: The attribute of not being easily accomplished, solved, or comprehended.

o Discrimination: The degree to which an item differentiates among people with higher or lower levels of the trait,

ability, or whatever it is that is being measured.

▪ Not all items of the test might be given equal weight.

New cards

Item Response Theory (IRT)

• 1. Dichotomous Test Items

o Test items or questions that can be answered with only one of two alternative responses.

o Eg. true–false, yes–no, or correct–incorrect questions.

• 2. Polytomous Test Items

o Test items or questions with three or more alternative responses.

o Only one is scored correct or scored as being consistent with a targeted trait or other construct.

New cards

RELIABILITY AND INDIVIDUAL SCORES

1. STANDARD ERROR OF MEASUREMENT (SEM or SEM)

2. THE STANDARD ERROR OF THE DIFFERENCE BETWEEN TWO SCORES

New cards

1. STANDARD ERROR OF MEASUREMENT (SEM or SEM)

• Tool used to estimate or infer the extent to which an observed score deviates from a true score.

o Provides an estimate of the amount of error inherent in an observed score or measurement.

o The higher the reliability of a test, the lower the SEM.

New cards

Standard Error of a Score

• Denoted by the symbol σmeas

• An index of the extent to which one individual's scores vary over tests presumed to be parallel.

o σ: The standard deviation of test scores by the group oft esttakers.

o rxx: The reliability coefficient

<p>• Denoted by the symbol σmeas</p><p>• An index of the extent to which one individual's scores vary over tests presumed to be parallel.</p><p>o σ: The standard deviation of test scores by the group oft esttakers.</p><p>o rxx: The reliability coefficient</p>

New cards

Confidence Interval

• A range or band of test scores that is likely to contain the true score.

o Used to set the confidence interval for a particular score or to determine whether a score is significantly different from a criterion.

New cards

2. THE STANDARD ERROR OF THE DIFFERENCE BETWEEN TWO SCORES

• Standard Error of the Difference

o A statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant.

▪ If the probability is more than 5% that the difference occurred by chance, then, for all intents and purposes, it is presumed that there was no difference.

New cards

The formula for the standard error of the difference between two scores

▪ σdiff: The standard error of the difference between two scores.

▪ σ2meas1: The squared standard error of measurement for test 1.

▪ σ2meas2: The squared standard error of measurement for test 2.