Measurement final exam: reliability

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/35

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

36 Terms

1
New cards

What are the two major criteria for evaluating measurements?

Validity: accuracy/truthfulness of interpretations or inferences made from test scores; reliability: consistency, dependability, repeatability

2
New cards

What is the relationship between reliability and validity?

Just because measurements are reliable does not ensure that they are valid. Example: a yard stick from a country that does not use yards.

3
New cards

Describe reliability as repeatability over time

Any observation has error. Assume errors are random. Some error is positive; some is negative. Positive and negative errors cancel out. The mean of repeated measurements can thus be considered the “true” score.

4
New cards

How does the concept of reliability as repeatability over time differ between physical and psychological measurements?

Physical measurements can be repeated numerous times without changing the object being measured or its properties; psychological measurements cannot.

5
New cards

What are some possible sources of error in psychological measurement?

Individual sources: fatigue, illness, motivation, fluctuations in the trait; external sources: testing situation (distractions, room conditions), administration errors, scoring errors, content of test items

6
New cards

What is test-retest reliability?

Repeated testing as a method for determining reliability. Each person is administered the same test at two time points. Reliability is estimated as the correlation between their scores.

7
New cards

What are some problems with test-retest reliability?

Motivation: decreases reliability
Memory: increases reliability
Learning: can increase or decrease reliability
People may change as a result of the measurement process
The trait may change over time which can be incorrectly viewed as measurement error

8
New cards

Why is there a need for reliability theory?

We can’t measure people multiple times and take the average of their scores. If test-retest reliability is problematic, more testing occasions would be even worse. To get a stable mean, we would need at least 30 observations. We cannot measure people with a number of instruments measuring the same construct and take an average of their scores. We would need at least 30 instruments. Instruments are expensive to create. There might not be a sufficient number of instructs for a particular trait. Different instruments might measure the trait differently.

9
New cards

What are the four stages in the evolution of reliability theory?

Classical true-and-error score theory, infinite parallel tests theory, domain sampling theory, generalizability theory

10
New cards

What are the assumptions of classical true-and-error score theory?

  1. Observed score = True Score + Error

  2. True score is “expected” or mean of observed scores

  3. Errors are not correlated with True Score, i.e., are random

  4. Errors on two parallel tests are uncorrelated

11
New cards

What are some conclusions drawn from true-and-error score theory?

  1. Mean (expectation) of error scores for any examinee is zero

  2. Observed Score variance can be partitioned into True Score variance and Error Score variance

  3. Reliability coefficient is the squared correlation between Observed Score and True Score, estimated by the correlation between scores on two parallel tests

  4. Reliability is the ratio of True Score variance to Observed Score variance

  5. Reliability is one minus the proportion of Error Score variance to Observed Score variance

  6. Index of reliability (IR) is unsquared reliability coefficient, or unsquared correlation between Observed Score and True Score

  7. Error variance is the amount of Observed Score variance that is due to unreliability

  8. Standard Error of Measurement (SEM) is the standard deviation of the errors of measurement around an estimate of the true score. The estimate of the true score is obtained from the IR. The SEM defines a confidence interval in which the “real” True Score lies.

12
New cards

What are some issues with true + error score reliability theory?

Parallel tests measure the same construct, have the same number of items, have the same mean and standard deviation, and if there are more than two, the correlations between them are equal. In practice, parallel tests are almost impossible to create (i.e., with same mean and standard deviation). Reliability coefficients are sample dependent, and True Score estimates are sample dependent. The SEM and its confidence interval are sample dependent. The SEM is constant for all score levels, and there is only one SEM for each reliability coefficient.

13
New cards

What are some assumptions of infinite parallel tests theory?

For any construct, there are an infinite number of parallel tests that can be constructed by randomly drawing a specified number of items from all possible items. These parallel tests will all correlate with each other to the same extent (plus or minus sampling error). An observed score can be decomposed into a True Score and an Error Score. A True Score is defined as the mean Observed Score across infinite parallel tests.

14
New cards

What are some implications of infinite parallel tests theory?

The reliability coefficient is the correlation between Observed Scores on any two parallel tests, varying by sample error based on the specific tests used. The IR is the correlation of Observed Scores with the average score over an infinite number of parallel tests (True Score). The SEM is the standard deviation of error scores across infinite parallel tests for one individual. In theory, the SEM can vary from examinee to examinee, but there is no method for implementing this variable SEM in practice. The formulas for the reliability coefficient, IR, and SEM are the same as in true-and-error score theory.

15
New cards

What is the basic idea behind domain sampling theory?

Any measuring instrument is composed of a random sample of items from a specified domain. Domain sampling is the simplest case of infinite parallel tests. Reliability is repeatability across random samples from the domain, as calculated by the correlation between scores on any two random samples.

16
New cards

What are the assumptions of domain sampling theory?

With repeated samples from the domain, the averages of the mean, standard deviation, and correlation of items in the samples are the same as the mean, standard deviation, and correlation of all the items in the domain.

17
New cards

What is reliability in domain sampling theory?

The average correlation of scores on one item with all other items in the domain. Can be interpreted as the proportion of Observed Score variance not attributable to Error.

18
New cards

What are the IR and SEM in domain sampling theory?

IR: correlation between Observed Scores on the random samples of items and scores on the whole domain of items (True Scores)

SEM: standard deviation of individual scores over sets of items. Can vary from person to person in theory but computationally is the same.

Formulas for IR and SEM are the same as in true-and-error score theory and infinite parallel tests theory.

19
New cards

What is parallel forms reliability?

The reliability coefficient is the Pearson product-moment correlation between scores for a group of examinees on two parallel tests.

20
New cards

What are problems with parallel forms reliability?

It’s difficult to get explicitly parallel forms as required by true-and-error score theory. The randomly parallel forms of infinite parallel tests theory are easier to obtain.

Scores on one form may affect scores on another form due to memory and/or motivation.

The time interval between the administration of the forms can affect the correlation between parallel forms.

The reliability coefficient will depend on the heterogeneity of the group to which the two forms are administered (and other factors that affect correlations).

The reliability coefficient will reflect how reliable the scores are and how parallel the tests are.

21
New cards

What are split-test methods for approximating parallel forms?

The Pearson product-moment correlation between scores on odd vs even items, first half versus second half, etc,

Spearman-Brown formula is used to inflate the split-test correlation to which it would be for the number of items in the complete test.

22
New cards

What are some assumptions of the Spearman-Brown formula?

The new items and the original items have the same standard deviations, same item difficulties (proportion correct), same item intercorrelations, and measure the same trait.

23
New cards

What are the steps to Rulon’s formula for split-test reliability?

  1. Split test into two parts: odd/even, first half/second half

  2. Compute total score for all items for each examinee

  3. Compute total score for each half for each examinee

  4. Compute the difference score between the two parts for each examinee

  5. Compute the variance of the difference scores

  6. Compute the variance of the all-items total scores

Not a correlation and does not have the problems associated with correlations!

24
New cards

What are some limitations of split-test estimates?

They are not applicable to speeded tests. They inflate reliability by including momentary fluctuations in performance with systematic variance. First and second halves (or odd-even splits) are frequently not equivalent; this can reduce reliability.

25
New cards

Describe internal consistency reliability as a whole-test method based on domain sampling.

Internal consistency is the logical extreme of split-test methods.

  1. Divide the test into its smallest part, i.e., single items

  2. Consider each item as a mini-test of length 1

  3. If there were 100 items, instead of using 50 items in two split-half tests, use each item as a “test.”

  4. Intercorrelate all items into a 100 x 100 correlation matrix.

  5. Take the average of those correlations to estimate reliability.

  6. Use the Spearman-Brown formula to get the reliability of the 100-item test from the (average) reliability of the 1-item tests because item responses correlate low with each other.

This approach is the conceptual basis for internal reliability consistency coefficients.

26
New cards

Describe Cronbach’s alpha reliability coefficient

An example of a whole-test method for internal consistency reliability. Error variance is the summed variance of the test items. Implies that reliability of a test is a function of the positive inter-item correlations of the items composing it. Can apply to dichotomous or non-dichotomous items. Item intercorrelations are assumed to be equal. Not usable for speeded tests. Ranges from negative infinity to 1.0. Can be interpreted as the “stepped up” average item intercorrelation.

27
New cards

What is the Kuder-Richardson Formula 20 (KR-20)?

A reliability coefficient for dichotomously scored test data. The same as Cronbach’s alpha but for 0-1 scored items.

28
New cards

What is Hoyt’s reliability coefficient?

Divides the total variance in a Persons by Items matrix of item responses into three variance components: variance between persons, variances between items, and interaction variance between items and persons (which is defined as the error variance in the generic formulas for reliability). Gives equivalent results to KR-20 and Cronbach’s alpha.

29
New cards

What are two ways internal consistency reliability can be increased?

By deleting items that are different from other items; by adding items that are similar to other items.

30
New cards

What is generalizability theory?

An extension of Hoyt’s analysis of variance approach, expands Hoyt’s definition of error by breaking down the interaction term to determine what factors account for the interaction: e.g, sex, age, testing conditions. Result is an expanded analysis of variance. Sources of variation that are not statistically significant are said to be “generalizable,” i.e., scores are not affected by these variances or their interactions. Error is defined as those testing conditions that have significant effects on test scores (Relative Error). This model depends on parallel tests, referred to as randomly parallel tests (equal-sized random samples of items from the same universe of admissible observations). They are not required to have equal means, variances, or intercorrelations. Reliability-like coefficients, called generalizability coefficients, are defined as the proportion of Observed Score variance due to universe score variance. A Universe Score is the statistical expectation of a person’s observed score overall conditions in the universe of generalization.

31
New cards

What are some issues with generalizability theory?

g (the generalizability coefficient) depends on the conditions selected for the universe of generalization. The magnitude of error depends on how the investigator defines the universe of generalization. Changing the conditions of measurement will change the g coefficient due to the effect on both Universe Scores and Relative Error. Use of parallel forms results in it being not maximally useful. Is a large sample method since it requires enough examinees in each cell of the ANOVA table to investigate statistical significance. Analyzes only mean difference across sources of variation.

32
New cards

Describe reliability for ratings data

Internal consistency methods are generally not applicable to scores (ratings) assigned by two or more raters. Raters might rate observed behaviors or writing that others have produced. Special methods apply to these kinds of ratings. Interrater agreement = the extent to which the different judges tend to make exactly the same judgment about the rated subject. Interrater reliability = the degree to which the ratings of different judges are rank-ordered in a similar fashion (i.e., are proportional). Evidence of both interrater agreement and interrater reliability is required before ratings can be "accepted.” The manner in which the index was calculated should be described along with its assumptions.

33
New cards

Describe reliability for criterion-referenced tests

The purpose of a criterion-referenced (or mastery) test is to compare each individual to some specified standard. Student variability is not essential. Classical estimates of reliability are not appropriate. The type of precision desired is reflected in the purpose of the test. If the score is intended for reference to a domain, the focus is on the precision of the score itself. If the purpose of the test is to make a classification decision with respect to some “cut score” then the focus in on the precision of the categorization decision (decision consistency). A number of different approaches have been proposed for estimating reliability under these circumstances.

34
New cards

Give a summary of reliability

The essence of reliability is “repeatability.” There are different kinds of reliability for different purposes. Test-retest reliability to evaluate stability of scores over time. Parallel forms reliability as an approximation to internal consistency reliability. Internal consistency to evaluate repeatability within one measurement occasion. Interrater reliability/agreement for judges’ ratings. Special methods for mastery tests. For individual measurement of individual differences, the end results of reliability calculations should be the standard error of measurement (SEM).

35
New cards

What are some problems with reliability: parallel forms?

Strictly parallel forms are almost impossible to create. Split-test methods (an approximation to parallel forms) are rarely parallel. Spearman-Brown formula frequently does not give correct predictions of reliability for lengthened or shortened tests. Confound the parallelism of the forms with the repeatability of the measurements, thus underestimating reliability Are subject to all the factors that affect correlations: restricted range, linearity, outliers, mixed samples, skewed distributions.

36
New cards

What are some problems with reliability: internal consistency

Not appropriate for speeded tests. Other things being equal, the more heterogeneous the group the higher the reliability. Are group dependent because they involve the variance of the observed scores. SEM is dependent on reliability, as is estimate of “true score,” so a single examinee with the same score in two different groups of examinees could have two different estimates of true scores and two different error bands around them. Are item dependent. Are, in effect, the “stepped up” average of the item intercorrelations. By removing items that don't correlate as highly with one another the reliability estimate can be improved. Result in a single standard error of measurement regardless of score. Reliabilities high enough to obtain very precise individual score estimates are difficult to obtain.