Chapter 5 What Is Test Reliability/Precision?

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/35

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

36 Terms

New cards

Reliability/Precision

Refers to the consistency of test scores in general

The most important attribute of a measurement instrument is its reliability. Describes the consistency of test scores - contains some degree of error, which can affect reliability and consistency

New cards

Measurement Error

Variations in measurement using a reliable instrument

Example: a yardstick - user error

Errors are usually due to random mistakes or inconsistencies of the person using the measurement tool. The tool has internal consistency.

New cards

Reliable Test

Is one we can trust to measure each person in approximately the same way every time it is used

A test must also be reliable if it is used to measure attributes and compare people

Just because a test has been shown to produce reliable scores, that does not mean the test is also valid. The evidence of reliability does not mean that the inferences that a test user makes from the scores on the test are correct or that the test is being used properly (validity)

New cards

Classical Test Theory

According to classical test theory, a persons test score (called the observed score) is made up of two independent parts: a true score and random error

X = T + E

New cards

True Score

(T) part of classical test theory

A true score is a measurement of the amount of the attribute that the test is designed to measure

An individual true score on a test is a value that can never really be known or determined, it represents the score that would be obtained if that individual took a test an infinite number of times and then the average score was computed. If we could average all the scores, the results would represent a score without random error.

One way to think about a true score is to think about choosing a member of your competitive video gaming team, you could choose someone who played one game, but the best chance is lie with choosing someone who has average success to estimate his or her true ability

New cards

Random Error

(E) part of classical test theory

The second part of an observed task or consists of random errors that occur anytime a person takes a test.

Random error is defined as the difference between a persons actual score on a test (the observed score) and that person's true score (T).

Because it is random, over an infinite number of testing's The air will increase and decrease a persons score by exactly the same amount, in other words the mean of all the errors scores over infinite testing's will be zero.

Two other important characteristics of measurement error are that it is normally distributed, and it is and correlated with true scores

Random error lowers the reliability of a test

New cards

Measurement error is both:

Random error & Systematic error

New cards

Systematic Error

When a single source of error always increases or decreases the true score by the same amount. That is, a scale consistently weighs 3 pounds more than the actual weight, in this case the error your scale makes is predictable and systematic.

Systematic error is often difficult to identify, practice effects and order effects can add systematic Air as well as random error to test scores.

Another important distinction between random error and systematic error is that random error lowers the reliability of a test

New cards

Reliability Coefficient

The correlation between the two sets of test scores

New cards

Determining True Score & It's Reliability Coefficient

We can never really determine a persons true score on any measure. A true score is the score of that a person would get if he or she took a test an infinite number of times and we averaged all the results, which is something we can never actually do.

Because we cannot ever know what a person's true score actually is, we can never exactly calculate a reliability coefficient. But there are methods we can use to do so.

New cards

Methods to Measure Reliability Coefficient

The test retest method

The alternate forms method

The internal consistency method (split half, coefficient alpha methods, and methods that evaluate score a reliability or agreement)

Each of these methods takes into account various conditions that can produce inconsistencies in test scores. The method chosen to estimate reliability depends on the test itself and the conditions under which the test user plans to administer the test. Each method produces a numerical reliability coefficient, which enables us to estimate and evaluate the reliability of the test.

New cards

Parallel (Key Term)

Let's assume that we could build two different forms of a test that measured the exact same construct in exactly the same way. Technically we would say that these alternate forms of the test were parallel. If we gave these two forms of the test to the same group of people we would still not expect that everyone would score exactly the same on the second administration of the test as they did on the first. This is because there will always be some measurement error that influences everyone's scores in a random, non-predictable fashion. Of course if the test were really measuring the same concepts in the same way we would expect people scores to be very similar across the two testing sessions. And the more similar the scores are, the better the reliability the test would be.

New cards

Reliability Coefficient (Key Term)

If there was no measurement error, we would expect that everyone's observe scores on the two parallel test would be the same. If so the correlation between the two sets of scores called the reliability coefficient, would be a perfect 1.0. Would also be the case that if the two groups of test scores were exactly the same for all individuals, the variance of the scores of each test would be exactly the same as well.

A measure of the accuracy of a test obtained by measuring the same individuals twice and computing the correlation of the two sets of measures

Hypothetically if no measurement error existed, observe scores on parallel test would be the same, and the reliability coefficient would be a perfect 1.0 meaningfully reliable

The reason why the addition of random error reduces the reliability of a test is because reliability is about estimating the proportion of variability In a set of observed test scores that is a attributable only to true scores

Reliability is defined as true score variance divided by total observe score variance

New cards

Test Retest Method

1/3 categories of reliability coefficients

At test developer gives the same test to the same group of test takers on two different occasions

Example: PAI personality assessment inventory

Intervals between testing can vary from hours to years - reliability decreases overtime

The main assumption is that test takers have not changed between the first and second administration

Circumstances of testing may remain as stable as possible, as well as the state of being the individual was in initially helps decrease error (rested & fed)

Correlation: The scores from the first and second administrations are then compared

Can lead to practice effects

New cards

Practice Effects

*test retest methods

Practice effects occur when a test take her benefit from taking the test the first time, which enables them to solve problems more quickly and correctly the second time

The test retest method is appropriate only when the test takers are not likely to learn something the first time they take the test, or when the interval is spaced appropriately

New cards

Alternate Forms Method

2/3 methods to measure reliability coefficient

Alternate forms when the test developer develops two different forms of the test

The two different forms of a test are compared using correlation

example: TONI-4 (intelligence test)

The test developers assessed the alternate forms reliability by giving the two forms to the same group of subjects in the same testing session results indicated that the correlation showed high reliability

Largest risk is inequivalence between the two forms

Alternate forms are much easier to develop for well defined characteristics, such as mathematical ability, then for personality traits

Can lead to order effects

New cards

Order Effects

*alternate forms method

Changes in test scores resulting from the order in which the test were taken

New cards

Internal Consistency Method

3/3 methods to measure reliability coefficient

Internal consistency method is a measure of how related the items, or groups of items, on the test are to one another

One can consider whether knowledge of how a person answered one item would give you information that would help you correctly predict how he or she answered another item

New cards

Split Half Reliability

*internal consistency

The split half method is to divide the test into halves and then compare the set of individual test scores on the first half with the set of individual test scores on the second half

Best way is to use random assignment to place each question in 1/2 or the other

When using the split half method we must mathematically adjust the reliability coefficient to compensate for the impact of splitting the test - Spearman-Brown formula

New cards

Homogeneous tests vs Heterogeneous Tests

Homogeneous test only measure one trait or characteristic

- estimating reliability using methods of internal consistency is appropriate only for homogeneous tests

Heterogeneous test measure more than one trait or characteristic

- if a test is heterogeneous the test developer should calculate and report an estimate of internal consistency for each homogenous sub transfer factor

New cards

Scorer Reliability

Score reliability or inter-scorer agreement:

The amount of consistency among scorers' judgement

An individual can make mistakes in scoring, which add error to test scores, particular win the score must make judgements about whether an answer is right or wrong.

The amount of consistency among scores judgements becomes an important consideration for tasks that require decisions by the administrator or scorer

Example: WCST Wisconsin card sorting trick

New cards

intrascorer reliability

whether each clinician was consistent in the way he or she assigned scores from test to test

New cards

interrater reliability

The degree to which readers are being consistent in their observations and scoring in instances where there is more than one person score in the test results

Whenever you use humans as part of your measurement procedure, you have to worry about whether the results to get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.

There are several ways to estimate interrater reliability.

New cards

Measuring interrater reliability

1. Give the test once, and have it scored by two scorers (Pearson product moment correlation)

New cards

Measuring interrater agreement

1. Create a reading instrument, and have it completed by two judges (Cohen's Kappa)

- nominal or ordinal

2. Calculate the consistency of scores for a single scorer. A single scorer rates or scores the same thing on more than one occasion. (Intraclass correlation coefficient)

New cards

Interrater reliability

Refers to the extent to which two or more individuals agree. If the observers agreed perfectly on all items, then interrater reliability would be perfect

New cards

intraclass correlation

A special type of correlation appropriate for comparing responses of more than two rators or more than two sets of scores

New cards

interrater agreement

An index of how consistently the scorers rate or make decisions

New cards

intrarater agreement

When one scorer makes judgments, the researcher also wants assurance that the scorer makes consistent judgements across all tests

New cards

Coefficient Alpha / KR-20

Imagine that we compute one split half reliability and then randomly divide the items into another set of split halves and recompute, and keep doing this until we have computer at all possible split half estimates of reliability. Cronbach's alpha is mathematically equivalent to the average of all possible split half estimates although that's not how we compute it

New cards

Pearson correlation coefficient

The Pearson Product Moment Correlation ( r ) requires that the data collected on the two variables be CONTINUOUS (Interval or Ratio) and that the relationship between the two variables be LINEAR.

Range -1.00 (perfect negative) to +1.00 (perfect positive)

New cards

standard error of measurement

SEM: it is the standard deviation of the sample scores multiplied by the square root of one minus the reliability of scores

is an estimate of how much the individuals observe test score (x) might differ from the individual true test score (T)

Ranges 0.0 - 1.0

Not reliable - Perfectly reliable

New cards

Difference between standard deviation and the standard error of measurement

The standard error of measurement estimates how repeated measures of a person on the same instrument tend to be distributed around his or her true score. SEM Is directly related to the reliability of a test. The larger the SEM the lower the reliability of the test and the less precision there is in the measures taken and scores obtained. Since all measurement contain some error, it is highly unlikely that any test will yield the same scores for a given person each time they are retested.

The SEM quantifies how much she would expect a student score to vary if they were to take the same test over and over on the same day.

The standard deviation measures the spread in the data, this is a property of the data set, that does not change. We often estimate the standard deviation by measuring the standard error. Standard error measures the uncertainty in the measure of the mean. This depends on how you measure and sample size.

LOOKING AT NORMAL CURVE:

We will assume that the first administration of the test yield a population standard deviation of 10. In a normal curve, the mean is zero, therefore, one standard deviation above the mean is 10, and one standard deviation below the mean is -10. Within that range, 68% of the time (middle) the true score would fall in this area. Within two standard deviations, or 95% of the population (-20 to +20).

————-

Assume that the standard error of measure has been calculated based on the reliability of the test. The standard error of measurement estimates how repeated measures of a person on the same instrument tend to be distributed around his Rehor true score. In this case the SEM is five. When we apply the inferential curve to the sample, we use the obtain score as the sample mean and then construct a theoretically normal curve around that obtain score.

Raw score = 50

SEM = 5

95% confident his true score is between 40-60

New cards

Confidence intervals

A range of scores that we feel confident will include the test taker's true score

Formula for a 95% confidence interval:

95%CL = X +_1.96(SEM)

New cards

Factors That Influence Reliability:

test length, homogeneity, test-retest interval, test administration, scoring, cooperation of test takers

New cards

generalizability theory

An approach to estimating reliability/precision

This theory concerns how well and under what conditions we can generalize an estimation of reliability/precision of test scores from one test administration to another. In other words, the test user can predict the reliability/precision of test scores obtained under different circumstances, such as administering a test in various plant locations or school systems. This theory proposes separating sources of systematic error from random error to eliminate systematic error.

Why is the separation of systematic error and random error important?

We can assume that if we were able to record the amount of random error in each measurement, the average error would be zero, and overtime random error would not interfere with obtaining an accurate measurement. However systematic error does affect the accuracy of a measurement therefore using this theory our goal is to eliminate systematic error.

using this theory you could look for systematic or ongoing predictable error that occurs when you weigh yourself. For instance the weight of your clothes and shoes will vary systematically depending on the weather and the time of year. Likewise your weight will be greater later in the day. On the other hand, variations in the measurement mechanism and your ability to read the scale accurately very randomly. We would predict therefore that if you weighed yourself at the same time of day wearing the same clothes, or better yet none at all, you would have a more accurate measurement of your weight. When you have the most accurate measurement of your weight you can confidently assume that changes in your weight from measurement measurement are due to real weight gain or loss and not to measurement error. Researchers in test developers identify systematic error in test scores by using statistical procedure called analysis of variance. As you recall, we discussed four sources of error the test itself, test administration, test scoring, and test taker. Researchers and test developers can set up a generalizability study in which two or more of the sources of error, the independent variables, can be very for the purpose of analyzing the variance of the test scores, the dependent variable, to find systematic error.