1/35
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Reliability/Precision
Refers to the consistency of test scores in general
The most important attribute of a measurement instrument is its reliability. Describes the consistency of test scores - contains some degree of error, which can affect reliability and consistency
Measurement Error
Variations in measurement using a reliable instrument
Example: a yardstick - user error
Errors are usually due to random mistakes or inconsistencies of the person using the measurement tool. The tool has internal consistency.
Reliable Test
Is one we can trust to measure each person in approximately the same way every time it is used
A test must also be reliable if it is used to measure attributes and compare people
Just because a test has been shown to produce reliable scores, that does not mean the test is also valid. The evidence of reliability does not mean that the inferences that a test user makes from the scores on the test are correct or that the test is being used properly (validity)
Classical Test Theory
According to classical test theory, a persons test score (called the observed score) is made up of two independent parts: a true score and random error
X = T + E
True Score
(T) part of classical test theory
A true score is a measurement of the amount of the attribute that the test is designed to measure
An individual true score on a test is a value that can never really be known or determined, it represents the score that would be obtained if that individual took a test an infinite number of times and then the average score was computed. If we could average all the scores, the results would represent a score without random error.
One way to think about a true score is to think about choosing a member of your competitive video gaming team, you could choose someone who played one game, but the best chance is lie with choosing someone who has average success to estimate his or her true ability
Random Error
(E) part of classical test theory
The second part of an observed task or consists of random errors that occur anytime a person takes a test.
Random error is defined as the difference between a persons actual score on a test (the observed score) and that person's true score (T).
Because it is random, over an infinite number of testing's The air will increase and decrease a persons score by exactly the same amount, in other words the mean of all the errors scores over infinite testing's will be zero.
Two other important characteristics of measurement error are that it is normally distributed, and it is and correlated with true scores
Random error lowers the reliability of a test
Measurement error is both:
Random error & Systematic error
Systematic Error
When a single source of error always increases or decreases the true score by the same amount. That is, a scale consistently weighs 3 pounds more than the actual weight, in this case the error your scale makes is predictable and systematic.
Systematic error is often difficult to identify, practice effects and order effects can add systematic Air as well as random error to test scores.
Another important distinction between random error and systematic error is that random error lowers the reliability of a test
Reliability Coefficient
The correlation between the two sets of test scores
Determining True Score & It's Reliability Coefficient
We can never really determine a persons true score on any measure. A true score is the score of that a person would get if he or she took a test an infinite number of times and we averaged all the results, which is something we can never actually do.
Because we cannot ever know what a person's true score actually is, we can never exactly calculate a reliability coefficient. But there are methods we can use to do so.
Methods to Measure Reliability Coefficient
The test retest method
The alternate forms method
The internal consistency method (split half, coefficient alpha methods, and methods that evaluate score a reliability or agreement)
Each of these methods takes into account various conditions that can produce inconsistencies in test scores. The method chosen to estimate reliability depends on the test itself and the conditions under which the test user plans to administer the test. Each method produces a numerical reliability coefficient, which enables us to estimate and evaluate the reliability of the test.
Parallel (Key Term)
Let's assume that we could build two different forms of a test that measured the exact same construct in exactly the same way. Technically we would say that these alternate forms of the test were parallel. If we gave these two forms of the test to the same group of people we would still not expect that everyone would score exactly the same on the second administration of the test as they did on the first. This is because there will always be some measurement error that influences everyone's scores in a random, non-predictable fashion. Of course if the test were really measuring the same concepts in the same way we would expect people scores to be very similar across the two testing sessions. And the more similar the scores are, the better the reliability the test would be.
Reliability Coefficient (Key Term)
If there was no measurement error, we would expect that everyone's observe scores on the two parallel test would be the same. If so the correlation between the two sets of scores called the reliability coefficient, would be a perfect 1.0. Would also be the case that if the two groups of test scores were exactly the same for all individuals, the variance of the scores of each test would be exactly the same as well.
A measure of the accuracy of a test obtained by measuring the same individuals twice and computing the correlation of the two sets of measures
Hypothetically if no measurement error existed, observe scores on parallel test would be the same, and the reliability coefficient would be a perfect 1.0 meaningfully reliable
The reason why the addition of random error reduces the reliability of a test is because reliability is about estimating the proportion of variability In a set of observed test scores that is a attributable only to true scores
Reliability is defined as true score variance divided by total observe score variance
Test Retest Method
1/3 categories of reliability coefficients
At test developer gives the same test to the same group of test takers on two different occasions
Example: PAI personality assessment inventory
Intervals between testing can vary from hours to years - reliability decreases overtime
The main assumption is that test takers have not changed between the first and second administration
Circumstances of testing may remain as stable as possible, as well as the state of being the individual was in initially helps decrease error (rested & fed)
Correlation: The scores from the first and second administrations are then compared
Can lead to practice effects
Practice Effects
*test retest methods
Practice effects occur when a test take her benefit from taking the test the first time, which enables them to solve problems more quickly and correctly the second time
The test retest method is appropriate only when the test takers are not likely to learn something the first time they take the test, or when the interval is spaced appropriately
Alternate Forms Method
2/3 methods to measure reliability coefficient
Alternate forms when the test developer develops two different forms of the test
The two different forms of a test are compared using correlation
example: TONI-4 (intelligence test)
The test developers assessed the alternate forms reliability by giving the two forms to the same group of subjects in the same testing session results indicated that the correlation showed high reliability
Largest risk is inequivalence between the two forms
Alternate forms are much easier to develop for well defined characteristics, such as mathematical ability, then for personality traits
Can lead to order effects
Order Effects
*alternate forms method
Changes in test scores resulting from the order in which the test were taken
Internal Consistency Method
3/3 methods to measure reliability coefficient
Internal consistency method is a measure of how related the items, or groups of items, on the test are to one another
One can consider whether knowledge of how a person answered one item would give you information that would help you correctly predict how he or she answered another item
Split Half Reliability
*internal consistency
The split half method is to divide the test into halves and then compare the set of individual test scores on the first half with the set of individual test scores on the second half
Best way is to use random assignment to place each question in 1/2 or the other
When using the split half method we must mathematically adjust the reliability coefficient to compensate for the impact of splitting the test - Spearman-Brown formula
Homogeneous tests vs Heterogeneous Tests
Homogeneous test only measure one trait or characteristic
- estimating reliability using methods of internal consistency is appropriate only for homogeneous tests
Heterogeneous test measure more than one trait or characteristic
- if a test is heterogeneous the test developer should calculate and report an estimate of internal consistency for each homogenous sub transfer factor
Scorer Reliability
Score reliability or inter-scorer agreement:
The amount of consistency among scorers' judgement
An individual can make mistakes in scoring, which add error to test scores, particular win the score must make judgements about whether an answer is right or wrong.
The amount of consistency among scores judgements becomes an important consideration for tasks that require decisions by the administrator or scorer
Example: WCST Wisconsin card sorting trick
intrascorer reliability
whether each clinician was consistent in the way he or she assigned scores from test to test
interrater reliability
The degree to which readers are being consistent in their observations and scoring in instances where there is more than one person score in the test results
Whenever you use humans as part of your measurement procedure, you have to worry about whether the results to get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.
There are several ways to estimate interrater reliability.
Measuring interrater reliability
1. Give the test once, and have it scored by two scorers (Pearson product moment correlation)
Measuring interrater agreement
1. Create a reading instrument, and have it completed by two judges (Cohen's Kappa)
- nominal or ordinal
2. Calculate the consistency of scores for a single scorer. A single scorer rates or scores the same thing on more than one occasion. (Intraclass correlation coefficient)
Interrater reliability
Refers to the extent to which two or more individuals agree. If the observers agreed perfectly on all items, then interrater reliability would be perfect
intraclass correlation
A special type of correlation appropriate for comparing responses of more than two rators or more than two sets of scores
interrater agreement
An index of how consistently the scorers rate or make decisions
intrarater agreement
When one scorer makes judgments, the researcher also wants assurance that the scorer makes consistent judgements across all tests
Coefficient Alpha / KR-20
Imagine that we compute one split half reliability and then randomly divide the items into another set of split halves and recompute, and keep doing this until we have computer at all possible split half estimates of reliability. Cronbach's alpha is mathematically equivalent to the average of all possible split half estimates although that's not how we compute it
Pearson correlation coefficient
The Pearson Product Moment Correlation ( r ) requires that the data collected on the two variables be CONTINUOUS (Interval or Ratio) and that the relationship between the two variables be LINEAR.
Range -1.00 (perfect negative) to +1.00 (perfect positive)
standard error of measurement
SEM: it is the standard deviation of the sample scores multiplied by the square root of one minus the reliability of scores
is an estimate of how much the individuals observe test score (x) might differ from the individual true test score (T)
Ranges 0.0 - 1.0
Not reliable - Perfectly reliable
Difference between standard deviation and the standard error of measurement
The standard error of measurement estimates how repeated measures of a person on the same instrument tend to be distributed around his or her true score. SEM Is directly related to the reliability of a test. The larger the SEM the lower the reliability of the test and the less precision there is in the measures taken and scores obtained. Since all measurement contain some error, it is highly unlikely that any test will yield the same scores for a given person each time they are retested.
The SEM quantifies how much she would expect a student score to vary if they were to take the same test over and over on the same day.
The standard deviation measures the spread in the data, this is a property of the data set, that does not change. We often estimate the standard deviation by measuring the standard error. Standard error measures the uncertainty in the measure of the mean. This depends on how you measure and sample size.
LOOKING AT NORMAL CURVE:
We will assume that the first administration of the test yield a population standard deviation of 10. In a normal curve, the mean is zero, therefore, one standard deviation above the mean is 10, and one standard deviation below the mean is -10. Within that range, 68% of the time (middle) the true score would fall in this area. Within two standard deviations, or 95% of the population (-20 to +20).
————-
Assume that the standard error of measure has been calculated based on the reliability of the test. The standard error of measurement estimates how repeated measures of a person on the same instrument tend to be distributed around his Rehor true score. In this case the SEM is five. When we apply the inferential curve to the sample, we use the obtain score as the sample mean and then construct a theoretically normal curve around that obtain score.
Raw score = 50
SEM = 5
95% confident his true score is between 40-60
Confidence intervals
A range of scores that we feel confident will include the test taker's true score
Formula for a 95% confidence interval:
95%CL = X +_1.96(SEM)
Factors That Influence Reliability:
test length, homogeneity, test-retest interval, test administration, scoring, cooperation of test takers
generalizability theory
An approach to estimating reliability/precision
This theory concerns how well and under what conditions we can generalize an estimation of reliability/precision of test scores from one test administration to another. In other words, the test user can predict the reliability/precision of test scores obtained under different circumstances, such as administering a test in various plant locations or school systems. This theory proposes separating sources of systematic error from random error to eliminate systematic error.
Why is the separation of systematic error and random error important?
We can assume that if we were able to record the amount of random error in each measurement, the average error would be zero, and overtime random error would not interfere with obtaining an accurate measurement. However systematic error does affect the accuracy of a measurement therefore using this theory our goal is to eliminate systematic error.
using this theory you could look for systematic or ongoing predictable error that occurs when you weigh yourself. For instance the weight of your clothes and shoes will vary systematically depending on the weather and the time of year. Likewise your weight will be greater later in the day. On the other hand, variations in the measurement mechanism and your ability to read the scale accurately very randomly. We would predict therefore that if you weighed yourself at the same time of day wearing the same clothes, or better yet none at all, you would have a more accurate measurement of your weight. When you have the most accurate measurement of your weight you can confidently assume that changes in your weight from measurement measurement are due to real weight gain or loss and not to measurement error. Researchers in test developers identify systematic error in test scores by using statistical procedure called analysis of variance. As you recall, we discussed four sources of error the test itself, test administration, test scoring, and test taker. Researchers and test developers can set up a generalizability study in which two or more of the sources of error, the independent variables, can be very for the purpose of analyzing the variance of the test scores, the dependent variable, to find systematic error.