Reliability of Test Scores and Test Items
an umbrella term under which different types of scores stability are assessed
suggests trustworthiness and stability
can pertain to stability of scores over time (test-retest), stability of item scores across items (internal consistency), or stability of ratings across judges, or raters, of a person, object, event, and so on (interrater reliability)
a quality of test scores that suggests they are sufficiently consistently and free form measurement error to be useful
the evaluation of score reliability involves a 2-step process that consists of (a) determining what possible sources of error may enter into test scores and (b) estimating the magnitude of those errors
Error can enter into the scores of psychological tests due to an enormous number of reasons, many of which are outside the purview of psychometric estimates of reliability.
Generally speaking, however, the error that enter into test scores may be categorized as stemming from one or more of the 3 following sources
the context in which the testing takes place
test taker
test itself
test scored with a degree of subjectivity
the label assigned to the error that may enter into scores whenever the element of subjectivity plays a part in scoring a test
it is assumed that different judges will not always assign the same exact scores or ratings to a given test performance even if:
the scoring differences specified in the test manual are explicit and detailed
the scorers are conscientious in applying those directions
it refers to variations in scores that stem from differences in the subjective judgement of the scorers
Scorer Reliability
refers to the variability inherent in test scores as a function of the fact that they are obtained at one point in time rather than at another
whereas a certain amount of time sampling error is assumed to enter into all test scores, as a rule, one should expect less of it in the scores of tests that assess relatively stable traits
Test-Retest Reliability
the term used to label the trait-irrelevant variability that can enter into test scores as a result of fortuitous factors related to the content of the specific items included in a test
Alternate-Form Reliability
To investigate this kind of reliability, 2 or more different forms of the test -- identical in purpose but differing in specific content -- need to be prepared and administered to the same group of subjects. The test taker’s scores on each of the versions are then correlated to obtain alternate-form reliability coefficients
Split-Half Reliability
Administer a test to a group of individuals and to create 2 scores for each person by splitting the test into halves
refers to error in scores that results from fluctuations in items across an entire test, as opposed to the content sample error emanating from the particular configuration of items included in the test as a whole
Such inconsistencies can be due to a variety of factors, including content sampling error and content hetergeneity
Content Homogeneity
results from the inclusion of items or set of items that tap content knowledge or psychological functions that differ from those tapped by other items in the same test
can be checked using split-half reliability or interim reliability
the 2 most frequently used formulas used to calculate interim consistency are the Kuder-Richardson Formula 20 (KR-20) and Coefficient Alpha (a) or Cronbach’s alpha
An estimate of reliability obtained by correlating pairs of scores from the same people on 2 different administrations of the same test
Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait
The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms coefficient of reliability, which is often termed the coefficient of equivalence
Alternate-Forms
simply different versions of a test that have been constructed so as to be parallel
An estimate of split-half reliability is obtained by correlating 2 pairs of scores obtained from equivalent halves of a single test administered once
It is a useful measure of reliability when it is impractical or undesirable to assess reliability with 2 tests or to administer a test twice (because of factors such as time and expense)
Step 1: Divide the test into equivalent halves
Step 2: Calculate a Pearson r between scores on the 2 halves of the test
Step 3: Adjust the half-test reliability using the Spearman-Brown formula
refers to the degree of correlation among all the items on a scale
An index interim consistency, in turn, is useful in assessing the homogeneity of the test.
the degree of agreement or consistency between 2 or more scores (judges or raters with regard to a particular measure)
describes the degree to which a test measures different factors. A heterogenous or non homogeneous test is composed of items that measure more than one trair
increase the number of items
factor and item analysis
an umbrella term under which different types of scores stability are assessed
suggests trustworthiness and stability
can pertain to stability of scores over time (test-retest), stability of item scores across items (internal consistency), or stability of ratings across judges, or raters, of a person, object, event, and so on (interrater reliability)
a quality of test scores that suggests they are sufficiently consistently and free form measurement error to be useful
the evaluation of score reliability involves a 2-step process that consists of (a) determining what possible sources of error may enter into test scores and (b) estimating the magnitude of those errors
Error can enter into the scores of psychological tests due to an enormous number of reasons, many of which are outside the purview of psychometric estimates of reliability.
Generally speaking, however, the error that enter into test scores may be categorized as stemming from one or more of the 3 following sources
the context in which the testing takes place
test taker
test itself
test scored with a degree of subjectivity
the label assigned to the error that may enter into scores whenever the element of subjectivity plays a part in scoring a test
it is assumed that different judges will not always assign the same exact scores or ratings to a given test performance even if:
the scoring differences specified in the test manual are explicit and detailed
the scorers are conscientious in applying those directions
it refers to variations in scores that stem from differences in the subjective judgement of the scorers
Scorer Reliability
refers to the variability inherent in test scores as a function of the fact that they are obtained at one point in time rather than at another
whereas a certain amount of time sampling error is assumed to enter into all test scores, as a rule, one should expect less of it in the scores of tests that assess relatively stable traits
Test-Retest Reliability
the term used to label the trait-irrelevant variability that can enter into test scores as a result of fortuitous factors related to the content of the specific items included in a test
Alternate-Form Reliability
To investigate this kind of reliability, 2 or more different forms of the test -- identical in purpose but differing in specific content -- need to be prepared and administered to the same group of subjects. The test taker’s scores on each of the versions are then correlated to obtain alternate-form reliability coefficients
Split-Half Reliability
Administer a test to a group of individuals and to create 2 scores for each person by splitting the test into halves
refers to error in scores that results from fluctuations in items across an entire test, as opposed to the content sample error emanating from the particular configuration of items included in the test as a whole
Such inconsistencies can be due to a variety of factors, including content sampling error and content hetergeneity
Content Homogeneity
results from the inclusion of items or set of items that tap content knowledge or psychological functions that differ from those tapped by other items in the same test
can be checked using split-half reliability or interim reliability
the 2 most frequently used formulas used to calculate interim consistency are the Kuder-Richardson Formula 20 (KR-20) and Coefficient Alpha (a) or Cronbach’s alpha
An estimate of reliability obtained by correlating pairs of scores from the same people on 2 different administrations of the same test
Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait
The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms coefficient of reliability, which is often termed the coefficient of equivalence
Alternate-Forms
simply different versions of a test that have been constructed so as to be parallel
An estimate of split-half reliability is obtained by correlating 2 pairs of scores obtained from equivalent halves of a single test administered once
It is a useful measure of reliability when it is impractical or undesirable to assess reliability with 2 tests or to administer a test twice (because of factors such as time and expense)
Step 1: Divide the test into equivalent halves
Step 2: Calculate a Pearson r between scores on the 2 halves of the test
Step 3: Adjust the half-test reliability using the Spearman-Brown formula
refers to the degree of correlation among all the items on a scale
An index interim consistency, in turn, is useful in assessing the homogeneity of the test.
the degree of agreement or consistency between 2 or more scores (judges or raters with regard to a particular measure)
describes the degree to which a test measures different factors. A heterogenous or non homogeneous test is composed of items that measure more than one trair
increase the number of items
factor and item analysis