Essentials of Test Validity
Validity, as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context.
It is a judgment based on evidence about the appropriateness of inferences drawn from test scores.
Characterizations of the validity of tests and test scores are frequently phrased in terms such as “acceptable” or “weak.” These terms reflect a judgment about how adequately the test measures what it purports to measure.
Inherent in a judgment of an instrument’s validity is a judgment of how useful it is for a particular purpose with a particular population of people.
No test or measurement technique is “universally valid” for all time, for all uses, with all types of test taker populations.
Tests may be shown to be valid within what we could characterize as reasonable boundaries of a contemplated usage.
Validation: the process of gathering and evaluating evidence about validity
It is the test developer’s responsibility to supply validity evidence in the test manual. It may sometimes be appropriate for test users to conduct their own validation studies with their own groups of test takers.
One way measurement specialists have traditionally conceptualized validity is according to 3 categories
Content Validity
Criterion-Related Validity
Construct Validity
It might be useful to visualize construct validity as being “umbrella validity” since every other variety of validity falls under of.
3 approaches to assessing validity -- associated, respectively, with content validity, criterion-related validity, and construct validity are
scrutinizing the test’s content
relating scores obtained on the test to other test scores or other measures
executing a comprehensive analysis of
how scores on the test relate to other test scores and measures
how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure
Accurate description, prediction, and explanation depends on the ability to manipulate or measure specific variables that are deemed imporant.
It affects our understanding of the world by guiding us in making decisions in research and about individuals.
relates more to what a test appears to measure to the person being tested than to what the test actually measures.
It is a judgment concerning how relevant the test items appear to be.
If a test definitely appears to measure what it purports to measure “on the face of it,” then it could be said to be high in face validity.
Judgments about face validity are frequently though of from the perspective of the test taker, not the test user.
A test’s lack of face validity could contribute to a lack of self confidence in the perceived effectiveness of the test -- with a consequential decrease in the test taker’s cooperation or motivation to do his/her best.
describes a judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample.
Usually, from the pooled information (along with the judgment of the test developer), a test blueprint emerges for the “structure” of the evaluation; that is, a plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, and so forth.
One method of measuring content validity, developed by C.H. Lawshe, is essentially a method for gauging agreement among raters or judges regarding how essential a particular item is.
Lawshe (1975) proposed that each rater respond to the following question for each item: “Is the skill or knowledge measured by this item (essential or useful but not essential or not necessary) to the performance of the job?”
For each item, the number of panelists stating that the item is essential is noted. According to Lawshe, if more than half the panelists indicate that an item is essential, that item has at least some content validity.
Content Validity Ratio
CVR = (ne - N/2) / N/2
wherein
ne = the number of SMEs rating an item as “essential”
N = total number of SMEs providing ratings
a judgment of how adequately a test score can be sued to infer an individual’s most probable standing on some measure of interest -- the measure of interest being the criterion
Criterion
Usually, we define a criterion broadly as a standard on which a judgment or decision may be based.
In the context of criterion-related validity, criterion is the standard against which a test or a test score is evaluated.
an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently)
If test scores are obtained at about the same time that the criterion measures are obtained, measures of the relationship between the test scores and the criterion provide evidence of concurrent validity.
Statements of concurrent validity indicate the extent to which test scores may be used to estimate an individual’s present standing on a criterion.
an index of the degree to which a test score predicts some criterion measure
measures of the relationship between the test scores and a criterion measure obtained at a future time provide an indication of the predication validity of the test
That is, how accurately scores on the test predict some criterion measure
a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct.
A construct is an informed, scientific data developed or hypothesized to describe or explained behavior.
The test is homogenous, measuring a single construct.
Test scores increase or decrease as a function of age, the passage of time, or an experimental manipulation as theoretically predicted.
Test scores obtained after some event or the mere passage of time (that is, post test scores) differ from pretest scores as theoretically predicted.
Test scores obtained by people from distinct groups vary as predicted by the theory.
Test scores correlate with scores on other tests in accordance with what would be predicted from a theory that covers the manifestation of the construct in question.
Homogeneity refers to how uniform a test is in measuring a single concept.
Subtests that in the test developer’s judgment do not correlate very well with the test as a while might have to be reconstructed or eliminated lest the test not measure the construct.
If all test items show significant, positive correlations with total test scores and if high scores on the test tend to pass each item more than low scorers do, then each item is probably measuring the same construct as the total test. Each item is contributing to test homogeneity.
If a test score purports to be a measure of a construct that could be expected to change over time, then the test score, too, should show the same progressive changes with age to be considered a valid measure of the construct.
also method of contrasted groups, one way of providing evidence for the validity of a test is to demonstrate that scores on the test vary in a predictable way as a function of membership in some group.
The rationale here is that if a test is a valid measure of a particular construct, then test scores from groups of people who would be presumed to differ with respect to that construct should have correspondingly different test scores.
Evidence for the construct validity of a particular test may converge from a number of sources, such as other tests or measures designed to assess the same or similar construct
If scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same or similar construct.
A validity coefficient showing little (a statistically insignificant) relationship between test scores and/or other variables with which scores on the test being construct-validated should not theoretically be correlated provides discriminant evidence of construct validity (also discriminant validity).
Both convergent and discriminant evidence of construct validity can be obtained by the use of factor analysis.
A shorthand term for a class of mathematical procedures designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ.
The purpose may be to identify the factor or factors in common between test scores on subscales within a particular test, or the factors in common between scores on a series of tests.
typically entails “estimating or extracting factors” ; deciding how many factors to retain'; and rotating factors to an interpretable orientation
a factor structure is explicitly hypothesized and is tested for its fit with the observed covariance structure of the measured variables
term commonly employed in factor analysis
conveys information about the extent to which the factor determines the test score/s
Validity, as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context.
It is a judgment based on evidence about the appropriateness of inferences drawn from test scores.
Characterizations of the validity of tests and test scores are frequently phrased in terms such as “acceptable” or “weak.” These terms reflect a judgment about how adequately the test measures what it purports to measure.
Inherent in a judgment of an instrument’s validity is a judgment of how useful it is for a particular purpose with a particular population of people.
No test or measurement technique is “universally valid” for all time, for all uses, with all types of test taker populations.
Tests may be shown to be valid within what we could characterize as reasonable boundaries of a contemplated usage.
Validation: the process of gathering and evaluating evidence about validity
It is the test developer’s responsibility to supply validity evidence in the test manual. It may sometimes be appropriate for test users to conduct their own validation studies with their own groups of test takers.
One way measurement specialists have traditionally conceptualized validity is according to 3 categories
Content Validity
Criterion-Related Validity
Construct Validity
It might be useful to visualize construct validity as being “umbrella validity” since every other variety of validity falls under of.
3 approaches to assessing validity -- associated, respectively, with content validity, criterion-related validity, and construct validity are
scrutinizing the test’s content
relating scores obtained on the test to other test scores or other measures
executing a comprehensive analysis of
how scores on the test relate to other test scores and measures
how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure
Accurate description, prediction, and explanation depends on the ability to manipulate or measure specific variables that are deemed imporant.
It affects our understanding of the world by guiding us in making decisions in research and about individuals.
relates more to what a test appears to measure to the person being tested than to what the test actually measures.
It is a judgment concerning how relevant the test items appear to be.
If a test definitely appears to measure what it purports to measure “on the face of it,” then it could be said to be high in face validity.
Judgments about face validity are frequently though of from the perspective of the test taker, not the test user.
A test’s lack of face validity could contribute to a lack of self confidence in the perceived effectiveness of the test -- with a consequential decrease in the test taker’s cooperation or motivation to do his/her best.
describes a judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample.
Usually, from the pooled information (along with the judgment of the test developer), a test blueprint emerges for the “structure” of the evaluation; that is, a plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, and so forth.
One method of measuring content validity, developed by C.H. Lawshe, is essentially a method for gauging agreement among raters or judges regarding how essential a particular item is.
Lawshe (1975) proposed that each rater respond to the following question for each item: “Is the skill or knowledge measured by this item (essential or useful but not essential or not necessary) to the performance of the job?”
For each item, the number of panelists stating that the item is essential is noted. According to Lawshe, if more than half the panelists indicate that an item is essential, that item has at least some content validity.
Content Validity Ratio
CVR = (ne - N/2) / N/2
wherein
ne = the number of SMEs rating an item as “essential”
N = total number of SMEs providing ratings
a judgment of how adequately a test score can be sued to infer an individual’s most probable standing on some measure of interest -- the measure of interest being the criterion
Criterion
Usually, we define a criterion broadly as a standard on which a judgment or decision may be based.
In the context of criterion-related validity, criterion is the standard against which a test or a test score is evaluated.
an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently)
If test scores are obtained at about the same time that the criterion measures are obtained, measures of the relationship between the test scores and the criterion provide evidence of concurrent validity.
Statements of concurrent validity indicate the extent to which test scores may be used to estimate an individual’s present standing on a criterion.
an index of the degree to which a test score predicts some criterion measure
measures of the relationship between the test scores and a criterion measure obtained at a future time provide an indication of the predication validity of the test
That is, how accurately scores on the test predict some criterion measure
a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct.
A construct is an informed, scientific data developed or hypothesized to describe or explained behavior.
The test is homogenous, measuring a single construct.
Test scores increase or decrease as a function of age, the passage of time, or an experimental manipulation as theoretically predicted.
Test scores obtained after some event or the mere passage of time (that is, post test scores) differ from pretest scores as theoretically predicted.
Test scores obtained by people from distinct groups vary as predicted by the theory.
Test scores correlate with scores on other tests in accordance with what would be predicted from a theory that covers the manifestation of the construct in question.
Homogeneity refers to how uniform a test is in measuring a single concept.
Subtests that in the test developer’s judgment do not correlate very well with the test as a while might have to be reconstructed or eliminated lest the test not measure the construct.
If all test items show significant, positive correlations with total test scores and if high scores on the test tend to pass each item more than low scorers do, then each item is probably measuring the same construct as the total test. Each item is contributing to test homogeneity.
If a test score purports to be a measure of a construct that could be expected to change over time, then the test score, too, should show the same progressive changes with age to be considered a valid measure of the construct.
also method of contrasted groups, one way of providing evidence for the validity of a test is to demonstrate that scores on the test vary in a predictable way as a function of membership in some group.
The rationale here is that if a test is a valid measure of a particular construct, then test scores from groups of people who would be presumed to differ with respect to that construct should have correspondingly different test scores.
Evidence for the construct validity of a particular test may converge from a number of sources, such as other tests or measures designed to assess the same or similar construct
If scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same or similar construct.
A validity coefficient showing little (a statistically insignificant) relationship between test scores and/or other variables with which scores on the test being construct-validated should not theoretically be correlated provides discriminant evidence of construct validity (also discriminant validity).
Both convergent and discriminant evidence of construct validity can be obtained by the use of factor analysis.
A shorthand term for a class of mathematical procedures designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ.
The purpose may be to identify the factor or factors in common between test scores on subscales within a particular test, or the factors in common between scores on a series of tests.
typically entails “estimating or extracting factors” ; deciding how many factors to retain'; and rotating factors to an interpretable orientation
a factor structure is explicitly hypothesized and is tested for its fit with the observed covariance structure of the measured variables
term commonly employed in factor analysis
conveys information about the extent to which the factor determines the test score/s