Validity is a judgment or estimate of how well a test measures what it purports to measure in a particular context.
It is a judgment based on evidence about the appropriateness of inferences drawn from test scores.
Inference: A logical result or deduction.
Characterizations are frequently phrased in terms such as “acceptable” or “weak”.
A test has been shown to be valid for a particular use with a particular population of test-takers at a particular time.
Tests should be valid within reasonable boundaries of a contemplated usage; once exceeded, validity may be questioned.
The validity of a test may diminish as the culture or the times change, and the validity of a test may have to be re-established with the same as well as other test-taker populations.
Validation is the process of gathering and evaluating evidence about validity.
Test developers provide validity evidence in the test manual.
Test users conduct local validation studies which may yield insights regarding a particular population of test-takers as compared to the norming sample described in a test manual.
Validation is necessary when there is intention to alter a test’s format, instructions, language, or content of the test.
Validation is necessary if a test user seeks to use a test with a population of test-takers that differs in some significant way from the population on which the test was standardized.
Trinitarian View of Validity
Content validity, construct validity, and criterion-related validity all contribute to a unified picture of a test’s validity.
A test user may not need to know about all three as one type of validity evidence may be more relevant than others depending on the use of the test.
Ecological Momentary Assessment (EMA)
Refers to in-the-moment and in-the-place evaluation of targeted variables (more analogous to states than traits).
Uses ecological validity which is a judgment regarding how well a test measures what it purports to measure at the time and place that the variable being measured is actually emitted.
Higher ecological validity = greater generalizability that the measure results to particular real-life circumstances.
Face Validity
What a test appears to measure to the person being tested rather than what the test actually measures.
Not technical, but more so a gut feeling judgment concerning how relevant test items appear to be.
High face validity = the test definitely APPEARS to measure what it is supposed to measure.
Frequently thought of from the perspective of the test-taker, not the test user.
Lack of face validity could contribute to a lack of confidence in the perceived effectiveness of the test, and a consequential decrease in test-taker cooperation or motivation to do their best.
Concerns stem from the belief that the use of such tests will result in invalid conclusions.
A test that lacks face validity is still relevant and useful.
Face validity may be more a matter of public relations than psychometric soundness.
Content Validity
Describes a judgment of how adequately a test samples behavior representative of the universe of that behavior that the test was designed to sample.
Does the test items reflect the behavior/characteristic being measured?
Test developers have a clear vision of the construct being measured, and the clarity of this vision can be reflected in the content validity of the test.
Include key components of the construct and exclude content irrelevant to the construct being measured.
Example: Educational achievement tests are content-valid measures when the proportion of the material covered by the test approximates the proportion of material covered in the course.
Test blueprint: “Structure” of the evaluation plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage, the organization of items in the test, etc.
Represents the culmination of efforts to adequately sample the universe of content areas that conceivably could be sampled in such a test.
In general, the information needed to blueprint content that is representative of the construct must come from subject matter experts.
For an employment test to be content-valid, its content must be a representative sample of the job-related skills required for employment.
Blueprinting techniques include interviews or behavioral observation with workers or supervisors.
Criterion-Related Validity
Judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest—the criterion.
Criterion: The standard against which a test or test score is evaluated.
It is relevant (pertinent or applicable to the matter at hand).
It is valid if one test (X) is being used as the criterion to validate a second test (Y), then evidence should exist that the test X is valid.
If the criterion used is a rating by a judge or a panel, then evidence should exist that the rating is valid (check credentials of raters).
Criterion contamination: Criterion measure that has been based, at least in part, on predictor measures.
Two types:
Concurrent validity: Index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently).
Extent to which test scores may be used to estimate an individual’s present standing on a criterion.
Relationship between the test scores and the criterion measures that are obtained at about the same time.
Example: Test A is the predictor and Test B is a validated measure of a certain variable that Test A purports to measure. In concurrent validity, both tests are administered at the same time and scores are obtained. If there is a positive relationship, it means that Test A measures the same variable that Test B is measuring.
The question to be addressed is “How well does Test A compare with Test B?”
Test B is the validating criterion.
If there is a good relationship between the test scores of Test A (a new test) with Test B, we could therefore conclude that Test A is a valid measure of the variable being measured because it is on par with the results of Test B (established valid measure of a variable).
Predictive validity: Index of the degree to which a test score predicts some criterion measures obtained at a future time, usually after some intervening event has taken place.
How accurately scores on the test predict some criterion measure.
Considerations in evaluating predictive validity:
Base rate: Extent to which a particular trait, behavior, characteristic, or attribute exists in the population; proportion between hit rate and miss rate.
Hit rate: Proportion of people a test accurately identifies as possessing or exhibiting a particular trait, behavior, characteristic, or attribute.
Miss rate: Proportion of people the test fails to identify as having, or not having, a particular characteristic or attribute.
Types of misses:
False positive: A miss wherein the test predicted the test-taker DID possess the particular characteristic or attribute being measured when in fact the test-taker DID NOT.
False negative: A miss wherein the test predicted the test-taker DID NOT possess the particular characteristic or attribute being measured when in fact the test-taker ACTUALLY DID.
Judgments of criterion-related validity are based on two types of statistical evidence:
Validity coefficient: Correlation coefficient that provides a measure of a relationship between test scores and scores on the criterion measure.
Pearson correlation coefficient is used to determine validity between two measures.
In correlating rankings, use Spearman rho rank-order correlation.
Validity coefficient is affected by restriction or inflation of range whether the range of scores is appropriate to the objective of the correlational analysis.
Attrition in the number of subjects may adversely affect validity coefficient (restricted range).
Cronbach and Gleser validity coefficients need to be large enough to enable the test user to make accurate decisions within the unique context in which a test is being used.
Validity coefficients should be high enough to result in the identification and differentiation of test-takers with respect to target attributes.
Incremental validity: In the context of predicting some criterion through multiple predictors, each measure used as a predictor should have criterion-related predictive validity; additional predictors should possess incremental validity.
Degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use.
Example: criterion is academic success; the main predictor is GWA. Additional predictors related to GWA could be time spent in the library, time spent studying, and time spent on sleeping. the additional predictors should predict an increase in GWA
For these predictors to have incremental validity, they should provide predictive information about GWA that is not explained by other predictors.
If time spent study and time spent in the library have a high correlation to the GWA, only of them is needed.
Time spent on sleeping may have good incremental validity because it explains another aspect of GWA that is not explained by existing predictors.
Time spent studying and time spent on sleeping have good incremental validity because they explain different aspects of what could predict the rise or fall of a student’s GWA (main predictor) that predicts their academic success (criterion).
Construct Validity
Judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct.
An informed, scientific idea developed or hypothesized to describe or explain behavior.
Unobservable, presupposed traits that a test developer may invoke to describe test behavior or criterion performance.
The researcher investigating the construct validity of a test must formulate hypotheses about the expected behavior of high scorers and low scorers on the test.
Hypotheses give rise to a tentative theory about the nature of the construct.
If the test is valid, high and low scorers will behave as predicted in the theory.
If the test is invalid, high scorers and low scorers will not behave as predicted, leading to contrary evidence.
Contrary evidence may be explained by the following reasons:
The test simply does not measure the construct.
The theory may need to be reexamined or new hypotheses must be formed.
Statistical procedures used may not be in favor of the test.
Contrary evidence can provide a stimulus for the discovery of new facets of the construct as well as alternative methods of measurement.
Evidence of Construct Validity
* Homogeneity: How uniform a test is in measuring a single concept.
* Methods in increasing homogeneity:
* Correlations between subtest scores and total test score
* Dichotomous tests remove items that do not show significant coefficients with total test scores.
* It is valid when all test items show significant positive correlations with total test scores and if high scorers on the test tend to pass each item more than low scorers do
* Multipoint scale each response is assigned a numerical score. Items that do not show significant Spearman rank-order correlation coefficients are eliminated.
* It is valid when all test items show significant, positive correlations with the total test scores, and the coefficient alpha may also be used.
* Item analysis: Relationship between test-takers’ scores on individual items and their score on the entire test. If high scorers tended to answer an item consistent with the construct and low scorers tended to answer an item inconsistent with the construct then the test is valid. Knowing that a test is homogeneous contributes no information about how the construct being measured relates to other constructs
* Changes with age: If a test score purports to be a measure of a construct that could be expected to change over time, then the rest score should also show the same progressive changes with age to be considered a valid measure of a construct. Variables such as marital satisfaction may be less stable over time, or more vulnerable to situational events. This is also true for personality tests, but does not in itself provide information about how the construct relates to other constructs
* Pretest-posttest changes: Evidence that test scores change as a result of some experience between a pretest and a posttest. Almost any intervening life experience could be predicted to yield changes in score from pretest to posttest. Pretest-posttest research should ideally include a control group (those that did not have any interventions) to rule out alternative explanations of the findings
* Evidence from distinct groups: Also referred to as the method of contrasted groups which demonstrate that scores on the test vary in a predictable way as a function of membership in some group. If a test is a valid measure of a particular construct, then test scores from groups of people who would be presumed to differ with respect to that construct should have correspondingly different test scores
* Convergent validity: The test undergoing construct validity has a high correlation with an older, more established test that is proven to measure the same or similar construct
* Divergent validity: The test undergoing construct validity has low or negative correlation with an older, more established test that is proven to measure a construct that is opposite or different from the construct being measured
* Multitrait-multimethod matrix: The matrix or table that results from correlating variables within and between methods. Values for any number of traits as obtained by various methods are inserted into the table, and the resulting matrix of correlations provides insight with respect to both the convergent and discriminant validity of the methods used
* Factor analysis: Class of mathematical procedures designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ. Frequently employed as a data reduction method in which several sets of scores and the correlations between them are analyzed. The purpose may be to identify the factor/s in common between test scores on subscales within a particular test, or the factors in common between scores on a series of tests
* Two types
* Exploratory factor analysis
* Estimating or extracting factors
* Deciding how many factors to retain
* Rotating factors to an interpretable orientation
* Confirmatory factor analysis: Tests the degree to which a hypothetical model (which includes factors) fits the actual data
* Factor loading: Conveys information about the extent to which the factor determines the test score or scores
Validity, Bias, and Fairness
bias: A factor inherent in a test that systematically prevents accurate, impartial measurement.
*Implies systematic variation as opposed to random variation accounted for by random errors.
* Bias in tests have something to do with the research study rather than the test itself (e.g., too few test-takers in one of the groups that represent a certain population).
Rating error: Judgment resulting from the intentional or unintentional misuse of a rating scale
Restriction-of-range rating errors:
Leniency/generosity error: Tendency to score higher than what is deserved.
Severity/strictness error: Tendency to score lower than what is deserved.
Central tendency error: Tendency to score average; score cluster in the middle of the rating continuum.
The way to overcome these errors is by using ranking in order to force a rater to rank ratees against each other instead of evaluating them using an absolute rating scale.
Halo effect: For some raters, ratees can do no wrong; tendency to give a particular ratee a higher rating than they objectively deserve because of the rater’s failure to discriminate among conceptually distinct and potentially independent aspects of a ratee’s behavior.
Test Fairness: Extent to which a test is used in an impartial, just, and equitable way.
Test developer: Test development and test usage guidelines.
Test user: The way the test is actually used.
Misunderstandings:
Tests are not “unfair” because they discriminate among groups of people.
There is no truism that all people are equal.
It is unfair to administer a test to a particular population which is not included in the standardization sample, but it does not in itself invalidate the test for use with that group.
Questions of test bias can sometimes be answered with mathematical precision and finality, while questions of fairness tend to be rooted more in thorny issues regarding values.