Validity Notes

Validity

The Concept of Validity

Validity in testing is an evaluation of how well a test measures what it intends to measure within a specific context.
A test's validity is specific to its use, the population taking it, and the time it is administered.
No test is universally valid for all times, uses, or test-taker populations.

Validation

Validation is the process of gathering and assessing evidence to support a test's validity.
Both test users and developers are involved in validating a test for a particular purpose.
Test developers are responsible for providing validity evidence in the test manual.
Test users may conduct their own validation studies with their own test-taker groups; these are called local validation studies.

Local Validation Studies

Local validation studies are essential when a test user modifies the test's format, instructions, language, or content.
Necessary when using a test with a population that differs significantly from the standardization population.

Validity Categories

Face Validity
Content validity
Criterion-related validity
- Concurrent Validity
- Predictive Validity
Construct validity

Face Validity

Face validity is about what a test appears to measure to the test-taker, rather than what it actually measures.
It's a judgment of how relevant the test items seem to be.
A test with high face validity appears, on the surface, to measure what it claims to measure.
Judgments about face validity are from the test-taker's viewpoint, unlike reliability and other forms of validity, which are from the test user's perspective.
A lack of face validity can decrease test-taker confidence, cooperation, and motivation.
A test lacking face validity may still be relevant and useful, but negative perceptions can arise from test-takers, parents, legislators, etc.

Content Validity

Content validity is an evaluation of how well a test samples the behavior representative of the entire range of behavior it was designed to assess.
For example, a test of assertiveness must cover a wide variety of situations to have content validity.

Content Validity – Educational Achievement Tests

An educational achievement test is considered content-valid when the proportion of material covered in the test matches the proportion covered in the course.

Content Validity – Employment Tests

For an employment test to be content-valid, it must be a representative sample of the job-related skills needed for the job.
Behavioral observation of successful employees is used to determine the content areas to include in the test.

Measuring Content Validity

C.H. Lawshe developed a method to measure content validity by gauging agreement among raters regarding the essential nature of an item.
Raters answer the question: Is the skill or knowledge measured by this item:
- Essential?
- Useful but not essential?
- Not necessary to the performance of the job?
If more than half the panelists rate an item as essential, it has some content validity.
Higher agreement among panelists indicates greater content validity.
Lawshe developed the content validity ratio (CVR) formula:
- $CVR = (n_e – (N/2)) / (N/2)$
- Where:
  - $n_e$ = number of panelists indicating essential
  - $N$ = total number of panelists
When fewer than half the panelists indicate "essential,” the CVR is negative.
When exactly half the panelists indicate essential, the CVR is zero.
When more than half, but not all, the panelists indicate “essential,” the CVR is positive and ranges between 0.00 and 0.99.
Lawshe suggested eliminating items if the observed agreement has more than a 5% chance of occurring by chance.

Measuring Content Validity – Minimal Values of the Content Validity Ratio

Minimal CVR values for a 5% level of chance
Here's a table of minimum CVR values based on the number of panelists:

Number of panelists	Minimum Value
5	0.99
6	0.99
7	0.99
8	0.75
10	0.62
12	0.56
14	0.51
15	0.49
20	0.42
25	0.37
30	0.33
35	0.31
40	0.29

Criterion-Related Validity

Criterion-related validity assesses how well a test score can be used to predict an individual's standing on a measure of interest (the criterion).

Criterion-Related Validity – Concurrent & Predictive Validity

Two types of validity evidence:
- Concurrent validity: The degree to which a test score relates to a criterion measure obtained at the same time.
- Predictive validity: The degree to which a test score predicts a criterion measure.

What is a Criterion?

A criterion is the standard against which a test or test score is evaluated.
It can be a test score, behavior, rating, diagnosis, etc.

Characteristics of a Criterion

An adequate criterion should be relevant.
An adequate criterion must also be valid for the purpose for which it is being used
If one test (X) is being used as the criterion to validate a second test (Y), then evidence should exist that test X is valid
If the criterion used is a rating made by a judge or a panel, then evidence should exist that the rating is valid (e.g. training and experience of the raters)

Characteristics of a Criterion

Ideally, a criterion should be uncontaminated
- Criterion contamination occurs when the criterion measure is based on predictor measures.
- If a psychiatric diagnosis (criterion) is based on MMPI-2 scores (predictor), the predictor measure has contaminated the criterion measure.

Criterion-Related Validity – Concurrent Validity

Concurrent validity is established when test scores and criterion measures are obtained at the same time.
It indicates how well test scores estimate an individual's current standing on a criterion.
If a psychodiagnostic test is validated against a criterion of already diagnosed psychiatric patients, it is concurrent validation.
Once established, the test can provide a faster, less expensive method for diagnosis or classification.
Concurrent validity can also explore the validity of Test A with respect to Test B if Test B is already proved as a valid test.

Criterion-Related Validity – Predictive Validity

Test scores are obtained at one time, and criterion measures are obtained in the future.
The intervening event between the test scores and criterion measures can be training, experience, therapy, medication, or simply the passage of time.

Criterion-Related Validity – Predictive Validity

Predictive validity indicates how accurately test scores predict a criterion measure obtained in the future.

Judgments of Criterion-Related Validity

Judgments of criterion-related validity (concurrent or predictive) are based on:
- Validity Coefficient
- Expectancy Data

Criterion-Related Validity – The Validity Coefficient

The validity coefficient is a correlation coefficient measuring the relationship between test scores and criterion scores.
A correlation coefficient between a score on a psychodiagnostic test and a criterion score assigned by psychodiagnosticians is an example.
Typically, the Pearson correlation coefficient is used to determine the validity between the two measures.
- Depending on variables such as the type of data, the sample size, and the shape of the distribution, other correlation coefficients could be used
For example, in correlating self-rankings of performance on some job with rankings made by job supervisors, the formula for the Spearman rho rank-order correlation would be employed

Criterion-Related Validity – Predictive Validity Incremental Validity

Test users predicting a criterion from test scores are often interested in the utility of multiple predictors.
The value of multiple predictors depends on:
- Each predictor having criterion-related predictive validity.
- Additional predictors possessing incremental validity.
- Incremental validity is the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use.

Criterion-Related Validity – Predictive Validity Incremental Validity

Incremental validity may be used when predicting something like academic success in college.
Grade-point average (GPA) at the end of the first year may be used as a measure of academic success.
A study of potential predictors of GPA may reveal that time spent in the library and time spent studying are highly correlated with GPA.
How much sleep a student’s roommate allows the student to have during exam periods correlates with GPA to a smaller extent.

Criterion-Related Validity – Predictive Validity Incremental Validity

One approach, employing the principles of incremental validity, is to start with the best predictor, the predictor that is most highly correlated with GPA.
This may be time spent studying. Then, using multiple-regression techniques, one would examine the usefulness of other predictors.
The test should also be efficient.

Criterion-Related Validity – Predictive Validity Incremental Validity

Even though time in the library is highly correlated with GPA, it may not possess incremental validity if it overlaps too much with the first predictor, time spent studying.
If time spent studying and time in the library are so highly correlated with each other that they reflect essentially the same thing, then only one needs to be included as a predictor.

Criterion-Related Validity – Predictive Validity Incremental Validity

By contrast, the variable of how much sleep a student’s roommate allows the student to have during exams may have good incremental validity
This is because it reflects a different aspect of preparing for exams (resting) from the first predictor studying

Criterion-Related Validity – Expectancy Data

Expectancy data provide information that can be used in evaluating the criterion-related validity of a test
Using a score obtained on some test(s) or measure(s), expectancy tables illustrate the likelihood that the testtaker will score within some interval of scores on a criterion measure
An expectancy table shows the percentage of people within specified test-score intervals who subsequently were placed in various categories of a certain criterion (for example, placed in “passed” category or “failed category)
An expectancy table may be created by a scatterplot.

Criterion-Related Validity – Expectancy Data

Example: Relationship between scores on the Language Usage Subset of the Differential Aptitude Test (DAT) and course grades in American history for eleventh-grade boys
Of the students who scored between 40 and 60 on the DAT, 83% scored 80 or above in their American history course

Criterion-Related Validity – Expectancy Data

Expectancy Table Example:

Language Usage/ Subtest Score	0-69	70-79	80-89	90-100
40 and above	0	17	29	54
30-39	8	46	29	17
20-29	15	59	24	2
Below 20	37	57	7	0