Psychology Licensing Exam: Test Construction Vocabulary

Overview of Test Construction in Psychology Licensing

Exam Representation: - Test construction is not a heavily emphasized area on the psychology licensing exam. - It accounts for approximately $10\%$ to $12\%$ of the total exam questions. - Distribution of difficulty: - Approximately three-fourths ( $75\%$ ) of these questions address basic or introductory level concepts. - The remaining one-fourth ( $25\%$ ) cover more advanced material.
Definition of a Test: - A test is defined as a "systematic procedure for measuring a sample of an individual's behavior."
The Three Ways a Test is Systematic: - Systematic Content: Items are selected in a structured, deliberate manner from the domain of interest. - Systematic Administration (Standardization): The test developer provides specific, uniform guidelines for administering the test to ensure consistency across different examiners and settings. - Systematic Scoring: The developer specifies clear rules or steps for evaluating examinees' responses, removing subjective bias.
Tests as Samples of Behavior: - A test cannot assess every possible element of a domain (e.g., all knowledge in psychology). Instead, it measures a representative sample of terms, concepts, and theories. - Two primary problems arise due to sampling: - Problem of Validity: Does the sample of items accurately and thoroughly represent the behavior intended to be measured? - Problem of Reliability: Would the examinee obtain the same score if they took the test at a different time or responded to a different sample of items?

Fundamental Concepts of Reliability

True Score ( $T$ ): - The score a person would obtain on a test if it were perfectly reliable and performance was entirely unaffected by error.
Measurement Error ( $E$ ): - Includes all factors irrelevant to the behavior being measured that affect scores in a random or unpredictable way. - Features of measurement error: - Randomness: It is unsystematic. It may increase, decrease, or have no effect on performance depending on the individual. - Sources: Examples include external distractions (which might cause arousal for some but distraction for others), confusing items, or ambiguous language.
Theoretical Equation of Scores: - The obtained score ( $X$ ) is a function of the true score plus measurement error: - $X = T + E$ - $X$ : Scores obtained by a sample of examinees. - $T$ : Variability in scores due to differences in true scores (truth). - $E$ : Variability in scores due to measurement error (error).
The Reliability Coefficient ( $r_{XX}$ ): - Reliability is estimated indirectly by measuring the consistency of scores across time, versions, items, or scorers. - Assumption: True scores are consistent, while measurement error is inconsistent. - Symbology: Expressed as a correlation coefficient, usually denoted as $r_{XX}$ or $r_{YY}$ . The matching subscripts signify the test is being correlated with itself. - Interpretive Properties: - It is already a squared number, ranging from $0$ to $+1$ . - It is never squared for interpretation. It is interpreted directly as the proportion of variability in obtained test scores due to true score variability. - Example: A reliability coefficient of $0.9$ indicates that $90\%$ of variability reflects true scores, while $10\%$ reflects measurement error. - Threshold: A coefficient of $0.8$ or higher is generally considered adequate for most tests. - Limitation: Reliability does not indicate what the test measures (Validity), only the consistency of whatever it measures.

Four Main Methods for Evaluating Reliability

1. Test-Retest Reliability: - Procedure: Administer the same test to the same group on two different occasions and correlate the scores. - Alternate Name: Coefficient of Stability. - Suitability: Appropriate for stable attributes (aptitude, traits). Inappropriate for fluctuating characteristics or tests susceptible to significant memory or practice effects.
2. Alternate Forms Reliability: - Procedure: Administer two equivalent versions of a test to the same sample and correlate the scores. - Alternate Names: Parallel forms, equivalent forms reliability. - Alternate Name of Coefficient: Coefficient of Equivalence. - Suitability: Best for stable characteristics. Limited if exposure to the first form unsystematically affects performance on the second.
3. Internal Consistency Reliability: - Procedure: Administer a test once to a single sample. - Suitability: Appropriate for unstable characteristics and tests affected by memory/practice. Not suitable for speeded tests (it tends to overestimate their reliability). - Three Measurement Sub-types: - Split-Half: Test is divided (e.g., odd-even items). Scores on halves are correlated. Limitation: Shorter tests are less reliable, and split-half underestimates the full test's reliability. - Spearman-Brown Prophecy Formula: Used to correct the split-half coefficient to estimate what the reliability would be for the full length of the test. Also used to estimate the effect of lengthening or shortening a test. - Cronbach's Coefficient Alpha: Conceptualized as the average of all possible split-half coefficients corrected by the Spearman-Brown formula. - Kuder-Richardson Formula 20 ( $KR-20$ ): Used specifically when items are scored dichotomously (e.g., right/wrong, even if multiple choice).
4. Inter-Rater (Inter-Scorer) Reliability: - Requirement: Necessary for subjective tests (projective tests, essays). - Procedure: Two or more independent raters score the same sample of tests. - Statistics Used: - Kappa Statistic: Used for nominal data (categories). - Coefficient of Concordance: Used for data in the form of ranks.

Standard Error of Measurement and Confidence Intervals

Standard Error of Measurement (SEM): - If an examinee took the same test an infinite number of times, their scores would form a distribution. - The mean of this distribution is the True Score. - The standard deviation of this distribution is the SEM, representing variability due to measurement error.
Confidence Intervals: - Used to interpret an obtained score as a range where the true score likely falls. - Construction: - $68\%$ Confidence Interval: $\text{Obtained Score} \pm 1 \times SEM$ - $95\%$ Confidence Interval: $\text{Obtained Score} \pm 2 \times SEM$ - Example Scenario: - Obtained score: $50$ - SEM: $5$ - $68\%$ Interval: $45$ to $55$ ( $50 \pm 5$ ). - $95\%$ Interval: $40$ to $60$ ( $50 \pm 10$ ).

Factors Affecting the Magnitude of Reliability

Test Length: Longer tests are generally more reliable than shorter ones (all things being equal).
Examinee Heterogeneity: Reliability coefficients are higher when the sample of examinees is heterogeneous regarding the trait being measured. This produces an unrestricted range of scores; a restricted range lowers the correlation.
Content Homogeneity: A test with homogeneous content (measuring one specific domain) has higher internal consistency than one with heterogeneous content.

Types of Validity

Definition: A test is valid when it accurately measures what it was designed to measure.
1. Content Validity: - Purpose: Assessing mastery of a specific content or behavior domain (e.g., achievement tests, job sample tests). - Procedure: Built-in during construction by identifying the domain, dividing it into subcategories, and writing representative items. - Evaluation: Primarily through the agreement of Subject Matter Experts (SMEs) that items representatively sample the domain.
2. Construct Validity: - Purpose: Measuring a hypothetical, intangible trait or construct (e.g., intelligence, self-esteem, depression). - Evaluation: Accumulated evidence, including expert judgment, group comparisons, and the Multi-trait Multi-method (MTMM) Matrix.
3. Criterion-Related Validity: - Purpose: Using a score (Predictor) to estimate status on an external measure (Criterion). - Predictive Validity: Forecasting future performance (e.g., job selection test predicting performance six months later). - Concurrent Validity: Estimating current status (e.g., job selection test predicting performance immediately). - Coefficient ( $R_{XY}$ ): Degree of association between predictor and criterion. - Coefficient of Determination ( $r^2$ ): Squared validity coefficient indicating shared variance. Example: $R_{XY} = 0.60$ , then $0.60^2 = 0.36$ or $36\%$ variability shared.

Construct Validity: The Multi-trait Multi-method (MTMM) Matrix

Assumptions: A score reflects the trait being measured and the method used. A valid test reflects the trait more than the method.
Convergent Validity: High correlation between measures of the same/similar traits using different methods.
Divergent (Discriminant) Validity: Low correlation between measures of unrelated traits.
MTMM Coefficients: - Monotrait-Monomethod: Same trait, same method. This is a reliability coefficient. High reliability is necessary because it limits validity. - Monotrait-Heteromethod: Same trait, different method. Provides evidence for convergent validity (should be high). - Heterotrait-Monomethod: Different traits, same method. Provides evidence for divergent validity (should be low). - Heterotrait-Heteromethod: Different traits, different method. Provides evidence for divergent validity (should be low).
Conclusion: Construct validity is confirmed when the Monotrait-Heteromethod coefficient is larger than both Heterotrait-Monomethod and Heterotrait-Heteromethod coefficients.

Criterion-Related Validity: Prediction and Decisions

Regression Equation: Used to predict a criterion score from a predictor score.
Standard Error of Estimate: Used to construct a confidence interval around a predicted criterion score. - $68\%$ Interval: $\text{Predicted Criterion Score} \pm 1 \times \text{Standard Error of Estimate}$ - $95\%$ Interval: $\text{Predicted Criterion Score} \pm 2 \times \text{Standard Error of Estimate}$
Incremental Validity: The increase in decision-making accuracy achieved by adding the new predictor. - Base Rate: The proportion of correct decisions made without the new predictor (e.g., successful employees hired via current methods). - Positive Hit Rate: The proportion of individuals who would have been successful if the new predictor were used. Calculated as: $\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$ . - Calculation: $\text{Incremental Validity} = \text{Positive Hit Rate} - \text{Base Rate}$ .
Scatter Plot Quadrants (Relative to Cutoff Scores): - True Positives: High predictor, high criterion. - False Positives: High predictor, low criterion. - True Negatives: Low predictor, low criterion. - False Negatives: Low predictor, high criterion. - Rule for Cutoffs: Raising the predictor cutoff score decreases the number of positives (both true and false) and increases the number of negatives (both true and false).

The Relationship Between Reliability and Validity

Reliability Limits Validity: A test's validity cannot be higher than the square root of its reliability. - $Validity \le \sqrt{Reliability}$ - Example: If reliability is $0.81$ , the maximum possible validity is $\sqrt{0.81} = 0.9$ .
Necessity vs. Sufficiency: Reliability is a necessary condition for validity, but it is not sufficient. High reliability does not guarantee that a test measures what it intends to measure.

Questions & Discussion

Question 1 (Reliability for Fluctuating Characteristics): - Prompt: To evaluate the reliability of a characteristic that varies in severity or intensity over time, which coefficient should be used: a) Equivalence, b) Stability, c) Alienation, or d) Internal Consistency? - Response: The correct answer is d) Coefficient of Internal Consistency. Stability (test-retest) and Equivalence (alternate forms) both require testing at different times, which is problematic for fluctuating traits. Alienation measures non-association. Internal consistency requires only one administration, capturing the state at that moment.
Question 2 (Heterogeneous vs. Homogeneous Samples): - Prompt: For a sample with IQ scores from $50$ to $150$ , the reliability is $0.80$ . If calculated for gifted children only, it would likely be: a) $0.80$ , b) >0.80, c) <0.80, or d) either? - Response: The correct answer is c) less than $0.80$ . Restricting the range of scores (moving to a homogeneous sample like gifted children only) lowers the correlation coefficient.
Question 3 (MTMM Matrix Interpretation): - Prompt: In an MTMM matrix, a large Heterotrait-Monomethod coefficient suggests the test: a) has convergent validity, b) has discriminant validity, c) lacks convergent validity, or d) lacks discriminant validity? - Response: The correct answer is d) lacks discriminant (divergent) validity. A large coefficient between different traits (Heterotrait) implies the test is correlating with something it shouldn't, failing to discriminate between constructs.
Question 4 (Changing Cutoff Scores): - Prompt: A test developer finds that raising the cutoff score on a selection test: a) increases TP and TN, b) decreases TP and TN, c) decreases TP and FP, or d) increases TP and FP? - Response: The correct answer is c) decreases the number of true and false positives. Raising the vertical predictor cutoff line on a scatter plot moves it to the right, reducing the zone of all positive hits (the area to the right of the line).