Psychology Licensing Exam: Test Construction Vocabulary

Overview of Test Construction in Psychology Licensing

  • Exam Representation:     - Test construction is not a heavily emphasized area on the psychology licensing exam.     - It accounts for approximately 10%10\% to 12%12\% of the total exam questions.     - Distribution of difficulty:         - Approximately three-fourths (75%75\%) of these questions address basic or introductory level concepts.         - The remaining one-fourth (25%25\%) cover more advanced material.

  • Definition of a Test:     - A test is defined as a "systematic procedure for measuring a sample of an individual's behavior."

  • The Three Ways a Test is Systematic:     - Systematic Content: Items are selected in a structured, deliberate manner from the domain of interest.     - Systematic Administration (Standardization): The test developer provides specific, uniform guidelines for administering the test to ensure consistency across different examiners and settings.     - Systematic Scoring: The developer specifies clear rules or steps for evaluating examinees' responses, removing subjective bias.

  • Tests as Samples of Behavior:     - A test cannot assess every possible element of a domain (e.g., all knowledge in psychology). Instead, it measures a representative sample of terms, concepts, and theories.     - Two primary problems arise due to sampling:         - Problem of Validity: Does the sample of items accurately and thoroughly represent the behavior intended to be measured?         - Problem of Reliability: Would the examinee obtain the same score if they took the test at a different time or responded to a different sample of items?

Fundamental Concepts of Reliability

  • True Score (TT):     - The score a person would obtain on a test if it were perfectly reliable and performance was entirely unaffected by error.

  • Measurement Error (EE):     - Includes all factors irrelevant to the behavior being measured that affect scores in a random or unpredictable way.     - Features of measurement error:         - Randomness: It is unsystematic. It may increase, decrease, or have no effect on performance depending on the individual.         - Sources: Examples include external distractions (which might cause arousal for some but distraction for others), confusing items, or ambiguous language.

  • Theoretical Equation of Scores:     - The obtained score (XX) is a function of the true score plus measurement error:     - X=T+EX = T + E     - XX: Scores obtained by a sample of examinees.     - TT: Variability in scores due to differences in true scores (truth).     - EE: Variability in scores due to measurement error (error).

  • The Reliability Coefficient (rXXr_{XX}):     - Reliability is estimated indirectly by measuring the consistency of scores across time, versions, items, or scorers.     - Assumption: True scores are consistent, while measurement error is inconsistent.     - Symbology: Expressed as a correlation coefficient, usually denoted as rXXr_{XX} or rYYr_{YY}. The matching subscripts signify the test is being correlated with itself.     - Interpretive Properties:         - It is already a squared number, ranging from 00 to +1+1.         - It is never squared for interpretation. It is interpreted directly as the proportion of variability in obtained test scores due to true score variability.         - Example: A reliability coefficient of 0.90.9 indicates that 90%90\% of variability reflects true scores, while 10%10\% reflects measurement error.         - Threshold: A coefficient of 0.80.8 or higher is generally considered adequate for most tests.     - Limitation: Reliability does not indicate what the test measures (Validity), only the consistency of whatever it measures.

Four Main Methods for Evaluating Reliability

  • 1. Test-Retest Reliability:     - Procedure: Administer the same test to the same group on two different occasions and correlate the scores.     - Alternate Name: Coefficient of Stability.     - Suitability: Appropriate for stable attributes (aptitude, traits). Inappropriate for fluctuating characteristics or tests susceptible to significant memory or practice effects.

  • 2. Alternate Forms Reliability:     - Procedure: Administer two equivalent versions of a test to the same sample and correlate the scores.     - Alternate Names: Parallel forms, equivalent forms reliability.     - Alternate Name of Coefficient: Coefficient of Equivalence.     - Suitability: Best for stable characteristics. Limited if exposure to the first form unsystematically affects performance on the second.

  • 3. Internal Consistency Reliability:     - Procedure: Administer a test once to a single sample.     - Suitability: Appropriate for unstable characteristics and tests affected by memory/practice. Not suitable for speeded tests (it tends to overestimate their reliability).     - Three Measurement Sub-types:         - Split-Half: Test is divided (e.g., odd-even items). Scores on halves are correlated. Limitation: Shorter tests are less reliable, and split-half underestimates the full test's reliability.         - Spearman-Brown Prophecy Formula: Used to correct the split-half coefficient to estimate what the reliability would be for the full length of the test. Also used to estimate the effect of lengthening or shortening a test.         - Cronbach's Coefficient Alpha: Conceptualized as the average of all possible split-half coefficients corrected by the Spearman-Brown formula.         - Kuder-Richardson Formula 20 (KR20KR-20): Used specifically when items are scored dichotomously (e.g., right/wrong, even if multiple choice).

  • 4. Inter-Rater (Inter-Scorer) Reliability:     - Requirement: Necessary for subjective tests (projective tests, essays).     - Procedure: Two or more independent raters score the same sample of tests.     - Statistics Used:         - Kappa Statistic: Used for nominal data (categories).         - Coefficient of Concordance: Used for data in the form of ranks.

Standard Error of Measurement and Confidence Intervals

  • Standard Error of Measurement (SEM):     - If an examinee took the same test an infinite number of times, their scores would form a distribution.     - The mean of this distribution is the True Score.     - The standard deviation of this distribution is the SEM, representing variability due to measurement error.

  • Confidence Intervals:     - Used to interpret an obtained score as a range where the true score likely falls.     - Construction:         - 68%68\% Confidence Interval: Obtained Score±1×SEM\text{Obtained Score} \pm 1 \times SEM         - 95%95\% Confidence Interval: Obtained Score±2×SEM\text{Obtained Score} \pm 2 \times SEM     - Example Scenario:         - Obtained score: 5050         - SEM: 55         - 68%68\% Interval: 4545 to 5555 (50±550 \pm 5).         - 95%95\% Interval: 4040 to 6060 (50±1050 \pm 10).

Factors Affecting the Magnitude of Reliability

  • Test Length: Longer tests are generally more reliable than shorter ones (all things being equal).

  • Examinee Heterogeneity: Reliability coefficients are higher when the sample of examinees is heterogeneous regarding the trait being measured. This produces an unrestricted range of scores; a restricted range lowers the correlation.

  • Content Homogeneity: A test with homogeneous content (measuring one specific domain) has higher internal consistency than one with heterogeneous content.

Types of Validity

  • Definition: A test is valid when it accurately measures what it was designed to measure.

  • 1. Content Validity:     - Purpose: Assessing mastery of a specific content or behavior domain (e.g., achievement tests, job sample tests).     - Procedure: Built-in during construction by identifying the domain, dividing it into subcategories, and writing representative items.     - Evaluation: Primarily through the agreement of Subject Matter Experts (SMEs) that items representatively sample the domain.

  • 2. Construct Validity:     - Purpose: Measuring a hypothetical, intangible trait or construct (e.g., intelligence, self-esteem, depression).     - Evaluation: Accumulated evidence, including expert judgment, group comparisons, and the Multi-trait Multi-method (MTMM) Matrix.

  • 3. Criterion-Related Validity:     - Purpose: Using a score (Predictor) to estimate status on an external measure (Criterion).     - Predictive Validity: Forecasting future performance (e.g., job selection test predicting performance six months later).     - Concurrent Validity: Estimating current status (e.g., job selection test predicting performance immediately).     - Coefficient (RXYR_{XY}): Degree of association between predictor and criterion.     - Coefficient of Determination (r2r^2): Squared validity coefficient indicating shared variance. Example: RXY=0.60R_{XY} = 0.60, then 0.602=0.360.60^2 = 0.36 or 36%36\% variability shared.

Construct Validity: The Multi-trait Multi-method (MTMM) Matrix

  • Assumptions: A score reflects the trait being measured and the method used. A valid test reflects the trait more than the method.

  • Convergent Validity: High correlation between measures of the same/similar traits using different methods.

  • Divergent (Discriminant) Validity: Low correlation between measures of unrelated traits.

  • MTMM Coefficients:     - Monotrait-Monomethod: Same trait, same method. This is a reliability coefficient. High reliability is necessary because it limits validity.     - Monotrait-Heteromethod: Same trait, different method. Provides evidence for convergent validity (should be high).     - Heterotrait-Monomethod: Different traits, same method. Provides evidence for divergent validity (should be low).     - Heterotrait-Heteromethod: Different traits, different method. Provides evidence for divergent validity (should be low).

  • Conclusion: Construct validity is confirmed when the Monotrait-Heteromethod coefficient is larger than both Heterotrait-Monomethod and Heterotrait-Heteromethod coefficients.

Criterion-Related Validity: Prediction and Decisions

  • Regression Equation: Used to predict a criterion score from a predictor score.

  • Standard Error of Estimate: Used to construct a confidence interval around a predicted criterion score.     - 68%68\% Interval: Predicted Criterion Score±1×Standard Error of Estimate\text{Predicted Criterion Score} \pm 1 \times \text{Standard Error of Estimate}     - 95%95\% Interval: Predicted Criterion Score±2×Standard Error of Estimate\text{Predicted Criterion Score} \pm 2 \times \text{Standard Error of Estimate}

  • Incremental Validity: The increase in decision-making accuracy achieved by adding the new predictor.     - Base Rate: The proportion of correct decisions made without the new predictor (e.g., successful employees hired via current methods).     - Positive Hit Rate: The proportion of individuals who would have been successful if the new predictor were used. Calculated as: True PositivesTrue Positives+False Positives\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}.     - Calculation: Incremental Validity=Positive Hit RateBase Rate\text{Incremental Validity} = \text{Positive Hit Rate} - \text{Base Rate}.

  • Scatter Plot Quadrants (Relative to Cutoff Scores):     - True Positives: High predictor, high criterion.     - False Positives: High predictor, low criterion.     - True Negatives: Low predictor, low criterion.     - False Negatives: Low predictor, high criterion.     - Rule for Cutoffs: Raising the predictor cutoff score decreases the number of positives (both true and false) and increases the number of negatives (both true and false).

The Relationship Between Reliability and Validity

  • Reliability Limits Validity: A test's validity cannot be higher than the square root of its reliability.     - ValidityReliabilityValidity \le \sqrt{Reliability}     - Example: If reliability is 0.810.81, the maximum possible validity is 0.81=0.9\sqrt{0.81} = 0.9.

  • Necessity vs. Sufficiency: Reliability is a necessary condition for validity, but it is not sufficient. High reliability does not guarantee that a test measures what it intends to measure.

Questions & Discussion

  • Question 1 (Reliability for Fluctuating Characteristics):     - Prompt: To evaluate the reliability of a characteristic that varies in severity or intensity over time, which coefficient should be used: a) Equivalence, b) Stability, c) Alienation, or d) Internal Consistency?     - Response: The correct answer is d) Coefficient of Internal Consistency. Stability (test-retest) and Equivalence (alternate forms) both require testing at different times, which is problematic for fluctuating traits. Alienation measures non-association. Internal consistency requires only one administration, capturing the state at that moment.

  • Question 2 (Heterogeneous vs. Homogeneous Samples):     - Prompt: For a sample with IQ scores from 5050 to 150150, the reliability is 0.800.80. If calculated for gifted children only, it would likely be: a) 0.800.80, b) >0.80, c) <0.80, or d) either?     - Response: The correct answer is c) less than 0.800.80. Restricting the range of scores (moving to a homogeneous sample like gifted children only) lowers the correlation coefficient.

  • Question 3 (MTMM Matrix Interpretation):     - Prompt: In an MTMM matrix, a large Heterotrait-Monomethod coefficient suggests the test: a) has convergent validity, b) has discriminant validity, c) lacks convergent validity, or d) lacks discriminant validity?     - Response: The correct answer is d) lacks discriminant (divergent) validity. A large coefficient between different traits (Heterotrait) implies the test is correlating with something it shouldn't, failing to discriminate between constructs.

  • Question 4 (Changing Cutoff Scores):     - Prompt: A test developer finds that raising the cutoff score on a selection test: a) increases TP and TN, b) decreases TP and TN, c) decreases TP and FP, or d) increases TP and FP?     - Response: The correct answer is c) decreases the number of true and false positives. Raising the vertical predictor cutoff line on a scatter plot moves it to the right, reducing the zone of all positive hits (the area to the right of the line).