RELIABILITY + VALIDITY OF TEST
Characteristics of a Test
Reliability
The reliability of a test refers to its degree of stability, consistency, predictability, and accuracy.
It addresses the extent to which scores obtained by a person are the same if the person is reexamined by the same test on different occasions.
Reliability is based on the variability of test scores due to random errors.
Random errors can include:
Misunderstanding a question
Mistakes during test administration
Changes in the person’s mood
Although no test is completely free of errors, high reliability implies that differences in scores are more likely to reflect true differences in abilities rather than random chance.
Factors Affecting Reliability
There are two primary reasons why test scores might contain some error:
Natural Variation in Human Performance
People don’t perform identically every time; variability is smaller in tests measuring abilities (e.g., intelligence) compared to tests measuring personality traits (e.g., mood).
Abilities change gradually over time, while personality traits can change frequently due to temporary factors. Thus, tests measuring personality are generally less reliable than ability tests.
Imprecise Measurement in Psychological Testing
Unlike physical sciences where measurements can be direct, psychological traits like intelligence must be measured indirectly through behaviors, introducing additional error.
Normal variations in performance can occur even without testing errors, necessitating good test design to minimize these inaccuracies.
Methods of Assessing Reliability
The four primary methods for assessing reliability include:
(a) Test-Retest Reliability
Checks the consistency of test results over time.
The same test is administered to the same group of individuals on two separate occasions. Scores are correlated to assess stability.
(b) Alternate-Forms Reliability
Measures consistency when two different versions of a test are used.
Two equivalent forms are created and administered to the same group, with scores compared for correlation.
(c) Internal Consistency
Assesses how well items on a test correlate with each other, ensuring that items measure the same construct. Methods include:
Split-Half Reliability
Coefficient Alpha
(d) Interscorer Reliability
Evaluates consistency when different examiners score the same test. High correlation indicates consistent scoring among examiners.
Test-Retest Reliability
Test-Retest Reliability
Involves administering the same test to the same group at two different times (e.g., two weeks apart) and correlating the scores.
High correlation implies that the test yields stable and consistent results, with changes being possibly due to real changes in the individual (as opposed to random factors).
Important to choose a proper time interval between tests to avoid:
Carryover Effect (if too short)
Real-life changes (if too long)
Example: A psychologist administering a stress tolerance test with a correlation of 0.85 between two sessions signifies good reliability.
Alternate-Forms Reliability
Alternate-Forms Reliability
Measures test result consistency when two different versions of the same test are used.
Developers create two equivalent forms of a test, each designed to measure the same trait at the same difficulty level.
High correlation between scores from both forms indicates their consistency in measuring the true construct reliably.
Example: Two forms of a math test (Form A and Form B) showing a correlation of 0.88 suggest high alternate-forms reliability.
Internal Consistency
Internal Consistency
Two methods are used for this reliability check:
Split-Half Reliability
Divides the test into two parts and correlates the scores from each half. Denotes how well the two halves agree.
Coefficient Alpha (Cronbach’s alpha)
Compares all items internally, providing a comprehensive measure of consistency.
Both methods can be limited as splitting can result in fewer questions, reducing stability, yet longer tests tend to be more reliable.
Interscorer Reliability
Interscorer Reliability
Determines consistency in test results when different examiners evaluate the same test submissions.
Particularly relevant in projects that involve subjective evaluations (e.g., projective tests such as Rorschach).
High correlation between different scorers indicates less subjectivity.
Example: If two psychologists score a Rorschach test and achieve similar ratings, the test is deemed reliable.
Standard Error of Measurement (SEM)
The SEM is a statistical estimate that indicates the amount of error to be expected in test scores.
SEM Explanation
Test scores are composed of:
True Score: The score obtained if there were no errors.
Error: Random influences (e.g., mood, distractions).
No score can be perfectly accurate; SEM provides a range of likely true scores based on the test's reliability.
SEM Function
SEM indicates the range within which a person's true score likely lies, functioning similarly to a standard deviation in indicating score variance.
About:
68% of the time, an individual's true score is within ±1 SEM of the observed score.
95% of the time, an individual's true score is within ±2 SEMs.
Example Calculation: If an individual's IQ score is 100 with a SEM of 3:
68% confidence that true IQ is between 97 and 103 (100 ± 3).
95% confidence that true IQ is between 94 and 106 (100 ± 6).
Validity
Validity is the most critical aspect of test construction, focusing on whether a test measures what it claims to measure.
A test must be reliable to be considered valid.
Example: A musical preference test may consistently produce results (reliable) but not accurately measure creativity (not valid).
Challenges in Proving Validity
Validating a test is challenging due to the abstract nature of psychological traits.
Definitions of these traits change over time, necessitating continuous validation efforts.
Test designers must clearly define the construct and develop appropriate questions.
Example: IQ tests may predict academic success but may not capture the full essence of intelligence as conceptualized.
Types of Validity
Content Validity
Content validity assesses how representative and relevant the test items are to the intended construct.
Test constructors must ensure that item selection covers all significant areas of the construct.
Example: A Spanish test lacking a listening component lacks content validity in measuring comprehensive language ability.
Face Validity
Face validity evaluates a test based on subjective judgment, often reviewed by users or stakeholders.
While it provides superficial assurance, psychometricians regard it as less rigorous than content validity.
Example: Arithmetic problems for mechanics should relate to machines, demonstrating face validity if stakeholders agree.
Criterion Validity
Criterion validity is assessed by comparing the test scores with an external measure that is theoretically related to the test.
Types of criterion validity include:
Concurrent Validity: Assessed when both the test and external measure are taken simultaneously.
Example: Comparing intelligence and academic achievement tests administered within the same week.
Predictive Validity: Evaluated when the test is given before the external measure is collected later.
Example: Correlating intelligence test scores with GPA in the subsequent year.
Construct Validity
Construct validity investigates whether a test measures the intended theoretical trait accurately through a sequence of steps:
Define the construct clearly.
Predict how the construct relates to other variables.
Conduct research to verify these predictions.
Example: A dominance test should correlate positively with leadership roles and negatively with submissive behaviors.
Various methods can establish construct validity, including age-related changes and intervention effects, to demonstrate that test behaviors align with the theoretical predictions.
Types of Construct Validity
Convergent Validity: Confirms expected relationships between constructs.
Discriminant Validity: Confirms expected lack of relationship between unrelated constructs.
Example: Self-esteem may be assessed against related constructs like social skills and optimism, emphasizing the construct's validity.
Standard Error of Estimate (SEE)
The SEE measures the prediction accuracy in regression analysis, indicating the standard deviation of residuals (the discrepancy between actual and predicted values).
A low SEE indicates accurate predictions, while a high SEE indicates less reliable predictions, impacting confidence in regression results.
Analogy: SEE is like the grouping of darts around a target (bullseye), showing average error in predictions.
Example: In predicting exam scores from study hours, an SEE of 4 signifies the average prediction error lies ±4 points from the actual score.
Formula for SEE
To calculate SEE from standard deviation of Y values and correlation coefficient between X and Y:
Where:
$SD_{Y}$ = Standard deviation of the Y values
$r$ = Correlation coefficient between X and Y values
Reliability
The reliability of a test refers to its degree of stability, consistency, predictability, and accuracy in measuring what it is intended to measure.
It addresses the extent to which scores obtained by a person on a specific test remain relatively stable upon repeated administrations over time. Reliability is affected by various factors, including, but not limited to, the clarity of test items and the testing environment.
Reliability is based on the variability of test scores due to random errors, which can arise from several sources, including:
Misunderstanding or ambiguity in a question, which may lead to misinterpretation.
Mistakes made during test administration, such as timing inconsistencies or non-standardized conditions.
Variations in the individual’s psychological state, such as mood fluctuations or stress levels, at the time of testing.
Although no test is completely free of errors, a high reliability coefficient indicates that differences in test scores are more likely to reflect true differences in abilities rather than random chance fluctuations in the measurement process.
Factors Affecting Reliability
There are two primary reasons for why test scores might contain some level of error:
Natural Variation in Human Performance
Individuals do not perform identically across different testing occasions; hence, some variability is expected. Variability tends to be smaller in tests measuring abilities (e.g., cognitive capabilities) compared to tests measuring personality traits (e.g., mood or emotional state).
Abilities can change gradually over time due to factors such as education or experience, while personality traits might fluctuate frequently based on external circumstances or specific events, which generally renders personality tests less reliable than ability-based assessments.
Imprecise Measurement in Psychological Testing
Unlike the measurements in physical sciences, where direct and objective methods are often employed, psychological constructs like intelligence are typically inferred indirectly from observed behaviors and responses, leading to additional sources of measurement error.
Normal variations in performance can occur even without testing errors; thus, a robust test design is essential to minimize these inaccuracies and reflect true constructs as accurately as possible.
Methods of Assessing Reliability
The four primary methods for assessing reliability include:
(a) Test-Retest Reliability
This method checks the consistency of test results over time by administering the same test to the same group of individuals on two separate occasions, typically with an appropriate interval between them. Scores from both administrations are correlated to assess the stability of responses.
(b) Alternate-Forms Reliability
Additionally known as parallel forms reliability, this method measures the consistency of test results when two different but equivalent versions of a test are administered to the same group. High correlation between scores from both forms indicates the tests are equivalent in measuring the same construct reliably.
(c) Internal Consistency
Internal consistency assesses how well items on a test correlate with one another, confirming that all items measure the same construct. Methods include:
Split-Half Reliability
Divides the test into two parts and correlates the scores from each half. A high correlation suggests the two halves agree and support the overall reliability of the test.
Coefficient Alpha (Cronbach’s alpha)
This method evaluates the inter-item correlations by examining all item responses concurrently and providing a comprehensive measure of consistency across the items on the test.
(d) Interscorer Reliability
This method assesses consistency in test results when different examiners score the same test submissions. High correlation between scores awarded by different examiners indicates greater consistency in scoring, which is crucial, especially in subjective evaluations like essays or projective tests.
Test-Retest Reliability
Test-Retest Reliability
Involves administering the same test to the same group at two different time points (e.g., after a two-week gap) and correlating the scores obtained from both test sessions. A high correlation coefficient implies that the test yields stable and consistent results, suggesting that any variations can be attributed to real changes in the individual's ability rather than random error.
It is crucial to select an appropriate time interval for retesting to avoid:
Carryover Effect (if the time interval is too short) leading to recollection of answers rather than true performance measurement.
Real-life changes or development (if the time interval is too long) affecting the individual's skills or knowledge.
For instance, if a psychologist administers a stress tolerance test with a resulting correlation of 0.85 between two separate sessions, this signifies good test-retest reliability, validating that responses remain stable over time.
Alternate-Forms Reliability
Alternate-Forms Reliability
This method specifically addresses the consistency in results when two different forms of the same test, designed to measure the same construct at the same level of difficulty, are utilized. It ensures that both forms are equivalent in terms of content and challenge.
Through the creation of two parallel forms, developers can ensure that scores will remain consistent across tests, with high correlation between the results of both forms indicating reliable assessment of the construct.
An example might include two versions of a math test (Form A and Form B) that yield a correlation coefficient of 0.88, suggesting that both forms reliably measure a student's mathematical proficiency without favoring one form over the other.
Internal Consistency
Internal Consistency
This checks reliability by evaluating the extent to which individual items within the test correlate with one another, confirming that all items contribute to a coherent measurement of the intended construct. The methods for assessing internal consistency include:
Split-Half Reliability
This method involves dividing the test into two halves and correlating the results to determine how consistently the two halves of the test agree with one another, thus reflecting the overall reliability of the test.
Coefficient Alpha (Cronbach’s alpha)
This metric averages the inter-item correlations among all test items, providing a thorough assessment of the internal consistency of the test, ensuring that all items work together to measure the same underlying construct reliably.
It is important to recognize that methods of assessing internal consistency may also present limitations, as artificially splitting a longer test can result in insufficient item numbers for stability assessment, while longer tests generally provide greater reliability.
Interscorer Reliability
Interscorer Reliability
This method determines whether test results remain consistent when scored by different raters or examiners scoring the same set of test submissions. It is especially relevant in subjective evaluations often encountered in qualitative assessments (e.g., essays, performance tests).
A high correlation between scores awarded by different raters signifies low subjectivity and high reliability of the scoring process. For example, if two psychologists evaluate the same Rorschach test responses and arrive at similar scores, this reliability reinforces the credibility of the test as a reliable measure of the underlying psychological constructs being assessed.
Standard Error of Measurement (SEM)
The SEM is a statistical estimate that indicates the amount of error expected in test scores, allowing for a better understanding of accuracy in measurement.
SEM Explanation
Test scores are composed of:
True Score: The score an individual would obtain if all sources of error were eliminated from the measurement process, providing the most accurate reflection of ability.
Error: Random influences from various sources (e.g., fluctuations in mood, external distractions) that can distort the obtained score.
Since no test can yield perfectly accurate scores, the SEM provides a statistically-based range within which an individual’s true score is likely to fall, taking into account reliability factors.
SEM Function
SEM represents the interval around an observed score that is likely to encompass the true score, operating analogously to the concept of standard deviation, which illustrates variations in scores. Specifically, approximately:
About 68% of the time, an individual's true score will fall within ±1 SEM of the observed score.
Roughly 95% of the time, the individual's true score will reside within ±2 SEMs.
For instance, if an individual's IQ score is reported as 100 with a SEM of 3, this means there is a 68% confidence that the true IQ falls between 97 and 103 (100 ± 3), with a 95% confidence that it lies between 94 and 106 (100 ± 6).
Validity
Validity is undoubtedly the most critical aspect of test construction, as it determines whether a test accurately measures what it purports to measure. While a test needs to be reliable to be considered valid, a reliable test is not necessarily valid.
For example, a musical preference test may yield consistent results (indicating reliability); however, if it fails to assess creativity as intended, its validity is negative.
Challenges in Proving Validity
Establishing the validity of a test can be complex due to the abstract nature of psychological traits that may evolve over time. Changes in the definitions of these traits require ongoing efforts to validate the assessment tools employed.
Test designers are challenged with clearly defining the constructs they intend to assess and ensuring that the items chosen for the test accurately reflect those constructs. An apt illustration includes IQ tests, which may effectively predict academic success but may not fully encapsulate the multidimensional nature of intelligence as a construct.
Types of Validity
Content Validity
Content validity examines how representative and relevant the test items are relative to the intended construct. Developers must ensure that the items selected comprehensively cover all significant aspects of the construct being measured.
For instance, a Spanish language proficiency test that lacks a listening component would exhibit deficiencies in content validity, as it fails to assess all critical skills associated with comprehensive language ability.
Face Validity
Face validity assesses the test based on subjective judgment, typically evaluated by users, experts, or stakeholders who determine if the test appears to measure what it claims. While it provides surface-level assurance, psychometricians argue that face validity is less rigorous than content validity.
An example could include arithmetic problems relevant for mechanics that relate directly to machines; the test demonstrates face validity if stakeholders agree that it measures what it’s intended to.
Criterion Validity
Criterion validity is determined by comparing the test scores with an external benchmark theoretically related to the construct being assessed. Types of criterion validity include:
Concurrent Validity: Assessed when both the test and the external measure are taken simultaneously, such as comparing intelligence tests with academic performance tests taken concurrently.
Predictive Validity: Evaluated when the test is administered prior to assessing the external measure later, for example, correlating intelligence test scores with subsequent GPA in the following academic year.
Construct Validity
Construct validity focuses on whether a test accurately measures the theoretical trait it purports to address. Establishing this involves several steps: 1. Clearly defining the construct in question. 2. Predicting how the construct relates to other variables and outcomes. 3. Conducting empirical research to verify these predictions.
An example could involve assessing a dominance test for constructive validity; it should correlate positively with measures of leadership roles while demonstrating negative correlations with submissive behaviors.
Various methodologies can substantiate construct validity, including examining age-related changes or effects of interventions to demonstrate consistency between test behaviors and theoretical expectations.
Types of Construct Validity
Convergent Validity: Affirms expected positive relationships between constructs, demonstrating that measures that should be related are indeed correlated.
Discriminant Validity: Confirms expected lack of relationships between constructs that are logically unrelated. For instance, self-esteem measures may be assessed against closely related constructs like social skills and optimism, reinforcing the validity of the self-esteem construct being assessed.
Standard Error of Estimate (SEE)
The SEE provides a gauge for the accuracy of predictions made by regression analysis, representing the standard deviation of residuals, which indicate the discrepancy between observed and predicted values in the data analysis.
A low SEE denotes precise predictions, whereas a high SEE illustrates less reliable predictions, which could undermine confidence in the results of the regression analysis.
An analogy for SEE might be the dispersion of darts around a bullseye during a target game, illustrating the average error that accompanies predictions. For example, if predictions for exam scores based on study hours yield an SEE of 4, this indicates that average prediction errors are expected to lie within ±4 points from the actual score.
Formula for SEE
To calculate SEE based on the standard deviation of Y values and the correlation coefficient between X and Y, the following formula may apply:
Where:
- $SD_{Y}$ = Standard deviation of the Y values determining response variability, contributing to prediction uncertainty.
- $r$ = Correlation coefficient that evaluates the degree of linear relationship between X and Y values.