L5 - Reliability Estimates

Complete Notes on Reliability Estimates

I. RELIABILITY

Definition: Is based on the consistency and precision of the results of the measurement process.
Importance: Test users require evidence that scores obtained from tests would be consistent if the tests were repeated on the same individuals or groups and that the scores are reasonably precise to have confidence or trust in scores.

II. ERRORS IN MEASUREMENT

Concept: Errors is the component of the observed test score that does not have to do with the testtaker’s ability.
Formula: X = T + E
- X: Observed score
- T: True score
- E: Error
Variance: Is used to describe sources of the test score variability.
- True Variance: This is the part of total variance that comes from real differences between people, without errors. Assumed to be stable across time and repeated measures.
- Error Variance: This is the unwanted variation in scores due to random errors.
- Total Variance: This is the total amount of variation in test scores, made up of both true differences between people and errors in measurement.
  - Formula: true variance + error variance
Factors Associated with Measurement Process: There are the factors associated with the process of measuring some variable, other than the variable being measured; example would be the test administration procedures, the testtaker’s culture, etc.
Random Error Fluctuations: Consists of Unpredictable and inconsistencies of other variables in the measurement process.
- Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores.
Systematic Error: Source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.

III. SOURCES OF ERROR VARIANCE

Test Construction: One source of variance during test construction is item sampling or content sampling, terms that refer to variation among items within a test as well as to variation among items between tests; some items are poorly worded to the point that different testtakers may have different interpretation of the item, making the test unreliable.
Test Administration:
- Test Environment
- Testtaker
- Test User
- Test Scoring and Interpretation
- Sampling Error

IV. RELIABILITY COEFFICIENT

Definition: A statistic that quantifies reliability, ranging from 0 (not at all reliable) to 1 (perfectly reliable).
Interpretation: A proportion that indicates the ratio between true variance on a test and the total variance.

V. HOW TO MEASURE RELIABILITY?

A. Test-Retest Reliability Estimates
B. Parallel-Forms and Alternate Forms Reliability Estimates
C. Split-Half Reliability Estimates
D. Internal Consistency
E. Inter-Scorer Reliability

A. TEST-RETEST RELIABILITY ESTIMATES

Definition: An estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.
Applicability: Only applicable to tests that measure something that is stable overtime.
Coefficient of Stability: The reliability coefficient obtained with a test-retest procedure.

B. PARALLEL-FORMS AND ALTERNATE-FORMS RELIABILITY ESTIMATES

Parallel Forms: Of a test exist when, for each form of the test, the means and the variances of observed test scores are equal.
Alternate Form: Is different versions of the test that have been constructed so as to be parallel.
Coefficient of Equivalence: Is the estimates that measures the degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability.
Similarities in this estimate:
- Two test administrations with the same group are required.
- Test scores may be affected by factors such as motivation, fatigue, or intervening events such as practice, learning or therapy.

C. SPLIT-HALF RELIABILITY ESTIMATE

Definition: Is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
Usefulness: It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice.
Computation Steps:
- Step 1: Divide the test into equivalent halves.
- Step 2: Calculate a Pearson r between scores on the two halves of the test.
- Step 3: Adjust the half-test reliability using the Spearman-Brown formula.
Step 1: Divide the test into equivalent halves.
- Randomly assign items to one or the other half of the test.
- Odd-Even-Reliability: Is assigning odd-numbered items to one half of the test and even-numbered items to the other half. This method is a popular way to split a test.
- Divide the test by content so that each half contains items equivalent with respect to content and difficulty.
Step 2: Compute the Pearson r between scores from both halves.
Step 3: Adjust the half-test reliability using the Spearman-Brown formula (for internal consistency)
- It allows a test developer or user to estimate internal consistency reliability from a correlation between two halves of a test.
- By determining the reliability of one half of a test, a test developer can use the Spearman-Brown formula to estimate the reliability of a whole test.
- Estimates of reliability based on consideration of the entire test therefore tend to be higher than those based on half of a test.

D. INTERNAL CONSISTENCY

Inter-Item Consistency: Refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test.
Test of Homogeneity: Tests are said to be homogenous if they contain items that measure a single trait.

1. INTER-ITEM CONSISTENCY

Kuder-Richardson formula (KR-20): Statistic of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong.
- Developed by Kuder and Richardson (1937).
Coefficient Alpha / Cronbach’s Alpha: Coefficient alpha is appropriate for use on tests containing non-dichotomous items (scale of 1-5, item response).
- Developed by Cronbach (1951) and subsequently elaborated on by others.
- Widely used as a measure of reliability, in part because it requires only one administration of the test.
- Typically ranges in value from 0 to 1.
- Coefficient alpha is calculated to help answer questions about how similar sets of data are.
- Bigger is not always better. A coefficient of 0.9 is too high and indicate redundancy in items.
- Assumes that each item has equal weight/ strength/ sensitivity to the construct being measured.

E. INTER-SCORER RELIABILITY

Definition: Is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure.
Usage: Often used when coding nonverbal behavior.

VI. TYPE OF RELIABILITY AND SOURCES OF ERROR

Sure, here's a table that differentiates the various reliability estimates:

Type of Reliability	Description	Key Differences
A. Test-Retest Reliability	Measures the consistency of test results over time by administering the same test to the same group on two different occasions.	Focuses on stability over time.
B. Parallel-Forms and Alternate Forms Reliability	Assesses the consistency of test results by administering different versions of a test (that measure the same construct) to the same group.	Focuses on equivalence between different forms of a test.
C. Split-Half Reliability	Involves splitting a test into two halves and correlating the scores from each half to assess internal consistency.	Focuses on the consistency of test items within a single test.
D. Internal Consistency	Measures the consistency of results across items within a test, often using Cronbach's alpha.	Focuses on the homogeneity of items within a test.
E. Inter-Scorer Reliability	Assesses the consistency of scores assigned by different scorers or raters.	Focuses on agreement between different evaluators.

Each type of reliability estimate serves a different purpose and is used in different contexts to ensure the accuracy and consistency of measurements.

Is there a specific type of reliability you're particularly interested in?