Reliability

Introduction

Reliability is when we measure consistently. A test is not considered to be reliable if the test scores vary noticeably according to an occasion or situation, then it cannot be used to make predictions about their behaviour.

Reliability is not the same as stability. In determining reliability, it is assumed that the test is measuring a relatively stable characteristic. Unreliability results from measurement errors produced by temporal internal states example; Low Motivation, indisposition or external conditions such as a distracting or uncomfortable testing environment.

Classical Reliability Theory

A person’s true score (that score which we will never be able to grasp a hold of) on a particular test is defined as the average of the scores the person would obtain if they took the test an infinite number of times (hence this is why the true score is a theoretical phenomenon). A true score can never be measured exactly; it must be estimated from the examinee’s observed score on the test.

🔢 Variance of the observed score = variance true score + variance of score error.

Reliability is the variance of the true score/variance of the observed score

r= 1.00 = means perfect reliability

r= 0.00 = complete unreliability.

Test-Retest Coefficient or Coefficient of Stability

Test retest coefficient also called the coefficient of stability is computed to determine whether a test measures consistently from one time to another. Correlating the scores obtained by a group of people on one administration with their scores on a second administration of the test. Differences in the conditions of the test items are likely to be greater after a long interval than after a short one.

Therefore, test-retest reliability tends to be higher (less error) when the interval between the initial test and retest is short (days or a few weeks).

Parallel Forms Coefficient/Coefficient of Equivalence

Parallel Form Coefficient is another reliability test. In test-retest reliability, when the interval between the initial test and the retest is short, examinees usually remember many of the questions and the responses on the initial test. Some people remember more of the test material than others, causing the correlation between test and retest to be less than perfect.

In parallel forms, on the first administration, Form A is administered first half of the group and Form B to the other half. In the second administration, vice versa. The problem is that it is very difficult to develop two truly parallel test versions. This is because it is very expensive and it is very difficult to construct hence the internal consistency coefficient came about. 

Internal Consistency Coefficients

There are three different methods, split-half method, kuder-richardson method and the coefficient Alpha. 

Split Half Method

A single test is seen as composed of two parts (parallel forms) measuring the same thing. Therefore, a test can be administered and separate scores assigned on two arbitrarily selected halves of the test. Example: One form with odd and the other form with even numbers.

Assuming that the two halves have equal means and variances, the reliability of the test as a whole may be measured by the Spearman-Brown Formula.

Kuder-Richardson Method

A test can be divided into many different ways, which may result in different scores for r. The idea behind this method is to take the average of the reliability coefficients from all the possible half splits as the overall reliability estimate. However, this is done through a short-cut procedure using a formula which is the Kuder-Richardson Method.

This test method is only applicable only when the test items are ‘yes’ or ‘no’. This is because test items are scored 0 or 1.

Coefficient Alpha

Since the Kuder-Richardson formulas are applicable only when the test items are scored 0 or 1. However the coefficient alpha is a general formula for estimating reliability when a test consists of items on which different scoring weights may be assigned to different responses. Hence, questionnaires using Likert scales can use this test.

Interscorer/interrater Reliability

The most common approach for determining this is to have two persons score the responses of a sizable number of examinees and then compute the correlation between the two sets of scores. Another approach is to have many people score the test responses of one examinee. Or have many people score the responses of several examinees. This yields an intraclass coefficient or coefficient of concordance.

Interpreting Reliability Coefficients

How high should a reliability coefficient be for a test or other psychometrics instruments to be useful? If the test is to be used for determining whether the mean scores of two groups of people are significantly different, a reliability coefficient of .60 or .70.

When the test is used however, to compare one person’s scores of other people or the former’s score on one test with his or her score on another test, a reliability coefficient of at least .85 is needed to determine whether small differences in scores are significant. Therefore, the reliability benchmark is related to the USE of the test at hand.

Variability and Test Length

Reliability coefficients tend to be higher when the variance of the test scores, items scores, and ratings are large rather than small.

Reliability of Criterion-Referenced Tests

The traditional concept refers to norm-referenced tests which are designed primarily to differentiate among individuals who possess various amounts of a specific characteristic. The greater the range of the individual test scores in norm referenced test scores, the higher the reliability of the test.

The goal in constructing criterion-referenced test scores is to identify people belonging into one of two groups; one group reaching mastery in skill and the other falling below the benchmark. Here traditional forms of internal consistency procedures are inappropriate. We use the coefficient of agreement and Cohen’s Kappa Coefficient.


Generalisability Theory

Psychometricians emphasise that a test has many reliabilities and these depend on the various sources of measurement error that are taken into account in computing a reliability coefficient. The computations of the generalisability coefficient may then be computed as the ratio of the executed variance of scores in the universe to the variance of scores in the sample.

By emphasising the importance of the conditions under which a test is administered and the purposes for which it is designed, generalisability theory shifted the focus of test users beyond preoccupation with the test itself as a good or poor in general to “Good or poor for what purpose?”