Scales, Reliability, and Validity
Scales
Definition of a Scale
A composite measure of a variable, formed by combining multiple items into a single index. This helps to understand complex variables that cannot be measured by a single item.
Key Characteristics of Rating Scales
More points mean more differentiation: Increasing the number of points on a scale (e.g., from 5 to 7, 9, 11, 13) allows for finer distinctions in measurement.
Higher number means higher degree: The numerical values on a scale should consistently reflect a higher degree or intensity of the measured concept. Reversing this order is incorrect.
Use of Odd Numbers: It is generally recommended to use odd numbers of points (e.g., 5, 7) in scales to allow for a neutral or middle option, providing a clear reference point (e.g., neither agree nor disagree).
Types of Rating Scales in Social Science
Less Common Scales (in Mass Communication)
Guttman scale and Thurstone scale: These are more prevalent in fields like economics and sociology, and rarely used in mass communication research.
Most Popular Scales (in Mass Communication) - You Must Know These
Likert Scale:
Named after Rensis Likert, who invented it.
Purpose: Primarily designed to measure people's agreement or disagreement with a statement.
Range: Typically ranges from "Strongly Disagree" to "Strongly Agree."
Example:
Strongly Disagree
Disagree
Neutral
Agree
Strongly Agree
Measures one specific thing: Agreement.
Semantic Differential Scale:
Purpose: Designed to measure perceptions using bipolar adjectives (opposite terms).
Structure: Uses pairs of opposite words as anchors for endpoints.
Examples of Bipolar Adjectives: Biased / Unbiased, Unfair / Fair, Good / Bad, Kind / Unkind, Positive / Negative.
Measures two different things: Unlike the Likert scale, which measures a single concept (agreement), the semantic differential scale allows for measuring two opposing aspects simultaneously (e.g., positive vs. negative attributes).
Reliability
Definition: The extent to which a measure consistently produces the same result again and again under the same conditions.
Key Concept: All about consistency, not accuracy or appropriateness.
Example: A student who is always 10 minutes late is reliable in their lateness because the behavior is consistent.
Real-world Example: A weighing scale is reliable if it produces consistent weight measurements when a person steps on and off multiple times within a short period. If the numbers fluctuate significantly, the scale lacks reliability.
Reliability Coefficient:
A numerical value that quantifies reliability.
Range: Between 0 and 1. 0 indicates no reliability, and 1 indicates perfect reliability.
Interpretation: Closer to 1 is better (higher reliability). Closer to 0 is not good.
Cutoff: Scholars generally recommend a reliability coefficient of 0.7 and above as an acceptable level of reliability.
Types of Reliability
Test-Retest Reliability:
Procedure: Involves giving the same measure twice to the same individuals and then checking the consistency between their scores.
Example: Taking Exam 1 today and then taking the exact same Exam 1 a week later.
Potential Issue/Limitation: The practice effect or learning effect. Individuals are likely to score better on the second test because they have learned from the first attempt, leading to inconsistent scores (an error). This means the reliability typically won't be perfect (1).
Split-Half Reliability:
Procedure: Dividing the entire set of items (e.g., survey questions) into two halves (e.g., odd-numbered questions vs. even-numbered questions) and checking the consistency between the scores of these two halves.
Example: Dividing a 100-question exam into 50 odd-numbered questions and 50 even-numbered questions, and comparing performance on both sets.
Potential Issue/Limitation: It is challenging, almost impossible, to ensure that both halves have equal difficulty or measure the exact same aspect with the same intensity. This inequality can lead to inconsistent scores and thus lower reliability.
Cross-Test Reliability:
Procedure: Using two different instruments or versions (with different items) to measure the same concept. Consistency is then checked between the scores of these different versions.
Example: Creating Version A and Version B of Exam 1 to measure the same course knowledge, and then comparing scores between the two versions.
Potential Issue/Limitation: Similar to split-half reliability, it is difficult to create two different versions that are identically difficult or equivalent in their measurement. Differences in difficulty or item content will lead to inconsistencies and prevent perfect reliability.
Inter-Rater Reliability (or Inter-coder Reliability):
Procedure: Measures the consistency in ratings or observations among different individuals (raters) who are assessing the same phenomenon.
Example: Asking two different raters (e.g., Brandon and Eric) to rate the humor in a television show. Each rater assigns numerical values for humor they perceive.
Potential Issue/Limitation: Different raters often have subjective biases, different interpretations, or varying criteria (e.g., what one person finds humorous, another might not). This leads to inconsistent ratings and lower inter-rater reliability.
General Conclusion on Reliability: All types of reliability have inherent limitations and are not perfect. There is always a room for error, meaning a reliability coefficient of 1 is rarely achieved in practice.
Validity
Relationship between Reliability and Validity
Reliability does NOT guarantee validity: A measure can consistently produce the same results (be reliable) but still not measure what it's supposed to measure (lack validity).
Validity GUARANTEES reliability: If a measure is valid (accurately measures what it claims), it must inherently be reliable (because consistency is a prerequisite for accuracy).
Order of Checking: Always check reliability first. If a measure is not reliable, there's no point in discussing its validity.
Analogy: If you have a broken clock that always reads 10:00, it is reliable (consistent), but it is only valid twice a day. However, if you have a clock that always reads the correct time, it is both valid and reliable.
Definition of Measurement Validity: The degree to which a measure actually measures what it claims to measure (i.e., its accuracy).
Key Concept: All about accuracy.
Difference from Internal/External Validity: This topic specifically refers to measurement validity (how well an instrument measures a concept), which is distinct from internal validity (accuracy of cause-and-effect relationships within a study) and external validity (generalizability of findings to other populations/settings).
Complexity: Unlike reliability, measurement validity is not typically quantified by a single numerical coefficient. It's an ongoing, complex process of building a logical argument and providing evidence and reasoning to support that a measure is accurate.
Types of Measurement Validity (Four Most Important)
Face Validity:
Procedure: An intuitive judgment based on whether an instrument appears or looks like it measures what it's supposed to on the surface.
Example of Lacking Face Validity: Using a ruler to measure a person's weight clearly lacks face validity because a ruler is for length, not weight. Or, asking a simple yes/no question like "Did you learn something?" to assess complex exam achievement.
Nature: It's a starting point, relying on common sense and expert judgment, but usually requires additional, more direct evidence.
Predictive Validity:
Procedure: Assesses whether a measure can accurately predict future outcomes that it theoretically should predict.
Example: If a measure of behavioral intention (e.g., intention to vote) is valid, it should accurately predict actual behavior (e.g., actual voting turnout). Similarly, a valid exam should predict a student's future ability to apply course knowledge.
Concurrent Validity:
Procedure: Checks how well a new measure correlates with other pre-existing, established, and validated measurements of the same concept that are administered at roughly the same time.
Example: If you develop a new intelligence test, it should show a strong correlation with scores from an older, widely accepted intelligence test among the same group of people. Individuals who score high on the old test should also score high on the new test.
Construct Validity:
Procedure: The most challenging form of validity. It ensures that an instrument accurately measures the specific theoretical construct (concept) it intends to measure, and only that construct, distinguishing it from other related but distinct constructs.
Example: A measure designed to assess "happiness" should primarily capture feelings of happiness and not inadvertently measure other emotions like anger, sadness, or fear, even though these are all emotions. It must be precise in what it measures and exclude what it does not intend to measure.