book 1

Reliability and Validity of Measurement

Learning Objectives

Define reliability, including the different types and how they are assessed.
Define validity, including the different types and how they are assessed.
Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Introduction to Measurement

Measurement in psychology involves assigning scores to individuals in a way that accurately represents certain characteristics (e.g., intelligence, self-esteem, depression). To ensure that these scores genuinely reflect the intended characteristic, researchers conduct studies validating their measurement tools. If research results indicate a measure is ineffective, psychologits discontinue its use.

Informal Example of Measurement

Consider an individual dieting for a month.
- If clothes fit more loosely and friends inquire about weight loss, confirming with a scale showing a loss of 10 pounds seems valid.
- Conversely, if the scale indicates a 10-pound gain, the scale may be deemed faulty.

Dimensions of Measurement

Researchers evaluate measurement methods primarily through two lenses: reliability and validity.

Reliability

Reliability refers to the consistency of a measure, and it can be assessed through three main types:

Test-Retest Reliability
- Definition: The consistency of scores over time when measuring a construct assumed to remain stable.
- Example: Intelligence, generally static over time, should yield similar scores a week apart.
- Assessment involves administering the measure twice to the same group and calculating the correlation (Pearson's r).
- A test-retest correlation of +.80 or greater is taken as indicative of good reliability.
Internal Consistency
- Definition: The consistency of responses across items in a multiple-item measure.
- Example: In the Rosenberg Self-Esteem Scale, agreement on self-worth items should correlate with agreement on items about good qualities.
- Internal consistency can be assessed using split-half correlation (comparing two sets of item scores).
- A split-half correlation of +.80 or greater is considered indicative of good internal consistency.
- Cronbach's α (alpha) is a common statistic used to measure internal consistency, representing an average of all possible split-half correlations.
- A Cronbach's α of +.80 or greater indicates adequate internal consistency.
Interrater Reliability
- Definition: The consistency of judgments across different observers.
- Example: Evaluating social skills via video ratings should yield highly correlated scores from different raters.
- Often assessed using Cronbach's α (for quantitative data) or Cohen's κ (for categorical data).

Validity

Validity refers to the extent to which scores from a measure represent the construct they purport to measure. While reliability is crucial for establishing validity, it is insufficient on its own; a measure can be reliable (yielding consistent scores) yet lack validity if it doesn't measure what it asserts.

Types of Validity

Face Validity
- Definition: The extent to which a measurement method appears, at face value, to measure the intended construct.
- Example: Self-esteem questionnaires typically ask about self-worth, aligning with common perceptions of self-esteem measurement.
- Although face validity can be quantitatively assessed, it is often considered weak evidence for true validity.
Content Validity
- Definition: The degree to which a measure encompasses the full range of the construct of interest.
- Example: A measure of test anxiety should cover both physiological symptoms and cognitive aspects if test anxiety is defined in those terms.
- Content validity is usually assessed through expert evaluation against the conceptual definition.
Criterion Validity
- Definition: How well scores on a measure correlate with other measures or variables expected to be related (criteria).
- Example: A new test anxiety measure should be negatively correlated with exam performance.
- When measured concurrently, it is referred to as concurrent validity. When assessed in the future, it is termed predictive validity.
- Convergent Validity: Scores should correlate positively with existing measurements of the same construct, such as a test of anxiety correlating with blood pressure during an exam.
- Discriminant Validity: Scores should not correlate highly with measures of conceptually distinct variables, indicating a measurement's specificity (e.g., self-esteem should not correlate with mood).

Key Takeaways

Psychologists validate their measures through research, discontinuing those that do not work.
Reliability encompasses test-retest, internal consistency, and interrater reliability.
Validity encompasses face, content, and criterion validity, with an emphasis on the relationship of scores to other expected measures.
Both reliability and validity assessments are ongoing processes and not confirmed by a single study.

Detailed Examination of Reliability

Fundamental Concept

Reliability is crucial in psychological measurement, where a reliable instrument consistently yields true variable scores.

Perfect reliability indicates true scores free from extraneous influences, a scenario rarely achieved in reality.
Reliability Definition: The proportion of variance in a score attributable to the true score of the assessed variable.

Methods for Computing Reliability

Various methods exist for estimating reliability, all focusing on computing the ratio of true score variance to total observed variance influenced by error.

Analysis of Variance (ANOVA) Method

Variance in scores can be partitioned into sources of substantive interest (signal) and error (noise), allowing for reliability assessment.
Example applied to temperature measurement using a potentially faulty thermometer can illustrate this methodology.

Internal Consistency Reliability

Definition: Related to how homogenous items are within a scale.
- A reliable scale's items should strongly correlate with one another, reflecting the same latent variable.
Assessed typically using Cronbach's α (alpha), indicating reliability proportionate to the true score.

Alpha Calculation

Coefficient alpha ranges from 0.0 to 1.0; a negative value signals an issue, usually related to negative item correlations.
Suggested minimum for adequate reliability is 0.70, particularly for stable instruments.
Alpha's stability may be influenced by sample size and item quantity during developmental phases.

Effects of Scale Length

Scale reliability correlates positively with the number of items. Shorter scales balance brevity against reliability Potential loss in meaning must be weighed during scale design.

Techniques for Optimizing Scale Length

The relation between item number and internal consistency implies that longer scales yield more reliable results, although minimizing burden on participants is crucial.

Sampling Techniques

Probability Sampling ensures equal chances of selection. Common types include simple random sampling, stratified sampling, and cluster sampling.
Nonprobability Sampling, often more feasible in psychology, runs the risk of bias, necessitating methods for enhancing sample representativeness.

Conclusion

Rigor in measuring psychological constructs is essential; researchers rely heavily on established statistical metrics for reliability and validity to build sound measurement tools.
Continuous assessment and adjustments enhance both reliability and validity to ensure meaningful, accurate psychological assessments.