Reliability and External Validity

Overview and Phases of Construct Investigation

Conceptual Framework: The process of scale development and construct validation is organized into three distinct phases as summarized by Flake, Pek, and Hehman ( $2017$ , $2020$ ).
Phase 1: Substantive Phase: * Construct conceptualization and literature review. * Generating items intended to measure the construct.
Phase 2: Structural Phase: * Item analysis. * Determining dimensionality (e.g., using factor analysis). * Assessing reliability.
Phase 3: External Phase: * Establishing convergent and discriminant validity.

Concepts of Reliability and Validity

General Definitions: * Reliability: Focuses on whether the measurement captured identified constructs consistently. It asks if the items reflect the same entity reliably. * External Validity: Focuses on whether the items match expectations of what they represent by comparing them to other known constructs.
Reliability vs. Validity Visualizations: * Valid but Not Reliable: Data points are centered around the goal but widely dispersed. * Reliable but Not Valid: Data points are tightly clustered together but far away from the intended target/goal. * Reliable and Valid: Data points are tightly clustered exactly at the target/goal.

Types of Reliability

Reliability Across Time (Test-Retest Reliability): * Definition: The consistency of a measurement when used under the same conditions with the same participants at different time points. * Procedure: Administer the scale at two separate times to each participant and compute the correlation ( $r$ ) between the two scores (Time $1$ and Time $2$ ). * Considerations: One must be mindful of the type of construct (e.g., mood, which is fleeting, vs. religious belief, which is stable) and the lag time between measurements. * Requirement: Assumes the construct itself is stable across the chosen timeframe.
Internal Consistency: * Definition: Focuses on item homogeneity and the extent to which different subsets of items capture the same thing. * Requirement: Only involves a single administration of the scale. * Standard Metric: This is how most research papers test reliability.

Internal Consistency: Split-Half and Cronbach's Alpha

Split-Half Reliability: * Process: Split a scale into two halves (e.g., two sets of $3$ items for a $6$ -item scale). Calculate an average score for each half and then correlate the two half-scale scores. * Limitation: Reliability results depend entirely on exactly how the data is split. Different splits yield different correlation coefficients ( $r$ ).
Cronbach's Alpha ( $\alpha$ ): * Refinement: Cronbach’s $\alpha$ effectively computes the average of all possible split-half correlations. * Interpretation: Interpreted as Pearson’s $r$ . Values range from $0$ (no internal consistency) to $1$ (perfect internal consistency). Negative values indicate an error (e.g., failed to reverse-code items). * Rule-of-Thumb: Acceptable reliability is often considered to be $\alpha ≥ 0.7$ , though this depends on the construct and research progress.

Limitations and Misuse of Cronbach's Alpha

Prevalence: Reported by $87\%$ of papers discussing internal consistency. The original paper (Cronbach, $1951$ ) has approximately $65,000$ citations.
Assumption of Tau-Equivalence: * Alpha assumes a factor model where items have equal loadings, and each item indicates only one factor. * In reality, items have strong primary loadings but often have weak, ignored loadings on other factors.
Sensitivity to Scale Length (The Cortina Study, $1993$ ): * $-$ A scale with $3$ items and an average inter-item correlation of $0.57$ yields $\alpha = 0.80$ . * $-$ A scale with $10$ items and an average inter-item correlation of $0.28$ also yields $\alpha = 0.80$ . * Conclusion: One can artificially inflate alpha simply by adding more items, regardless of actual item quality or internal consistency.
Unidimensionally Misconception: Alpha is designed for unidimensional scales (single factor). It cannot be used to prove that a scale is unidimensional. If a toaster heats falafel, it does not mean the falafel is bread; similarly, a high alpha score does not prove the items form only one factor.

McDonald's Omega

General Overview: McDonald ( $1978$ , $1999$ ) proposed Omega ( $\omega$ ) as an alternative to Alpha that does not assume tau-equivalence or unidimensionality.
Structure: Omega uses the factor structure obtained from factor analysis and assumes a general factor (an overarching, higher-order factor).
Types of Omega: * Omega Hierarchical ( $\omega_h$ ): Appropriate for unidimensional scales where items share variance with a general factor. It is typically smaller than Omega Total for multidimensional scales. * Omega Total ( $\omega_t$ ): Appropriate for multidimensional scales. It accounts for variance shared with both specific extracted factors and the general factor.

Implementation of Omega in R

Code Structure: Use psych::item_omega or a similar wrapper. * raq_omg <- item_omega(data_object, n = 4, factor_method = "minres", poly_cor = TRUE) * n: Number of factors identified in initial factor analysis. * factor_method: Method used (e.g., "minres" for Minimum Residual Estimation). * poly_cor: Boolean indicating if polychoric correlations are used.
Testing Assumptions: The output provides a table of factor loadings. The g column shows loadings with the general factor, while subsequent columns show loadings for specific factors. These should be similar to initial factor analysis results but not identical, as the model now incorporates the general factor.

Reverse-Coding and Composite Scores

Reverse-Coding Items: * Alpha and Omega assume all items are coded in the same direction (e.g., high scores always mean high levels of the construct). * Example (Fear of Statistics): "Statistics make me cry" (Positive phrasing) vs. "Standard deviations excite me" (Reverse phrased). The latter must be reverse-coded so that a high response (Strongly Disagree shifted to $5$ ) aligns with high fear.
Composite Scores: * Defined as a single score obtained by aggregating (summing or averaging) scale items. * Represents a participant's level of the target construct for use in downstream statistical testing. * Multiple Factors: If a scale captures multiple related constructs, separate composite scores must be calculated for each factor (e.g., Factor $1$ : Items $1-2$ ; Factor $2$ : Items $3-4$ ).

External Validity: Convergent and Discriminant

Construct Validity Subtypes: * Convergent Validity: Tests whether constructs that theoretically should be related are related in reality (measures correlate). * Discriminant Validity: Tests whether constructs that theoretically should not be related are not related in reality (measures do not correlate).
Two-Step Process: * Step 1: Identify theoretical related/unrelated constructs. * Step 2: Correlate measures. If related constructs correlate, you have convergent validity. If unrelated constructs do not correlate, you have discriminant validity.
Applied Case (R Anxiety Questionnaire): * Convergent: Fear of statistics was significantly negatively correlated with average statistics grades ( $r = -0.28, p = 0.031$ ). * Discriminant: Fear of statistics showed no significant correlation with non-stats module grades ( $r = 0.02, p = 0.332$ ).

Questions & Discussion

Poll Question 1: In which of the following situations is Cronbach's alpha a sensible choice for testing reliability? * Options: When testing unidimensional scales; multidimensional scales; scales with a large number of items; none of the above. * Answer: When testing unidimensional scales (though omega is often preferred).
Poll Question 2: In which of the following situations is McDonald's omega total a sensible choice for testing reliability? * Options: (A) When testing multidimensional scales that also assume the existence of a general factor; (B) When testing unidimensional scales; (C) When testing a unidimensional scale that is tau-equivalent; (D) All of the above. * Answer: (D) All of the above (Omega is more flexible than Alpha).