Hypothesis Testing with Multiple Groups
Chapter 11 Hypothesis Testing with Multiple Groups
11.1 Getting Ready
This chapter extends the examination of mean differences from individual groups to comparisons of several subgroup means.
The comparison of subgroup means includes independent variables with more than two categories or continuous variables collapsed into several discrete categories.
Limitation: Relying solely on comparisons of two outcomes constricts the information derived from independent variables when using several categories is feasible.
Focus: Exploring these multi-category hypotheses testing methods.
R Setup: Load the
countries2data set and attach necessary libraries:descrDescToolsgplotseffectsizeHmisc
11.2 Internet Access as an Indicator of Development
Objective: Study indicators of development worldwide through the lens of internet access.
Context: Access to the internet is increasingly seen as a crucial infrastructure component, akin to traditional forms of transportation and communication (roads, bridges).
Focus Variable:
internetfrom thecountries2data set, which estimates the percentage of households with regular internet access at the country level.Initial Exploration:
Plotting a histogram to assess the distribution of internet access.
Generating descriptive statistics for understanding the variable's spread:
Minimum: 0.0
1st Quartile: 27.6
Median: 58.4
Mean: 54.2
3rd Quartile: 79.6
Maximum: 99.7
NA's: 1
11.2.1 Distribution Insights
Histogram Analysis:
Distribution shows a fairly even spread from low to middle, with a spike in the 70-90% access range.
Confirmation of a slight left-skew, though not pronounced.
Key Findings from Descriptive Statistics:
Significant variation in internet access noted globally:
Bottom Quartile: 0% to about 28% access.
Top Quartile: Approximately 80% to almost 100% access.
Resulting disparity highlights a clear divide between ‘haves’ and ‘have nots.’
11.3 Analyzing the Relationship between Wealth and Internet Access
Inquiry: Does wealth (measured by GDP per capita) correlate with variations in internet access?
Hypothesis: Higher GDP should yield higher internet access on average.
Methodology: Initially employ a t-test to compare mean access across two GDP groups (low vs high).
R Implementation:
Create a two-category GDP variable:
Use
cut2to split original GDP data into two equal groups and store incountries2$gdp2.Using
compmeansto analyze mean differences in internet access across groups.
Expected vs. Observed Means:
Low GDP Group: Mean internet access at 33.39%.
High GDP Group: Mean internet access at 76.41%.
From tests:
Conducting a Welch Two Sample t-test:
t = -16.0, df = 176, p < 2e-16.
Confirmatory statistics showed:
Cohen's D: -2.40 (indicating a strong relationship).
11.4 Analyzing Variance (ANOVA)
Purpose: Assess the significance of differences in mean values of internet access across categories of GDP.
Null Hypothesis (H0): No variation in mean internet access across GDP categories.
Hypothesis Testing: H0: μ1 = μ2 = μ3 = μ4
Key Concept: Variation must be calculated as:
Total sum of squares (SST) which can be decomposed into:
Sum of Squares Between (SSB): variation in means across GDP categories.
Sum of Squares Within (SSW): variation around means within categories.
Statistical Components for ANOVA:
Formula:
Degree of Freedom:
For SSW:
For SSB:
Use F-ratio for comparisons:
Comparison to critical value for deciding on null hypothesis rejection.
11.5 Effect Size and Its Calculation
Effect Size Measure: Eta-squared (B7^2) shows how much variance is explained by independent variable categories.
Calculation in R:
Get B7^2 yields results illustrating a strong relationship:
B7^2 = 0.79, indicating 79% variance explanation.
11.6 Population Size and Internet Access
Inquiry: Influence of population size on internet access:
Analysis begins with creating four-category population measure:
countries2$pop4
Observed mean internet access across population quartiles shows fairly homogenous distribution without significant variation.
ANOVA results suggested that variations across population sizes were not statistically significant (p-value = 0.16).
11.7 Connection Between t-Scores and F-Ratios
Reinforcement: F-ratio and t-scores provide analogous information:
Notably, relates to when only two categories of independent variables exist, showing equivalent results significantly.
11.8 Future Directions in Statistical Analysis
Transition to different types of dependent variables (ordinal, nominal) often leads to alternative statistical measures like chi-square and cross-tabulation.
Next chapters will broaden the introduction to hypothesis testing, addressing these additional nuanced statistical methodologies.
11.9 Assignments
Exercises involving the application and interpretation of ANOVA in practical scenarios, emphasizing critical analysis and conclusion formation on statistical significance and effects observed in datasets.