Hypothesis Testing with Multiple Groups

This chapter extends the examination of mean differences from individual groups to comparisons of several subgroup means.
The comparison of subgroup means includes independent variables with more than two categories or continuous variables collapsed into several discrete categories.
Limitation: Relying solely on comparisons of two outcomes constricts the information derived from independent variables when using several categories is feasible.
Focus: Exploring these multi-category hypotheses testing methods.
R Setup: Load the countries2 data set and attach necessary libraries:
- descr
- DescTools
- gplots
- effectsize
- Hmisc

Objective: Study indicators of development worldwide through the lens of internet access.
Context: Access to the internet is increasingly seen as a crucial infrastructure component, akin to traditional forms of transportation and communication (roads, bridges).
Focus Variable: internet from the countries2 data set, which estimates the percentage of households with regular internet access at the country level.
Initial Exploration:
- Plotting a histogram to assess the distribution of internet access.
- Generating descriptive statistics for understanding the variable's spread:
- Minimum: 0.0
- 1st Quartile: 27.6
- Median: 58.4
- Mean: 54.2
- 3rd Quartile: 79.6
- Maximum: 99.7
- NA's: 1

Histogram Analysis:
- Distribution shows a fairly even spread from low to middle, with a spike in the 70-90% access range.
- Confirmation of a slight left-skew, though not pronounced.
Key Findings from Descriptive Statistics:
- Significant variation in internet access noted globally:
- Bottom Quartile: 0% to about 28% access.
- Top Quartile: Approximately 80% to almost 100% access.
- Resulting disparity highlights a clear divide between ‘haves’ and ‘have nots.’

Inquiry: Does wealth (measured by GDP per capita) correlate with variations in internet access?
Hypothesis: Higher GDP should yield higher internet access on average.
Methodology: Initially employ a t-test to compare mean access across two GDP groups (low vs high).
R Implementation:
- Create a two-category GDP variable:
- Use cut2 to split original GDP data into two equal groups and store in countries2$gdp2.
- Using compmeans to analyze mean differences in internet access across groups.
Expected vs. Observed Means:
- Low GDP Group: Mean internet access at 33.39%.
- High GDP Group: Mean internet access at 76.41%.
From tests:
- Conducting a Welch Two Sample t-test:
- t = -16.0, df = 176, p < 2e-16.
- Confirmatory statistics showed:
- Cohen's D: -2.40 (indicating a strong relationship).

Purpose: Assess the significance of differences in mean values of internet access across categories of GDP.
Null Hypothesis (H0): No variation in mean internet access across GDP categories.
Hypothesis Testing: H0: μ1 = μ2 = μ3 = μ4
Key Concept: Variation must be calculated as:
- Total sum of squares (SST) which can be decomposed into:
- Sum of Squares Between (SSB): variation in means across GDP categories.
- Sum of Squares Within (SSW): variation around means within categories.

Formula:
- $SST = SSB + SSW$
Degree of Freedom:
- For SSW: $ext{dfw} = n - k$
- For SSB: $ext{dfb} = k - 1$
Use F-ratio for comparisons:
- $F = rac{MSB}{MSE}$
- Comparison to critical value for deciding on null hypothesis rejection.

Effect Size Measure: Eta-squared (B7^2) shows how much variance is explained by independent variable categories.
Calculation in R:
- Get B7^2 yields results illustrating a strong relationship:
- B7^2 = 0.79, indicating 79% variance explanation.

Inquiry: Influence of population size on internet access:
- Analysis begins with creating four-category population measure: countries2$pop4
Observed mean internet access across population quartiles shows fairly homogenous distribution without significant variation.
ANOVA results suggested that variations across population sizes were not statistically significant (p-value = 0.16).

Reinforcement: F-ratio and t-scores provide analogous information:
- Notably, $(t^2)$ relates to $F$ when only two categories of independent variables exist, showing equivalent results significantly.

Transition to different types of dependent variables (ordinal, nominal) often leads to alternative statistical measures like chi-square and cross-tabulation.
Next chapters will broaden the introduction to hypothesis testing, addressing these additional nuanced statistical methodologies.

Exercises involving the application and interpretation of ANOVA in practical scenarios, emphasizing critical analysis and conclusion formation on statistical significance and effects observed in datasets.