Hypothesis Testing with Multiple Groups

Chapter 11 Hypothesis Testing with Multiple Groups

11.1 Getting Ready

  • This chapter extends the examination of mean differences from individual groups to comparisons of several subgroup means.

  • The comparison of subgroup means includes independent variables with more than two categories or continuous variables collapsed into several discrete categories.

  • Limitation: Relying solely on comparisons of two outcomes constricts the information derived from independent variables when using several categories is feasible.

  • Focus: Exploring these multi-category hypotheses testing methods.

  • R Setup: Load the countries2 data set and attach necessary libraries:

    • descr

    • DescTools

    • gplots

    • effectsize

    • Hmisc

11.2 Internet Access as an Indicator of Development

  • Objective: Study indicators of development worldwide through the lens of internet access.

  • Context: Access to the internet is increasingly seen as a crucial infrastructure component, akin to traditional forms of transportation and communication (roads, bridges).

  • Focus Variable: internet from the countries2 data set, which estimates the percentage of households with regular internet access at the country level.

  • Initial Exploration:

    • Plotting a histogram to assess the distribution of internet access.

    • Generating descriptive statistics for understanding the variable's spread:

    • Minimum: 0.0

    • 1st Quartile: 27.6

    • Median: 58.4

    • Mean: 54.2

    • 3rd Quartile: 79.6

    • Maximum: 99.7

    • NA's: 1

11.2.1 Distribution Insights

  • Histogram Analysis:

    • Distribution shows a fairly even spread from low to middle, with a spike in the 70-90% access range.

    • Confirmation of a slight left-skew, though not pronounced.

  • Key Findings from Descriptive Statistics:

    • Significant variation in internet access noted globally:

    • Bottom Quartile: 0% to about 28% access.

    • Top Quartile: Approximately 80% to almost 100% access.

    • Resulting disparity highlights a clear divide between ‘haves’ and ‘have nots.’

11.3 Analyzing the Relationship between Wealth and Internet Access

  • Inquiry: Does wealth (measured by GDP per capita) correlate with variations in internet access?

  • Hypothesis: Higher GDP should yield higher internet access on average.

  • Methodology: Initially employ a t-test to compare mean access across two GDP groups (low vs high).

  • R Implementation:

    • Create a two-category GDP variable:

    • Use cut2 to split original GDP data into two equal groups and store in countries2$gdp2.

    • Using compmeans to analyze mean differences in internet access across groups.

  • Expected vs. Observed Means:

    • Low GDP Group: Mean internet access at 33.39%.

    • High GDP Group: Mean internet access at 76.41%.

  • From tests:

    • Conducting a Welch Two Sample t-test:

    • t = -16.0, df = 176, p < 2e-16.

    • Confirmatory statistics showed:

    • Cohen's D: -2.40 (indicating a strong relationship).

11.4 Analyzing Variance (ANOVA)

  • Purpose: Assess the significance of differences in mean values of internet access across categories of GDP.

  • Null Hypothesis (H0): No variation in mean internet access across GDP categories.

  • Hypothesis Testing: H0: μ1 = μ2 = μ3 = μ4

  • Key Concept: Variation must be calculated as:

    • Total sum of squares (SST) which can be decomposed into:

    • Sum of Squares Between (SSB): variation in means across GDP categories.

    • Sum of Squares Within (SSW): variation around means within categories.

Statistical Components for ANOVA:
  • Formula:

    • SST=SSB+SSWSST = SSB + SSW

  • Degree of Freedom:

    • For SSW: extdfw=nkext{dfw} = n - k

    • For SSB: extdfb=k1ext{dfb} = k - 1

  • Use F-ratio for comparisons:

    • F=racMSBMSEF = rac{MSB}{MSE}

    • Comparison to critical value for deciding on null hypothesis rejection.

11.5 Effect Size and Its Calculation

  • Effect Size Measure: Eta-squared (B7^2) shows how much variance is explained by independent variable categories.

  • Calculation in R:

    • Get B7^2 yields results illustrating a strong relationship:

    • B7^2 = 0.79, indicating 79% variance explanation.

11.6 Population Size and Internet Access

  • Inquiry: Influence of population size on internet access:

    • Analysis begins with creating four-category population measure: countries2$pop4

  • Observed mean internet access across population quartiles shows fairly homogenous distribution without significant variation.

  • ANOVA results suggested that variations across population sizes were not statistically significant (p-value = 0.16).

11.7 Connection Between t-Scores and F-Ratios

  • Reinforcement: F-ratio and t-scores provide analogous information:

    • Notably, (t2)(t^2) relates to FF when only two categories of independent variables exist, showing equivalent results significantly.

11.8 Future Directions in Statistical Analysis

  • Transition to different types of dependent variables (ordinal, nominal) often leads to alternative statistical measures like chi-square and cross-tabulation.

  • Next chapters will broaden the introduction to hypothesis testing, addressing these additional nuanced statistical methodologies.

11.9 Assignments

  • Exercises involving the application and interpretation of ANOVA in practical scenarios, emphasizing critical analysis and conclusion formation on statistical significance and effects observed in datasets.