Statistics Inference Notes

ANOVA Assumptions

  • Analysis of Variance (ANOVA) is used to determine if there are differences among three or more populations.
  • One-way ANOVA compares population means based on one categorical variable.
  • Populations are approximately normally distributed.
  • The population standard deviations (variances) are unknown but assumed equal.
  • Samples are selected randomly and independently from each population.
  • Here we compare a total of c populations, rather than just two.

Competing Hypotheses for One-Way ANOVA

  • H<em>0:μ</em>1=μ<em>2==μ</em>kH<em>0: \mu</em>1 = \mu<em>2 = \dots = \mu</em>k
  • H1H_1: Not all population means are equal
  • Sample means come from different sampling distributions and are not as close together when H0H_0 is false.
  • Sample means are close together because there is only one sampling distribution when H0H_0 is true.

Test Statistic

  • F(df<em>1,df</em>2)=MSTRMSEF(df<em>1, df</em>2) = \frac{MSTR}{MSE}
    • Where df<em>1=(c1)df<em>1 = (c - 1) and df</em>2=(nTc)df</em>2 = (n_T - c)
  • These inferences are based on an F-distribution α\alpha.
  • FαF_{\alpha}
  • F-corresponding to the left-hand probabilities (1α)(1 - \alpha).
  • (F<em>df</em>1,df<em>2)</em>α=1(F<em>df</em>2,df<em>1)</em>1α(F<em>{df</em>1, df<em>2})</em>{\alpha} = \frac{1}{(F<em>{df</em>2, df<em>1})</em>{1-\alpha}}
  • The parameters df<em>1df<em>1 and df</em>2df</em>2 are called the numerator and denominator degrees of freedom.

Calculating MSTR and MSE

  • MSTR: Mean Square for Treatment (Variance between groups/samples)
  • MSE: Mean Square Error (Variance within groups/samples)
  • xˉ=<em>i=1cn</em>ixˉ<em>in</em>T\bar{x} = \frac{\sum<em>{i=1}^{c} n</em>i \bar{x}<em>i}{n</em>T} (Grand mean)
  • SSTR=<em>i=1cn</em>i(xˉixˉ)2SSTR = \sum<em>{i=1}^{c} n</em>i (\bar{x}_i - \bar{x})^2 (Sum of squares due to treatment)
  • MSTR=SSTR(c1)MSTR = \frac{SSTR}{(c - 1)}
  • SSE=<em>i=1c</em>j=1n<em>i(x</em>ijxˉ<em>i)2=</em>i=1c(n<em>i1)s</em>i2SSE = \sum<em>{i=1}^{c} \sum</em>{j=1}^{n<em>i} (x</em>{ij} - \bar{x}<em>i)^2 = \sum</em>{i=1}^{c} (n<em>i - 1)s</em>i^2 (Error sum of squares)
  • MSE=SSE(nTc)MSE = \frac{SSE}{(n_T - c)}

Sum of Squares Decomposition

  • Total Sum of Squares (SST) = Sum of Squares due to Treatment (SSTR) + Error Sum of Squares (SSE)
  • Variation due to differences between groups + Variation due to random sampling

One-Way ANOVA Table

  • Source of Variation, SS, df, MS (Variance), F ratio
    • Between Groups: SSTR, c1c - 1, MSTR=SSTRc1MSTR = \frac{SSTR}{c - 1}
    • Within Groups: SSE, n<em>Tcn<em>T - c, MSE=SSEn</em>TcMSE = \frac{SSE}{n</em>T - c}
    • Total: SST, nT1n_T - 1
    • F = MSTR/MSE
    • c = number of groups/samples
    • nTn_T = sum of the sample sizes from all groups
    • df = degrees of freedom

Example: Public Transportation Savings

  • Research analyst Sean Cox looked at study results from a Boston Globe article that claimed commuters there topped the nation in cost savings from public transportation. He wants to know if the average savings significantly differ among these cities. Cox looked at samples drawn from four cities.
  • Assumptions:
    • Populations are normally distributed
    • Populations have equal variances
    • Samples are randomly and independently drawn

Example: Summary Statistics

  • xˉ=287,76024=11,990\bar{x} = \frac{287,760}{24} = 11,990
  • SSTR=5(1262211990)2+8(1258511990)2+6(1172011990)2+5(1073011990)2=13,204,720SSTR = 5(12622-11990)^2 + 8(12585-11990)^2 + 6(11720-11990)^2 + 5(10730-11990)^2 = 13,204,720
  • MSTR=13,204,720(41)=4,401,573MSTR = \frac{13,204,720}{(4-1)} = 4,401,573
  • SSE=(51)(87.79)2+(81)(80.40)2+(61)(83.96)2+(51)(90.62)2=144,180SSE = (5 - 1)(87.79)^2 + (8 - 1)(80.40)^2 + (6 - 1)(83.96)^2 + (5 - 1)(90.62)^2 = 144,180
  • MSE=SSE(nTc)=144,180(244)=7,209MSE = \frac{SSE}{(n_T - c)} = \frac{144,180}{(24 - 4)} = 7,209
  • F(3,20)=4,401,5737,209=610.57F(3,20) = \frac{4,401,573}{7,209} = 610.57

One-Factor ANOVA Example Solution

  • H<em>0:μ</em>1=μ<em>2=μ</em>3=μ4H<em>0: \mu</em>1 = \mu<em>2 = \mu</em>3 = \mu_4
  • H<em>1:μ</em>i not all equalH<em>1: \mu</em>i \text{ not all equal}
  • α=0.05\alpha = 0.05
  • df1=3df_1 = 3
  • df2=20df_2 = 20
  • F=610.57F = 610.57
  • Critical Value: F3,20,.05=3.10F_{3,20,.05} = 3.10
  • Decision: Reject H0H_0 at α=0.05\alpha = 0.05
  • Conclusion: There is evidence that at least one μi\mu_i differs from the rest and the mean savings are not the same for each city.

ANOVA - Single Factor: Excel Output

  • EXCEL: data | data analysis | ANOVA: single factor

Chi-Square Test of Independence

  • A test of independence – also called a chi-square test of a contingency table – analyzes the relationship between two qualitative variables.
  • The competing hypotheses:
    • H0H_0: The two classifications are independent
    • H1H_1: The two classifications are dependent
  • A contingency table shows the frequencies for two qualitative variables. Each variable has two or more categories.
  • The test for independence is based on the expected and observed frequencies for each cell in the table.

Test of Independence: Test Statistics

  • χ2=(o<em>ije</em>ij)2eij\chi^2 = \sum \sum \frac{(o<em>{ij} - e</em>{ij})^2}{e_{ij}}
    • Where:
      • oijo_{ij} denotes the observed frequency in row i of column j
      • eije_{ij} is the expected frequency in row i of column j
      • eij=(Row i total)(Column j total)Sample Sizee_{ij} = \frac{(\text{Row i total})(\text{Column j total})}{\text{Sample Size}}
  • To apply a chi-square test of independence, the expected frequency for each cell must be at least 5.

Test of Independence: Chi-Squared Distribution

  • The variable follows the Chi-Squared Distribution with (r1)(c1)(r – 1)(c – 1) degrees of freedom or df=(r1)(c1)df = (r – 1)(c – 1)
    • Where:
      • r – number of rows
      • c – number of columns
  • The critical value -> Chi-Squared Distribution Table

Example: Compression Garment Brand and Customer Age

  • Does the brand of compression garment purchased depend on the customer’s age?
  • We use the notation o<em>ijo<em>{ij} to denote the observed frequency in row i of column j. Similarly, e</em>ije</em>{ij} is the expected frequency in row i of column j.
  • Under the independence assumption, the expected frequency per cell is: eij=(Row i total)(Column j total)Sample Sizee_{ij} = \frac{(\text{Row i total})(\text{Column j total})}{\text{Sample Size}}

Example: Expected Frequencies

  • For row 1 and column 1, the expected frequency, e11e_{11}, is (396)(228)600=150.48\frac{(396)(228)}{600} = 150.48

Example: Deviations and Squared Deviations

  • The deviations (o<em>ije</em>ij)(o<em>{ij} – e</em>{ij}) are calculated.
  • We square each deviation and divide by the respective expected frequency. These values are then summed to produce the test statistic.

Summarizing the Example. Solution

  • H0: Age and brand name are independent.
  • H1: Age and brand name are dependent.
  • χ2=22.53\chi^2 = 22.53 with df=(r1)(c1)=(21)(31)=2df = (r – 1)(c – 1) = (2 - 1)(3 - 1) = 2
  • Decision Rule: If \chi^2 > 5.991 reject H0, otherwise, do not reject H0
  • Here, \chi^2 = 22.53 > 5.991, so we reject H0 and conclude that brand and age group are dependant.
  • χ.052=5.991\chi^2_{.05} = 5.991

Chi-square test of Independence in EXCEL

  • The Excel CHISQ.TEST function performs the chi-square test on two supplied data sets (of observed and expected frequencies), and returns the probability that the differences between the sets are simply due to sampling error.