Statistics Inference Notes

Analysis of Variance (ANOVA) is used to determine if there are differences among three or more populations.
One-way ANOVA compares population means based on one categorical variable.
Populations are approximately normally distributed.
The population standard deviations (variances) are unknown but assumed equal.
Samples are selected randomly and independently from each population.
Here we compare a total of c populations, rather than just two.

$H0: \mu1 = \mu2 = \dots = \muk$
$H_1$ : Not all population means are equal
Sample means come from different sampling distributions and are not as close together when $H_0$ is false.
Sample means are close together because there is only one sampling distribution when $H_0$ is true.

$F(df1, df2) = \frac{MSTR}{MSE}$
- Where $df1 = (c - 1)$ and $df2 = (n_T - c)$
These inferences are based on an F-distribution $\alpha$ .
$F_{\alpha}$
F-corresponding to the left-hand probabilities $(1 - \alpha)$ .
$(F{df1, df2}){\alpha} = \frac{1}{(F{df2, df1}){1-\alpha}}$
The parameters $df1$ and $df2$ are called the numerator and denominator degrees of freedom.

MSTR: Mean Square for Treatment (Variance between groups/samples)
MSE: Mean Square Error (Variance within groups/samples)
$\bar{x} = \frac{\sum{i=1}^{c} ni \bar{x}i}{nT}$ (Grand mean)
$SSTR = \sum{i=1}^{c} ni (\bar{x}_i - \bar{x})^2$ (Sum of squares due to treatment)
$MSTR = \frac{SSTR}{(c - 1)}$
$SSE = \sum{i=1}^{c} \sum{j=1}^{ni} (x{ij} - \bar{x}i)^2 = \sum{i=1}^{c} (ni - 1)si^2$ (Error sum of squares)
$MSE = \frac{SSE}{(n_T - c)}$

Total Sum of Squares (SST) = Sum of Squares due to Treatment (SSTR) + Error Sum of Squares (SSE)
Variation due to differences between groups + Variation due to random sampling

Research analyst Sean Cox looked at study results from a Boston Globe article that claimed commuters there topped the nation in cost savings from public transportation. He wants to know if the average savings significantly differ among these cities. Cox looked at samples drawn from four cities.
Assumptions:
- Populations are normally distributed
- Populations have equal variances
- Samples are randomly and independently drawn

$\bar{x} = \frac{287,760}{24} = 11,990$
$SSTR = 5(12622-11990)^2 + 8(12585-11990)^2 + 6(11720-11990)^2 + 5(10730-11990)^2 = 13,204,720$
$MSTR = \frac{13,204,720}{(4-1)} = 4,401,573$
$SSE = (5 - 1)(87.79)^2 + (8 - 1)(80.40)^2 + (6 - 1)(83.96)^2 + (5 - 1)(90.62)^2 = 144,180$
$MSE = \frac{SSE}{(n_T - c)} = \frac{144,180}{(24 - 4)} = 7,209$
$F(3,20) = \frac{4,401,573}{7,209} = 610.57$

$H0: \mu1 = \mu2 = \mu3 = \mu_4$
$H1: \mui \text{ not all equal}$
$\alpha = 0.05$
$df_1 = 3$
$df_2 = 20$
$F = 610.57$
Critical Value: $F_{3,20,.05} = 3.10$
Decision: Reject $H_0$ at $\alpha = 0.05$
Conclusion: There is evidence that at least one $\mu_i$ differs from the rest and the mean savings are not the same for each city.

A test of independence – also called a chi-square test of a contingency table – analyzes the relationship between two qualitative variables.
The competing hypotheses:
- $H_0$ : The two classifications are independent
- $H_1$ : The two classifications are dependent
A contingency table shows the frequencies for two qualitative variables. Each variable has two or more categories.
The test for independence is based on the expected and observed frequencies for each cell in the table.

$\chi^2 = \sum \sum \frac{(o{ij} - e{ij})^2}{e_{ij}}$
- Where:
 - $o_{ij}$ denotes the observed frequency in row i of column j
 - $e_{ij}$ is the expected frequency in row i of column j
 - $e_{ij} = \frac{(\text{Row i total})(\text{Column j total})}{\text{Sample Size}}$
To apply a chi-square test of independence, the expected frequency for each cell must be at least 5.

The variable follows the Chi-Squared Distribution with $(r – 1)(c – 1)$ degrees of freedom or $df = (r – 1)(c – 1)$
- Where:
  - r – number of rows
  - c – number of columns
The critical value -> Chi-Squared Distribution Table

Does the brand of compression garment purchased depend on the customer’s age?
We use the notation $o{ij}$ to denote the observed frequency in row i of column j. Similarly, $e{ij}$ is the expected frequency in row i of column j.
Under the independence assumption, the expected frequency per cell is: $e_{ij} = \frac{(\text{Row i total})(\text{Column j total})}{\text{Sample Size}}$

For row 1 and column 1, the expected frequency, $e_{11}$ , is $\frac{(396)(228)}{600} = 150.48$

The deviations $(o{ij} – e{ij})$ are calculated.
We square each deviation and divide by the respective expected frequency. These values are then summed to produce the test statistic.

H0: Age and brand name are independent.
H1: Age and brand name are dependent.
$\chi^2 = 22.53$ with $df = (r – 1)(c – 1) = (2 - 1)(3 - 1) = 2$
Decision Rule: If \chi^2 > 5.991 reject H0, otherwise, do not reject H0
Here, \chi^2 = 22.53 > 5.991, so we reject H0 and conclude that brand and age group are dependant.
$\chi^2_{.05} = 5.991$

The Excel CHISQ.TEST function performs the chi-square test on two supplied data sets (of observed and expected frequencies), and returns the probability that the differences between the sets are simply due to sampling error.