BM2014 09-11 Handout Notes

One-Way Analysis of Variance (ANOVA)

Used to determine if there are statistically significant differences between the means of three or more independent groups.
Compares the means between groups of interest.
Tests the null hypothesis: H0: \mu1 = \mu2 = \mu3 = \cdots = \mu_k
Alternative hypothesis: H_1: At least one of the means is different from the others.
- \mu: Group mean.
- k: Number of groups.
Null hypothesis states that the population means for all treatment levels are equal.
If even one population mean is different, the null hypothesis is rejected.
ANOVA is computed using three sums of squares: total, treatment (columns), and error.
- SS: 'sum of squares'.
- MS: 'mean square'.
- SSC: 'sum of squares of columns' (sum of squares between), measures variation between columns or treatments.
- SSE: 'sum of squares of error' (sum of squares within), measures variation within treatments, unexplained by the treatment.
- SST: 'total sum of squares', measures all variation in the dependent variable; contains both SSC and SSE.
- MSC, MSE, and MST are the mean squares of columns, error, and total, respectively.
- Mean square is an average, computed by dividing the squares’ sum by the degrees of freedom.
- F is a ratio of two variances.
- F value (ANOVA) is the ratio of the treatment variance (MSC) to the error variance (MSE).
Formulas for computing a one-way ANOVA:
- SSC = \sum{j=1}^{C} nj(\bar{x}_j - \bar{x})^2
  - df_C = C - 1
  - MSC = \frac{SSC}{df_c}
- SSE = \sum{j=1}^{C} \sum{i=1}^{nj} (x{ij} - \bar{x}_j)^2
  - df_E = N - C
  - MSE = \frac{SSE}{df_E}
- SST = \sum{j=1}^{C} \sum{i=1}^{nj} (x{ij} - \bar{x})^2
  - df_T = N - 1
- F = \frac{MSC}{MSE}
  - i: A particular member of a treatment level.
  - j: A treatment level.
  - C: The number of treatment levels.
  - n_j: The number of observations in a given treatment level.
  - \bar{x}: The grand mean.
  - \bar{x}_j: The column mean.
  - x_{ij}: The individual value.

Reading the F Distribution Table

Associated with every F value are two unique degrees of freedom (df) values: degrees of freedom in the numerator (dfC) and degrees of freedom in the denominator (dfE).
- df_C values are the treatment (column) degrees of freedom, C - 1.
- df_E values are the error degrees of freedom, N - C.
Analysis of variance tests are always one-tailed tests with the rejection region in the upper tail.
The decision rule rejects the null hypothesis if the observed F value is greater than the critical F value.

Example

A company has three manufacturing plants, and company officials want to determine whether there is a difference in workers’ average age at the three locations.
The data includes the ages of five randomly selected workers at each plant.
- Use \alpha = 0.01.

Solution

STEP 1: Hypothesize:
- H0: \mu1 = \mu2 = \mu3 = \cdots = \mu_k
- H_1: At least one of the means is different from the others
STEP 2: Compute for the degrees of freedom and their corresponding value in the F table.
- df_C = C - 1 = 3 - 1 = 2
- df_E = N - C = 15 - 3 = 12
- Therefore, critical F value is: F_{0.01, 2, 12} = 6.93
- Decision rule: reject the null hypothesis if the observed value of F is greater than 6.93.
STEP 3: Compute for the following formulas:
- SSC = \sum{j=1}^{C} nj(\bar{x}_j - \bar{x})^2 = [5(28.2 - 28.33)^2 + 8(32.0 - 28.33)^2 + 7(24.8 - 28.33)^2 = 129.73
- SSE = \sum{j=1}^{C} \sum{i=1}^{nj} (x{ij} - \bar{x}_j)^2 = [(29 - 28.2)^2 + (27 - 28.2)^2 + \cdots + (26 - 24.8)^2] = 19.60
- SST = \sum{j=1}^{C} \sum{i=1}^{nj} (x{ij} - \bar{x})^2 = [(29 - 28.33)^2 + (27 - 28.33)^2 + \cdots + (25 - 28.33)^2] = 149.33
- df_T = N - 1 = 15 - 1 = 14
- MSC = \frac{SSC}{df_c} = \frac{129.73}{2} = 64.865
- MSE = \frac{SSE}{df_E} = \frac{19.60}{12} = 1.633
- F = \frac{MSC}{MSE} = \frac{64.865}{1.633} = 39.721
STEP 4: Prepare the output table:
- Source of variance | SS | df | MS | F
  - Between groups | 129.73 | 2 | 64.87 | 39.71
  - Within groups | 19.60 | 12 | 1.63 |
  - Total | 149.33 | 14 |
STEP 5: The decision is to reject the null hypothesis
- Observed value = 39.71 > critical table F value of 6.93.
- There is a significant difference between the mean ages of workers at the three plants.
- This difference could have hiring implications; motivation, discipline, and experience may differ with age, call for different managerial approaches in each plant.

Multiple Comparison Tests

When an ANOVA yields an overall significant difference between treatment means, we often need to know which treatment means are responsible for the difference.
Posteriori or post hoc pairwise comparisons are made after the experiment when the researcher decides to test for significant differences between the sample means based on a significant overall F value.
The two multiple comparison tests are Tukey’s HSD test for designs with equal sample sizes and the Tukey–Kramer procedure for situations in which sample sizes are unequal.

Tukey’s Honestly Significant Difference (HSD) Test

The case of equal sample sizes.
Tukey’s HSD test considers the number of treatment levels, the value of mean square error, and the sample size.
Using these values as well as a table value, q, the HSD test determines the critical difference necessary between the means of any two treatment levels for the means to be significantly different.
Formula:
- HSD = q_{\alpha, C, N-C} \sqrt{\frac{MSE}{n}}
  - C: The number of treatment levels.
  - N: The total number of observations.
  - MSE: The mean square error.
  - n: The sample size.
  - q_{\alpha, C, N-C}: The critical value of the studentized range distribution.
Using the previous example, Tukey’s HSD test can be used for multiple comparisons between plants 1 and 2, 2 and 3, and 1 and 3.
- C = 3
- df_E = N - C = 12
- q_{0.01, 3, 12} = 5.04
- HSD = 5.04 \sqrt{\frac{1.63}{5}} = 2.88
Any of the pairs of means that differ by more than 2.88 are significantly different at \alpha = 0.01
- |\bar{x}1 - \bar{x}2| = |28.2 - 32.0| = 3.8
- |\bar{x}1 - \bar{x}3| = |28.2 - 24.8| = 3.4
- |\bar{x}2 - \bar{x}3| = |32.0 - 24.8| = 7.2
All three comparisons are greater than the value of HSD, which is 2.88.
- The mean ages between workers at any and all pairs of plants are significantly different.

Tukey-Kramer Procedure

The case of unequal sample sizes.
Tukey’s HSD was modified to handle situations in which the sample sizes are unequal.
Formula:
- Tukey - Kramer = q{\alpha, C, N-C} \sqrt{\frac{MSE}{2} (\frac{1}{nr} + \frac{1}{n_s})}
  - C: The number of treatment levels.
  - N: The total number of observations.
  - MSE: The mean square error.
  - n_r: The sample size for the rth sample.
  - n_s: The sample size for the sth sample.
  - q_{\alpha, C, N-C}: The critical value of the studentized range distribution.

Chi-Squared Test

Chi-Square Goodness-of-Fit Test

Used to analyze multinomial distribution trials’ probabilities along a single dimension.
Formula:
- X^2 = \sum \frac{(f0 - fc)^2}{f_c}
  - df = k - 1
    - f_0: Frequency of observed values.
    - f_c: Frequency of expected values.
    - k: Number of categories.
    - df: Degrees of freedom.

Steps

STEP 1: Hypothesize.
- H_0: The observed distribution is the same as the expected distribution.
- H_1: The observed distribution is not the same as the expected distribution.
STEP 2: The statistical test being used is X^2 = \sum \frac{(f0 - fc)^2}{f_c}.
STEP 3: Let \alpha = 0.05
STEP 4: Determine the critical square value.
- df = k - 1 = 4 - 1 = 3
- X^2_{0.05, 3} = 7.815
- After the data are analyzed, an observed chi-square greater than 7.815 must be computed to reject the null hypothesis (H_0).
STEP 5: The observed values gathered in the sample data below sum to 207. Thus, n = 207.
- The expected proportions are given, but the expected frequencies must be calculated by multiplying the expected proportions by the observed frequencies’ sample total.
STEP 6: Compute using the formula.
STEP 7: Analyze results.

Test of Independence

Examines the frequencies for two variables at different levels to determine whether one variable is independent of the other or if the variables are related.
Formulas:
- Expected frequencies
  - e_{ij} = \frac{(row \ i \ total)(column \ j \ total)}{n}
    - e_{ij}: The expected frequency in row i, column j
    - n: The total of all the frequencies (i.e., the sum of the rows and columns).
- Chi-square test of independence
  - x^2 = \sum \sum \frac{(f0 - fe)^2}{f_e}
    - df = (r - 1)(c - 1)
      - r: Is the number of rows
      - c: Is the number of columns

Steps

STEP 1: Hypothesize.
- H_0: Gender is independent of age.
- H_1: Gender is not independent of age.
STEP 2: Decide on the type of test; chi-square test of independence is appropriate.
STEP 3: Decide on the level of significance and determine the critical value(s) and region (s).
- Let \alpha = 0.05.
- Here, there are three rows (r = 3) and two columns (c = 2).
- The degrees of freedom are (3 - 1)(2 - 1) = 2.
- The critical value of chi-square for \alpha = 0.05 is 5.9915.
STEP 4: Decision
- If the chi-square value calculated from the data is greater than 5.9915, we reject the null hypothesis and conclude that gender is not independent of age.
STEP 5: Select a random sample and do relevant calculations.
- The process of computing the test statistic begins by calculating the expected frequencies using the formula for e_{ij} given previously.

Regression Analysis

Correlation

Correlation research involves using statistical procedures to analyze the relationship between two variables.
After doing so, we would end up with a number called the correlation coefficient, which would range from −1.00 to +1.00.
The strength of a relationship is indicated by the absolute numerical value of the correlation coefficient.
The stronger the correlation, the closer the numerical value will be to 1.00, regardless of the sign.
A perfect correlation is indicated by a plus or minus 1 (\pm 1.00), which is the strongest relationship possible.
Close to a 0 correlation is also possible, meaning that virtually no relationship exists between two variables.

Pearson Product Moment Correlation

Used for examining linear relationships between variables measured on interval or ratio scales.
Formula:
- r = \frac{N \sum XY - (\sum X)(\sum Y)}{\sqrt{[N \sum X^2 - (\sum X)^2][N \sum Y^2 - (\sum Y)^2]}}
Steps:
1. Create columns for X, X^2, Y, Y^2, and XY.
2. Square each X value and place the product in the X^2 column. Do the same for the Y values.
3. Multiply each X value by its correspondingY value to obtain XY.
4. Sum the X, Y, X^2, Y^2, and XY columns.
5. Plug the values into the formula.

Interpretation of Results

Coefficient Meaning
- +1.00 A perfect positive relationship
- +0.80 A fairly strong positive relationship
- +0.60 A moderate positive relationship
- 0.00 No relationship
- -0.60 A moderate negative relationship
- -0.80 A fairly strong negative relationship
- -1.00 A perfect negative relationship

Interpreting the Pearson Correlation

Cause and effect.
- That a correlation exists between variables does not necessarily mean that one causes the other.
- Sometimes relationships exist because of the effect of other outside influences.
- Causal relationships can be demonstrated only through experimentation.
Restricted Range.
- Caution should also be exercised when interpreting correlations from a restricted range versus a full range of scores.
- The problem of restricting the range occurs when a sample is used that limits the scores on one or both variables more so than would be found in the population.

Uses of the Pearson Correlation

Validity.
- Correlations are sometimes used to measure the validity of newly developed assessment instruments.
- Validity is the degree to which a test measures what it is supposed to measure.
Reliability.
- Correlation is also used to assess the reliability of testing instruments.
- Reliability refers to test scores’ consistency or stability upon repeated tests (or on alternate forms of the test).
Prediction.
- One of the main uses of the Pearson correlation is prediction.
- If there is a strong relationship between the two variables, then knowing the score on one variable will enable us to predict the other variable’s score.

Regression Analysis

The process of constructing a mathematical model or function that can be used to predict or determine one variable by another variable or other variables.
The most elementary regression model is called simple regression or bivariate regression involving two variables in which another variable predicts one variable.
In simple regression, the variable to be predicted is called the dependent variable and is designated as y.
The predictor is called the independent variable, or explanatory variable, and is designated as x.
In simple regression analysis, only a straight-line relationship between two variables is examined.
ŷ = b0 + b1x
- ŷ: Is the predicted value of y
- b_0: Is the sample intercept
- b_1: Is the sample slope
To determine the regression line equation for a sample of data, the researcher must determine the values for b0 and b1.
This process is sometimes referred to as least-squares analysis.

Least-Squares Analysis

When a regression model is developed by producing the minimum sum of the squared error values.
Formula for the slope of the regression line:
- b_1 = \frac{\sum xy - \frac{(\sum x)(\sum y)}{n}}{\sum x^2 - \frac{(\sum x)^2}{n}}
Formula for the y-intercept of the regression line:
- b0 = \bar{y} - b1\bar{x} = \frac{\sum y}{n} - b_1(\frac{\sum x}{n})