Exam Preparation Notes
Mann-Whitney U Test
- The Mann-Whitney U test is a non-parametric test used to compare two independent groups.
- It is the non-parametric equivalent of the unpaired t-test.
- The procedure involves ranking the observations.
- If an observation appears more than once (tied ranks), we assign the mean of the potential ranks of these data.
Oil Price Analysis
Objective: Determine if the oil price is different in January and June, assuming the distributions of the oil prices in January and June samples are of the same shape.
Significance Level: 5%.
Assumption: Samples are not normally distributed.
Hypotheses:
- H_0: The medians of the two groups (January and June oil prices) are equal.
- H_1: The medians of the two groups are not equal.
Data and Ranks:
Oil Price Month Actual Ranks 66.7 Jan 1 68 Jan 2 68.9 Jan 3 69.5 Jan 4 70.3 Jan 5 70.9 June 6 71 June 7 72 June 8 72.1 Jan 9 72.5 June 10 73.1 June 11 75.5 June 12 Calculations:
R_1 (Sum of ranks for January) = 1 + 2 + 3 + 4 + 5 + 9 = 24
R_2 (Sum of ranks for June) = 6 + 7 + 8 + 10 + 11 + 12 = 54
n_1 = 6 (Sample size for January)
n_2 = 6 (Sample size for June)
u1 = R1 - \frac{n1(n1 + 1)}{2} = 24 - \frac{6 \times 7}{2} = 24 - 21 = 3
u2 = R2 - \frac{n2(n2 + 1)}{2} = 54 - \frac{6 \times 7}{2} = 54 - 21 = 33
u = Min(u1, u2) = Min(3, 33) = 3
Critical Value:
- Using tables, the critical value for n1 = 6, n2 = 6, and \alpha = 0.05 (two-tailed test) is U_{critical} = 5.
Decision:
- Since u = 3 < U{critical} = 5, we reject H0.
Conclusion:
- There is evidence to suggest that the median oil price is different in January and June.
- Median (Jan) = 69.2
- Median (June) = 72.25
- It appears that the median oil price is higher in June than in January.
Chi-squared Random Variables
- If Y1, Y2, and Y3 are independent \chi^2 random variables with 3, 7, and 8 degrees of freedom respectively, then the distribution of \sum Yi is:
- \sum Yi \sim \chi^2{3+7+8} = \chi^2_{18}
Finding Values from Chi-squared Tables
Find the values of \chi^2{0.05, 16} and \chi^2{0.025, 9} using tables.
- \chi^2_{0.05, 16} = 26.296
- \chi^2_{0.025, 9} = 19.023
R code to calculate these values:
qchisq(0.05, 16, lower.tail = F)
qchisq(0.025, 9, lower.tail = F)
Expected value and Standard Deviation for Chi-squared Distribution
For X \sim \chi^2_v:
E(X) = v
Var(X) = 2v
SD(X) = \sqrt{2v}
If X \sim \chi^2_7:
- E(X) = 7
- Var(X) = 2 \times 7 = 14
- SD(X) = \sqrt{14}
Association Between Inoculation and Flu Infection
Data:
Infected Not Infected Totals Inoculated 5 241 246 Not Inoculated 90 292 382 Totals 95 533 628 Assumptions:
- Independence: A person cannot be both inoculated and not inoculated, and a person cannot be both infected and not infected.Each person contributes to a single cell in the table.
Expected Frequencies:
- E_{11} = \frac{246 \times 95}{628} = 37.2
- E_{12} = \frac{246 \times 533}{628} = 208.8
- E_{21} = \frac{382 \times 95}{628} = 57.8
- E_{22} = \frac{382 \times 533}{628} = 324.2
- No expected frequency is less than 5.
Yates' Continuity Correction:
- \chi^2{corrected} = \sum \frac{(|Oi - Ei| - 0.5)^2}{Ei}
- \chi^2_{corrected} = \frac{(|5 - 37.2| - 0.5)^2}{37.2} + \frac{(|241 - 208.8| - 0.5)^2}{208.8} + \frac{(|90 - 57.8| - 0.5)^2}{57.8} + \frac{(|292 - 324.2| - 0.5)^2}{324.2} = 52.3
Degrees of Freedom:
- (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1
Critical Value:
- \chi^2_{0.05, 1} = 3.841
Decision:
- Since \chi^2{corrected} = 52.3 > \chi^2{0.05, 1} = 3.841, we reject H_0.
Conclusion:
- There is evidence to suggest that there is an association between inoculation and flu infection.
Effect Size (Phi Coefficient):
- \phi = \sqrt{\frac{\chi^2}{n}} = \sqrt{\frac{52.3}{628}} = 0.29
- Medium effect.
Other Chi-squared Test Considerations
If the expected frequencies assumption for using the \chi^2 test of association fails, Fisher's exact test may be used.
R Output Analysis:
Pearson's Chi-squared test data: Personality Matrix X-squared = 71.2, df = 3, p-value = 2.362e-15
- Based on this output, since the p-value (2.362e-15) is less than 0.05, there is evidence of an association between Personality and Colour Preference.
Post Hoc Method:
- If there were evidence of an association, the standardized residuals method could be used as a post-hoc analysis.
Ordinal Data:
The linear-by-linear association procedure is used when we have ordinal data in contingency tables.
Generalized Linear Models (GLMs)
- Family of Distributions:
- We consider the exponential family of distributions for generalized linear models.
- Linearity:
- With respect to generalized linear models, "linear" means that the linear predictor must be linear with respect to the coefficients, i.e., \beta_i.
- Canonical Link Function for Gamma Regression:
- The canonical link function for gamma regression is the reciprocal function. This link function is often not used in practice because the rate \lambda, which we want to predict, can be <0, and the reciprocal of R gives the whole model.
Poisson Distribution
Typical Use:
- The Poisson distribution is typically used in situations where we count events, e.g., counting the number of calls received by a call center in a given time period.
Probability Calculation:
- The number of errors on a given page of a Mathematics book is said to follow a Poisson distribution with parameter 0.5. Find the probability that there are at least two errors on a given page.
- Let X denote the number of errors on a given page, i.e., X \sim Poisson(0.5).
- P(X \geq 2) = 1 - P(X = 1) - P(X = 0)
- P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}
- P(X=1) = \frac{e^{-0.5} 0.5^1}{1!} = \frac{0.6065 \times 0.5}{1} = 0.303
- P(X=0) = \frac{e^{-0.5} 0.5^0}{0!} = \frac{0.6065 \times 1}{1} = 0.607
- P(X \geq 2) = 1 - 0.303 - 0.607 = 0.09
Linear Predictors in GLMs:
- Which of the following are considered to be a linear predictor in the sense of GLMs?
- (a) \beta0 + \beta1 x_1 : No
- (b) \beta0 + \beta2 x2 + \beta3 x_2^2 : Yes
- (c) \beta0 + \beta2 x2 + \beta3 x3 + \beta4 x2 x3 : Yes
- (d) \beta0 + ln(\beta1) x_1 : No
- Which of the following are considered to be a linear predictor in the sense of GLMs?
Log Link Function for Count Data:
- We use the log link function for count data as it avoids obtaining negative counts.
R Code for Poisson Regression and Wald Test:
R code for performing a Poisson regression on the Surgery data where the dependent variable is Surgery Visits and the independent variable is Location:
SurgeryPoissonReg <- glm(SurgeryVisits ~ Location, data = Surgery, family = "poisson") summary(SurgeryPoissonReg)
R code for performing a Wald test in this scenario:
library(lmtest) waldtest(SurgeryPoissonReg, test = "Chisq")
Logistic Regression
Distribution of Dependent Variable:
- In logistic regression, the dependent variable must follow a binomial distribution.
Linearity:
- Linearity in this scenario means that each continuous independent variable must be linearly related to the logit of the dependent variable.
Methods of Evaluating a Logistic Regression Model:
- Any two of the following:
- Log-likelihood
- Deviance
- Pseudo-R^2 statistics (Hosmer and Lemeshow, Cox and Snell, Nagelkerke)
- Wald Statistic
- Odds ratio
- Any two of the following:
Odds and Logit Calculation:
- The probability p of a machine producing a non-defective item is 0.85. Calculate the odds of the machine producing a non-defective item. Calculate logit(p).
odds = \frac{p}{1 - p} = \frac{0.85}{0.15} = 5.67
logit(p) = ln(odds) = ln(5.67) = 1.73
Logit Equation and Probability Calculation:
- Another company claims it can predict the probability p' of its machine producing a non-defective item by the following equation:
logit(p') = ln(\frac{p'}{1-p'}) = 3 + 2x1 where x1 is a variable that measures maintenance. Give the equation for p', i.e., make p' the subject of the above formula.
Let y = logit(p') = ln(\frac{p'}{1-p'})
e^y = \frac{p'}{1-p'} = e^{3 + 2x_1}
p' = (1 - p')e^y
p' = e^y - p'e^y
p' + p'e^y = e^y
p'(1 + e^y) = e^y
p' = \frac{e^y}{1 + e^y} = \frac{e^{3 + 2x1}}{1 + e^{3 + 2x1}}
- Another company claims it can predict the probability p' of its machine producing a non-defective item by the following equation: