Exam Preparation Notes

Mann-Whitney U Test

  • The Mann-Whitney U test is a non-parametric test used to compare two independent groups.
  • It is the non-parametric equivalent of the unpaired t-test.
  • The procedure involves ranking the observations.
  • If an observation appears more than once (tied ranks), we assign the mean of the potential ranks of these data.

Oil Price Analysis

  • Objective: Determine if the oil price is different in January and June, assuming the distributions of the oil prices in January and June samples are of the same shape.

  • Significance Level: 5%.

  • Assumption: Samples are not normally distributed.

  • Hypotheses:

    • H_0: The medians of the two groups (January and June oil prices) are equal.
    • H_1: The medians of the two groups are not equal.
  • Data and Ranks:

    Oil PriceMonthActual Ranks
    66.7Jan1
    68Jan2
    68.9Jan3
    69.5Jan4
    70.3Jan5
    70.9June6
    71June7
    72June8
    72.1Jan9
    72.5June10
    73.1June11
    75.5June12
  • Calculations:

    • R_1 (Sum of ranks for January) = 1 + 2 + 3 + 4 + 5 + 9 = 24

    • R_2 (Sum of ranks for June) = 6 + 7 + 8 + 10 + 11 + 12 = 54

    • n_1 = 6 (Sample size for January)

    • n_2 = 6 (Sample size for June)

    • u1 = R1 - \frac{n1(n1 + 1)}{2} = 24 - \frac{6 \times 7}{2} = 24 - 21 = 3

    • u2 = R2 - \frac{n2(n2 + 1)}{2} = 54 - \frac{6 \times 7}{2} = 54 - 21 = 33

    • u = Min(u1, u2) = Min(3, 33) = 3

  • Critical Value:

    • Using tables, the critical value for n1 = 6, n2 = 6, and \alpha = 0.05 (two-tailed test) is U_{critical} = 5.
  • Decision:

    • Since u = 3 < U{critical} = 5, we reject H0.
  • Conclusion:

    • There is evidence to suggest that the median oil price is different in January and June.
    • Median (Jan) = 69.2
    • Median (June) = 72.25
    • It appears that the median oil price is higher in June than in January.

Chi-squared Random Variables

  • If Y1, Y2, and Y3 are independent \chi^2 random variables with 3, 7, and 8 degrees of freedom respectively, then the distribution of \sum Yi is:
    • \sum Yi \sim \chi^2{3+7+8} = \chi^2_{18}

Finding Values from Chi-squared Tables

  • Find the values of \chi^2{0.05, 16} and \chi^2{0.025, 9} using tables.

    • \chi^2_{0.05, 16} = 26.296
    • \chi^2_{0.025, 9} = 19.023
  • R code to calculate these values:

    • qchisq(0.05, 16, lower.tail = F)
    • qchisq(0.025, 9, lower.tail = F)

Expected value and Standard Deviation for Chi-squared Distribution

  • For X \sim \chi^2_v:

    • E(X) = v

    • Var(X) = 2v

    • SD(X) = \sqrt{2v}

    • If X \sim \chi^2_7:

      • E(X) = 7
      • Var(X) = 2 \times 7 = 14
      • SD(X) = \sqrt{14}

Association Between Inoculation and Flu Infection

  • Data:

    InfectedNot InfectedTotals
    Inoculated5241246
    Not Inoculated90292382
    Totals95533628
  • Assumptions:

    • Independence: A person cannot be both inoculated and not inoculated, and a person cannot be both infected and not infected.Each person contributes to a single cell in the table.
  • Expected Frequencies:

    • E_{11} = \frac{246 \times 95}{628} = 37.2
    • E_{12} = \frac{246 \times 533}{628} = 208.8
    • E_{21} = \frac{382 \times 95}{628} = 57.8
    • E_{22} = \frac{382 \times 533}{628} = 324.2
    • No expected frequency is less than 5.
  • Yates' Continuity Correction:

    • \chi^2{corrected} = \sum \frac{(|Oi - Ei| - 0.5)^2}{Ei}
    • \chi^2_{corrected} = \frac{(|5 - 37.2| - 0.5)^2}{37.2} + \frac{(|241 - 208.8| - 0.5)^2}{208.8} + \frac{(|90 - 57.8| - 0.5)^2}{57.8} + \frac{(|292 - 324.2| - 0.5)^2}{324.2} = 52.3
  • Degrees of Freedom:

    • (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1
  • Critical Value:

    • \chi^2_{0.05, 1} = 3.841
  • Decision:

    • Since \chi^2{corrected} = 52.3 > \chi^2{0.05, 1} = 3.841, we reject H_0.
  • Conclusion:

    • There is evidence to suggest that there is an association between inoculation and flu infection.
  • Effect Size (Phi Coefficient):

    • \phi = \sqrt{\frac{\chi^2}{n}} = \sqrt{\frac{52.3}{628}} = 0.29
    • Medium effect.

Other Chi-squared Test Considerations

  • If the expected frequencies assumption for using the \chi^2 test of association fails, Fisher's exact test may be used.

  • R Output Analysis:

    Pearson's Chi-squared test
    data: Personality Matrix
    X-squared = 71.2, df = 3, p-value = 2.362e-15
    
    • Based on this output, since the p-value (2.362e-15) is less than 0.05, there is evidence of an association between Personality and Colour Preference.
  • Post Hoc Method:

    • If there were evidence of an association, the standardized residuals method could be used as a post-hoc analysis.
  • Ordinal Data:

  • The linear-by-linear association procedure is used when we have ordinal data in contingency tables.

Generalized Linear Models (GLMs)

  • Family of Distributions:
    • We consider the exponential family of distributions for generalized linear models.
  • Linearity:
    • With respect to generalized linear models, "linear" means that the linear predictor must be linear with respect to the coefficients, i.e., \beta_i.
  • Canonical Link Function for Gamma Regression:
    • The canonical link function for gamma regression is the reciprocal function. This link function is often not used in practice because the rate \lambda, which we want to predict, can be <0, and the reciprocal of R gives the whole model.

Poisson Distribution

  • Typical Use:

    • The Poisson distribution is typically used in situations where we count events, e.g., counting the number of calls received by a call center in a given time period.
  • Probability Calculation:

    • The number of errors on a given page of a Mathematics book is said to follow a Poisson distribution with parameter 0.5. Find the probability that there are at least two errors on a given page.
    • Let X denote the number of errors on a given page, i.e., X \sim Poisson(0.5).
    • P(X \geq 2) = 1 - P(X = 1) - P(X = 0)
    • P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}
    • P(X=1) = \frac{e^{-0.5} 0.5^1}{1!} = \frac{0.6065 \times 0.5}{1} = 0.303
    • P(X=0) = \frac{e^{-0.5} 0.5^0}{0!} = \frac{0.6065 \times 1}{1} = 0.607
    • P(X \geq 2) = 1 - 0.303 - 0.607 = 0.09
  • Linear Predictors in GLMs:

    • Which of the following are considered to be a linear predictor in the sense of GLMs?
      • (a) \beta0 + \beta1 x_1 : No
      • (b) \beta0 + \beta2 x2 + \beta3 x_2^2 : Yes
      • (c) \beta0 + \beta2 x2 + \beta3 x3 + \beta4 x2 x3 : Yes
      • (d) \beta0 + ln(\beta1) x_1 : No
  • Log Link Function for Count Data:

    • We use the log link function for count data as it avoids obtaining negative counts.
  • R Code for Poisson Regression and Wald Test:

    • R code for performing a Poisson regression on the Surgery data where the dependent variable is Surgery Visits and the independent variable is Location:

      SurgeryPoissonReg <- glm(SurgeryVisits ~ Location, data = Surgery, family = "poisson")
      summary(SurgeryPoissonReg)
      
    • R code for performing a Wald test in this scenario:

      library(lmtest)
      waldtest(SurgeryPoissonReg, test = "Chisq")
      

Logistic Regression

  • Distribution of Dependent Variable:

    • In logistic regression, the dependent variable must follow a binomial distribution.
  • Linearity:

    • Linearity in this scenario means that each continuous independent variable must be linearly related to the logit of the dependent variable.
  • Methods of Evaluating a Logistic Regression Model:

    • Any two of the following:
      • Log-likelihood
      • Deviance
      • Pseudo-R^2 statistics (Hosmer and Lemeshow, Cox and Snell, Nagelkerke)
      • Wald Statistic
      • Odds ratio
  • Odds and Logit Calculation:

    • The probability p of a machine producing a non-defective item is 0.85. Calculate the odds of the machine producing a non-defective item. Calculate logit(p).

    odds = \frac{p}{1 - p} = \frac{0.85}{0.15} = 5.67

    logit(p) = ln(odds) = ln(5.67) = 1.73

  • Logit Equation and Probability Calculation:

    • Another company claims it can predict the probability p' of its machine producing a non-defective item by the following equation:
      logit(p') = ln(\frac{p'}{1-p'}) = 3 + 2x1 where x1 is a variable that measures maintenance. Give the equation for p', i.e., make p' the subject of the above formula.

    Let y = logit(p') = ln(\frac{p'}{1-p'})

    e^y = \frac{p'}{1-p'} = e^{3 + 2x_1}

    p' = (1 - p')e^y

    p' = e^y - p'e^y

    p' + p'e^y = e^y

    p'(1 + e^y) = e^y

    p' = \frac{e^y}{1 + e^y} = \frac{e^{3 + 2x1}}{1 + e^{3 + 2x1}}