Lec1_Intro_EF

Course Information

  • Course Name: P8131 Biostatistical Methods II

  • Instructor: Wenpin Hou, Ph.D.

  • Department: Biostatistics, Columbia University

Course Materials

Required Textbooks

  • Dobson, A. J., & Barnett, A. G. (2008). An Introduction to Generalized Linear Model (3rd Ed.). Chapman & Hall.

  • Fitzmaurice, G. M., Laird, N. M. & Ware, J. H. (2011). Applied Longitudinal Analysis (2nd Ed.). Wiley.

  • Hosmer, D. W., Lemeshow, S., & May, S. (2008). Applied Survival Analysis (2nd Ed.). Wiley.

Recommended Textbooks

  • Faraway, J. J. (2016). Extending the Linear Model with R (2nd Ed.). Chapman & Hall.

  • Agresti, A. (2015). Foundations of Linear and Generalized Linear Models (1st Ed.). John Wiley & Sons, Inc.

  • Diggle, P. J., Heagerty, P., Liang, K. Y., & Zeger, S. L. (2013). Analysis of Longitudinal Data (2nd Ed.). Oxford.

  • Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd Ed.). Springer.

  • Casella, G., & Berger, R. L. (2024). Statistical Inference (2nd Ed.). Springer.

Grading Policy

  • Homework: 40% (10 assignments, equal weights, submit via Canvas, late submissions not accepted)

  • Midterm Exam: 30% (Date: March 13, in class, one double-sided A4 reference sheet allowed)

  • Final Exam: 30% (Date: April 29, in class, one double-sided A4 reference sheet allowed)

  • Canvas: Check frequently for homework, materials, and grades

  • Honor Code: Adhere to the Mailman School Student Honor Code; academic integrity violations will be reported.

Office Hours

  • Instructor: Tuesdays, 1:30-2:30 PM

  • Teaching Assistants:

    • Safiya Sirota: Wednesdays, 10-11 AM, ARB R638

    • Ting-Hsuan Chang: Thursdays, 1-2 PM, ARB R638

    • Yuying Lu: Mondays, 1-2 PM, ARB R638

    • Qin Huang: Tuesdays, 3-4 PM with Huanyu, ARB R627

    • Huanyu Chen: Tuesdays, 3-4 PM, ARB R627

  • Classroom Policy: Attendance highly recommended; participation encouraged; do not share course material online without permission. Direct administrative questions to Paul McCullough (pm2692).

Course Overview

  • Continuation: This course continues P8130 (Biostatistical Methods I) and covers key areas including:

    • Generalized Linear Models

    • Longitudinal Data Analysis

    • Survival Analysis

  • Objective: Introduce basic concepts of each topic and demonstrate their application in real-world problems.

  • Software: The R programming language will primarily be used, but students may also use SAS, Matlab, Python, SPSS, or Excel for homework analysis.

Part I: Generalized Linear Model

Outline of Topics

  • Exponential family distributions

  • Generalized linear model basics

  • Logistic regression

  • Nominal and ordinal logistic regression

  • Poisson regression

  • Contingency table

  • Case study

Examples in Data Analysis

Example I: Kyphosis Data

  • Context: Measurements on 81 children post-corrective spinal surgery

  • Response Variable: Kyphosis (presence/absence of deformity)

  • Covariates:

    • Age of child (months)

    • Number of vertebrae involved

    • Starting point of involved vertebrae

  • Questions: Relationship of covariates to response; patient screening feasibility

Example II: Vehicle Safety Study

  • Participants: 150 men and 150 women

  • Focus: Importance of air conditioning and power steering in car buying

  • Data Representation: Importance ratings by gender and age group

  • Question: Relationship between sex, age, and car preferences

Example III: Cargo Vessel Damage Study

  • Objective: Assess damage risk in ships based on various factors

  • Variables:

    • Ship type (A-E)

    • Year of construction (1960-79)

    • Period of operation

    • Aggregated months of service

    • Number of damage accidents

  • Questions: Impact of these variables on damage occurrences

Exponential Family Overview

  • Definition: A large family of distributions including normal, exponential, Poisson, Bernoulli, binomial, gamma, and beta.

  • Parameter Representation: Written as f(y; θ) = s(y)t(θ) exp(a(y)b(θ)), where a, b, s, and t are known functions.

  • Canonical Form: If a(y) = y, it’s the canonical form with b(θ) as the natural parameter.

Properties of Exponential Family Distribution

  • Integration: Must satisfy Z f(y; θ)dy = 1 for r.v. Y

  • Derivatives with respect to θ:

    • d/dθ Z f(y; θ) dy = d/dθ [1] = 0

    • d2f(y; θ)/dθ2 = conditions for expectations and variances

Gaussian Example

  • Normal Distribution: f(y) = (1/√(2πσ)) exp[-(y - µ)²/(2σ²)]

  • Objective: Parameter µ (mean), with σ² as a nuisance parameter.

  • Canonical Parameters: b(µ), c(µ), d(y) in canonical form, illustrating relationships among the distributions and their parameters.

Additional Examples of Exponential Family Distributions

  • Bernoulli: f(y) = p^y (1-p)^(1-y)

  • Binomial: f(y; n, p) = (n choose y) * p^y (1-p)^(n-y)

  • Poisson: f(y; λ) = (λ^y e^(-λ))/y!

  • Gamma and Beta: Additional family members illustrating various statistical applications.

Summary of Key Relationships and Formulas

  • Moment Generating Functions:

    • Express relationships among parameters and moments of the distributions

  • Formulas: Useful formulas for transformations of distributions to ease computations.

robot