L4- Distribution of Observations

Learning Objectives

  • By the end of this lecture you should be able to:
    • Describe and identify several statistical distributions (Normal, t, \chi^2, Binomial, Bernoulli).
    • Apply reference ranges to the Normal distribution when appropriate.
    • Calculate the probability that an observation lies between two values using the standard Normal distribution.

Core Definitions & Concepts

  • Population vs Sample
    • Population: Entire group of interest; possesses a fixed, but usually unknown, numerical characteristic (parameter).
    • Sample: Subset drawn from the population, used to estimate population characteristics.
  • Parameter vs Statistic
    • Parameter (Greek letters): Numerical characteristic of a population (e.g., population mean cholesterol of all Asian males in Nguyen).
    • Statistic (Latin letters): Numerical summary derived from a sample (e.g., mean cholesterol of a sample of 50 Asian males in Melbourne) used to infer the parameter.

Notational Conventions

  • Mean
    • Population mean: \mu
    • Sample mean: \bar{x} (spoken “x-bar”).
  • Standard Deviation
    • Population SD: \sigma
    • Sample SD: S or SD
    • Excel command: STDEV(range) computes the sample standard deviation.
  • Proportion
    • Population proportion: \pi
    • Sample proportion: p

Overview of Distributions Covered

  • Bernoulli
    • Used for a single trial (sample size n=1) with two outcomes.
    • Rarely applied in practice; not covered further in the course.
  • Binomial
    • Two mutually exclusive outcomes; fixed number of trials n; constant probability \pi.
    • Concerned with the count of “successes”.
    • Example question: “In 10 events, what is P(X=5)?”
  • Normal (Gaussian) Distribution – Focus of this lecture.
  • t Distribution – Will be addressed in later sessions.
  • Chi-Square (\chi^2) Distribution – Will be addressed in later sessions.

The Normal (Gaussian) Distribution

  • Also called the Gaussion distribution.
  • Applies to continuous data.
  • Characteristics:
    • Shape: Symmetric, bell-shaped curve.
    • Parameters:
    • Central location: \mu
    • Spread: \sigma
  • Visual: Mean at center, curve spreads outward in units of \sigma.
  • Importance: Foundation for many statistical methods and probability calculations.

Summary Statistics Revisited

  • Provide concise, essential information about data distribution.
  • Two classes:
    • Measures of Central Tendency: Mean, median, mode.
    • Measures of Dispersion: Standard deviation, inter-quartile range (IQR), range.
  • Interpretation of dispersion (illustrated by two curves):
    • Narrow curve ⇒ small variability.
    • Wide curve ⇒ large variability.

Shapes of Distributions

  • Symmetric / Normal: Mean = Median = Mode.
  • Negatively (Left) Skewed:
    • Tail extends to smaller values.
    • Outliers pull mean left: \text{Mean} < \text{Median} < \text{Mode}.
  • Positively (Right) Skewed:
    • Tail extends to larger values.
    • Outliers pull mean right: \text{Mode} < \text{Median} < \text{Mean}.
  • Terminology:
    • Symmetric ⇒ normal
    • Asymmetric ⇒ skewed

Choosing Appropriate Statistics to Report

  • If data ~ Normal ⇒ report mean ± SD.
  • If data skewed ⇒ report median & IQR (mean is distorted by outliers).

Checking for Normality

  1. Graphical Methods
    • Histogram
    • Box-plot
    • Look for symmetry, bell shape, absence of long tails.
  2. Numerical Comparison
    • Compare mean and median:
      • Close ⇒ likely symmetric/normal.
      • Far apart ⇒ likely skewed.

Graphical & Numerical Examples

  • Weight Histogram
    • Visually symmetric; mean = median = 79 ⇒ weight ~ Normal.
  • Age Distribution (Left-Skewed)
    • Most values > 70; tail toward younger ages.
    • Mean < Median ⇒ left skew.
  • Post-Procedure Length of Stay (Right-Skewed)
    • Many short stays, few very long stays.
    • Mean > Median ⇒ right skew.

Clinical Data Example – Cardiac Surgery Database

  • Variables: Pre-operative creatinine (mmol/L) & dialysis status.
  • Overall Creatinine
    • Histogram & box-plot show long right tail (outliers up to <2 mmol/L).
  • Grouped by Dialysis Status
    • No Dialysis:
    • Mean =0.10, Median =0.09 (similar) ⇒ approximate normality despite long tail.
    • Yes Dialysis:
    • Mean =0.462, Median =0.426 (mean significantly higher) ⇒ strong right skew.
  • Statistical Reporting Decision
    • Because Yes Dialysis group is skewed, compare groups using median & IQR rather than mean & SD.

Additional Example – Body Mass Index (BMI)

  • Histogram nearly symmetric; a few outliers up to 68 BMI.
  • Mean ≈ Median (red lines overlap).
  • Conclusion: BMI data ~ Normal; mean ± SD appropriate.

Practical Implications & Next Steps

  • Always examine distribution shape before selecting summary statistics or formal tests.
  • Reference ranges (e.g., \mu \pm 2\sigma) valid only under approximate normality.
  • Standard Normal table or software allows probability calculations once data are standardized.
  • Future lectures will build on these ideas to cover t and \chi^2 distributions, and to perform inferential tests based on Normal theory.