L4- Distribution of Observations
Learning Objectives
- By the end of this lecture you should be able to:
- Describe and identify several statistical distributions (Normal, t, \chi^2, Binomial, Bernoulli).
- Apply reference ranges to the Normal distribution when appropriate.
- Calculate the probability that an observation lies between two values using the standard Normal distribution.
Core Definitions & Concepts
- Population vs Sample
- Population: Entire group of interest; possesses a fixed, but usually unknown, numerical characteristic (parameter).
- Sample: Subset drawn from the population, used to estimate population characteristics.
- Parameter vs Statistic
- Parameter (Greek letters): Numerical characteristic of a population (e.g., population mean cholesterol of all Asian males in Nguyen).
- Statistic (Latin letters): Numerical summary derived from a sample (e.g., mean cholesterol of a sample of 50 Asian males in Melbourne) used to infer the parameter.
Notational Conventions
- Mean
- Population mean: \mu
- Sample mean: \bar{x} (spoken “x-bar”).
- Standard Deviation
- Population SD: \sigma
- Sample SD: S or SD
- Excel command:
STDEV(range)
computes the sample standard deviation.
- Proportion
- Population proportion: \pi
- Sample proportion: p
Overview of Distributions Covered
- Bernoulli
- Used for a single trial (sample size n=1) with two outcomes.
- Rarely applied in practice; not covered further in the course.
- Binomial
- Two mutually exclusive outcomes; fixed number of trials n; constant probability \pi.
- Concerned with the count of “successes”.
- Example question: “In 10 events, what is P(X=5)?”
- Normal (Gaussian) Distribution – Focus of this lecture.
- t Distribution – Will be addressed in later sessions.
- Chi-Square (\chi^2) Distribution – Will be addressed in later sessions.
The Normal (Gaussian) Distribution
- Also called the Gaussion distribution.
- Applies to continuous data.
- Characteristics:
- Shape: Symmetric, bell-shaped curve.
- Parameters:
- Central location: \mu
- Spread: \sigma
- Visual: Mean at center, curve spreads outward in units of \sigma.
- Importance: Foundation for many statistical methods and probability calculations.
Summary Statistics Revisited
- Provide concise, essential information about data distribution.
- Two classes:
- Measures of Central Tendency: Mean, median, mode.
- Measures of Dispersion: Standard deviation, inter-quartile range (IQR), range.
- Interpretation of dispersion (illustrated by two curves):
- Narrow curve ⇒ small variability.
- Wide curve ⇒ large variability.
Shapes of Distributions
- Symmetric / Normal: Mean = Median = Mode.
- Negatively (Left) Skewed:
- Tail extends to smaller values.
- Outliers pull mean left: \text{Mean} < \text{Median} < \text{Mode}.
- Positively (Right) Skewed:
- Tail extends to larger values.
- Outliers pull mean right: \text{Mode} < \text{Median} < \text{Mean}.
- Terminology:
- Symmetric ⇒ normal
- Asymmetric ⇒ skewed
Choosing Appropriate Statistics to Report
- If data ~ Normal ⇒ report mean ± SD.
- If data skewed ⇒ report median & IQR (mean is distorted by outliers).
Checking for Normality
- Graphical Methods
- Histogram
- Box-plot
- Look for symmetry, bell shape, absence of long tails.
- Numerical Comparison
- Compare mean and median:
- Close ⇒ likely symmetric/normal.
- Far apart ⇒ likely skewed.
Graphical & Numerical Examples
- Weight Histogram
- Visually symmetric; mean = median = 79 ⇒ weight ~ Normal.
- Age Distribution (Left-Skewed)
- Most values > 70; tail toward younger ages.
- Mean < Median ⇒ left skew.
- Post-Procedure Length of Stay (Right-Skewed)
- Many short stays, few very long stays.
- Mean > Median ⇒ right skew.
Clinical Data Example – Cardiac Surgery Database
- Variables: Pre-operative creatinine (mmol/L) & dialysis status.
- Overall Creatinine
- Histogram & box-plot show long right tail (outliers up to <2 mmol/L).
- Grouped by Dialysis Status
- No Dialysis:
- Mean =0.10, Median =0.09 (similar) ⇒ approximate normality despite long tail.
- Yes Dialysis:
- Mean =0.462, Median =0.426 (mean significantly higher) ⇒ strong right skew.
- Statistical Reporting Decision
- Because Yes Dialysis group is skewed, compare groups using median & IQR rather than mean & SD.
Additional Example – Body Mass Index (BMI)
- Histogram nearly symmetric; a few outliers up to 68 BMI.
- Mean ≈ Median (red lines overlap).
- Conclusion: BMI data ~ Normal; mean ± SD appropriate.
Practical Implications & Next Steps
- Always examine distribution shape before selecting summary statistics or formal tests.
- Reference ranges (e.g., \mu \pm 2\sigma) valid only under approximate normality.
- Standard Normal table or software allows probability calculations once data are standardized.
- Future lectures will build on these ideas to cover t and \chi^2 distributions, and to perform inferential tests based on Normal theory.