Part II: Descriptive Statistics

Chapter 5: The Normal Approximation for Data

Required Reading: All Sections
Author: Shengjie Jiang, Ph.D.
Date: 1/18

The Normal (Probability) Distribution

  • Importance:

    • The normal distribution is regarded as the most significant of all probability distributions.

  • Common Variables Modeled with Normal Distribution:

    • Health-related characteristics (e.g., heights, weights, cholesterol, blood pressure).

    • Psychological measurements (e.g., intelligence and aptitude test scores).

    • Measurement errors in scientific experiments.

    • Economic measurements and indicators, including flood measurements.

  • Definition:

    • A continuous random variable $X$ is said to be normally distributed with mean $\mu$ and standard deviation $\sigma$ if the density function of $X$ has the following form:
      f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \quad -\infty < x < \infty

  • Notation:

    • $X \sim N(\mu, \sigma)$

    • Read as "$X$ follows a Normal Distribution with mean $\mu$ and standard deviation $\sigma$"

Normal Curve Characteristics

  • Key Features:

    • The curve has a single peak.

    • The total area under the curve is 100% (or 1 when considered depending on the scale).

    • The curve is consistently above the horizontal axis.

    • The mean $\mu$ indicates the center of the distribution.

    • The distribution is symmetric concerning the mean.

    • Inflection Points:

    • Points where the curve's concavity changes.

    • The distance from the mean to either inflection point is termed the standard deviation (SD, $\sigma$).

    • Roughly 68% of the area below the curve occurs between the inflection points.

Normal Curve: Mean and Standard Deviation

  • Definition:

    • A normal curve can be entirely described by two parameters: its mean $\mu$ and standard deviation $\sigma$.

  • Density Function:

    • f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \quad -\infty < x < \infty

  • Total Area:

    • The total area under the normal curve is equal to 1 (or 100%).

    • The curve enables analysis of data represented by histograms of normally distributed data.

    • The area beneath the curve is interpreted as a probability.

Changes in Normal Curve Attributes

  • Increasing Mean:

    • A rise in the mean shifts the curve to the right.

  • Increasing Standard Deviation:

    • An increase in the standard deviation results in a “flattened” curve.

  • Shape Consistency:

    • The fundamental shape remains unchanged, and the area under the normal curve consistently equals 1.

  • Empirical Rule:

    • All normal distributions share this rule allowing the use of a Standard Normal Curve, simplifying calculations.

Normal Curve and the Empirical Rule (68-95-99.7)

  • 68% Rule:

    • 68% of observations fall within the interval $(\mu - \sigma, \mu + \sigma)$.

    • Therefore, there is a 68% likelihood of a variable being within one standard deviation of the mean.

  • 95% Rule:

    • 95% of observations lie within $(\mu - 2\sigma, \mu + 2\sigma)$.

    • Thus, there is a 95% chance of a variable's value being within two standard deviations of the mean.

  • 99.7% Rule:

    • 99.7% of observations fall within $(\mu - 3\sigma, \mu + 3\sigma)$.

    • There is a 99.7% chance of a variable being within three standard deviations of the mean.

Example: Women’s Heights

  • Normal Distribution Parameters:

    • Women’s heights are modeled using a normal distribution with $\mu = 64.5"$ and $\sigma = 2.5"$.

  • Empirical Rule Application:

    • 68% Range:

    • Between: $64.5 - 2.5 = 62$ inches and $64.5 + 2.5 = 67$ inches.

    • 95% Range:

    • Between: $64.5 - 2 imes 2.5 = 59.5$ inches and $64.5 + 2 imes 2.5 = 69.5$ inches.

    • 99.7% Range:

    • Between: $64.5 - 3 imes 2.5 = 57$ inches and $64.5 + 3 imes 2.5 = 72$ inches.

Questions on Women’s Heights

  1. In which range do the middle 95% of all women lie?

  2. What percentage of women are taller than 67"?

  3. What percentage of women are shorter than 59.5"?

  4. What percentage of women are shorter than 69.5"?

  5. In which range do the top 2.5% of all women lie?

  • Problem Statement:

    • What if the percentages of interest cannot be expressed in terms of 68%, 95%, or 99.7%?

    • For example, what percentages of women are shorter than 68"? Taller than 70"? Or between these values?

Standard Normal Distribution

  • Definition:

    • The standard normal distribution is defined by its mean $\mu = 0$ and standard deviation $\sigma = 1$.

  • Notation:

    • $N(0, 1)$

  • Density Function:

    • f(x) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right), \quad -\infty < x < \infty

  • Area:

    • The total area under the standard normal distribution curve equals 1.

  • Shape:

    • The standard normal distribution is bell-shaped and symmetric about its mean.

  • Conversion Process:

    • Any normally distributed variable can be standardized via simple algebra (re-centering and re-scaling).

    • Process:
      N(\mu, \sigma) \text{ standardizing } \rightarrow N(0, 1)

Standardization of Normal Distribution

  • Conversion Formula:

    • Any normally distributed data set can be converted to a standard normal distribution using the formula:
      X \sim N(\mu, \sigma) \Rightarrow z = \frac{X - \mu}{\sigma} \sim N(0, 1)

  • Z-Score Definition:

    • A z-score indicates how many standard deviations an observation is above (+) or below (−) the mean.

Application of Standard Units: Women’s Heights

  • Known Percentages from Z-Scores:

    • Percentages can be determined for z-scores of $0, \pm 1, \pm 2, \pm 3$, related to the Empirical Rule.

  • Challenges with Percentages:

    • What if required percentages are not expressed in terms of 68%, 95%, or 99.7%?

    • Example questions:

    • What percentage of women are taller than 70"?

    • What percentage of women are shorter than 68"?

    • What percentage of women are between 68" and 70"?

Practice Question: Brain Weights

  • Hypothesis:

    • Brain weights of individuals affected by a disease follow a normal distribution with mean 1000 g and standard deviation 100 g.

  • Probability Questions:

    1. Probability that a randomly selected individual's brain weight is less than 850 g.

    2. Probability that the brain weight is above 1250 g.

    3. Probability that the brain weight is between 905 g and 1300 g.

Z-Scores and Comparison

  • Z-Score Characteristics:

    • Measures the distance of a value from the mean in standard deviations.

    • A positive z-score signifies a value above the mean.

    • A negative z-score indicates a value below the mean.

    • A small z-score reflects closeness to the mean in comparison to other data points.

    • A large z-score denotes significant distance from the mean relative to the data. - Application: Z-scores help compare relative standings across different or similar datasets.

Relative Performance Example: Alice's Midterms

  • Midterm Scores Comparison:

    • Organic Chemistry (OChem):

    • Mean: 55

    • Standard Deviation: 25

    • Alice's Score: 80

    • Statistics Class:

    • Mean: 50

    • Standard Deviation: 10

    • Alice's Score: 75

  • Questions:

    1. Relative performance on each test?

    2. What are her percentiles on both midterms?

Applying Context: Olympic Performances

  • Example from Olympics:

    • Dobrynska won a gold medal in the 2008 Olympics with a long jump of 6.63m, which is 0.5m higher than the average.

    • Fountain claimed the gold in the 200m run with a time of 23.21s, 1.5s faster than average.

  • Data Summary:

    • Long Jump:

    • Mean: 6.11m

    • SD: 0.24m

    • Individual Performance: 6.63m

    • 200m Run:

    • Mean: 24.71s

    • SD: 0.70s

    • Individual Performance: 23.21s

  • Questions:

    • Whose performance was more impressive based on statistical metrics?

Percentiles in Normal Distribution

  • Backward Normal Calculation:

    • When given the area or percentage, one must find the corresponding value of $x$.

  • Formula for Backward Calculation:

    • x = \mu + z \cdot \sigma

  • Example:

    • In Alice’s STAT midterm, determine the score necessary to fall within the top 10% (90th percentile).

  • Statistical Illustration:

    • Consider a normal distribution $N(50, 10)$.

Backward Normal Calculation – Practice Questions

  • Brain Weights Example:

    • Probability below what weight does the bottom 10% of brain weights fall?

  • Cereal Box Scenario:

    • Manufacturer constructs cereal boxes based on the Normal model with mean 16.3 ounces and standard deviation 0.2 ounces, while the label indicates 16.0 ounces.

  • Key Question:

    • What fraction will be underweight (less than 16 ounces)?

    • What weight corresponds to 5% of cereal boxes being below that weight?