Normal Distribution
Statistical Reasoning: Normal Distribution Study Notes
Density Curve
Density Curve Overview:
Density curves are used in exploratory data analysis (EDA).
They provide a smoothed approximation of histograms, which are discrete graphs representing ranges of continuous values.
The primary purpose of density curves is to facilitate easier calculations, particularly through integration, comparing to discrete bins.
Purpose of Density Curves:
Density curves serve similar functions as histograms:
Show overall patterns (shape, center, variability).
Identify striking deviations such as outliers.
Density functions are smoothed versions of histograms that allow the calculation of the area under the curve to represent probabilities.
Center and Variability of a Density Curve
Areas under a Density Curve:
The area under a density curve represents proportions of the total observations.
Median:
The median is defined as the point where half of the observations lie on either side, which coincides with half the area under the curve to its left.
For symmetric density curves, the median is located at the center.
Mean:
The mean is the arithmetic average of all observations.
For normal distributions, the mean and the median are equal due to symmetry.
Normal Distribution
Characteristics of Normal Distribution:
Normal distribution is defined to be symmetrical, single-peaked, and bell-shaped.
A specific normal distribution is fully characterized by its mean and standard deviation.
Changing the mean affects the location along the axis but does not alter the shape.
Changing the standard deviation alters the shape of the curve:
A larger standard deviation results in a wider and flatter curve, indicating greater variability in data.
The 68–95–99.7% Rule
This rule provides a guideline for the distribution of observations in a normal distribution:
Approximately 68% of the observations fall within one standard deviation of the mean ( ext{Mean} ext{ ± } 1 imes ext{SD} ).
Approximately 95% of the observations fall within two standard deviations of the mean ( ext{Mean} ext{ ± } 2 imes ext{SD} ).
Approximately 99.7% of the observations fall within three standard deviations of the mean ( ext{Mean} ext{ ± } 3 imes ext{SD} ).
The total area under any probability distribution curve sums to 100%.
Case Study: Heights of Young Women
The height distribution for women aged 18 to 24 is approximately normal with:
Mean = 63.7 inches
Standard deviation = 2.5 inches
Application of the 68–95–99.7% Rule:
68% of data: 63.7 ext{ ± } 2.5 = [61.2, 66.2]
95% of data: 63.7 ext{ ± } (2 imes 2.5) = 63.7 ext{ ± } 5 = [58.7, 68.7]
99.7% of data: 63.7 ext{ ± } (3 imes 2.5) = 63.7 ext{ ± } 7.5 = [56.2, 71.2]
Conclusions from Data Analysis:
50% of all young women are taller than 63.7 inches, which is the mean.
34% of young women are within the range from 63.7 inches to 66.2 inches.
The segment between the mean and mean + 1 standard deviation shows that 34% are between 63.7 and 66.2 inches, then calculated as:
[63.7, 63.7 + 2.5]
Standardized Normal Distribution: Z-Score
Z-Score Definition:
A z-score standardizes values of a variable for comparison, converting the normal distribution to a standard normal distribution with mean = 0 and standard deviation = 1.
The formula for calculating a z-score is:
z = rac{ ext{Observation} - ext{Mean}}{ ext{Standard Deviation}}
A positive z-score indicates an observation above the mean, while a negative z-score indicates an observation below the mean.
Case Study: ACT versus SAT Scores
Performance Comparison:
Madison scored 600 on the SAT Mathematics exam.
Gabriel scored 21 on the ACT test.
SAT scores are normally distributed with:
Mean = 500
Standard deviation = 100
ACT scores are normally distributed with:
Mean = 18
Standard deviation = 6
Z-Score Calculation:
Madison's z-score: z_{ ext{Madison}} = rac{600 - 500}{100} = 1
Gabriel's z-score: z_{ ext{Gabriel}} = rac{21 - 18}{6} = 0.5
Percentiles of Normal Distributions
Percentiles Defined:
The median is the 50th percentile.
The first and third quartiles are the 25th and 75th percentiles respectively.
Finding Percentiles:
Percentiles for a specific z-score can be found using statistical tables or software (e.g., R, Python, Excel).
A specified percentile (cth percentile) denotes that c percent of observations lie below a given value and the remainder above.
Z-Score Table Examples
Example 1: The percentile at a z-score of 1.5 is the 93rd percentile.
Example 2: The z-score at the 42nd percentile is -0.2.