Normal Distribution

Density Curve Overview:
- Density curves are used in exploratory data analysis (EDA).
- They provide a smoothed approximation of histograms, which are discrete graphs representing ranges of continuous values.
- The primary purpose of density curves is to facilitate easier calculations, particularly through integration, comparing to discrete bins.
Purpose of Density Curves:
- Density curves serve similar functions as histograms:
- Show overall patterns (shape, center, variability).
- Identify striking deviations such as outliers.
- Density functions are smoothed versions of histograms that allow the calculation of the area under the curve to represent probabilities.

Areas under a Density Curve:
- The area under a density curve represents proportions of the total observations.
- Median:
- The median is defined as the point where half of the observations lie on either side, which coincides with half the area under the curve to its left.
- For symmetric density curves, the median is located at the center.
- Mean:
- The mean is the arithmetic average of all observations.
- For normal distributions, the mean and the median are equal due to symmetry.

Characteristics of Normal Distribution:
- Normal distribution is defined to be symmetrical, single-peaked, and bell-shaped.
- A specific normal distribution is fully characterized by its mean and standard deviation.
- Changing the mean affects the location along the axis but does not alter the shape.
- Changing the standard deviation alters the shape of the curve:
  - A larger standard deviation results in a wider and flatter curve, indicating greater variability in data.

This rule provides a guideline for the distribution of observations in a normal distribution:
- Approximately 68% of the observations fall within one standard deviation of the mean ( $ext{Mean} ext{ ± } 1 imes ext{SD}$ ).
- Approximately 95% of the observations fall within two standard deviations of the mean ( $ext{Mean} ext{ ± } 2 imes ext{SD}$ ).
- Approximately 99.7% of the observations fall within three standard deviations of the mean ( $ext{Mean} ext{ ± } 3 imes ext{SD}$ ).
- The total area under any probability distribution curve sums to 100%.

The height distribution for women aged 18 to 24 is approximately normal with:
- Mean = 63.7 inches
- Standard deviation = 2.5 inches
Application of the 68–95–99.7% Rule:
- 68% of data: $63.7 ext{ ± } 2.5 = [61.2, 66.2]$
- 95% of data: $63.7 ext{ ± } (2 imes 2.5) = 63.7 ext{ ± } 5 = [58.7, 68.7]$
- 99.7% of data: $63.7 ext{ ± } (3 imes 2.5) = 63.7 ext{ ± } 7.5 = [56.2, 71.2]$
Conclusions from Data Analysis:
- 50% of all young women are taller than 63.7 inches, which is the mean.
- 34% of young women are within the range from 63.7 inches to 66.2 inches.
- The segment between the mean and mean + 1 standard deviation shows that 34% are between 63.7 and 66.2 inches, then calculated as:
- $[63.7, 63.7 + 2.5]$

Z-Score Definition:
- A z-score standardizes values of a variable for comparison, converting the normal distribution to a standard normal distribution with mean = 0 and standard deviation = 1.
- The formula for calculating a z-score is:
- $z = rac{ ext{Observation} - ext{Mean}}{ ext{Standard Deviation}}$
- A positive z-score indicates an observation above the mean, while a negative z-score indicates an observation below the mean.

Performance Comparison:
- Madison scored 600 on the SAT Mathematics exam.
- Gabriel scored 21 on the ACT test.
- SAT scores are normally distributed with:
- Mean = 500
- Standard deviation = 100
- ACT scores are normally distributed with:
- Mean = 18
- Standard deviation = 6
Z-Score Calculation:
- Madison's z-score: $z_{ ext{Madison}} = rac{600 - 500}{100} = 1$
- Gabriel's z-score: $z_{ ext{Gabriel}} = rac{21 - 18}{6} = 0.5$

Percentiles Defined:
- The median is the 50th percentile.
- The first and third quartiles are the 25th and 75th percentiles respectively.
Finding Percentiles:
- Percentiles for a specific z-score can be found using statistical tables or software (e.g., R, Python, Excel).
- A specified percentile ( $c$ th percentile) denotes that $c$ percent of observations lie below a given value and the remainder above.