Topic 7: The Distribution of Data: Exploring the Normal Distribution (WorkBook Notes)

1. Introduction: Transitioning from Binomial to Normal Distribution

The Normal Distribution is introduced as a more powerful and flexible tool for understanding real-world data compared to the Binomial Distribution. While the Binomial Distribution is effective for modeling processes with a fixed number of trials, two possible outcomes (success or failure), and a constant probability of success, it faces significant limitations:

Scaling Issues: As the number of trials increases, the Binomial Distribution becomes difficult to work with. Calculating probabilities directly involves large factorials and complex expressions.
Interpretability: Even with technology, interpreting results for large datasets becomes less intuitive.
Continuous Data: The Binomial Distribution is restricted to discrete events, whereas the Normal Distribution provides a simpler way to model large datasets and continuous variables.

2. What is the Normal Distribution?

The Normal Distribution is defined as a continuous probability distribution. This means it describes outcomes that can take any value within a range rather than being restricted to whole numbers. Common examples of where this distribution applies include:

Heights of people.
Test scores.
Income levels.
Measurement errors.

Graphically, it is known as the "bell curve" because it rises to a peak in the middle and falls away symmetrically on both sides, with the mean ( $\mu$ ) located exactly at the center.

3. Key Features of the Normal Distribution

The distribution is characterized by the following four essential features:

Symmetry: The distribution is perfectly symmetrical around its center.
Equality of Central Tendency: The Mean = Median = Mode. All three measures of central tendency huddle together at the single central point.
Bell-Shaped Curve: Most observations cluster comfortably around the center, while fewer values wander into the thinner, quieter tails at either end.
Total Area Equals 1: The total area under the curve is equal to 1, or $100\%$ . This means every possible outcome is accounted for, leaving no loose ends or missing pieces.

4. Central Tendency and Dispersion

To apply the Normal Distribution to a range of contexts, two primary questions must be answered regarding the variable: where it is centered and how spread out it is.

Central Tendency: The Mean ( $\mu$ )

The mean determines the center of the distribution.
If the mean ( $\mu$ ) increases, the entire curve shifts to the right.
If the mean ( $\mu$ ) decreases, the entire curve shifts to the left.

Dispersion: Variance and Standard Deviation

Variance ( $\sigma^2$ ): Measures how spread out the values are. A larger variance indicates more dispersed data.
Standard Deviation ( $\sigma$ ): This is the square root of the variance and is considered the most effective measurement for variation because it is expressed in the same units as the data. * Small ( $\sigma$ ): Values are tightly clustered around the mean. * Large ( $\sigma$ ): Values are widely spread.

5. Predictability and the Empirical Rule

The shape of the Normal Distribution is highly predictable, allowing for the use of the "empirical rule," which outlines how data is distributed around the mean based on standard deviations:

$\pm 1\sigma$ : About $68\%$ of values lie within $1$ standard deviation of the mean.
$\pm 2\sigma$ : About $95\%$ of values lie within $2$ standard deviations of the mean.
$\pm 3\sigma$ : About $99.7\%$ of values lie within $3$ standard deviations of the mean.

This rule enables quick estimation of probabilities without needing detailed calculations. Values near the mean are common, while values far from the mean are rare.

6. Standardisation and Z-Scores

A major challenge in data analysis is that datasets use different scales (e.g., test scores out of $100$ vs. house prices in the hundreds of thousands). Standardisation transforms data from its original scale into a common, unit-free scale using the z-score.

The Z-Score Formula

$z = \frac{x - \mu}{\sigma}$

Where:

$z$ = the z-score for the value of interest.
$x$ = the original value.
$\mu$ = the population mean.
$\sigma$ = the standard deviation.

The z-score answers the question: "How many standard deviations above or below the mean is this value?"

7. Interpreting Z-Scores

$z = 0$ : The value is exactly at the mean (completely typical).
$z = +1$ : The value is one standard deviation above the mean (slightly above average).
$z = +2$ : The value is two standard deviations above the mean (unusually high/extreme).
$z = -1$ : The value is one standard deviation below the mean (slightly below average).
$z = -2$ : The value is two standard deviations below the mean (unusually low/extreme).

Z-scores allow for the comparison of values from entirely different contexts, such as comparing a student's exam score to a housing price, by determining how extreme each value is within its own distribution.

8. Questions for Review and Practical Applications

Case Study 1: Exam Scores

Given: Mean ( $\mu$ ) = $70$ , Standard Deviation ( $\sigma$ ) = $10$ , Student Score ( $x$ ) = $85$ .
Calculation: $z = \frac{85 - 70}{10} = 1.5$ .
Interpretation: The student scored $1.5$ standard deviations above the mean.

Case Study 2: House Prices

Given: Mean ( $\mu$ ) = $\$650,000$ , Standard Deviation ( $\sigma$ ) = $\$50,000$ , House Price ( $x$ ) = $\$600,000$ .
Calculation: $z = \frac{600000 - 650000}{50000} = -1$ .
Interpretation: The house is $1$ standard deviation below the mean, meaning it is cheaper than the average property.

Case Study 3: Weekly Income

Given: Mean ( $\mu$ ) = $\$1,000$ , Standard Deviation ( $\sigma$ ) = $\$100$ , Weekly Income ( $x$ ) = $\$1,200$ .
Calculation: $z = \frac{1200 - 1000}{100} = 2$ .
Interpretation: This income is $2$ standard deviations above the mean, indicating it is relatively high compared to the dataset.

9. Interpreting Probabilities Using Z-Scores

The Normal Distribution acts as a probability map. Areas under the curve correspond to the likelihood of certain outcomes.

Left-hand side area: Represents the probability of values falling below a certain point (cumulative probability).
Right-hand side area: Represents the probability of values falling above a certain point ( $1 - \text{left-side area}$ ).
Between two values: The area between two specific z-scores represents the proportion of observations within that range.

Important Probability Benchmarks

$z = 0$ : $50\%$ area lies below. A typical value splits the data in half.
$z = +1$ : Approximately $84\%$ area lies below. This value is higher than most; only $16\%$ are higher.
$z = -1$ : Approximately $16\%$ area lies below. Only $16\%$ of values are below this point, while $84\%$ are above.

10. Using the Normal Distribution in Excel

Excel provides specific functions to calculate probabilities and inverse values without manually using tables.

The NORM.DIST Function

Formula: =NORM.DIST(x, mean, standard_dev, cumulative)

x: The value to evaluate.
mean: The average ( $\mu$ ).
standard_dev: The standard deviation ( $\sigma$ ).
cumulative: Set to TRUE for cumulative probability ( $P(X \leq x)$ ).

Common Probability Calculations in Excel:

Probability up to a value ( $P(X \leq x)$ ): =NORM.DIST(x, mean, sd, TRUE)
Probability above a value (P(X > x)): =1 - NORM.DIST(x, mean, sd, TRUE)
Probability between two values ( $P(a \leq X \leq b)$ ): =NORM.DIST(b, mean, sd, TRUE) - NORM.DIST(a, mean, sd, TRUE)

The NORM.INV Function

Formula: =NORM.INV(p, mean, sd)

This returns the value $x$ given a specific probability ( $p$ ). For example, it can identify the threshold value for the top $10\%$ of a market.

Example Case: House Prices

Given: mean ( $\mu$ ) = $\$600,000$ , standard deviation ( $\sigma$ ) = $\$120,000$ .
Goal: Probability a house sells for $\$750,000$ or less.
Excel Formula: =NORM.DIST(750000, 600000, 120000, TRUE)
Result: $0.7475$ ( $74.75\%$ ). This means about $74.75\%$ of houses in the market sell for $\$750,000$ or less.