Lecture 18 - Comprehensive Study Notes on Normal Distribution, Empirical Rules, and Statistical Inference

Review of Normal Distribution and Deductive Logic

  • Conceptual Framework: Deductive logic in statistics involves taking general knowledge about a population (parameters) to determine the probability of specific events (observations).

  • Scenario Background: Observations of year nine javelin throws follow a normal distribution with the following parameters:     * Population Mean (μ\mu): 24m24\,m     * Standard Deviation (σ\sigma): 4m4\,m

  • The Quantitative Algorithm: There is a three-step instruction set for solving normal distribution problems:     1. Draw it: Sketch the normal distribution, mark the center (μ=24\mu = 24), define the spread (σ=4\sigma = 4), and identify the observation (x=33x = 33 ).     2. Calculate the Z-score: Determine the standardized score to see how many standard deviations the observation is from the mean.     3. Use the Probability Table: Look up the Z-score in the statistical tables to find the percentile or probability.

Standardized Calculations for Specific Javelin Throws

  • Point Observation (33m33\,m):     * Determining the percentile for a throw of 33m33\,m (the probability of throwing 33m33\,m or less).     * Z-score Calculation:         Z=xμσZ = \frac{x - \mu}{\sigma}         Z=33244=94=2.25Z = \frac{33 - 24}{4} = \frac{9}{4} = 2.25     * Interpretation: The throw is exactly 2.252.25 standard deviations above the population mean.     * Table Lookup: Using the positive Z-value table, locating the row for 2.22.2 and column for 0.050.05 yields a probability of 0.98780.9878.     * Result: This throw is in the 98.78th98.78^{th} percentile.

  • Point Observation (22m22\,m):     * The speaker's nephew, Nico, throws consistently at 22m22\,m.     * The process remains the same: Draw the curve, compute the negative Z-score, and find the lower percentile in the tables.

  • Tail Probability (Greater than Case):     * Task: Find the probability that a student throws further than 28m28\,m.     * Logic: Since the total area under the normal curve equals 11, the probability of a value being higher than an observation is calculated as 1percentile1 - \text{percentile}.     * Z-score Calculation:         Z=28244=1.00Z = \frac{28 - 24}{4} = 1.00     * Probability: The table value for Z=1.00Z = 1.00 is 0.84130.8413. The final answer is 10.8413=0.15871 - 0.8413 = 0.1587.

Probability Between Two Observations

  • Problem Statement: What is the probability that a randomly selected student throws between 20m20\,m and 28m28\,m?

  • Strategic Approach: Captured by the formula P(20 < X < 28).     * The goal is to find the area under the curve between two points. This is achieved by taking the probability of the larger value (everything to the left of 2828) and subtracting the probability of the smaller value (everything to the left of 2020).

  • Step-by-Step Execution:     1. Z-score for 28:         Z1=28244=1.00Z_1 = \frac{28 - 24}{4} = 1.00         Corresponding Probability (P1P_1): 0.84130.8413     2. Z-score for 20:         Z2=20244=1.00Z_2 = \frac{20 - 24}{4} = -1.00         Corresponding Probability (P2P_2) from the negative table: 0.15870.1587     3. Subtraction:         0.84130.1587=0.68260.8413 - 0.1587 = 0.6826

  • Conclusion: There is a 68.26%68.26\% probability that a throw falls between 20m20\,m and 28m28\,m.

The Inverse Normal Problem

  • Definition: An inverse problem occurs when the percentile/probability is provided, and the task is to find the corresponding value of the observation (xx).

  • Example Scenario: If a student is in the 67th67^{th} percentile, how far did they throw the javelin?

  • Algorithm:     1. Draw it: Identify the 50%50\% mark (the mean) and shade from the left until reaching approximately 0.670.67.     2. Table Search: Look into the body of the probability tables for the value closest to 0.67000.6700. In the positive table, a probability of approximately 0.670.67 corresponds to a Z-score of 0.440.44.     3. Algebraic Rearrangement: Use the Z-score formula solved for xx:         x=(Z×σ)+μx = (Z \times \sigma) + \mu     4. Calculation:         x=(0.44×4)+24x = (0.44 \times 4) + 24         x=1.76+24=25.76mx = 1.76 + 24 = 25.76\,m

  • Note on Algebra: The formula rearrangement involves multiplying both sides by σ\sigma and adding μ\mu to isolate xx.

The Empirical Rule (68-95-99.7 Rule)

  • Core Principle: Because all normal distributions share the same shape, they can all be transformed into a standard normal distribution (μ=0\mu = 0, σ=1\sigma = 1). This allows for a constant set of probabilities:     * Within 11 Standard Deviation: Approximately 68%68\% (68.26%68.26\%) of all values fall within μ±1σ\mu \pm 1\sigma.     * Within 22 Standard Deviations: Approximately 95%95\% of all values fall within μ±2σ\mu \pm 2\sigma.     * Within 33 Standard Deviations: Approximately 99.7%99.7\% of all values fall within μ±3σ\mu \pm 3\sigma.

  • Outlier Definition: A measurement is considered an "extreme outlier" if it falls more than three standard deviations above or below the mean (a probability of roughly 33 in 1,0001,000).

  • Extreme Ranges: Four standard deviations cover 99.99%99.99\% of the distribution, making such an occurrence a 11 in 10,00010,000 chance.

Introduction to Statistical Inference

  • The Key Goal: Moving from the world of statistics (sample data) to estimating population parameters. This is the "money" in statistical data analysis.

  • Sampling Error: Error that arises inherently because a sample is measured instead of the entire population. It is an unavoidable uncertainty that statistics seeks to quantify.

  • Inference Definition: Methods and procedures for forming judgments and estimating population parameters using sample statistics calculate from random samples.

  • Confidence and Credibility: Statistics does not give a single certain answer; it provides an estimate with a quantified level of confidence or uncertainty.

Fundamental Terminology and Taxonomy

  • Population: The entire set of units for which we measure a trait.     * Example: All New Zealand citizens eligible to vote; all stars in the Milky Way galaxy.

  • Parameter: A characteristic of the population.     * Assumptions: Parameters are considered fixed and unknown (unless a full census is taken).     * Example: The actual proportion of all voters choosing the Labor party.

  • Sample: A collection of units or a subset taken from the population of interest.     * Example: A Colmar Brunson poll of 1,0001,000 New Zealand voters; stars visible via binoculars.

  • Statistic: A characteristic or feature of a sample.     * Nature: Statistics are not fixed; they vary from sample to sample.     * Example: The percentage in a specific sample of 1,0001,000 people who vote Labor.

Population Parameters: Means and Proportions

  • Notation Convention: Greek letters are used for population parameters (to look "fancy and smart") while Roman letters/hats are used for statistics.

  • Population Mean (μ\mu):     * Formula for NN units: μ=1Ni=1NXi\mu = \frac{1}{N} \sum_{i=1}^{N} X_i     * Calculable only if every measurement in the population is known.

  • Sample Mean (xˉ\bar{x}):     * Formula for nn units: xˉ=1ni=1nXi\bar{x} = \frac{1}{n} \sum_{i=1}^{n} X_i     * Used to estimate μ\mu.

  • Population Proportion (π\pi):     * Symbol used is the Greek letter pi (π\pi). Note: This is notation, not the numeric constant 3.141593.14159.     * Structurally identical to a mean, where responses are coded as 11 (yes/success) or 00 (no/failure).

  • Sample Proportion (p^\hat{p}):     * Referred to as "P-hat." Used when dealing with categorical data in a sample.

Questions & Discussion

  • Question: How can I capture the area between two observations on the graph?

  • Response: The strategy is to find the probability of the larger value and subtract the probability of the smaller value. This "takes out" the unshaded chunk you don't need.

  • Question: What is the Z-score for the throw of 28m28\,m?

  • Response: 1.001.00. The observation (2828) minus the mean (2424) equals 44. 44 divided by the standard deviation (44) equals 11. It is important to express this as 1.001.00 for table lookup purposes.

  • Question: How do we rearrange the Z-score formula to find an observation?

  • Response: Multiple by the denominator (σ\sigma) to get Z×σ=xμZ \times \sigma = x - \mu. Then add the mean (μ\mu) to both sides to solve for xx.