Confidence Intervals and the Student's t-Distribution in Biostatistics

Course Overview and Biostatistics Introduction

  • Course Title: BIOL 376 Biostatistics

  • Instructor: Nate Sutter

  • Topic: Confidence Intervals (Lecture 14)

  • Primary Lecture Objectives:     * Understand and operationalize confidence intervals used for estimating the population mean (μ\mu) from a sample mean (yˉ\bar{y}).     * Review the contributions of Sir R. A. Fisher, a founding figure in biostatistics.     * Evaluate the work of William Gosset and the development of the "Student’s" t-test.

Initial Practical Applications and Sampling Distributions

  • Scenario 1: The Button Population:     * A bowl contains 10,00010,000 buttons.     * Population Parameters: μ=40\mu = 40, σ=7\sigma = 7.     * Sampling Method: A sample of n=100n = 100 buttons is taken 85 times with replacement, and the sample mean (yˉ\bar{y}) is calculated for each.     * Sampling Distribution Properties:         1. Shape: The distribution of sample means will be approximately normal (following the Central Limit Theorem).         2. Mean: The mean of the sampling distribution is equal to the population mean (μyˉ=40\mu_{\bar{y}} = 40).         3. Standard Deviation (Standard Error): Calculated as SE=σnSE = \frac{\sigma}{\sqrt{n}}. In this case, SE=7100=0.7SE = \frac{7}{\sqrt{100}} = 0.7.

  • Scenario 2: Alfalfa Aliquots (Problem 6.2.3 from Samuels, p. 182):     * Data set of five aliquots analyzed for insoluble ash: 10.010.0, 8.98.9, 9.19.1, 11.711.7, and 7.97.9.     * Required Calculations:         * Sample Mean (yˉ\bar{y}): The arithmetic average of the observations.         * Standard Deviation (ss): A measure of the dispersion of the data around the sample mean.         * Standard Error (SESE): Calculated as SE=snSE = \frac{s}{\sqrt{n}}.

  • Theoretical Limit Trends:     * As sample size (nn) increases:         * The sample mean (yˉ\bar{y}) approaches the population mean (μ\mu).         * The sample standard deviation (ss) approaches the population standard deviation (σ\sigma).         * The Standard Error (SESE) approaches zero.

Historical Foundations: R. A. Fisher and William S. Gosset

  • R. A. Fisher (1890):     * Regarded as a brilliant statistician and population geneticist.     * He was instrumental in bridging the gap between biology and mathematics.     * Known for developing "Fisher’s exact test."     * Significant Works: Wrote The Genetical Theory of Natural Selection.     * Historical Context: He was an ardent proponent of eugenics.

  • William S. Gosset ("Student"):     * Published his work on the use of t-distributions under the pseudonym "Student."     * He remained on friendly terms with both Pearson and Fisher even during periods of intense rivalry between them.

Standard Deviation vs. Standard Error

  • Standard Deviation (SD):     * Refers to the dispersion of a sample or a population (σ\sigma).     * Answers: "What is the 'typical' distance of a single observation from the mean?"

  • Standard Error (SE):     * Measures the unreliability in the sample mean’s ability to estimate the population mean (μ\mu).     * Answers: "What is the reliability of my sample mean as an estimate of μ\mu?"     * It specifically measures error due to sampling.

  • Cautions Regarding Experimental Error:     * SESE does not account for all errors. Other sources include:         * Measurement error.         * Bias resulting from flawed experimental design.         * Poorly worded survey questions.

Concepts and Logic of Confidence Intervals

  • The Interval Approach:     * With a random sample, yˉ\bar{y} is an estimate of μ\mu. The reliability of this estimate depends on the dispersion in the sample (ss) and the sample size (nn).     * In real-world applications, researchers usually only have one sample, not repeated samples. An interval is calculated around that single yˉ\bar{y} to make a confident statement about μ\mu.

  • Interval Size and Confidence Levels:     * The interval will be bigger if a higher level of confidence (e.g., "super confident") is required that the interval contains μ\mu.     * The interval will be smaller if the requirement for confidence is lower.

Sampling from a Normal Distribution and t-Distributions

  • Standard Normal Reference (Z):     * For a random variable sampled from a standard normal distribution (where μ=0\mu = 0, σ=1\sigma = 1), the formula is Z=yˉμsZ = \frac{\bar{y} - \mu}{s}.     * The Standard Error of the Mean (SEMSEM) is σn\frac{\sigma}{\sqrt{n}}.     * An area of approximately 95% is covered by the mean ±\pm 2 standard deviations.     * Specifically, the boundaries Z=1.96Z = -1.96 and Z=1.96Z = 1.96 contain 95% of the area. Each tail (Z > 1.96 and Z < -1.96) contains an area of 0.0250.025.     * The Problem: The calculation requires σ\sigma, which is the population standard deviation and usually unknown.

  • The t-Distribution Solution:     * To compensate for not knowing σ\sigma, researchers use the sample estimator ss.     * Instead of the Z-multiplier (1.96), a multiplier from the appropriate t-distribution is used.     * Characteristics of t-distributions:         * Symmetric and bell-shaped.         * Approaches the normal curve as sample size (nn) goes to infinity.         * Has "fatter tails" than the normal curve. This indicates a higher spread, which is the "uncertainty price" paid for using ss instead of σ\sigma.

  • T-distribution solves the problem from sampling from a normal distribution

Degrees of Freedom (df)

  • For a sample of size nn, there are n1n - 1 degrees of freedom.     * Definition: In a sample, n1n - 1 observation values are free to vary once the mean is calculated.     * The final (nthn^{th}) observation's value is fixed by the requirement of the mean and the existing n1n - 1 values; it is not free to vary.

Case Study: Monarch Butterfly Confidence Interval

  • Data (from Samuels, p. 185):     * n=14n = 14     * yˉ=32.81\bar{y} = 32.81     * s=2.47s = 2.47

  • Procedure:     1. Calculate Degrees of Freedom: df=141=13df = 14 - 1 = 13.     2. Locate critical value (t0.025t_{0.025}) in Table 4 for 13 degrees of freedom. The value is 2.162.16.     3. General Formula for CI:         yˉ±t0.025×sn\bar{y} \pm t_{0.025} \times \frac{s}{\sqrt{n}}     4. Calculated Interval:         32.81±2.16×2.471432.81 \pm 2.16 \times \frac{2.47}{\sqrt{14}}

Interpretation of Confidence Intervals

  • Long-Run Frequency: In the long run, 95% of the confidence intervals (approximately 19 out of 20) calculated from repeated samples will contain the true population mean (μ\mu).

  • Case Example: Bill Length (μ=25.4\mu = 25.4, σ=0.08\sigma = 0.08):     * Sample (a), n=5:         * Example 1: yˉ=25.419\bar{y} = 25.419, s=0.085s = 0.085         * Example 2: yˉ=25.320\bar{y} = 25.320, s=0.056s = 0.056         * Example 3: yˉ=25.392\bar{y} = 25.392, s=0.091s = 0.091         * Example 4: yˉ=25.451\bar{y} = 25.451, s=0.064s = 0.064     * Sample (b), n=20:         * Example 1: yˉ=25.384\bar{y} = 25.384, s=0.088s = 0.088         * Example 2: yˉ=25.376\bar{y} = 25.376, s=0.077s = 0.077         * Example 3: yˉ=24.413\bar{y} = 24.413, s=0.067s = 0.067         * Example 4: yˉ=25.392\bar{y} = 25.392, s=0.083s = 0.083

  • Crucial Distinction: Any specific individual sample's confidence interval either DOES or DOES NOT contain μ\mu. We state our confidence in the process that generates the interval.

Statistical Software Implementation (R)

  • Finding Area Under a Curve: Use the qt() function to identify critical values.     * Example: qt(0.025, 30) provides the t-critical value for a tail area of 0.025 with 30 degrees of freedom.

  • Performing Continuous Tests: Use the t.test() function.     * Specific help can be found using ?t.test().     * The function can be used to calculate 90%, 95%, or 99% confidence intervals for a given dataset.