Confidence Intervals and the Student's t-Distribution in Biostatistics

Course Title: BIOL 376 Biostatistics
Instructor: Nate Sutter
Topic: Confidence Intervals (Lecture 14)
Primary Lecture Objectives: * Understand and operationalize confidence intervals used for estimating the population mean ( $\mu$ ) from a sample mean ( $\bar{y}$ ). * Review the contributions of Sir R. A. Fisher, a founding figure in biostatistics. * Evaluate the work of William Gosset and the development of the "Student’s" t-test.

Scenario 1: The Button Population: * A bowl contains $10,000$ buttons. * Population Parameters: $\mu = 40$ , $\sigma = 7$ . * Sampling Method: A sample of $n = 100$ buttons is taken 85 times with replacement, and the sample mean ( $\bar{y}$ ) is calculated for each. * Sampling Distribution Properties: 1. Shape: The distribution of sample means will be approximately normal (following the Central Limit Theorem). 2. Mean: The mean of the sampling distribution is equal to the population mean ( $\mu_{\bar{y}} = 40$ ). 3. Standard Deviation (Standard Error): Calculated as $SE = \frac{\sigma}{\sqrt{n}}$ . In this case, $SE = \frac{7}{\sqrt{100}} = 0.7$ .
Scenario 2: Alfalfa Aliquots (Problem 6.2.3 from Samuels, p. 182): * Data set of five aliquots analyzed for insoluble ash: $10.0$ , $8.9$ , $9.1$ , $11.7$ , and $7.9$ . * Required Calculations: * Sample Mean ( $\bar{y}$ ): The arithmetic average of the observations. * Standard Deviation ( $s$ ): A measure of the dispersion of the data around the sample mean. * Standard Error ( $SE$ ): Calculated as $SE = \frac{s}{\sqrt{n}}$ .
Theoretical Limit Trends: * As sample size ( $n$ ) increases: * The sample mean ( $\bar{y}$ ) approaches the population mean ( $\mu$ ). * The sample standard deviation ( $s$ ) approaches the population standard deviation ( $\sigma$ ). * The Standard Error ( $SE$ ) approaches zero.

R. A. Fisher (1890): * Regarded as a brilliant statistician and population geneticist. * He was instrumental in bridging the gap between biology and mathematics. * Known for developing "Fisher’s exact test." * Significant Works: Wrote The Genetical Theory of Natural Selection. * Historical Context: He was an ardent proponent of eugenics.
William S. Gosset ("Student"): * Published his work on the use of t-distributions under the pseudonym "Student." * He remained on friendly terms with both Pearson and Fisher even during periods of intense rivalry between them.

Standard Deviation (SD): * Refers to the dispersion of a sample or a population ( $\sigma$ ). * Answers: "What is the 'typical' distance of a single observation from the mean?"
Standard Error (SE): * Measures the unreliability in the sample mean’s ability to estimate the population mean ( $\mu$ ). * Answers: "What is the reliability of my sample mean as an estimate of $\mu$ ?" * It specifically measures error due to sampling.
Cautions Regarding Experimental Error: * $SE$ does not account for all errors. Other sources include: * Measurement error. * Bias resulting from flawed experimental design. * Poorly worded survey questions.

The Interval Approach: * With a random sample, $\bar{y}$ is an estimate of $\mu$ . The reliability of this estimate depends on the dispersion in the sample ( $s$ ) and the sample size ( $n$ ). * In real-world applications, researchers usually only have one sample, not repeated samples. An interval is calculated around that single $\bar{y}$ to make a confident statement about $\mu$ .
Interval Size and Confidence Levels: * The interval will be bigger if a higher level of confidence (e.g., "super confident") is required that the interval contains $\mu$ . * The interval will be smaller if the requirement for confidence is lower.

Standard Normal Reference (Z): * For a random variable sampled from a standard normal distribution (where $\mu = 0$ , $\sigma = 1$ ), the formula is $Z = \frac{\bar{y} - \mu}{s}$ . * The Standard Error of the Mean ( $SEM$ ) is $\frac{\sigma}{\sqrt{n}}$ . * An area of approximately 95% is covered by the mean $\pm$ 2 standard deviations. * Specifically, the boundaries $Z = -1.96$ and $Z = 1.96$ contain 95% of the area. Each tail (Z > 1.96 and Z < -1.96) contains an area of $0.025$ . * The Problem: The calculation requires $\sigma$ , which is the population standard deviation and usually unknown.
The t-Distribution Solution: * To compensate for not knowing $\sigma$ , researchers use the sample estimator $s$ . * Instead of the Z-multiplier (1.96), a multiplier from the appropriate t-distribution is used. * Characteristics of t-distributions: * Symmetric and bell-shaped. * Approaches the normal curve as sample size ( $n$ ) goes to infinity. * Has "fatter tails" than the normal curve. This indicates a higher spread, which is the "uncertainty price" paid for using $s$ instead of $\sigma$ .
T-distribution solves the problem from sampling from a normal distribution

For a sample of size $n$ , there are $n - 1$ degrees of freedom. * Definition: In a sample, $n - 1$ observation values are free to vary once the mean is calculated. * The final ( $n^{th}$ ) observation's value is fixed by the requirement of the mean and the existing $n - 1$ values; it is not free to vary.

Data (from Samuels, p. 185): * $n = 14$ * $\bar{y} = 32.81$ * $s = 2.47$
Procedure: 1. Calculate Degrees of Freedom: $df = 14 - 1 = 13$ . 2. Locate critical value ( $t_{0.025}$ ) in Table 4 for 13 degrees of freedom. The value is $2.16$ . 3. General Formula for CI: $\bar{y} \pm t_{0.025} \times \frac{s}{\sqrt{n}}$ 4. Calculated Interval: $32.81 \pm 2.16 \times \frac{2.47}{\sqrt{14}}$

Long-Run Frequency: In the long run, 95% of the confidence intervals (approximately 19 out of 20) calculated from repeated samples will contain the true population mean ( $\mu$ ).
Case Example: Bill Length ( $\mu = 25.4$ , $\sigma = 0.08$ ): * Sample (a), n=5: * Example 1: $\bar{y} = 25.419$ , $s = 0.085$ * Example 2: $\bar{y} = 25.320$ , $s = 0.056$ * Example 3: $\bar{y} = 25.392$ , $s = 0.091$ * Example 4: $\bar{y} = 25.451$ , $s = 0.064$ * Sample (b), n=20: * Example 1: $\bar{y} = 25.384$ , $s = 0.088$ * Example 2: $\bar{y} = 25.376$ , $s = 0.077$ * Example 3: $\bar{y} = 24.413$ , $s = 0.067$ * Example 4: $\bar{y} = 25.392$ , $s = 0.083$
Crucial Distinction: Any specific individual sample's confidence interval either DOES or DOES NOT contain $\mu$ . We state our confidence in the process that generates the interval.

Finding Area Under a Curve: Use the qt() function to identify critical values. * Example: qt(0.025, 30) provides the t-critical value for a tail area of 0.025 with 30 degrees of freedom.
Performing Continuous Tests: Use the t.test() function. * Specific help can be found using ?t.test(). * The function can be used to calculate 90%, 95%, or 99% confidence intervals for a given dataset.