Confidence Intervals and the Student's t-Distribution in Biostatistics
Course Overview and Biostatistics Introduction
Course Title: BIOL 376 Biostatistics
Instructor: Nate Sutter
Topic: Confidence Intervals (Lecture 14)
Primary Lecture Objectives: * Understand and operationalize confidence intervals used for estimating the population mean () from a sample mean (). * Review the contributions of Sir R. A. Fisher, a founding figure in biostatistics. * Evaluate the work of William Gosset and the development of the "Student’s" t-test.
Initial Practical Applications and Sampling Distributions
Scenario 1: The Button Population: * A bowl contains buttons. * Population Parameters: , . * Sampling Method: A sample of buttons is taken 85 times with replacement, and the sample mean () is calculated for each. * Sampling Distribution Properties: 1. Shape: The distribution of sample means will be approximately normal (following the Central Limit Theorem). 2. Mean: The mean of the sampling distribution is equal to the population mean (). 3. Standard Deviation (Standard Error): Calculated as . In this case, .
Scenario 2: Alfalfa Aliquots (Problem 6.2.3 from Samuels, p. 182): * Data set of five aliquots analyzed for insoluble ash: , , , , and . * Required Calculations: * Sample Mean (): The arithmetic average of the observations. * Standard Deviation (): A measure of the dispersion of the data around the sample mean. * Standard Error (): Calculated as .
Theoretical Limit Trends: * As sample size () increases: * The sample mean () approaches the population mean (). * The sample standard deviation () approaches the population standard deviation (). * The Standard Error () approaches zero.
Historical Foundations: R. A. Fisher and William S. Gosset
R. A. Fisher (1890): * Regarded as a brilliant statistician and population geneticist. * He was instrumental in bridging the gap between biology and mathematics. * Known for developing "Fisher’s exact test." * Significant Works: Wrote The Genetical Theory of Natural Selection. * Historical Context: He was an ardent proponent of eugenics.
William S. Gosset ("Student"): * Published his work on the use of t-distributions under the pseudonym "Student." * He remained on friendly terms with both Pearson and Fisher even during periods of intense rivalry between them.
Standard Deviation vs. Standard Error
Standard Deviation (SD): * Refers to the dispersion of a sample or a population (). * Answers: "What is the 'typical' distance of a single observation from the mean?"
Standard Error (SE): * Measures the unreliability in the sample mean’s ability to estimate the population mean (). * Answers: "What is the reliability of my sample mean as an estimate of ?" * It specifically measures error due to sampling.
Cautions Regarding Experimental Error: * does not account for all errors. Other sources include: * Measurement error. * Bias resulting from flawed experimental design. * Poorly worded survey questions.
Concepts and Logic of Confidence Intervals
The Interval Approach: * With a random sample, is an estimate of . The reliability of this estimate depends on the dispersion in the sample () and the sample size (). * In real-world applications, researchers usually only have one sample, not repeated samples. An interval is calculated around that single to make a confident statement about .
Interval Size and Confidence Levels: * The interval will be bigger if a higher level of confidence (e.g., "super confident") is required that the interval contains . * The interval will be smaller if the requirement for confidence is lower.
Sampling from a Normal Distribution and t-Distributions
Standard Normal Reference (Z): * For a random variable sampled from a standard normal distribution (where , ), the formula is . * The Standard Error of the Mean () is . * An area of approximately 95% is covered by the mean 2 standard deviations. * Specifically, the boundaries and contain 95% of the area. Each tail (Z > 1.96 and Z < -1.96) contains an area of . * The Problem: The calculation requires , which is the population standard deviation and usually unknown.
The t-Distribution Solution: * To compensate for not knowing , researchers use the sample estimator . * Instead of the Z-multiplier (1.96), a multiplier from the appropriate t-distribution is used. * Characteristics of t-distributions: * Symmetric and bell-shaped. * Approaches the normal curve as sample size () goes to infinity. * Has "fatter tails" than the normal curve. This indicates a higher spread, which is the "uncertainty price" paid for using instead of .
T-distribution solves the problem from sampling from a normal distribution
Degrees of Freedom (df)
For a sample of size , there are degrees of freedom. * Definition: In a sample, observation values are free to vary once the mean is calculated. * The final () observation's value is fixed by the requirement of the mean and the existing values; it is not free to vary.
Case Study: Monarch Butterfly Confidence Interval
Data (from Samuels, p. 185): * * *
Procedure: 1. Calculate Degrees of Freedom: . 2. Locate critical value () in Table 4 for 13 degrees of freedom. The value is . 3. General Formula for CI: 4. Calculated Interval:
Interpretation of Confidence Intervals
Long-Run Frequency: In the long run, 95% of the confidence intervals (approximately 19 out of 20) calculated from repeated samples will contain the true population mean ().
Case Example: Bill Length (, ): * Sample (a), n=5: * Example 1: , * Example 2: , * Example 3: , * Example 4: , * Sample (b), n=20: * Example 1: , * Example 2: , * Example 3: , * Example 4: ,
Crucial Distinction: Any specific individual sample's confidence interval either DOES or DOES NOT contain . We state our confidence in the process that generates the interval.
Statistical Software Implementation (R)
Finding Area Under a Curve: Use the
qt()function to identify critical values. * Example:qt(0.025, 30)provides the t-critical value for a tail area of 0.025 with 30 degrees of freedom.Performing Continuous Tests: Use the
t.test()function. * Specific help can be found using?t.test(). * The function can be used to calculate 90%, 95%, or 99% confidence intervals for a given dataset.