Inferring Population Means

Chapter 9: Inferring Population Means

The primary objectives for this chapter include understanding the t-model, constructing and interpreting confidence intervals for the mean, and performing hypothesis tests for the mean.

The Central Limit Theorem for Sample Means

Definition: If certain conditions are met, the Central Limit Theorem (CLT) assures us that the distribution of sample means follows an approximately Normal distribution no matter what the shape of the population distribution.
Conditions for CLT: When determining whether the CLT can be applied to analyze data, three essential conditions must be checked: 1. Random Sample and Independence: Each observation must be collected randomly from the population, and the observations must be independent of one another. 2. Large Sample: One of two scenarios must be true: either the population distribution itself is Normal, or the sample size is large (typically $n \ge 25$ is considered sufficient). 3. Big Population: If the sample is collected without replacement, the population must be at least $10$ times larger than the sample size ( $N \ge 10n$ ).
Properties of the Sampling Distribution: If the three conditions are met, a random sample drawn from a population with mean $\mu$ and standard deviation $\sigma$ results in a sampling distribution with: - Mean: $\mu$ - Standard Deviation ( $SD(\bar{x})$ ): $\frac{\sigma}{\sqrt{n}}$ - Shape: Approximately Normal. The larger the sample size, the closer the distribution becomes to a Normal distribution. - Note on Population Normalcy: If the population is Normal to begin with, then the sampling distribution is exactly a Normal distribution, regardless of the sample size.

Named Example: Weight of Angus Cows

Context: The weight of Angus cows is distributed with a population mean $\mu = 1309\,lbs$ and a population standard deviation $\sigma = 157\,lbs$ .
CLT Application: For a random sample of $n = 100$ Angus cows: - The sample means will average $1309\,pounds$ . - The standard deviation of the sample means ( $SD(\bar{x})$ ) is calculated as:
$SD(\bar{x}) = \frac{\sigma}{\sqrt{n}} = \frac{157}{\sqrt{100}} = 15.7\,pounds$ - The CLT states the sampling distribution will be approximately Normal: $N(1309, 15.7)$ .
Distribution Estimates: For the means of all possible random samples: - $68\%$ will fall between $1293.3$ and $1324.7\,lb$ . - $95\%$ will fall between $1277.6$ and $1340.4\,lb$ . - $99.7\%$ will fall between $1261.9$ and $1356.1\,lb$ .

The Student’s t-Distribution

The Challenge: While the CLT is powerful, in practice, we almost never know the population standard deviation ( $\sigma$ ).
Transition from s to t: Using the sample standard deviation ( $s$ ) to estimate $\sigma$ works for the standard error ( $SE(\bar{x}) = \frac{s}{\sqrt{n}}$ ), but applying this to a Normal model introduces error.
Origins: William Gosset developed new models, one for each sample size ( $n$ ), which provide better accuracy when $\sigma$ is unknown. These are known as the Student’s t-distributions.
Characteristics of the t-Distribution: - It is symmetric and bell-shaped. - It has "thicker tails" than the Normal distribution. - Its specific shape depends on the degrees of freedom (df). - If $df$ is small, the tails are thick; as $df$ increases, the tails become thinner and the distribution approaches the Normal distribution.
Degrees of Freedom: For every sample size $n$ , there is a different t-distribution. The degrees of freedom are calculated as: - $df = n - 1$ - This represents the number of independent quantities left after the parameters have been estimated (e.g., the $df$ of the mean is $n - 1$ ).

Answering Questions about Population Means

There are two primary approaches for answering questions about a population mean: 1. Confidence Intervals: Used for estimating the value of a parameter. 2. Hypothesis Tests: Used for deciding whether a parameter’s value matches a specific claim.
These methods are modifications of those used for population proportions, adapted for population means.

Confidence Intervals for a Population Mean

Conditions Check: 1. Random, independent sample. 2. Large sample ( $n \ge 25$ or the population is Normally distributed). 3. Big population (If sampling without replacement, population must be at least $10 \times$ sample size).
The Standardized Sample Mean: When conditions are met, the standardized sample mean follows the t-model with $n - 1$ degrees of freedom: $t = \frac{\bar{x} - \mu}{SE(\bar{x})}$
Standard Error (SE): We estimate the standard deviation of the sampling distribution using: $SE(\bar{x}) = \frac{s}{\sqrt{n}}$
One-Sample t-Interval Formula: $\bar{x} \pm t_{n-1}^* \times SE(\bar{x})$ - The critical value $t_{n-1}^*$ depends on the desired confidence level and the degrees of freedom ( $n - 1$ ). - Models with few degrees of freedom have a larger standard deviation than the Normal model, resulting in wider confidence intervals.

Critical Values and Table Usage

Finding t*: Critical values are found using a t-Table in the row for degrees of freedom and the column for the desired confidence level.
Example: For a sample size of $n = 10$ at a $95\%$ confidence level, $df = 10 - 1 = 9$ . Looking at the table, the critical value is $2.262$ .
Missing Degrees of Freedom: If the specific number of degrees of freedom is not listed in the table, use the next smaller number available in the table.
Examples for Critical Value find: - Finding the critical value of $t$ for a $90\%$ confidence interval with $df = 17$ : $1.740$ (based on provided table segments). - Finding the critical value of $t$ for a $98\%$ confidence interval with $df = 88$ : Since $88$ is not in the listed table, one would use the next smaller listed value, such as $df = 80$ . Based on the table provided, the critical value for $98\%$ at $df = 80$ is $2.374$ .

Summary Comparison: z vs. t

t-Distribution Characteristics: - Unimodal and symmetric about its mean. - Long tails compared to the Normal distribution. - Converges to the Normal model for large sample sizes ( $n$ ).
When to Use: - If $\sigma$ is known, use the Normal model ( $z$ ). - If $\sigma$ is unknown and estimated using $s$ , use the t-model.

Named Example: College Student Sleep

Objective: Build a $90\%$ Confidence Interval for the mean amount of sleep college students get per night based on a random sample of $25$ students.
Given Data: $n = 25$ , $\bar{x} = 6.64$ , $s = 1.075$ .
Condition Check: Sample is random and independent; population is large; $n = 25$ is sufficient.
Parameters: - $df = 25 - 1 = 24$ - $SE(\bar{x}) = \frac{1.075}{\sqrt{25}} = 0.215\,hours$ - Critical value $t_{24}^*$ for $90\%$ confidence is $1.711$ .
Calculation: - Margin of Error ( $ME$ ) = $1.711 \times 0.215 = 0.368\,hours$ - Interval: $6.64 \pm 0.368$ - $90\%$ CI: $(6.272, 7.008)$
Conclusion: We are $90\%$ confident that the true population mean number of hours college students sleep is between $6.272$ and $7.008$ hours.
Technical Note on Interpretation: It is correct to say " $90\%$ of all possible samples will produce intervals that actually do contain the true mean sleep," but the "I am $90\%$ confident" phrasing is more personal and less technical for general readers.

Named Example: Highway Speeds

Scenario: A random sample of $30$ cars has a mean speed of $63.3\,mph$ with a standard deviation of $5.23\,mph$ .
Task: Find the $95\%$ confidence interval.
Condition Verification: Random/independent sample; population is large ( $300+$ cars); sample size is at least $25$ .
Calculation (using technology): - Mean: $63.3$ - $SE$ : $0.95486299$ - $df$ : $29$ - $L. Limit$ : $61.347086$ - $U. Limit$ : $65.252914$ - Confidence Interval: $(61.35, 65.25)$
Interpretation: We are $95\%$ confident the mean speed of all cars is between $61.35$ and $65.25\,mph$ .
Plausibility Check: Is it plausible the mean speed is $67\,mph$ ? No, because $67$ is not contained within our confidence interval.

Named Example: Movie Watching Habits

Scenario: Random sample of $35$ students; $\bar{x} = 4.14$ movies, $s = 10.02$ .
Task: Construct a $90\%$ confidence interval.
Results: - $SE$ : $1.6936891$ - $df$ : $34$ - $90\%\,CI$ : $(1.28, 7.00)$ .
Comparative Logic: A $95\%$ confidence interval for the same data would be wider than the $90\%$ interval because the higher confidence requirement requires a larger $t^*$ multiplier.

Hypothesis Testing for the Mean

Four-Step Process: 1. Hypothesize: State the null ( $H_0$ ) and alternative ( $H_a$ ) hypotheses about the population parameter. 2. Prepare: Choose a significance level ( $\alpha$ ), select the test statistic, and check conditions/assumptions. 3. Compute to Compare: Calculate the test statistic and the resulting p-value. 4. Interpret: Decide whether to reject the null hypothesis and state the conclusion in context.
Test Statistic for One-Sample t-Test: $t = \frac{\bar{x} - \mu_0}{SE_{\text{EST}}}$ - where $SE_{\text{EST}} = \frac{s}{\sqrt{n}}$ - If conditions hold, this follows a t-distribution with $df = n - 1$ .
One- and Two-sided Alternative Hypotheses: The choice of $H_a$ (either directed e.g., > \mu_0 or undirected e.g., $\neq \mu_0$ ) determines how the p-value is calculated (one tail vs. two tails).

Named Example: Nursing Staff Experience

Situation: In $2010$ , mean experience was $14.3\,years$ . A survey of $35$ nurses shows $\bar{x} = 18.37\,years$ and $s = 11.12\,years$ . Have years of experience increased?
Significance Level: $\alpha = 0.05$
Hypotheses: - $H_0: \mu = 14.3$ - H_a: \mu > 14.3
Calculation: - $SE_{\text{EST}} = \frac{11.12}{\sqrt{35}} = 1.88$ - $t = \frac{18.37 - 14.3}{1.88} = 2.165$ - $df = 34$
Result: p-value = $0.019$ .
Conclusion: Since the p-value ( $0.019$ ) is less than the significance level ( $0.05$ ), we reject $H_0$ . Evidence suggests the mean experience among nursing staff has indeed increased.

Named Example: Hockey Attendance

Situation: $2010$ average attendance was $17,072$ . A sample of $30$ games in $2014$ shows $\bar{x} = 18,104$ and $s = 1203.5$ . Has attendance changed?
Hypotheses: - $H_0: \mu = 17072$ - $H_a: \mu \neq 17072$
Calculation: - $SE = 219.728$ - $t\text{-Stat} = 4.70$ - p\text{-value} < 0.0001
Conclusion: With a p-value less than the significance level, we reject $H_0$ . Mean attendance has changed since $2010$ .

Named Example: Weight Loss Study

Scenario: $76$ subjects on a low-fat diet for $12\,months$ . $\bar{x} = 2.2\,kg$ loss, $s = 6.1\,kg$ . Is mean weight loss greater than $0$ ?
Hypotheses: - $H_0: \mu = 0$ - H_a: \mu > 0
Calculation: - $t = \frac{2.2 - 0}{6.1 / \sqrt{76}} = 3.144$ - $df = 76 - 1 = 75$ .
Finding P-value via Table: - $df = 75$ is not in the table; use $df = 60$ (next smallest). - Look across the row for values bracketing $3.144$ . These are $2.915$ (upper-tail prob $0.0025$ ) and $3.232$ (upper-tail prob $0.001$ ). - The p-value is therefore between $0.001$ and $0.0025$ .
Final Result: Accurate p-value using technology is $p = 0.0012$ . Reject $H_0$ at $\alpha = 0.05$ . Mean weight loss is significantly greater than $0$ .