Inferring Population Means

Chapter 9: Inferring Population Means

  • The primary objectives for this chapter include understanding the t-model, constructing and interpreting confidence intervals for the mean, and performing hypothesis tests for the mean.

The Central Limit Theorem for Sample Means

  • Definition: If certain conditions are met, the Central Limit Theorem (CLT) assures us that the distribution of sample means follows an approximately Normal distribution no matter what the shape of the population distribution.

  • Conditions for CLT: When determining whether the CLT can be applied to analyze data, three essential conditions must be checked:     1. Random Sample and Independence: Each observation must be collected randomly from the population, and the observations must be independent of one another.     2. Large Sample: One of two scenarios must be true: either the population distribution itself is Normal, or the sample size is large (typically n25n \ge 25 is considered sufficient).     3. Big Population: If the sample is collected without replacement, the population must be at least 1010 times larger than the sample size (N10nN \ge 10n).

  • Properties of the Sampling Distribution: If the three conditions are met, a random sample drawn from a population with mean μ\mu and standard deviation σ\sigma results in a sampling distribution with:     - Mean: μ\mu     - Standard Deviation (SD(xˉ)SD(\bar{x})): σn\frac{\sigma}{\sqrt{n}}     - Shape: Approximately Normal. The larger the sample size, the closer the distribution becomes to a Normal distribution.     - Note on Population Normalcy: If the population is Normal to begin with, then the sampling distribution is exactly a Normal distribution, regardless of the sample size.

Named Example: Weight of Angus Cows

  • Context: The weight of Angus cows is distributed with a population mean μ=1309lbs\mu = 1309\,lbs and a population standard deviation σ=157lbs\sigma = 157\,lbs.

  • CLT Application: For a random sample of n=100n = 100 Angus cows:     - The sample means will average 1309pounds1309\,pounds.     - The standard deviation of the sample means (SD(xˉ)SD(\bar{x})) is calculated as:
          SD(xˉ)=σn=157100=15.7poundsSD(\bar{x}) = \frac{\sigma}{\sqrt{n}} = \frac{157}{\sqrt{100}} = 15.7\,pounds     - The CLT states the sampling distribution will be approximately Normal: N(1309,15.7)N(1309, 15.7).

  • Distribution Estimates: For the means of all possible random samples:     - 68%68\% will fall between 1293.31293.3 and 1324.7lb1324.7\,lb.     - 95%95\% will fall between 1277.61277.6 and 1340.4lb1340.4\,lb.     - 99.7%99.7\% will fall between 1261.91261.9 and 1356.1lb1356.1\,lb.

The Student’s t-Distribution

  • The Challenge: While the CLT is powerful, in practice, we almost never know the population standard deviation (σ\sigma).

  • Transition from s to t: Using the sample standard deviation (ss) to estimate σ\sigma works for the standard error (SE(xˉ)=snSE(\bar{x}) = \frac{s}{\sqrt{n}}), but applying this to a Normal model introduces error.

  • Origins: William Gosset developed new models, one for each sample size (nn), which provide better accuracy when σ\sigma is unknown. These are known as the Student’s t-distributions.

  • Characteristics of the t-Distribution:     - It is symmetric and bell-shaped.     - It has "thicker tails" than the Normal distribution.     - Its specific shape depends on the degrees of freedom (df).     - If dfdf is small, the tails are thick; as dfdf increases, the tails become thinner and the distribution approaches the Normal distribution.

  • Degrees of Freedom: For every sample size nn, there is a different t-distribution. The degrees of freedom are calculated as:     - df=n1df = n - 1     - This represents the number of independent quantities left after the parameters have been estimated (e.g., the dfdf of the mean is n1n - 1).

Answering Questions about Population Means

  • There are two primary approaches for answering questions about a population mean:     1. Confidence Intervals: Used for estimating the value of a parameter.     2. Hypothesis Tests: Used for deciding whether a parameter’s value matches a specific claim.

  • These methods are modifications of those used for population proportions, adapted for population means.

Confidence Intervals for a Population Mean

  • Conditions Check:     1. Random, independent sample.     2. Large sample (n25n \ge 25 or the population is Normally distributed).     3. Big population (If sampling without replacement, population must be at least 10×10 \times sample size).

  • The Standardized Sample Mean: When conditions are met, the standardized sample mean follows the t-model with n1n - 1 degrees of freedom:   t=xˉμSE(xˉ)t = \frac{\bar{x} - \mu}{SE(\bar{x})}

  • Standard Error (SE): We estimate the standard deviation of the sampling distribution using:   SE(xˉ)=snSE(\bar{x}) = \frac{s}{\sqrt{n}}

  • One-Sample t-Interval Formula:   xˉ±tn1×SE(xˉ)\bar{x} \pm t_{n-1}^* \times SE(\bar{x})   - The critical value tn1t_{n-1}^* depends on the desired confidence level and the degrees of freedom (n1n - 1).   - Models with few degrees of freedom have a larger standard deviation than the Normal model, resulting in wider confidence intervals.

Critical Values and Table Usage

  • Finding t*: Critical values are found using a t-Table in the row for degrees of freedom and the column for the desired confidence level.

  • Example: For a sample size of n=10n = 10 at a 95%95\% confidence level, df=101=9df = 10 - 1 = 9. Looking at the table, the critical value is 2.2622.262.

  • Missing Degrees of Freedom: If the specific number of degrees of freedom is not listed in the table, use the next smaller number available in the table.

  • Examples for Critical Value find:     - Finding the critical value of tt for a 90%90\% confidence interval with df=17df = 17: 1.7401.740 (based on provided table segments).     - Finding the critical value of tt for a 98%98\% confidence interval with df=88df = 88: Since 8888 is not in the listed table, one would use the next smaller listed value, such as df=80df = 80. Based on the table provided, the critical value for 98%98\% at df=80df = 80 is 2.3742.374.

Summary Comparison: z vs. t

  • t-Distribution Characteristics:     - Unimodal and symmetric about its mean.     - Long tails compared to the Normal distribution.     - Converges to the Normal model for large sample sizes (nn).

  • When to Use:     - If σ\sigma is known, use the Normal model (zz).     - If σ\sigma is unknown and estimated using ss, use the t-model.

Named Example: College Student Sleep

  • Objective: Build a 90%90\% Confidence Interval for the mean amount of sleep college students get per night based on a random sample of 2525 students.

  • Given Data: n=25n = 25, xˉ=6.64\bar{x} = 6.64, s=1.075s = 1.075.

  • Condition Check: Sample is random and independent; population is large; n=25n = 25 is sufficient.

  • Parameters:     - df=251=24df = 25 - 1 = 24     - SE(xˉ)=1.07525=0.215hoursSE(\bar{x}) = \frac{1.075}{\sqrt{25}} = 0.215\,hours     - Critical value t24t_{24}^* for 90%90\% confidence is 1.7111.711.

  • Calculation:     - Margin of Error (MEME) = 1.711×0.215=0.368hours1.711 \times 0.215 = 0.368\,hours     - Interval: 6.64±0.3686.64 \pm 0.368     - 90%90\% CI: (6.272,7.008)(6.272, 7.008)

  • Conclusion: We are 90%90\% confident that the true population mean number of hours college students sleep is between 6.2726.272 and 7.0087.008 hours.

  • Technical Note on Interpretation: It is correct to say "90%90\% of all possible samples will produce intervals that actually do contain the true mean sleep," but the "I am 90%90\% confident" phrasing is more personal and less technical for general readers.

Named Example: Highway Speeds

  • Scenario: A random sample of 3030 cars has a mean speed of 63.3mph63.3\,mph with a standard deviation of 5.23mph5.23\,mph.

  • Task: Find the 95%95\% confidence interval.

  • Condition Verification: Random/independent sample; population is large (300+300+ cars); sample size is at least 2525.

  • Calculation (using technology):     - Mean: 63.363.3     - SESE: 0.954862990.95486299     - dfdf: 2929     - L.LimitL. Limit: 61.34708661.347086     - U.LimitU. Limit: 65.25291465.252914     - Confidence Interval: (61.35,65.25)(61.35, 65.25)

  • Interpretation: We are 95%95\% confident the mean speed of all cars is between 61.3561.35 and 65.25mph65.25\,mph.

  • Plausibility Check: Is it plausible the mean speed is 67mph67\,mph? No, because 6767 is not contained within our confidence interval.

Named Example: Movie Watching Habits

  • Scenario: Random sample of 3535 students; xˉ=4.14\bar{x} = 4.14 movies, s=10.02s = 10.02.

  • Task: Construct a 90%90\% confidence interval.

  • Results:     - SESE: 1.69368911.6936891     - dfdf: 3434     - 90%CI90\%\,CI: (1.28,7.00)(1.28, 7.00).

  • Comparative Logic: A 95%95\% confidence interval for the same data would be wider than the 90%90\% interval because the higher confidence requirement requires a larger tt^* multiplier.

Hypothesis Testing for the Mean

  • Four-Step Process:     1. Hypothesize: State the null (H0H_0) and alternative (HaH_a) hypotheses about the population parameter.     2. Prepare: Choose a significance level (α\alpha), select the test statistic, and check conditions/assumptions.     3. Compute to Compare: Calculate the test statistic and the resulting p-value.     4. Interpret: Decide whether to reject the null hypothesis and state the conclusion in context.

  • Test Statistic for One-Sample t-Test:   t=xˉμ0SEESTt = \frac{\bar{x} - \mu_0}{SE_{\text{EST}}}   - where SEEST=snSE_{\text{EST}} = \frac{s}{\sqrt{n}}   - If conditions hold, this follows a t-distribution with df=n1df = n - 1.

  • One- and Two-sided Alternative Hypotheses: The choice of HaH_a (either directed e.g., > \mu_0 or undirected e.g., μ0\neq \mu_0) determines how the p-value is calculated (one tail vs. two tails).

Named Example: Nursing Staff Experience

  • Situation: In 20102010, mean experience was 14.3years14.3\,years. A survey of 3535 nurses shows xˉ=18.37years\bar{x} = 18.37\,years and s=11.12yearss = 11.12\,years. Have years of experience increased?

  • Significance Level: α=0.05\alpha = 0.05

  • Hypotheses:     - H0:μ=14.3H_0: \mu = 14.3     - H_a: \mu > 14.3

  • Calculation:     - SEEST=11.1235=1.88SE_{\text{EST}} = \frac{11.12}{\sqrt{35}} = 1.88     - t=18.3714.31.88=2.165t = \frac{18.37 - 14.3}{1.88} = 2.165     - df=34df = 34

  • Result: p-value = 0.0190.019.

  • Conclusion: Since the p-value (0.0190.019) is less than the significance level (0.050.05), we reject H0H_0. Evidence suggests the mean experience among nursing staff has indeed increased.

Named Example: Hockey Attendance

  • Situation: 20102010 average attendance was 17,07217,072. A sample of 3030 games in 20142014 shows xˉ=18,104\bar{x} = 18,104 and s=1203.5s = 1203.5. Has attendance changed?

  • Hypotheses:     - H0:μ=17072H_0: \mu = 17072     - Ha:μ17072H_a: \mu \neq 17072

  • Calculation:     - SE=219.728SE = 219.728     - t-Stat=4.70t\text{-Stat} = 4.70     - p\text{-value} < 0.0001

  • Conclusion: With a p-value less than the significance level, we reject H0H_0. Mean attendance has changed since 20102010.

Named Example: Weight Loss Study

  • Scenario: 7676 subjects on a low-fat diet for 12months12\,months. xˉ=2.2kg\bar{x} = 2.2\,kg loss, s=6.1kgs = 6.1\,kg. Is mean weight loss greater than 00?

  • Hypotheses:     - H0:μ=0H_0: \mu = 0     - H_a: \mu > 0

  • Calculation:     - t=2.206.1/76=3.144t = \frac{2.2 - 0}{6.1 / \sqrt{76}} = 3.144     - df=761=75df = 76 - 1 = 75.

  • Finding P-value via Table:     - df=75df = 75 is not in the table; use df=60df = 60 (next smallest).     - Look across the row for values bracketing 3.1443.144. These are 2.9152.915 (upper-tail prob 0.00250.0025) and 3.2323.232 (upper-tail prob 0.0010.001).     - The p-value is therefore between 0.0010.001 and 0.00250.0025.

  • Final Result: Accurate p-value using technology is p=0.0012p = 0.0012. Reject H0H_0 at α=0.05\alpha = 0.05. Mean weight loss is significantly greater than 00.