Detailed Study Notes for STAT 2000 - Inference for the Mean of a Single Population

Unit 1 – Inference for the Mean of a Single Population

Introduction to Statistical Inference

Inference Definition: Statistical inference is a set of methods used to draw conclusions or make predictions about a larger population based on data obtained from a smaller, representative sample. This process typically involves estimating population parameters (like means or proportions) and testing hypotheses about these parameters.

Probability Distribution

Definition: The probability distribution of a random variable $X$ is a mathematical function that describes all possible values $X$ can take (the sample space) and the probability associated with each value. For discrete variables, this is often a probability mass function (PMF), while for continuous variables, it's a probability density function (PDF), showing the likelihood of the variable falling within a certain range.
Notation: If $X \sim N(\mu, \sigma)$ , it indicates that $X$ follows a normal probability distribution. Here, $\mu$ represents the population mean, and $\sigma$ represents the population standard deviation.

Distribution of the Sample Mean

Understanding Sample Means: Rather than analyzing individual data points or their probabilities, statistical inference often focuses on the distribution of sample statistics, particularly the sample mean ( $\bar{x}$ ), calculated from multiple random samples of size $n$ . This helps account for sample-to-sample variability.
Goal: The primary goal is to understand how sample means are distributed, allowing us to calculate the probability that a sample mean $\bar{x}$ falls within a specific range, or to make inferences about the population mean based on a single sample mean.

Motivation for the Sample Mean Distribution

Exploration: If we repeatedly take multiple random samples of the same size $n$ from the same population, and calculate the mean $\bar{x}$ for each sample, these sample means will themselves form a distribution. This distribution of sample means is known as the sampling distribution of the sample mean.

Example with Pepsi Cans

Scenario: The actual fill volumes of Pepsi cans are known to follow a normal distribution with a population mean $\mu = 355 \text{ ml}$ and a population standard deviation $\sigma = 2 \text{ ml}$ . We are interested in the distribution of sample means of these fill volumes.
Sample Process:
1. Use R Code to generate random samples to simulate this scenario:
 R set.seed(1) # Ensures reproducibility of random sample generation x <- rnorm(1000, 355, 2) # Generates 1000 random normal values
Results: If we were to take many samples (e.g., 100 samples of size 100 each) and compute the mean of each sample, the histogram of these sample means would show a distinct distribution, which will appear normal.

Sample Means and Distribution Parameters

Observations: When examining the sampling distribution of the sample mean ( $\bar{X}$ ):
1. The distribution of $\bar{X}$ typically remains normal, especially if the underlying population is normal or if the sample size is large (due to the Central Limit Theorem).
2. The mean of the sampling distribution of $\bar{X}$ (denoted as $E(\bar{X})$ or $\mu_{\bar{X}}$ ) is equal to the population mean $\mu = 355$ . This indicates that the sample mean is an unbiased estimator of the population mean.
3. The standard deviation of the sampling distribution of $\bar{X}$ (known as the standard error of the mean and denoted as $\sigma_{\bar{X}}$ or $SE$ ) is calculated by the formula $\frac{\sigma}{\sqrt{n}}$ . For example, if $n=100$ , then $\frac{\sigma}{\sqrt{n}} = \frac{2}{\sqrt{100}} = \frac{2}{10} = 0.2$ . This value quantifies the average deviation of sample means from the population mean.
Understanding Variability: The decrease in standard deviation (compared to the population standard deviation) indicates that sample averages are inherently less variable than individual observations. Larger sample sizes lead to smaller standard errors, meaning sample means cluster more closely around the true population mean.

Sampling Distribution

Notation and Properties: If individual observations $X$ in a population follow a normal distribution, $\text{i.e., } X \sim N(\mu, \sigma)$ , then the sampling distribution of the sample mean $\bar{X}$ also follows a normal distribution: $\bar{X} \sim N \left(\mu, \frac{\sigma}{\sqrt{n}} \right)$ .
Mean: The mean of the sample means, $\mu_{\bar{X}}$ , is always equal to the population mean $\mu$ . This is a crucial property for making accurate inferences.
Standard Error: The standard deviation of the sampling distribution of the sample mean, $\frac{\sigma}{\sqrt{n}}$ , measures the precision of the sample mean as an estimator of the population mean.

Example: Pepsi Revisited

Analysis Question: Given the Pepsi example with $\mu = 355$ , $\sigma = 2$ , and $n = 100$ , what is the probability that a sample of 100 cans will have an average fill volume greater than $356 \text{ ml} $? ($ P(\bar{X} > 356) $).</li><li>Solution Calculation: We first standardize the sample mean using the Z-score formula for sample means,$ Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} $. The standard error is$ \frac{2}{\sqrt{100}} = 0.2 $.$ P(\bar{X} > 356) = P \left(Z > \frac{(356 - 355)}{0.2} \right) = P(Z > 5).
- Referring to a standard normal (Z) table or using statistical software, $P(Z > 5)$ is an extremely small value, approximately 0.000000287 $or effectively$ 0.0000 when rounded to four decimal places. This means it is highly unlikely to observe a sample mean greater than $356 \text{ ml}$ if the true population mean is $355 \text{ ml}$.

Pulse Rate Analysis Example

Pulse Rate Normality: Assume women's pulse rates are normally distributed with a population mean \mu = 74 $beats per minute and a population standard deviation$ \sigma = 12 $beats per minute.</li><li>Calculating Probability: Find the probability that the average pulse rate for a random sample of$ 25 $women falls between 68 and 80 beats per minute. Here,$ n=25 $and the standard error is$ \frac{12}{\sqrt{25}} = \frac{12}{5} = 2.4 $.<ul><li>$ P(68 < \bar{X} < 80) = P \left(\frac{68-74}{2.4} < Z < \frac{80-74}{2.4} \right) = P(-2.50 < Z < 2.50) $.</li><li>To find this probability, we look up the cumulative probabilities for$ Z=2.50 $and$ Z=-2.50 $in a Z-table.$ P(Z < 2.50) = 0.9938 $and$ P(Z < -2.50) = 0.0062 $. Therefore,$ P(-2.50 < Z < 2.50) = P(Z < 2.50) - P(Z < -2.50) = 0.9938 - 0.0062 = 0.9876. This implies that about $98.76\% of samples of 25 women will have an average pulse rate between 68 and 80 bpm.

Central Limit Theorem

Theorem Definition: The Central Limit Theorem (CLT) is a fundamental theorem in statistics stating that, even if the population distribution is not normal, the sampling distribution of the sample mean ( $\bar{X}$ ) will approach a normal distribution as the sample size ( $n$ ) becomes sufficiently large. This holds true regardless of the original shape of the population distribution (e.g., skewed, uniform).
Practical Implication: The CLT is incredibly powerful because it allows us to apply methods developed for normal distributions (like Z-tests and confidence intervals) to infer about the means of populations that are not necessarily normally distributed, provided that the sample size is large enough (a common rule of thumb is $n \geq 30$ ).

Statistical Inference Basics

Building Confidence Intervals (CIs): A confidence interval provides a range of values within which the true population parameter (e.g., the population mean $\mu$ ) is likely to lie, based on sample data. For the population mean when the population standard deviation $\sigma$ is known, a CI is given by:
$( \bar{x} - z{\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{x} + z{\alpha/2} \frac{\sigma}{\sqrt{n}} )$
Here, $\bar{x}$ is the sample mean, $z{\alpha/2}$ is the critical Z-value from the standard normal distribution corresponding to the desired confidence level (e.g., for a 95% CI, $z{\alpha/2} = 1.96$ ), $\sigma$ is the population standard deviation, and $n$ is the sample size.
Confidence Level (CL): Represents the probability that the calculated interval will contain the true population parameter if the sampling process were repeated many times. Common confidence levels are 90%, 95%, and 99%.

Hypothesis Testing

Definition: Hypothesis testing is a formal statistical procedure used to assess claims or assumptions (hypotheses) regarding a population parameter based on evidence from sample statistics. It involves a systematic process of making a decision about a statement concerning a population.
Testing Hypotheses: Hypothesis testing involves setting up two opposing statements:
- Null Hypothesis ( $H0$ ): This is the statement of no effect, no difference, or no relationship. It often includes an equality (e.g., $\mu = \mu0$ ).
- Alternative Hypothesis ( $Ha$ ): This is the statement that contradicts the null hypothesis, suggesting the presence of an effect, difference, or relationship. It can be one-sided (e.g., \mu > \mu0 or \mu < \mu0) or two-sided (e.g., $\mu \ne \mu0$ ).
 The goal is to determine if there is enough statistical evidence to reject the null hypothesis in favor of the alternative hypothesis.
Common Techniques:
- Z-tests: Used when the population standard deviation $\sigma$ is known, and the sample size is sufficiently large (or the population is normally distributed). The test statistic is calculated as $Z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}$ .
- T-tests: Used when the population standard deviation $\sigma$ is unknown and must be estimated from the sample data (using the sample standard deviation, $s$ ). T-tests are particularly common and are appropriate when the sample size is small, assuming the population is normally distributed. The test statistic is $t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$ with $n-1$ degrees of freedom.

Control for Type I and II Errors

Type I Error ( $\alpha$ ): Occurs when a true null hypothesis ( $H_0$ ) is incorrectly rejected. The probability of making a Type I error is denoted by $\alpha$ , also known as the significance level of the test. A commonly chosen $\alpha$ value is $0.05$ , meaning there is a 5% chance of rejecting a true null hypothesis.
Type II Error ( $\beta$ ): Occurs when a false null hypothesis ( $H_0$ ) is not rejected. The probability of making a Type II error is denoted by $\beta$ . Minimizing $\beta$ is important, as it means we are less likely to miss a real effect.
- Power of a test = $1 - \beta$ : This represents the probability of correctly rejecting a false null hypothesis. A test with higher power is more desirable as it is more likely to detect a true effect when one exists. Researchers aim to design studies with sufficient power.

Conclusion

Summary of Inference Methods: This unit has provided a foundational understanding of statistical inference, focusing on how to analyze and draw conclusions about a population mean using sample data. Key concepts include understanding sample means, deriving their sampling distributions, applying statistical tests (Z-tests, T-tests) to compare means against hypothesized values, and utilizing estimations and confidence intervals to quantify uncertainty. By mastering these methods, one can effectively leverage data to make informed decisions and draw meaningful conclusions.
Key Formulas for Reference: Students should be familiar with the formulas for the standard error of the mean ( $\frac{\sigma}{\sqrt{n}}$ ), Z-test statistic ( $\frac{\bar{x} - \mu0}{\sigma/\sqrt{n}}$ ), t-test statistic ( $\frac{\bar{x} - \mu0}{s/\sqrt{n}}$ ), and the construction of confidence intervals for the mean.