Detailed Study Notes for STAT 2000 - Inference for the Mean of a Single Population

Unit 1 – Inference for the Mean of a Single Population
Introduction to Statistical Inference
  • Inference Definition: Statistical inference is a set of methods used to draw conclusions or make predictions about a larger population based on data obtained from a smaller, representative sample. This process typically involves estimating population parameters (like means or proportions) and testing hypotheses about these parameters.

Probability Distribution
  • Definition: The probability distribution of a random variable XX is a mathematical function that describes all possible values XX can take (the sample space) and the probability associated with each value. For discrete variables, this is often a probability mass function (PMF), while for continuous variables, it's a probability density function (PDF), showing the likelihood of the variable falling within a certain range.

  • Notation: If XN(μ,σ)X \sim N(\mu, \sigma), it indicates that XX follows a normal probability distribution. Here, μ\mu represents the population mean, and σ\sigma represents the population standard deviation.

Distribution of the Sample Mean
  • Understanding Sample Means: Rather than analyzing individual data points or their probabilities, statistical inference often focuses on the distribution of sample statistics, particularly the sample mean (xˉ\bar{x}), calculated from multiple random samples of size nn. This helps account for sample-to-sample variability.

  • Goal: The primary goal is to understand how sample means are distributed, allowing us to calculate the probability that a sample mean xˉ\bar{x} falls within a specific range, or to make inferences about the population mean based on a single sample mean.

Motivation for the Sample Mean Distribution
  • Exploration: If we repeatedly take multiple random samples of the same size nn from the same population, and calculate the mean xˉ\bar{x} for each sample, these sample means will themselves form a distribution. This distribution of sample means is known as the sampling distribution of the sample mean.

Example with Pepsi Cans
  • Scenario: The actual fill volumes of Pepsi cans are known to follow a normal distribution with a population mean μ=355 ml\mu = 355 \text{ ml} and a population standard deviation σ=2 ml\sigma = 2 \text{ ml}. We are interested in the distribution of sample means of these fill volumes.

  • Sample Process:

    1. Use R Code to generate random samples to simulate this scenario:
      R set.seed(1) # Ensures reproducibility of random sample generation x <- rnorm(1000, 355, 2) # Generates 1000 random normal values

  • Results: If we were to take many samples (e.g., 100 samples of size 100 each) and compute the mean of each sample, the histogram of these sample means would show a distinct distribution, which will appear normal.

Sample Means and Distribution Parameters
  • Observations: When examining the sampling distribution of the sample mean (Xˉ\bar{X}):

    1. The distribution of Xˉ\bar{X} typically remains normal, especially if the underlying population is normal or if the sample size is large (due to the Central Limit Theorem).

    2. The mean of the sampling distribution of Xˉ\bar{X} (denoted as E(Xˉ)E(\bar{X}) or μXˉ\mu_{\bar{X}}) is equal to the population mean μ=355\mu = 355. This indicates that the sample mean is an unbiased estimator of the population mean.

    3. The standard deviation of the sampling distribution of Xˉ\bar{X} (known as the standard error of the mean and denoted as σXˉ\sigma_{\bar{X}} or SESE) is calculated by the formula σn\frac{\sigma}{\sqrt{n}}. For example, if n=100n=100, then σn=2100=210=0.2\frac{\sigma}{\sqrt{n}} = \frac{2}{\sqrt{100}} = \frac{2}{10} = 0.2. This value quantifies the average deviation of sample means from the population mean.

  • Understanding Variability: The decrease in standard deviation (compared to the population standard deviation) indicates that sample averages are inherently less variable than individual observations. Larger sample sizes lead to smaller standard errors, meaning sample means cluster more closely around the true population mean.

Sampling Distribution
  • Notation and Properties: If individual observations XX in a population follow a normal distribution, i.e., XN(μ,σ)\text{i.e., } X \sim N(\mu, \sigma), then the sampling distribution of the sample mean Xˉ\bar{X} also follows a normal distribution: XˉN(μ,σn)\bar{X} \sim N \left(\mu, \frac{\sigma}{\sqrt{n}} \right).

  • Mean: The mean of the sample means, μXˉ\mu_{\bar{X}}, is always equal to the population mean μ\mu. This is a crucial property for making accurate inferences.

  • Standard Error: The standard deviation of the sampling distribution of the sample mean, σn\frac{\sigma}{\sqrt{n}}, measures the precision of the sample mean as an estimator of the population mean.

Example: Pepsi Revisited
  • Analysis Question: Given the Pepsi example with μ=355\mu = 355, σ=2\sigma = 2, and n=100n = 100, what is the probability that a sample of 100 cans will have an average fill volume greater than $356 \text{ ml}?(? (P(\bar{X} > 356)).</p></li><li><p><strong>SolutionCalculation:</strong>WefirststandardizethesamplemeanusingtheZscoreformulaforsamplemeans,).</p></li><li><p><strong>Solution Calculation:</strong> We first standardize the sample mean using the Z-score formula for sample means,Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}.Thestandarderroris. The standard error is\frac{2}{\sqrt{100}} = 0.2..P(\bar{X} > 356) = P \left(Z > \frac{(356 - 355)}{0.2} \right) = P(Z > 5).

    • Referring to a standard normal (Z) table or using statistical software, $P(Z > 5)$ is an extremely small value, approximately 0.000000287oreffectivelyor effectively0.0000 when rounded to four decimal places. This means it is highly unlikely to observe a sample mean greater than $356 \text{ ml}$ if the true population mean is $355 \text{ ml}$.

Pulse Rate Analysis Example
  • Pulse Rate Normality: Assume women's pulse rates are normally distributed with a population mean \mu = 74beatsperminuteandapopulationstandarddeviationbeats per minute and a population standard deviation\sigma = 12beatsperminute.</p></li><li><p><strong>CalculatingProbability:</strong>Findtheprobabilitythattheaveragepulserateforarandomsampleofbeats per minute.</p></li><li><p><strong>Calculating Probability:</strong> Find the probability that the average pulse rate for a random sample of25womenfallsbetween68and80beatsperminute.Here,women falls between 68 and 80 beats per minute. Here,n=25andthestandarderrorisand the standard error is\frac{12}{\sqrt{25}} = \frac{12}{5} = 2.4.</p><ul><li><p>.</p><ul><li><p>P(68 < \bar{X} < 80) = P \left(\frac{68-74}{2.4} < Z < \frac{80-74}{2.4} \right) = P(-2.50 < Z < 2.50).</p></li><li><p>Tofindthisprobability,welookupthecumulativeprobabilitiesfor.</p></li><li><p>To find this probability, we look up the cumulative probabilities forZ=2.50andandZ=-2.50inaZtable.in a Z-table.P(Z < 2.50) = 0.9938andandP(Z < -2.50) = 0.0062.Therefore,. Therefore,P(-2.50 < Z < 2.50) = P(Z < 2.50) - P(Z < -2.50) = 0.9938 - 0.0062 = 0.9876. This implies that about $98.76\% of samples of 25 women will have an average pulse rate between 68 and 80 bpm.

Central Limit Theorem
  • Theorem Definition: The Central Limit Theorem (CLT) is a fundamental theorem in statistics stating that, even if the population distribution is not normal, the sampling distribution of the sample mean (Xˉ\bar{X}) will approach a normal distribution as the sample size (nn) becomes sufficiently large. This holds true regardless of the original shape of the population distribution (e.g., skewed, uniform).

  • Practical Implication: The CLT is incredibly powerful because it allows us to apply methods developed for normal distributions (like Z-tests and confidence intervals) to infer about the means of populations that are not necessarily normally distributed, provided that the sample size is large enough (a common rule of thumb is n30n \geq 30).

Statistical Inference Basics
  • Building Confidence Intervals (CIs): A confidence interval provides a range of values within which the true population parameter (e.g., the population mean μ\mu) is likely to lie, based on sample data. For the population mean when the population standard deviation σ\sigma is known, a CI is given by:
    (xˉz<em>α/2σn,xˉ+z</em>α/2σn)( \bar{x} - z<em>{\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{x} + z</em>{\alpha/2} \frac{\sigma}{\sqrt{n}} )
    Here, xˉ\bar{x} is the sample mean, z<em>α/2z<em>{\alpha/2} is the critical Z-value from the standard normal distribution corresponding to the desired confidence level (e.g., for a 95% CI, z</em>α/2=1.96z</em>{\alpha/2} = 1.96), σ\sigma is the population standard deviation, and nn is the sample size.

  • Confidence Level (CL): Represents the probability that the calculated interval will contain the true population parameter if the sampling process were repeated many times. Common confidence levels are 90%, 95%, and 99%.

Hypothesis Testing
  • Definition: Hypothesis testing is a formal statistical procedure used to assess claims or assumptions (hypotheses) regarding a population parameter based on evidence from sample statistics. It involves a systematic process of making a decision about a statement concerning a population.

  • Testing Hypotheses: Hypothesis testing involves setting up two opposing statements:

    • Null Hypothesis (H<em>0H<em>0): This is the statement of no effect, no difference, or no relationship. It often includes an equality (e.g., μ=μ</em>0\mu = \mu</em>0).

    • Alternative Hypothesis (H<em>aH<em>a): This is the statement that contradicts the null hypothesis, suggesting the presence of an effect, difference, or relationship. It can be one-sided (e.g., \mu > \mu0 or \mu < \mu0) or two-sided (e.g., μμ</em>0\mu \ne \mu</em>0).
      The goal is to determine if there is enough statistical evidence to reject the null hypothesis in favor of the alternative hypothesis.

  • Common Techniques:

    • Z-tests: Used when the population standard deviation σ\sigma is known, and the sample size is sufficiently large (or the population is normally distributed). The test statistic is calculated as Z=xˉμ0σ/nZ = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}.

    • T-tests: Used when the population standard deviation σ\sigma is unknown and must be estimated from the sample data (using the sample standard deviation, ss). T-tests are particularly common and are appropriate when the sample size is small, assuming the population is normally distributed. The test statistic is t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} with n1n-1 degrees of freedom.

Control for Type I and II Errors
  • Type I Error (α\alpha): Occurs when a true null hypothesis (H0H_0) is incorrectly rejected. The probability of making a Type I error is denoted by α\alpha, also known as the significance level of the test. A commonly chosen α\alpha value is 0.050.05, meaning there is a 5% chance of rejecting a true null hypothesis.

  • Type II Error (β\beta): Occurs when a false null hypothesis (H0H_0) is not rejected. The probability of making a Type II error is denoted by β\beta. Minimizing β\beta is important, as it means we are less likely to miss a real effect.

    • Power of a test = 1β1 - \beta: This represents the probability of correctly rejecting a false null hypothesis. A test with higher power is more desirable as it is more likely to detect a true effect when one exists. Researchers aim to design studies with sufficient power.

Conclusion
  • Summary of Inference Methods: This unit has provided a foundational understanding of statistical inference, focusing on how to analyze and draw conclusions about a population mean using sample data. Key concepts include understanding sample means, deriving their sampling distributions, applying statistical tests (Z-tests, T-tests) to compare means against hypothesized values, and utilizing estimations and confidence intervals to quantify uncertainty. By mastering these methods, one can effectively leverage data to make informed decisions and draw meaningful conclusions.

  • Key Formulas for Reference: Students should be familiar with the formulas for the standard error of the mean (σn\frac{\sigma}{\sqrt{n}}), Z-test statistic (xˉμ<em>0σ/n\frac{\bar{x} - \mu<em>0}{\sigma/\sqrt{n}}), t-test statistic (xˉμ</em>0s/n\frac{\bar{x} - \mu</em>0}{s/\sqrt{n}}), and the construction of confidence intervals for the mean.