Inferences from Samples to Populations: Understanding Sampling Distributions

Introduction to Sampling Distributions

Learning Goal: The primary objective is to understand the fundamental ideas underlying sampling distributions. This allows for the analysis of both a distribution of sample means ( $\bar{x}$ ) and a distribution of sample proportions ( $\hat{p}$ ).
Sampling Distribution Definition: The distribution of any sample statistic (such as a mean or proportion) calculated from all possible samples of a particular size is called a sampling distribution.
Technical Distinction: A distribution specifically showing the means of all possible samples is technically called a sampling distribution of sample means.

Fundamental Concepts and the Basic Idea of Sample Means

Small Population Example: Consider a population consisting of only three children with ages $4$ , $5$ , and $9$ .
- Population Mean ( $\mu$ ): The mean age is calculated as $\frac{4 + 5 + 9}{3} = 6.0\, \text{years}$ .
Sample Size $n = 1$ :
- Possible samples: ${4}$ , ${5}$ , and ${9}$ .
- Sample means ( $\bar{x}$ ): $4.0$ , $5.0$ , and $9.0$ .
- Mean of the sample means: $\frac{4 + 5 + 9}{3} = 6.0\, \text{years}$ .
- The mean of the sample means is exactly equal to the population mean ( $\mu$ ).
Sample Size $n = 2$ (Sampling with Replacement):
- Sampling with Replacement: A method where one member is chosen at random, recorded, and then put back into the pool before the next member is chosen. This allows the same individual to be selected multiple times in a single sample.
- Total possible samples for $n = 2$ from a population of $3$ : $3 \times 3 = 9$ possible samples.
- Possible Samples and Means:
  1. Sample ${4, 4}$ , Mean = $4.0$
  2. Sample ${4, 5}$ , Mean = $4.5$
  3. Sample ${4, 9}$ , Mean = $6.5$
  4. Sample ${5, 4}$ , Mean = $4.5$
  5. Sample ${5, 5}$ , Mean = $5.0$
  6. Sample ${5, 9}$ , Mean = $7.0$
  7. Sample ${9, 4}$ , Mean = $6.5$
  8. Sample ${9, 5}$ , Mean = $7.0$
  9. Sample ${9, 9}$ , Mean = $9.0$
Frequency of Sample Means ( $n = 2$ ):
- $4.0$ : Frequency $1$
- $4.5$ : Frequency $2$
- $5.0$ : Frequency $1$
- $6.5$ : Frequency $2$
- $7.0$ : Frequency $2$
- $9.0$ : Frequency $1$
Observations for $n = 2$ :
- The mean of these nine sample means is $\frac{4.0 + 4.5 + 6.5 + 4.5 + 5.0 + 7.0 + 6.5 + 7.0 + 9.0}{9} = 6.0\, \text{years}$ . This remains equal to the population mean.
- The distribution starts to show clustering near the population mean of $6.0$ , appearing "more normal" than the distribution for $n = 1$ .

The Impact of Sample Size and the Central Limit Theorem

Influence of Larger Samples: As sample size increases (e.g., $n = 10$ ), the distribution of sample means looks increasingly like a normal distribution.
Central Limit Theorem: This phenomenon, where the distribution of sample means approaches a normal distribution as the sample size increases regardless of the population's distribution shape, is a consequence of the Central Limit Theorem.
Unrealistic Scenarios: Drawing a sample of size $n = 10$ from a population of only $3$ requires multiple inclusions of the same individuals; however, it serves as a conceptual model for how larger samples narrow the distribution toward normality.

Sampling with Larger Populations

Practical Constraints: In real-world statistics, populations are often too large to survey entirely, making the true population mean ( $\mu$ ) unknown and necessitating the use of sample means ( $\bar{x}$ ) as estimates.
Sampling Error: This is the inherent error introduced by using a random sample to estimate a population parameter rather than the entire population.
- Exclusions: Sampling error does not include errors from biased sampling, poorly worded survey questions, or recording mistakes.
Example: Web Research Hours (Data Set 8.1):
- Population: $400$ students.
- Population Mean ( $\mu$ ): $3.88\, \text{hours}$ .
- Population Standard Deviation ( $\sigma$ ): $2.40\, \text{hours}$ .
- Sample Statistics: A random sample of $n = 32$ students might yield a sample mean of $\bar{x} = 4.38\, \text{hours}$ .
- Multiple samples of the same size will result in different sample means, and typically, no single sample mean exactly matches the true population mean.

Notation and Characteristics of Sampling Distributions

Notation for Means

Entity	Size	Mean	Standard Deviation
Population	$N$	$\mu$	$\sigma$
Sample	$n$	$\bar{x}$	$s$

Characteristics of the Distribution of Sample Means

Normality: The larger the sample size, the more closely the distribution approximates a normal distribution.
Mean: The mean of the distribution of sample means is equal to the population mean ( $\mu_{\bar{x}} = \mu$ ).
Standard Deviation (Standard Error): The standard deviation of the distribution of sample means is the population standard deviation divided by the square root of the sample size: $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ .
Rule of Thumb: A common guideline assumes the distribution of sample means is close to normal if the sample size is greater than $30$ (n > 30).

Probability and Standard Scores in Sample Means

Standard Score ( $z$ ) for Sample Means: Used to determine how extreme a sample mean is within the sampling distribution.
- Formula: $z = \frac{\bar{x} - \mu}{\sigma_{\bar{x}}}$
Example Application (Web Research):
- Given: $n = 32$ , $\mu = 3.88$ , $\sigma = 2.40$ .
- Standard Error: $\sigma_{\bar{x}} = \frac{2.40}{\sqrt{32}} \approx 0.42$ .
- For a sample mean $\bar{x} = 5.01$ : $z = \frac{5.01 - 3.88}{0.42} = 2.69$ .
- Result: A $z$ -score of $2.69$ corresponds to the $99.64$ th percentile. The probability of selecting a sample with a mean greater than $5.01$ is $1 - 0.9964 = 0.0036$ (or $0.36\%$ ).

Case Study: Sampling Texas Farms

Context: Texas has approximately $225,000$ farms.
Known Parameters: Population mean $\mu = 582\, \text{acres}$ , population standard deviation $\sigma = 150\, \text{acres}$ .
Problem: Find the probability of a random sample of $n = 100$ farms having a mean size greater than $600\, \text{acres}$ .

Standard Error Calculation: $\sigma_{\bar{x}} = \frac{150}{\sqrt{100}} = 15\, \text{acres}$ .
Standard Score Calculation: $z = \frac{600 - 582}{15} = \frac{18}{15} = 1.2$ .
Probability Determination: A $z$ -score of $1.2$ is in the $88.49$ th percentile (probability $0.8849$ of being less than $600$ ). Thus, the probability of the mean being greater than $600$ is $1 - 0.8849 = 0.1151$ .

Sample Proportions

Population Proportion ( $p$ ): A population parameter representing the exact proportion of a population possessing a specific trait (e.g., car ownership).
- Example: $240$ out of $400$ students own cars, so $p = \frac{240}{400} = 0.6$ .
Sample Proportion ( $\hat{p}$ ): The proportion observed within a sample.
- Example: A sample of $n = 32$ might yield a $\hat{p} = 0.75$ .
Distribution of Sample Proportions: Results from calculating $\hat{p}$ for all possible samples of a given size.

Notation for Proportions

Entity	Size	Proportion
Population	$N$	$p$
Sample	$n$	$\hat{p}$

Characteristics of the Distribution of Sample Proportions

Normality: Becomes more normal as sample size $n$ increases.
Mean: Equal to the population proportion ( $\mu_{\hat{p}} = p$ ).
Standard Deviation: Given by the formula $\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}$ .

Case Study: Analyzing Sample Proportion (Car Ownership)

Scenario: Population proportion $p = 0.6$ ; Sample size $n = 32$ .
Selected Sample: Yields a sample proportion $\hat{p} = 0.75$ .
Standard Deviation of Proportions: $\sigma_{\hat{p}} = \sqrt{\frac{0.6(1 - 0.6)}{32}} \approx 0.09$ .
Standard Score Calculation: $z = \frac{0.75 - 0.6}{0.09} = 1.67$ .
Probability: A $z$ -score of $1.67$ corresponds to the $95.25$ th percentile.
- Probability of \text{proportion} < 0.75 is $0.9525$ .
- Probability of \text{proportion} > 0.75 is $1 - 0.9525 = 0.0475$ .
Interpretation: If $100$ random samples were taken, only about $5$ of them would be expected to have a proportion of car owners higher than $0.75$ .

Questions & Discussion

Think About It: Suppose you choose only one sample of size $n = 32$ . According to Figure 8.4, are you more likely to choose a sample with a mean less than $2.5$ or a sample with a mean less than $3.5$ ? Explain.
- Self-Correction/Context: Based on the normal distribution centered at $3.88$ , a value of $3.5$ is closer to the mean than $2.5$ . Therefore, a larger area of the normal curve lies to the left of $3.5$ than to the left of $2.5$ , making a sample mean less than $3.5$ significantly more likely.