Chapter 8: Sampling and Sampling Distributions (Notes)

8.1 Introduction

  • In statistical inference, use random samples to learn about a population from which they are drawn.
  • Information extracted from samples comes in the form of summary statistics: sample mean, sample standard deviation, etc.
  • Sample statistics are treated as estimators of population parameters (e.g., μ, p, etc.).
  • Sampling: the process of selecting a sample from a population; a representative sample is analyzed and used to make inferences about the population.
  • Examples of sampling applications:
    • Political polling to estimate voting proportions.
    • Auditing: sample vouchers to estimate the population mean expenditure.
    • Medical testing: analyzing a few samples to infer disease characteristics of the population.
  • Sampling error: the absolute difference between an unbiased estimate and the corresponding population parameter, e.g. |\hat{p} - p| or |\bar{X} - μ|.

8.2 Reasons for Sample Survey

  • Census vs. sample survey:
    • Census counts all elements in the population.
    • Reasons to prefer sampling over census:
    • Movement of population elements (e.g., fish, birds, mosquitoes) makes counting all elements impractical.
    • Cost and time: contacting and processing data from the entire population is expensive and slow.
    • Destructive tests: some tests destroy the unit; sampling avoids destroying the entire population.
  • Destructive testing example: applying stress until a manufactured item breaks; testing whole population would destroy all items.

8.3 Types of Bias During Sample Survey

  • Bias occurs when the method used tends to over- or under-estimate the true value.
  • Common biases:
    • Undercoverage Bias: the sample does not represent the population (e.g., surveying railway station passengers may not reflect all commuters).
    • Non-response Bias: only a subset of selected respondents responds; non-respondents may differ systematically.
    • Wording Bias: wording or sequencing of questions influences responses.
  • Wording and design of questions must minimize bias to improve reliability.

8.3.1 Sampling and Non-Sampling Errors

  • Sampling error: arises because a sample may differ from the population; different samples yield different estimates.
  • Non-sampling errors: biases and mistakes that occur in census or sampling surveys (e.g., incorrect enumeration, non-random selection, vague questionnaires, coding/editing errors).
  • Minimizing sampling errors can be achieved by:
    • Clear, precise questions;
    • Careful administration; training interviewers; accurate data processing.
  • Measurement of sampling error uses the standard error of the estimate; precision depends on sample size:
    • Standard error is inversely related to the square root of the sample size: SE \propto \frac{1}{\sqrt{n}}.
  • Figure references (conceptual): larger samples reduce the element of error.

8.4 Population Parameters and Sample Statistics

  • Parameter: an exact but generally unknown measure describing the entire population (e.g., μ, σ², σ, median, proportion p).
  • Notation:
    • Population parameters are typically denoted with lowercase Greek letters: e.g., \mu, \sigma^2, \sigma, etc.
  • Sample statistic: a measure computed from a sample, used to estimate population parameters (e.g., \bar{X}, s^2, s, \hat{p}).
  • Notation:
    • Sample statistics are usually denoted by Roman letters: \bar{X}, s^2, s, \hat{p}, etc.
  • Key concept: the value of a statistic varies across samples; the population parameter is treated as a constant.
  • Estimation framework: probabilities are attached to possible sample outcomes to assess reliability and sampling error (as in Figure 8.2 in the text).

8.5 Principles of Sampling

  • Two important principles:
    • (i) Principle of statistical regularity
    • (ii) Principle of inertia of large numbers

8.5.1 Principle of Statistical Regularity

  • Based on the law of statistical regularity: a moderately large random sample from a large population tends on average to reflect the characteristics of the population.
  • Key factors:
    • (i) Sample Size Should be Large: larger samples are more representative but costlier; balance needed between accuracy and resources.
    • (ii) Samples Must be Drawn Randomly: simple random sampling is the default; ensures each combination of elements has equal probability of selection.

8.5.2 Principle of Inertia of Large Numbers

  • A corollary of statistical regularity; as sample size grows, the statistical inference becomes more accurate and stable under similar conditions.
  • Analogy: tossing a coin many times yields relative frequencies approaching equality of heads and tails.

8.6 Sampling Methods

  • Sampling methods are categorized by representation basis and element selection method.
  • Representation Basis:
    • Probability (Random) sampling
    • Non-probability (Non-random) sampling
  • Element Selection methods:
    • Unrestricted, Restricted

Table: Types of Sampling Methods

  • Probability methods (Random):
    • Simple random sampling
    • Complex random sampling
    • Stratified sampling
    • Cluster sampling
    • Systematic sampling
    • Multi-stage sampling
  • Non-probability methods:
    • Convenience sampling
    • Purposive sampling
    • Quota sampling
    • Judgement sampling

8.6.1 Probability Sampling Methods

  • Simple Random (Unrestricted) Sampling:
    • Every population member has an equal and independent chance of being selected.
    • Requires a complete list of population elements (frame).
    • Frame allows random number generation to pick the sample.
    • Pros: tests assume independence; cons: full availability of population elements may be impractical.
  • Stratified Sampling:
    • Population divided into strata (mutually exclusive groups) that are relevant and meaningful.
    • Draw simple random samples within each stratum; samples can be proportional or non-proportional to respective strata sizes.
    • Increases efficiency by ensuring representation of important segments.
    • Example: employee motivation study by job level; sample sizes by strata may be proportional or disproportionate.
    • Proportional sampling: number from each stratum in same proportion as population; Disproportional sampling: strata sizes are not proportional.
  • Cluster Sampling (Area Sampling):
    • Population divided into clusters; clusters are randomly selected; all elements within chosen clusters are studied.
    • Clusters are internally heterogeneous but inter-cluster homogenous. Useful when a sampling frame of individuals is hard; easier to enumerate clusters.
    • Examples: households in blocks, flights for customer surveys on a given route.
  • Multistage Sampling:
    • A hierarchical sampling design; sample in stages (regions -> towns -> households, etc.).
    • Useful when population is widely spread and simple random sampling is impractical.
  • Systematic Sampling:
    • Elements are listed in a known order; select a random first element, then every k-th element thereafter, where k is the sampling interval, typically k = N/n.
    • Example: 50 samples from 1000 accounts with interval k = 20.

8.6.2 Non-Random Sampling Methods

  • Convenience Sampling:
    • Units selected for ease of access; inexpensive and quick.
    • Representativeness is uncertain; caution needed in inference.
  • Purposive Sampling:
    • Targeted respondents who can provide the needed information are chosen.
  • Judgement Sampling:
    • Respondents chosen based on the investigator’s judgment about who possesses the needed information.
    • Limited cross-sectional generalizability; risk of biased conclusions if judgment is poor.
  • Quota Sampling:
    • A form of proportionate stratified sampling via convenience basis; ensure predefined quotas for subgroups (e.g., gender, income, age).
    • Criticized because it violates random sampling, making precision unreliable.

8.6.3 Choice of Sampling Methods

  • Decision factors:
    • Nature of the study, population size, required precision, budget/resources, etc.
  • Simple plan (high-level):
    • If representativeness is crucial, use simple random sampling; if structure exists, consider stratified or cluster sampling; if quick information from available strata is needed, use systematic or sequential sampling; if budget is tight, cluster sampling may be appropriate.
  • Guidelines (from Fig. 8.3): flowchart-like guidance on selecting methods depending on aims (representativeness, speed, parameter assessment, etc.).

8.7 Reliability and Errors in Sampling

  • Reliability of a sample can be assessed by:
    • Repeating the sampling process multiple times and comparing results; low variation signals reliability.
    • Taking sub-samples from the main sample and checking consistency.
    • Comparing sample results to theoretical expectations from mathematical properties (e.g., binomial, normal, Poisson) to assess fit.
  • Distinguish sampling vs non-sampling errors:
    • Sampling errors arise from using a sample instead of the entire population.
    • Non-sampling errors arise from biases and mistakes in enumeration, sampling design, questionnaires, editing, and processing.

8.7.1 Standard Error of Statistic

  • A measure of sampling error: how much a statistic varies across repeated samples.
  • For a population with infinite size, the standard error of the mean is:
    SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}.
  • The standard error indicates the precision of the estimate; smaller SE means more precise.
  • In practice, when σ is unknown, use the sample standard deviation s to estimate SE and switch to the t-distribution when needed.
  • The standard error also informs estimation and confidence intervals (e.g., in estimation procedures discussed in Chapter 9).

Finite Population Correction (FPC)

  • When sampling without replacement from a finite population of size N, the SE of the mean is adjusted:
    SE(\bar{X}) = \sigma \sqrt{\frac{N-n}{N-1}} \cdot \frac{1}{\sqrt{n}}.
  • In practice, ignore the FPC when n/N ≤ 0.05.
  • If σ is unknown, replace σ with s and use the t-distribution with df = n-1.

8.7.2 Distinction: Population, Sample Distributions, and Sampling Distributions

  • Population distribution: distribution of all elements; mean \mu and standard deviation \sigma describe the population.
  • Sample distribution: distribution of the statistic (e.g., mean) across repeated samples from the population; uses a discrete distribution in many cases.
  • Sampling distribution: distribution of the statistic across all possible samples of a fixed size drawn from the population; has its own mean (often denoted \mu{\bar{X}}) and standard deviation (the standard error \sigma{\bar{X}}).
  • Important facts for the sampling distribution of the mean:
    • The mean of the sampling distribution of the mean equals the population mean: \mu_{\bar{X}} = μ.
    • The standard deviation of the sampling distribution of the mean is: \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}.
    • If the population is normally distributed, the sampling distribution of the mean is normal for all sample sizes; otherwise, for large n it tends toward normal via the Central Limit Theorem.

8.8 Sampling Distribution of Sample Mean

  • The distribution of the sample mean depends on the population distribution and sample size.

8.8.1 Sampling Distribution of Mean When Population Has Non-Normal Distribution

  • Central Limit Theorem (CLT): as n grows, the sampling distribution of the sample mean is approximately normal with mean μ and standard deviation \sigma/\sqrt{n}, regardless of the population shape (finite mean μ and finite standard deviation σ).
  • CLT statement (form):
    • If random samples of size n are drawn from a non-normal population with finite mean μ and standard deviation σ, then as n increases, the distribution of the sample mean is approximately normal with:
    • Mean: E(X̄) = μ
    • Standard deviation: \text{SD}(X̄) = \frac{σ}{\sqrt{n}}
  • Practical guidelines for CLT applicability:
    • If the population is normal: the sampling distribution of the mean is normal for any n.
    • If the population is approximately symmetric: the sampling distribution of the mean is approximately normal for relatively small n.
    • If the population is skewed: a larger sample size is needed (often n ≥ 30) for the sampling distribution of the mean to be approximately normal.
  • Standard normal transform for a single sample mean:
    z = \frac{\bar{X} - μ}{σ/\sqrt{n}}.
  • Implication: for large n, the distribution of sample means is concentrated around μ with reduced spread relative to the population spread.

8.8.2 Sampling Distribution of Mean When Population Has Normal Distribution

  • If the population is normal and the population standard deviation σ is known, the sampling distribution of the mean is normal with:
    • Mean: E(X̄) = μ
    • Standard deviation: \text{SD}(X̄) = \frac{σ}{\sqrt{n}}
  • If all possible samples of size n are drawn with replacement from a normal population, the sampling distribution of the mean is normal for any n.
  • Z-based probabilities for the mean use: z = \frac{\bar{X} - μ}{σ/\sqrt{n}}.
  • Illustrative benchmarks for the sampling distribution of the mean (when normal):
    • About 68% within \pm σ/\sqrt{n} of the mean.
    • About 95% within \pm 1.960 \cdot \frac{σ}{\sqrt{n}} of the mean.
  • When the population standard deviation is known, one uses the normal model for inference (or t-distribution if σ is unknown).

Finite Population Correction (FPC) (revisited)

  • If the population is finite with N elements and the sample size is n drawn without replacement, then the SE is adjusted by the finite population multiplier:
    SE(\bar{X}) = \frac{σ}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}.
  • As noted, when n/N \le 0.05, the FPC is often ignored.

8.8.3 Sampling Distribution of Difference Between Two Sample Means

  • When comparing two populations (means μ1, μ2) using independent samples of sizes n1, n2 with population sds σ1, σ2, the distribution of the difference of the sample means \bar{X}1 - \bar{X}2 has:
    • Mean: E(\bar{X}1 - \bar{X}2) = μ1 - μ2.
    • Standard error: if independent and from finite/infinite populations, approximately:
      SE(\bar{X}1 - \bar{X}2) = \sqrt{\frac{σ1^2}{n1} + \frac{σ2^2}{n2}}.
  • If population standard deviations are unknown, one uses appropriate estimates (pooled or separate) and possibly a t-based approach depending on assumptions.

8.9 Sampling Distribution of Sample Proportion

  • For a binary characteristic (success/failure) with population proportion p, the sample proportion is:
    \hat{p} = \frac{X}{n},
    where X is the number of successes in a sample of size n.
  • The sampling distribution of the sample proportion has:
    • Mean: E(\hat{p}) = p.
    • Standard error (infinite population):
      SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}.
  • Normal approximation criteria (larger samples):
    • If the sample size is large, the distribution of \hat{p} is approximately normal provided
      np \ge 5 \quad \text{and} \quad n(1-p) \ge 5.
  • Finite population correction for proportions: apply the multiplier \sqrt{\frac{N-n}{N-1}} when sampling without replacement from a finite population.
  • For finite populations, the mean and standard deviation expressions are the same as above but with the FPC multiplier applied to the standard error.

8.9.1 Sampling Distribution of the Difference of Two Proportions

  • For two populations with sizes N1, N2 and sample sizes n1, n2, the distribution of the difference in sample proportions \hat{p}1 - \hat{p}2 has:
    • Mean: E(\hat{p}1 - \hat{p}2) = p1 - p2.
    • Standard error (assuming independence):
      SE(\hat{p}1 - \hat{p}2) = \sqrt{\frac{p1(1-p1)}{n1} + \frac{p2(1-p2)}{n2}}.
  • With large samples, the difference of proportions is well approximated by a normal distribution.

Example 8.1 to 8.12 (Selected Key Results and Methods)

  • Example 8.1 (Sample mean with known σ):
    • Given: population mean μ = 41.5, population σ = 2.5, sample size n = 50.
    • SE of the mean: SE = \frac{σ}{\sqrt{n}} = \frac{2.5}{\sqrt{50}} = 0.3536.
    • Probability: P(40.5 ≤ X̄ ≤ 42) = Φ(1.4140) - Φ(-2.8281) ≈ 0.9184.
  • Example 8.2 (Normal weights, n = 16):
    • Population mean μ = 800, σ = 300, n = 16.
    • SE of mean: \frac{300}{\sqrt{16}} = 75.
    • (a) P(X̄ > 900) ≈ 0.0918.
    • (b) Middle 95% of sample means: 800 ± 1.96×75 = (653, 947) g.
  • Example 8.3 (Three monitors, independent):
    • Population mean μ = 4300 h, σ = 730 h, n = 3; SE = 730/\sqrt{3} ≈ 421.48.
    • Probability that the three-unit set lasts at least 13,000 h: P(X̄ ≥ 4333.33) ≈ 0.4681.
  • Example 8.4 (Big Bazar, 25 outlets):
    • N = 130, n = 25, σ = 40; SE with FPC: σ/\sqrt{n} × \sqrt{(N-n)/(N-1)} ≈ 13.72.
    • Probability that the sample mean falls within ±30 of the population mean: ≈ 0.9708.
  • Example 8.5 (CEO sampling to bound SE):
    • μ = 8000, σ = 300; want SE ≤ 0.015×μ = 120.
    • Solve: 300/√n ≤ 120 ⇒ √n ≥ 25 ⇒ n ≥ 625.
  • Example 8.6 (Safal tea consumption):
    • Population σ = 1.50 kg (unknown μ); n = 25; want P(|X̄ − μ| ≤ 0.5 kg) and required sample for 98% confidence.
    • (a) P(|X̄ − μ| ≤ 0.5) ≈ 0.9544.
    • (b) For 98% confidence, z ≈ 2.33; n ≈ (z×σ/0.5)^2 ≈ (2.33×1.5/0.5)^2 ≈ 49.84 ⇒ n ≈ 50.
  • Example 8.7 (Motorcycle efficiency):
    • Population: μ = 90, σ ≈ unknown; sample n = 25; observed x̄ = 87, s = 5.
    • Since σ unknown, use t-distribution with df = n − 1 = 24.
    • P(X̄ ≤ 87) ≈ P(t ≤ -3.00) ≈ 0.0031.
  • Example 8.8 (Stereos A vs B):
    • A: μ1 = 1400 h, σ1 = 200, n1 = 125; B: μ2 = 1200 h, σ2 = 100, n2 = 125.
    • SE for difference: \sqrt{\frac{σ1^2}{n1} + \frac{σ2^2}{n2}} = \sqrt{\frac{200^2}{125} + \frac{100^2}{125}} = \sqrt{(3200) + (800)} = \sqrt{4000} = 63.25.
    • (a) P(X̄A − X̄B ≥ 160) ≈ 0.9772.
    • (b) P(X̄A − X̄B ≥ 250) ≈ 0.0062.
  • Example 8.9 (Two lots of ball bearings):
    • Each lot: μ = 0.5 kg, σ = 0.02 kg; n = 100 per lot.
    • Difference mean under independence: difference of means ~ N(0, SE^2) with SE = √( (σ^2/n) + (σ^2/n) ) = √(2×(0.02^2/100)) ≈ 0.0028 kg.
    • Probability that the difference exceeds 0.002 kg: ≈ 0.0258.

8. Self-Practice Problems and Conceptual Questions (selected highlights)

  • Self-Practice Problems 8A covers:
    • 8.1 Probability that sample mean lies within a symmetric interval around μ with given n and σ.
    • 8.2 Time-between-arrivals problem using CLT for sample mean.
    • 8.3 Comparison of two population means using two samples.
    • 8.4 Interval for mean with known distribution assumptions.
    • 8.5, 8.6, 8.7, 8.8 additional CLT-based problems with finite/infinite populations.
  • Conceptual Questions 8A includes items about:
    • Reason for sampling, population vs sample, sampling vs probability distributions, standard error, finite population correction, etc.

8.9 and 8.x: Additional Practice and Solutions

  • 8.9: Sampling distribution of sample proportion and difference of two proportions with examples:
    • Example 8.10: Probability that proportion defective is between two values for n = 300, p = 0.03.
    • Example 8.11: Finite population sample proportion exceeding 50% acceptance for p = 0.60, N = 100,000, n = 100.
    • Example 8.12: Difference in proportions for two companies with n1 = 250, n2 = 300; probability that difference ≤ 0.02.

8.10 to 8.12: More Examples (summary of results and methods)

  • These examples illustrate the use of normal approximation to proportion, confidence intervals for proportions, and comparing two proportions under large-sample conditions.
  • Key formulas used:
    • For a single proportion: SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}} (infinite population).
    • For a finite population: multiply by the finite population correction factor \sqrt{\frac{N-n}{N-1}}.
    • For the difference of two proportions: SE(\hat{p}1 - \hat{p}2) = \sqrt{\frac{p1(1-p1)}{n1} + \frac{p2(1-p2)}{n2}}.

8. Concepts Quiz and Review (high-level takeaways)

  • True/False statements test understanding of:
    • The nature of sampling distributions and standard errors.
    • When the Central Limit Theorem applies and why.
    • The distinction between population distribution, sample distribution, and sampling distribution.
    • The role of finite population correction and degrees of freedom in t-distributions.
  • Multiple-choice items reinforce: identifying appropriate sampling methods, recognizing normal approximation criteria, and applying standard error formulas.

Key Formulas to Remember

  • Population mean, population SD: μ, \; σ.

  • Sample mean, sample SD: \bar{X}, \; s; sampling distribution of the mean has mean μ{\bar{X}} = μ and SE σ{\bar{X}} = \frac{σ}{\sqrt{n}} (infinite population).

  • Finite population correction (for sampling without replacement):
    SE(\bar{X}) = \frac{σ}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}.

  • If σ unknown, replace with s and use t-distribution with degrees of freedom df = n-1.

  • Central Limit Theorem (CLT) for non-normal populations:

    • \bar{X} \approx N(μ, σ^2/n) for sufficiently large n.
  • Sampling distribution of the difference of two means (independent samples):

    • E(\bar{X}1 - \bar{X}2) = μ1 - μ2
    • SE(\bar{X}1 - \bar{X}2) = \sqrt{\frac{σ1^2}{n1} + \frac{σ2^2}{n2}}.
  • Sampling distribution of a proportion: \hat{p} = X/n, \; E(\hat{p}) = p, \; SE(\hat{p}) = \sqrt{ \frac{p(1-p)}{n} }.

  • Difference of two proportions (large samples):
    SE(\hat{p}1 - \hat{p}2) = \sqrt{\frac{p1(1-p1)}{n1} + \frac{p2(1-p2)}{n2}}.

  • Degrees of freedom for t-distribution: df = n - 1.

  • For normal approximations, common z-values:

    • 68%: within \pm 1\sigma, i.e., within \pm \frac{σ}{\sqrt{n}}.
    • 95%: within \pm 1.960 \cdot \frac{σ}{\sqrt{n}}.