Survey Precision and Statistical Inference Notes

Overview and Institutional Context

These study notes are based on the course materials for "Survey Precision" presented by Dr. Atoui Saida for the academic year 2025-2026 at Setif 1 University - Ferhat ABBAS (جامعة سطيف 1 - فرحات عباس). The primary focus of the material is statistical inference and the determination of precision when analyzing both qualitative and quantitative variables within a population.

Introduction to Statistical Inference

Statistical Inference is defined as the process of inferring unknown characteristics of a population from partial observations. This process inherently includes a margin of error, as it involves making generalizations about a large group based on a subset (sample) of that group.

Precision of a Percentage or Proportion (Qualitative Variables)

When dealing with a qualitative variable, the objective is to determine the precision of a percentage or proportion PP. The problem is framed as follows:

  • P0P_0: The observed proportion within the sample.
  • PP: The true population proportion, which is unknown.
  • nn: The sample size.
  • xx: The number of subjects in the sample possessing the specific characteristic.
  • The formula for the observed proportion is:
    p0=xnp_0 = \frac{x}{n}

Theoretical Reminders and the Binomial Distribution

For repeated sampling, the count of subjects xx presenting a characteristic follows a binomial distribution, denoted as B(n,P)B(n, P). The observed proportion p0p_0 exhibits the following statistical properties:

  • Mean: np0np_0
  • Variance: p0q0n\frac{p_0q_0}{n} (Note: for counts, the variance is expressed as np0q0np_0q_0)
  • Standard deviation (ss): p0q0n\sqrt{\frac{p_0q_0}{n}}

When the sample size nn is sufficiently large, the binomial distribution can be approximated by a normal distribution, a concept referred to as the "Normal Approximation."

The Standard Normal Distribution and Probability (Alpha)

The probability α\alpha (alpha) represents the probability of having values outside of a specific interval defined by the z-scores (z,+z)(-z, +z). This is visualized as the area in the tails of the distribution (α/2\alpha/2 in each tail).

According to the table of the standardized normal distribution (Loi Normale), derived from Fisher and Yates (Statistical tables for biological, agricultural, and medical research), the probability ϵ\epsilon relates to the absolute value of the reduced deviation exceeding a given value. Notable z-score values include:

  • For α=0.05\alpha = 0.05, the z-score is 1.9601.960.
  • For α=0.01\alpha = 0.01, the z-score is 2.5762.576.
  • For α=0.10\alpha = 0.10, the z-score is 1.6451.645.

Small probability values and their corresponding z-scores:

  • 0.0013.290530.001 \rightarrow 3.29053
  • 0.00013.890590.0001 \rightarrow 3.89059
  • 0.000014.417170.00001 \rightarrow 4.41717
  • 0.0000014.891640.000001 \rightarrow 4.89164
  • 0.00000015.326720.0000001 \rightarrow 5.32672
  • 0.000000015.730730.00000001 \rightarrow 5.73073
  • 0.0000000016.109410.000000001 \rightarrow 6.10941

Confidence Interval for Proportions

The calculation assumes that the true population proportion PP is close to the observed proportion P0P_0. The population proportion is estimated to lie within a confidence interval:
P=P0±eP = P_0 \pm e Where:

  • ee: The margin of error.
  • The range of the interval is from P0eP_0 - e to P0+eP_0 + e.
  • The total width of the interval is 2e2e.
  • Error risk α\alpha corresponds to the z-score.
  • Maximum error formula: e=z×se = z \times s, where s=pqns = \sqrt{\frac{pq}{n}}.
  • Full estimation formula: P=P0±zp0q0nP = P_0 \pm z \sqrt{\frac{p_0q_0}{n}}

Crucially, the risk of error α\alpha is inversely proportional to the width of the confidence interval.

Application: Disease Frequency Case Study

Data Provided:

  • Sample size (nn): 250250
  • Affected subjects (xx): 2525
  • Observed proportion (p0p_0): 25250=0.1\frac{25}{250} = 0.1 (or 10%10\%
  • Complement (q0q_0): 10.1=0.91 - 0.1 = 0.9
  • Risk error (\alpha): 5%(0.05)5\% (0.05), which implies z=1.96z = 1.96

Calculation:

  1. Standard deviation of the proportion: s=0.1×0.9250=0.019s = \sqrt{\frac{0.1 \times 0.9}{250}} = 0.019
  2. Margin of error: e=1.96×0.019=0.038e = 1.96 \times 0.019 = 0.038
  3. Interval calculation: P=0.1±0.038P = 0.1 \pm 0.038
  4. Result: Lower limit = 0.0620.062, Upper limit = 0.1380.138

Check Conditions: To validate the normal approximation, the conditions np5np \ge 5 and nq5nq \ge 5 must be met at the interval limits:

  • Lower limit (0.0620.062): 250×0.062=15.5250 \times 0.062 = 15.5; 250×0.938=234.5250 \times 0.938 = 234.5
  • Upper limit (0.1380.138): 250×0.138=34.5250 \times 0.138 = 34.5; 250×0.862=215.5250 \times 0.862 = 215.5 All values are 5\ge 5, so the approximation is valid.

Conclusion: The frequency of the disease is estimated at 10%10\% and varies between 6.2%6.2\% and 13.8%13.8\% for a confidence level where p=0.05p = 0.05 (IC95%=6.2%13.8%IC 95\% = 6.2\% - 13.8\%).

Precision of a Mean (Quantitative Variables)

For quantitative variables, the objective is to estimate the population mean mm (unknown) using the sample mean m0m_0.

Statistical Parameters:

  • Sample Mean (m0m_0): xin\frac{\sum x_i}{n}
  • Variance of variable xx (s2s^2): (xim)2n\frac{\sum (x_i - m)^2}{n}
  • Standard deviation of variable xx (ss): (xim)2n\sqrt{\frac{\sum (x_i - m)^2}{n}}
  • Variance of the mean (mm): s2n\frac{s^2}{n}
  • Standard deviation of the mean (mm): sn\frac{s}{\sqrt{n}}

Application Conditions:

  1. The variable must follow a normal distribution.
  2. The sample size must be 30\ge 30 (n30n \ge 30).

Confidence Interval for the Mean

The calculation assumes mm is close to m0m_0 and falls within the interval m0±em_0 \pm e.

  • Range: m0em_0 - e to m0+em_0 + e
  • Width: 2e2e
  • Formula: m=m0±zsnm = m_0 \pm z \frac{s}{\sqrt{n}}

Fluctuation of Variable xx: To find the range in which individual values of xix_i fluctuate in the population:

  • Formula: xi=m0±ex_i = m_0 \pm e
  • Error for individual values: e=z×se = z \times s
  • Full formula: xi=m0±zsx_i = m_0 \pm z s

Application: Average Number of Children Case Study

Data Provided:

  • Sample size (nn): 400400
  • Sample average (m0m_0): 66 children
  • Standard deviation (ss): 22 children
  • Risk error (\alpha): 5%(0.05)5\% (0.05), which implies z=1.96z = 1.96

Condition Check:

  • The variable (number of children) follows a normal distribution.
  • n=400n = 400, which is greater than 3030.

Calculation of Mean Precision:

  1. Standard deviation of the mean: sm=2400=0.1s_m = \frac{2}{\sqrt{400}} = 0.1
  2. Margin of error: e=1.96×0.1=0.2e = 1.96 \times 0.1 = 0.2
  3. Population mean estimate: m=6±0.2m = 6 \pm 0.2
  4. Results: m[5.8,6.2]m \in [5.8, 6.2]

Conclusion on Mean: The average number of children in the population varies between 5.85.8 and 6.26.2 children (p=0.05p = 0.05; IC95%=5.86.2IC 95\% = 5.8 - 6.2).

Calculation of Variable Fluctuation:

  1. Margin of error for individuals: e=z×s=1.96×2=3.92e = z \times s = 1.96 \times 2 = 3.92 (rounded to 44 children).
  2. Interval: xi=6±4x_i = 6 \pm 4
  3. Result: 22 to 1010 children.

Conclusion on Fluctuation: For 95%95\% of families in the population, the number of children is between 22 and 1010 children.