Statistical Intervals Based on a Single Sample
Basic Properties of Confidence Intervals
- Introduction: Focus on estimating a population mean (\mu). Assumptions:
- Population distribution is normal.
- Population standard deviation \sigma is known (unrealistic in practice).
- Sample Observations: Observations x1, x2, …, xn are from a random sample X1, …, X_n from a normal distribution with mean \mu and standard deviation \sigma.
- Distribution of Sample Mean: The sample mean \bar{X} is normally distributed with:
- Expected value: E(\bar{X}) = \mu
- Standard deviation: (\frac{\sigma}{\sqrt{n}})
- Standardizing: Standardizing \bar{X} yields a standard normal variable:
Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} - Standard Normal Curve: The area under the standard normal curve between -1.96 and 1.96 is 0.95.
P(-1.96 < Z < 1.96) = 0.95
P(-1.96 < \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} < 1.96) = 0.95 - Manipulating Inequalities:
- Multiply by (\frac{\sigma}{\sqrt{n}}):
-1.96 \frac{\sigma}{\sqrt{n}} < \bar{X} - \mu < 1.96 \frac{\sigma}{\sqrt{n}} - Subtract \bar{X} from each term:
-\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} < -\mu < -\bar{X} + 1.96 \frac{\sigma}{\sqrt{n}} - Multiply by -1 (reverses inequality direction):
\bar{X} + 1.96 \frac{\sigma}{\sqrt{n}} > \mu > \bar{X} - 1.96 \frac{\sigma}{\sqrt{n}}
\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} < \mu < \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}
- Multiply by (\frac{\sigma}{\sqrt{n}}):
- Random Interval: The probability statement can be written as:
P(\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} < \mu < \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95 - Interpretation: Think of a random interval with endpoints \bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} and \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}. In interval notation:
(\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}) - Properties of the Interval:
- The interval is random because its endpoints depend on the random variable \bar{X}.
- It is centered at the sample mean \bar{X}.
- It extends 1.96 \frac{\sigma}{\sqrt{n}} to each side of \bar{X}.
- The width of the interval is 2(1.96) \frac{\sigma}{\sqrt{n}}, which is not random; only the location is random.
- Probability Statement Paraphrased: "The probability is 0.95 that the random interval includes the true value of \mu."
Definition of Confidence Interval
- Prior to Experiment: Before data collection, it is likely that \mu will lie inside the interval.
Example 7.2
- Values: Given \sigma = 2.0, n = 31, and \bar{x} = 80.0, the 95% confidence interval (CI) for the true average preferred height is:
(80.0 - 1.96 \frac{2.0}{\sqrt{31}}, 80.0 + 1.96 \frac{2.0}{\sqrt{31}}) = (79.3, 80.7) - Interpretation: We can be highly confident (95% confidence level) that 79.3 < \mu < 80.7. This narrow interval indicates a precise estimation of \mu.
Interpreting a Confidence Level
- Inheritance: The 95% confidence level is inherited from the 0.95 probability of the random interval capturing the true value of \mu.
- Incorrect Conclusion: It is tempting to conclude that \mu is within the fixed interval with a probability of 0.95, but this is incorrect after substituting \bar{x} = 80.0 because randomness disappears.
- Correct Interpretation: Relies on the long-run relative frequency interpretation of probability.
- Saying event A has probability 0.95 means if the experiment is repeated many times, A will occur 95% of the time.
- Repeated Sampling: Suppose we obtain multiple independent samples of typists’ preferred heights and compute a 95% CI for each sample.
- If A is the event that \bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} < \mu < \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}, then P(A) = 0.95.
- In the long run, 95% of the computed CIs will contain \mu.
- Example: In a scenario with 100 intervals, approximately 95 of them would contain \mu.
- Focus on Long-Run: The 95% confidence level isn't a statement about a specific interval like (79.3, 80.7) but about what happens if many similar intervals are constructed.
- Classical CI: These intervals are "classical" because their interpretation relies on the classical notion of probability.
Other Levels of Confidence
- Adjusting for Desired Confidence: Any desired confidence level can be achieved by replacing 1.96 (for 95% CI) with the appropriate standard normal critical value (z-score).
- Generalization: A probability of 1 - \alpha is achieved by using z_{\alpha/2} in place of 1.96.
- CI Formula (7.5): The general formula for a CI is:
\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}
Which can be expressed as:
Point estimate of \mu \pm (z critical value) * (standard error of the mean)
Example 7.3
Context: A production process modification for engine control housing units, where hole diameters are normally distributed.
- Prior standard deviation: \sigma = 0.100 mm (assumed unchanged).
- Sample of n = 40 housing units, with mean diameter \bar{x} = 5.426 mm.
Goal: Calculate a 90% confidence interval for the true average hole diameter.
Calculation: To obtain confidence level of 90%, so 100(1 - \alpha) = 90, thus \alpha = 0.10 and z{\alpha/2} = z{0.05} = 1.645.
5.426 \pm 1.645 \frac{0.100}{\sqrt{40}} = (5.400, 5.452)
Conclusion: With high confidence (90%), we can say 5.400 < \mu < 5.452. The interval is narrow due to small variability in hole diameter.
Confidence Level, Precision, and Sample Size
- Trade-off: Higher confidence levels result in wider intervals, meaning reduced precision.
- Width Comparison:
- 95% interval width: 2(1.96) \frac{\sigma}{\sqrt{n}} = 3.92 \frac{\sigma}{\sqrt{n}}
- 99% interval width: 2(2.58) \frac{\sigma}{\sqrt{n}} = 5.16 \frac{\sigma}{\sqrt{n}}
- Inverse Relationship: Confidence level (reliability) is inversely related to precision.
- Strategy: Specify desired confidence level and interval width, then determine the necessary sample size.
Example 7.4
- Context: Estimating true average response time \mu for a computer time-sharing system after a new operating system installation. Response times are normally distributed with \sigma = 25 millisec.
- Goal: Determine sample size n to ensure a 95% CI with a width of at most 10.
- Calculation:
2 z_{\alpha/2} \frac{\sigma}{\sqrt{n}} = w
where w is the desired width, in this case 10. So,
10 = 2 (1.96) \frac{25}{\sqrt{n}}
\sqrt{n} = \frac{2 (1.96)(25)}{10} = 9.80
n = (9.80)^2 = 96.04 - Sample Size: Round up to the nearest integer, so n = 97 is required.
General Formula for Sample Size
- Formula: To ensure an interval width w:
n = (\frac{2 z_{\alpha/2} \sigma}{w})^2 - Dependence: n increases as:
- w decreases (smaller width requires larger sample size).
- \sigma increases (more population variability requires larger sample size).
- 100(1 - \alpha)\uparrow (as \alpha decreases, z_{\alpha/2} increases for higher confidence level)
Bound on the Error of Estimation
- Definition: The half-width of the 95% CI (1.96 \frac{\sigma}{\sqrt{n}}) is the bound on the error of estimation.
- Interpretation: With 95% confidence, the point estimate \bar{x} will be no farther than this bound from \mu.
- Objective: Determine a sample size for which a particular value of the bound is achieved.
Estimating Mean to Within a Bound
- Generalization: To estimate \mu to within an amount B with 100(1 - \alpha)\% confidence, the necessary sample size is found by replacing 2/w with 1/B in the sample size formula.
- Sample size formula: n = (z_{\alpha/2} \frac{\sigma}{B})^2
Deriving a Confidence Interval
Objective: Construct a CI for a parameter \theta based on a sample X1, X2, …, X_n.
Requirements of Random Variable: Find a random variable h(X1, X2, …, X_n; \theta) that:
- Depends functionally on both the sample and \theta.
- Has a probability distribution that does not depend on \theta or any other unknown parameters.
Normal Example: If the population is normal with known \sigma and \theta = \mu, then
h(X1, …, Xn; \mu) = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}
satisfies both properties (standard normal distribution).Appropriate Estimator: Distribution of the h function is usually suggested by examining the distribution of an appropriate estimator.
Probability Statement: For any \alpha between 0 and 1, find constants a and b such that
P(a < h(X1, …, Xn; \theta) < b) = 1 - \alpha
Where a and be do not depend on q.Manipulation: Manipulate the inequalities to isolate \theta:
P(l(X1, X2, …, Xn) < \theta < u(X1, X2, …, Xn)) = 1 - \alphaConfidence Limits: l(x1, x2, …, xn) and u(x1, …, x_n) are the lower and upper confidence limits, respectively, for a 100(1 - \alpha)\% CI.
Example 7.5
Context: Time to breakdown of an insulating fluid between electrodes follows an exponential distribution with parameter \lambda.
Sample: a sample of n = 10 gives x_i values.
Goal: Find a 95% CI for \lambda and for the true average breakdown time.
Random Variable: Let h(X1, X2, …, Xn; \lambda) = 2\lambda \sum{Xi}. This has a chi-squared distribution with 2n degrees of freedom (df).
Chi-Squared Distribution: Denoted by v = 2n, where v is the parameter. For example for n=10 we have v=20.
Using Chi-Squared Table: With v = 20, 34.170 captures upper-tail area 0.025, and 9.591 captures lower-tail area 0.025.
Probability: For n = 10: \P(9.591 < 2\lambda \sum X_i < 34.170) = 0.95
Isolating Lambda: Division by 2\sum Xi isolates \lambda: P(\frac{9.591}{2\sum Xi} < \lambda < \frac{34.170}{2\sum X_i}) = 0.95
Confidence Limits: The lower limit is \frac{9.591}{2\sum Xi}, and an upper limit is \frac{34.170}{2\sum Xi}.
Calculation: For given data, \sum x_i = 550.87, giving the interval (0.00871, 0.03101).
Expected Value: The expected value of an exponential rv is m = \frac{1}{\lambda}. So
P(\frac{2\sum Xi }{34.170} < \frac{1}{\lambda} < \frac{2\sum Xi}{9.591} = 0.95CI for Meam: For true average breakdown time is (2\sum xi /34.170, 2\sum xi /9.591) = (32.24, 114.87) . Interval is wide due to variability and small sample size.
Bootstrap Confidence Intervals
- Bootstrap CI for Theta: Can be applied to obtain a CI for \theta.
- Estimating Mean Example: Consider again estimating the mean \mu of a normal distribution when \sigma is known.
- Percentile: 1.96 \frac{\sigma}{\sqrt{n}} is the 97.5th percentile of the distribution of \bar{X} - \mu$
P(\bar{X} - \mu < 1.96 \frac{\sigma}{\sqrt{n}}) = P(Z < 1.96) = 0.975 - Symmetry: Similarly, -1.96 \frac{\sigma}{\sqrt{n}} is the 2.5th percentile.
- Interval:
0.95 = P(2.5th \text{ percentile } < \bar{X} - \mu < 97.5th \text{ percentile }) = P( - 2.5th \text{ percentile } > \mu - \bar{X} > - 97.5th \text{ percentile }) - Bootstrap CI: With
- l = \bar{X} - 97.5th \text{ percentile of } \bar{X} - \mu
- u = \bar{X} - 2.5th \text{ percentile of } \bar{X} - \mu
the CI for \mu is (l, u).
- Bootstrap Samples: Percentiles can be estimated from bootstrap samples.
- Procedure: If have B = 1000 bootstrap samples, calculate \bar{X}^ and and the differences \bar{X}^ - \bar{X}$$
- Estimate for Percentiles: 25th largest and 25th smallest of these differences are estimates of the 97.5th and 2.5th percentiles.