MAT 162: Statistical Inference I - Week 1: Basic Concepts of Statistical Inference

Statistical Inference: Fundamentals and Definitions

Statistical inference is defined as the process of drawing conclusions about a population based on information obtained from a sample. This field of study bridges the gap between raw data collection and the broader conclusions that can be applied to massive datasets or entire populations.

Key Statistical Terms and Definitions

To understand the framework of statistical inference, one must define the core units and measurements used in data analysis:

Population: This refers to the entire collection of individuals or objects of interest to the researcher. * Example: All registered voters in Nigeria, currently estimated at approximately $90$ million people.
Sample: A subset of the population selected for specific study. * Example: $1,000$ voters selected randomly from the total population of voters.
Parameter: A numerical characteristic of a population. These are usually unknown because it is rarely feasible to measure an entire population. * Example: The population mean height, denoted by the symbol $\mu$ .
Statistic: A numerical characteristic of a sample, calculated directly from the studied data. * Example: The sample mean height, denoted by $\bar{x}$ .
Sampling Frame: A comprehensive list of all units in the population from which the sample is drawn. * Example: The voter registration list provided by the Independent National Electoral Commission (INEC).

Comparison of Population and Sample Concepts

Researchers use sample statistics ( $\bar{x}$ , $s$ , $\hat{p}$ ) to estimate population parameters ( $\mu$ , $\sigma$ , $p$ ) because studying an entire population is often impractical for several reasons:

Cost: Measuring every unit in a population is prohibitively expensive.
Time: Collecting data from a whole population takes too long.
Access: Some units in a population may be geographically or logistically difficult to reach.
Destructive Testing: In manufacturing, some tests destroy the item (e.g., testing the lifespan of a lightbulb or the impact resistance of a car). In these cases, testing the whole population would leave nothing to use or sell.
Infinite Populations: Some populations are theoretically infinite, making a full census impossible.

Notation Summary

Description	Population (Parameter)	Sample (Statistic)
Size	$N$	$n$
Mean	$\mu$ (mu)	$\bar{x}$ (x-bar)
Proportion	$p$	$\hat{p}$ (p-hat)
Standard Deviation	$\sigma$ (sigma)	$s$
Variance	$\sigma^2$	$s^2$

Descriptive vs. Inferential Statistics

Descriptive Statistics: Focuses on describing and summarizing data using tables, graphs, and summary measures. * Example: Stating that the average score of a specific group of $50$ students is $65\%$ .
Inferential Statistics: Focuses on drawing conclusions beyond the immediate data at hand by using probability and sampling theory. * Example: Estimating that the actual university-wide average is $65\% \pm 3\%$ .

Estimation Theory: Point and Interval Estimates

One of the primary goals of statistical inference is estimation, which involves using sample data to determine the likely values of population parameters.

Point Estimation

A point estimate is a single numerical value used to estimate a population parameter. While simple and easy to communicate, it provides no measure of uncertainty or reliability.

Population Mean ( $\mu$ ) corresponds to the Sample Mean ( $\bar{x}$ ).
Population Proportion ( $p$ ) corresponds to the Sample Proportion ( $\hat{p}$ ).
Population Variance ( $\sigma^2$ ) corresponds to the Sample Variance ( $s^2$ ).

Example: If a sample of $100$ students produces a mean test score of $72$ , the point estimate for the population mean is $72$ .

Interval Estimation

An interval estimate is a range of values within which the population parameter is expected to lie, accompanied by a specific level of confidence (e.g., $90\%$ , $95\%$ , or $99\%$ ). The structure is:

$\text{Interval Estimate} = \text{Point Estimate} \pm \text{Margin of Error}$

Example: Based on a sample of $100$ students, a researcher might estimate with $95\%$ confidence that the population mean test score is between $68$ and $76$ .

Comparison Table

Feature	Point Estimate	Interval Estimate
Value	Single number	Range of values
Precision	Unknown	Quantified by the margin of error
Confidence Level	Not provided	Explicitly provided
Likelihood of Accuracy	Very low (rarely exactly correct)	Higher (likely to capture the parameter)

Reliability and Factors Influencing Estimation

The reliability of an estimate depends on three primary factors:

Sample Size ( $n$ ): As the sample size increases, the estimate becomes more reliable and the margin of error decreases.
Variability ( $\sigma$ or $s$ ): Less variability in the data leads to a more reliable estimate.
Sampling Method: Random sampling is always more reliable than biased or non-probability sampling methods.

Key takeaway: Increasing the sample size is considered the most effective way to improve the reliability of a statistical estimate.

Sampling Distributions and Variability

A sampling distribution is the probability distribution of a sample statistic (such as $\bar{x}$ ) computed from all possible samples of a fixed size $n$ drawn from a population. This concept is fundamental because statistics vary from sample to sample—a phenomenon called sampling variability.

Illustration of a Sampling Distribution

Consider a population consisting of the set: ${2, 4, 6}$ .

$N = 3$
$\mu = 4$
$\sigma \approx 1.63$

If we take all possible samples of size $n=2$ (with replacement), the possible samples and their means are:

$(2, 2) \rightarrow \bar{x} = 2$
$(2, 4) \rightarrow \bar{x} = 3$
$(2, 6) \rightarrow \bar{x} = 4$
$(4, 2) \rightarrow \bar{x} = 3$
$(4, 4) \rightarrow \bar{x} = 4$
$(4, 6) \rightarrow \bar{x} = 5$
$(6, 2) \rightarrow \bar{x} = 4$
$(6, 4) \rightarrow \bar{x} = 5$
$(6, 6) \rightarrow \bar{x} = 6$

The mean of these sample means ( $\mu_{\bar{x}}$ ) is $4$ , which is exactly equal to the population mean $\mu$ .

The Standard Error of the Mean (SEM)

The Standard Error of the Mean ( $\sigma_{\bar{x}}$ ) measures the variability of sample means around the population mean. It is essentially the standard deviation of the sampling distribution.

Formulas for Standard Error

If the population standard deviation is known: $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

If the population standard deviation is unknown, the sample standard deviation is used to estimate the standard error: $s_{\bar{x}} = \frac{s}{\sqrt{n}}$

Effect of Sample Size on Precision ( $\sigma = 10$ )

Sample Size ( $n$ )	Standard Error calculation	Precision Level
$25$	$\frac{10}{\sqrt{25}} = 2.0$	Low precision
$100$	$\frac{10}{\sqrt{100}} = 1.0$	Moderate precision
$400$	$\frac{10}{\sqrt{400}} = 0.5$	High precision

Constraint: As sample size increases, the standard error decreases, leading to more precise estimates.

The Central Limit Theorem (CLT)

The Central Limit Theorem holds that for any population with mean $\mu$ and standard deviation $\sigma$ , the sampling distribution of the sample mean $\bar{x}$ will be approximately normally distributed if the sample size $n$ is sufficiently large (typically $n \ge 30$ ), regardless of the shape of the original population distribution.

Sampling Distribution Shapes ( $n \ge 30$ )

If Population is Normal $\rightarrow$ Sampling Distribution is Normal
If Population is Skewed $\rightarrow$ Sampling Distribution is Approximately Normal
If Population is Uniform $\rightarrow$ Sampling Distribution is Approximately Normal
If Population is Bimodal $\rightarrow$ Sampling Distribution is Approximately Normal

Properties of the Sampling Distribution of the Mean

The mean of the distribution equals the population mean: $\mu_{\bar{x}} = \mu$ .
The standard deviation (Standard Error) equals $\frac{\sigma}{\sqrt{n}}$ .
If a population is already normal, the sampling distribution will be normal for any sample size $n$ . Otherwise, the CLT applies as $n$ grows.

Practical Exercises and Applications

Exercise 1: Identifying Terms

Scenario 1: Average weight of newborns in Lagos. Measures $500$ babies across $5$ hospitals. * Population: All newborn babies in Lagos. * Sample: The $500$ measured newborns. * Parameter: Average weight of all Lagos newborns. * Statistic: Average weight of the $500$ sampled babies.
Scenario 2: Political poll of $2,500$ Nigerians regarding a new policy. * Population: All Nigerians relevant to the policy. * Sample: The $2,500$ interviewed Nigerians. * Parameter: True percentage of all Nigerians supporting the policy. * Statistic: Percentage of the $2,500$ interviewees who support the policy.

Exercise 2: Estimation Problems

Given a sample of $50$ households with an average electricity bill of \text{\206\246}15,000: * The point estimate for the population mean is \text{\206\246}15,000. * An interval estimate might reasonably be suggested as \text{\206\246}14,000 to \text{\206\246}16,000 (depending on confidence level).
Why increase sample size?: Increasing the sample size ( $n$ ) from $100$ to $400$ quadruples the denominator in the standard error formula ( $\sqrt{400} = 20$ vs $\sqrt{100} = 10$ ), effectively halving the standard error and doubling the precision of the estimate.

Exercise 3: Sampling Distribution Calculation

A population has $\mu = 50$ and $\sigma = 12$ . For $n = 36$ :

Mean of Sampling Distribution: $\mu_{\bar{x}} = \mu = 50$ .
Standard Error: $\sigma_{\bar{x}} = \frac{12}{\sqrt{36}} = \frac{12}{6} = 2$ .

Week 1 Assessment and Evaluation

Quiz Questions

The average height of all students in a university is a: Parameter (Answer: b).
A sample statistic used to estimate a population parameter is called a: Point estimate (Answer: b).
As sample size increases, the standard error: Decreases (Answer: b).
The standard deviation of the sampling distribution of the mean is called the: Standard error (Answer: c).
True or False: The mean of the sampling distribution of $\bar{x}$ equals the population mean $\mu$ . (Answer: True).

Homework Assignment

Collect a dataset of at least $20$ values (e.g., prices, heights, or temperatures).
Calculate the sample mean ( $point estimate$ ), sample standard deviation ( $s$ ), and standard error of the mean ( $\frac{s}{\sqrt{n}}$ ).
Write a synthesis paragraph explaining the meaning of the point estimate and the inherent reasons for lack of $100\%$ certainty.