Calculating sample statistics (e.g., sample means) is straightforward with the formula.
Different samples lead to different sample statistics.
Extrapolating from Sample to Population:
The main question is how to extrapolate from a sample to the population.
Earnings Data Set Example:
The AD earnings file contains 171 observations for 30-year-old female full-time workers.
The sample mean for earnings was 41,413.
Question: What can be said about the likely range of mean earnings for all 30-year-old female full-time workers in the entire country?
Is the observed sample mean just an artifact of that particular sample?
Introducing Population Concepts:
Introduce the concept of the mean of the population.
Provide a measure of how precisely the sample mean estimates the population mean.
Sampling from a Finite Population
Definition of Population:
Population is defined as all units, people, or objects of interest.
Example: A census that records multiple characteristics for all people in a country.
Population Mean (\mu):
The average of all values in the population.
Population Standard Deviation (\sigma):
The standard deviation for all values in the population.
1880 Census Example:
Population: All US population in 1880 (finite population).
Variable: Age.
Number of observations (N): 50,169,452 people.
Population average age (\mu): 24.13.
Population standard deviation (\sigma): 18.61.
Sample from the Population:
Sample size (n): 25 people randomly selected from the 50 million.
Sample average ($\bar{x}$): 27.84.
Sample standard deviation (s): 20.71.
The histogram for the sample looks different from the population. The mean for the sample is higher than the mean of the population.
Repeated Sampling
Experiment:
Select more samples of size 25 from the population.
For every sample, calculate the sample mean and sample standard deviation.
Repeat this experiment many times.
Using Stata:
Open the file called adhmeans in Stata.
The file contains 100 experiments where each time a sample of 25 observations were selected and the mean and standard deviation was saved for that sample.
Average of the Means:
Calculate the average of the means for all 100 samples.
The average is 23.82, which is much closer to the population mean of 24.13 than individual samples.
Histogram of Sample Means
Replicating the Histogram in Stata:
Use the command histogram mean, start(15) width(2.5).
Observations:
The histogram is centered around the population mean.
The distribution is roughly symmetric.
The standard deviation of 100 means (3.92) is much less than the individual standard deviation (e.g., 20.7).
Key Observations
Average of Sample Means:
The average of many sample means is close to the population mean.
Variability:
The sample mean is much less variable than the individual underlying observations.
Distribution:
The sample mean is approximately normally distributed, provided the sample is sufficiently large (e.g., at least 30 observations).
Population vs. Sample
Population:
The set of all observations, measurements, or experimental outcomes.
Sample:
A subset selected from the population.
Statistics Convention:
Capitalized letters for random variables (e.g., X).
Lowercase letters for sample realizations (e.g., x).
Example:
Population is one, two, three, four, and five.
Draw a value of four randomly.
Random variable X has realized value x = 4.
Sample of Size n
Each draw is a realization of a random variable.
Random variable x1 may be the earnings of the first person choosing randomly from the population, x2 will be the second value and so on.
Sample of size n has observed values x1, x2, …, xn that are realizations or outcomes of the random variables X1, X2, …, Xn.
Population Mean
The population mean of random variable X (denoted as \mu) is the probability-weighted average of all values of X in the population.
Expected Value:
\mu is also denoted as the expected value of X (E[X]).
The long-run average that's expected if we drew a value of X at random, drew a second value of X at random and keep on doing them and then we averaged this value.
Finite Population:
For a finite population (e.g., a census), \mu is the average of N population values of X.
Sample Variance (s^2) is calculated by averaging square deviations from the mean.
s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2
The divisor n-1 is called the degrees of freedom.
Sample Standard Deviation (s) is the square root of the variance.
s = \sqrt{s^2}
Properties of the Sample Mean
The sample mean, \bar{x}, on average will be very close to the population mean \mu.
The variability of the sample mean is much less than the variability of the individual observations.
The sample mean may be approximately normally distributed.
Population Assumptions
xi has a common mean \mu: E[xi] = \mu for all i.
xi has a common variance \sigma^2: Var(xi) = \sigma^2 for all i.
Different observations are statistically independent: xi is statistically independent of xj, where i ≠ j.
The value of x2 is not influenced by the value taken by x1.
Shorthand Notation: xi is distributed with the mean of \mu and the common variance of \sigma^2.
These assumptions are met if the data is obtained from simple random sampling.
first two assumptions state that the common mean and the common standard deviation exist.
Mean of the Sample Mean
Each sample mean that we calculate is going to be based on random numbers.
Each sample mean, \bar{x}, is a random number because it's based on a randomly selected sample.
The population mean of the sample means is the expected value of \bar{x}, E[\bar{x}], and it's equal to the population mean
\mu.
E[\bar{x}] = \mu
Standard Deviation of the Sample Mean
The population variance of the sample means, \sigma^2{\bar{x}}, is equal to E[(\bar{x} - \mu{\bar{x}})^2], and it turns out that it's equal to \frac{\sigma^2}{n}.
\sigma^2_{\bar{x}} = \frac{\sigma^2}{n}
The population standard deviation of the sample means is equal to \frac{\sigma}{\sqrt{n}}.
\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
The sample mean is less variable than the underlying data, as we saw with the census data.
The variability of the sample mean decreases as the sample size increases--larger samples lead to greater precision.
Standard Error of the Sample Mean
When the population standard deviation (\sigma) is unknown, replace \sigma^2 with the sample variance (s^2) to get the standard error of the sample mean.
The estimated variance of \bar{x} is calculated as s^2_{\bar{x}} = \frac{s^2}{n}.
Taking the root of that will give the estimated standard deviation.
The standard error of \bar{x}, which is \frac{s}{\sqrt{n}}, is the square root of one over n minus one sum of xi minus x bar squared divided by square root of n.
In general, the term standard error means estimated standard deviation.
Normal Distribution and the Central Limit Theorem (CLT)
\bar{x} will be distributed with the mean of \mu because the mean of the sample means is always equal to the population mean and the variance of \frac{\sigma^2}{n}.
If the sample size is random, if the sample is selected randomly and the sample size is large (n -> ∞), then the sampling distribution of \bar{x} will be approximately normal.
If a sample, if the sample is a simple random sample and the sample size n, then CLT states that z has the standard normal distribution.
Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$, This will have the mean of zero and the standard deviation one.
CLT says that if n is large, then z will be distributed normally with the mean of zero and the standard deviation of one (N(0,1)).
Often ensuring that the sample size is at least 30 observation will result in a good approximation.