Lecture Notes Flashcards

Sample Mean

Review of Last Time:
- Calculating sample statistics (e.g., sample means) is straightforward with the formula.
- Different samples lead to different sample statistics.
Extrapolating from Sample to Population:
- The main question is how to extrapolate from a sample to the population.
Earnings Data Set Example:
- The AD earnings file contains 171 observations for 30-year-old female full-time workers.
- The sample mean for earnings was 41,413.
- Question: What can be said about the likely range of mean earnings for all 30-year-old female full-time workers in the entire country?
- Is the observed sample mean just an artifact of that particular sample?
Introducing Population Concepts:
- Introduce the concept of the mean of the population.
- Provide a measure of how precisely the sample mean estimates the population mean.

Sampling from a Finite Population

Definition of Population:
- Population is defined as all units, people, or objects of interest.
- Example: A census that records multiple characteristics for all people in a country.
Population Mean (\mu):
- The average of all values in the population.
Population Standard Deviation (\sigma):
- The standard deviation for all values in the population.
1880 Census Example:
- Population: All US population in 1880 (finite population).
- Variable: Age.
- Number of observations (N): 50,169,452 people.
- Population average age (\mu): 24.13.
- Population standard deviation (\sigma): 18.61.
Sample from the Population:
- Sample size (n): 25 people randomly selected from the 50 million.
- Sample average ($\bar{x}$): 27.84.
- Sample standard deviation (s): 20.71.
- The histogram for the sample looks different from the population. The mean for the sample is higher than the mean of the population.

Repeated Sampling

Experiment:
- Select more samples of size 25 from the population.
- For every sample, calculate the sample mean and sample standard deviation.
- Repeat this experiment many times.
Using Stata:
- Open the file called adhmeans in Stata.
- The file contains 100 experiments where each time a sample of 25 observations were selected and the mean and standard deviation was saved for that sample.
Average of the Means:
- Calculate the average of the means for all 100 samples.
- The average is 23.82, which is much closer to the population mean of 24.13 than individual samples.

Histogram of Sample Means

Replicating the Histogram in Stata:
- Use the command histogram mean, start(15) width(2.5).
Observations:
- The histogram is centered around the population mean.
- The distribution is roughly symmetric.
- The standard deviation of 100 means (3.92) is much less than the individual standard deviation (e.g., 20.7).

Key Observations

Average of Sample Means:
- The average of many sample means is close to the population mean.
Variability:
- The sample mean is much less variable than the individual underlying observations.
Distribution:
- The sample mean is approximately normally distributed, provided the sample is sufficiently large (e.g., at least 30 observations).

Population vs. Sample

Population:
- The set of all observations, measurements, or experimental outcomes.
Sample:
- A subset selected from the population.
Statistics Convention:
- Capitalized letters for random variables (e.g., X).
- Lowercase letters for sample realizations (e.g., x).
Example:
- Population is one, two, three, four, and five.
- Draw a value of four randomly.
- Random variable X has realized value x = 4.

Sample of Size n

Each draw is a realization of a random variable.
Random variable x1 may be the earnings of the first person choosing randomly from the population, x2 will be the second value and so on.
Sample of size n has observed values x1, x2, …, xn that are realizations or outcomes of the random variables X1, X2, …, Xn.

Population Mean

The population mean of random variable X (denoted as \mu) is the probability-weighted average of all values of X in the population.
Expected Value:
- \mu is also denoted as the expected value of X (E[X]).
- The long-run average that's expected if we drew a value of X at random, drew a second value of X at random and keep on doing them and then we averaged this value.
Finite Population:
- For a finite population (e.g., a census), \mu is the average of N population values of X.
- \mu = E[X] = \frac{x1^* + x2^* + … + xN^}{N} = \frac{1}{N} \sum{i=1}^{N} x_i^
General Case:
- For an experiment leading to possible values x1^, x2^, … with probabilities P(X=x1^), P(X=x2^), …
- \mu = E[X] = P(X=x1^) \cdot x1^ + P(X=x2^) \cdot x2^ + … = \sum P(X=x) \cdot x

Coin Toss Example

X is the number of heads obtained in two coin tosses.
Possible values for X: 0, 1, 2.
Probabilities: P(X=0) = 0.25, P(X=1) = 0.5, P(X=2) = 0.25.
\mu = E[X] = 0 \cdot 0.25 + 1 \cdot 0.5 + 2 \cdot 0.25 = 1

Sample Mean

The sample mean is the average of n sample realizations.
Finite Population:
- Population: 1, 2, 3, 4, and 5. \mu = \frac{1+2+3+4+5}{5} = 3
- Sample of size 3 (with replacement): 2, 5, 4.
- \bar{x} = \frac{2+5+4}{3} = 3.67
In general, the sample mean is not necessarily close to the population mean.

Coin Toss Example (Sample)

Population mean: \mu = 1
Toss two coins five times, record the number of heads: 1, 0, 0, 2, 1.
\bar{x} = \frac{1+0+0+2+1}{5} = 0.8
In general, the sample mean by itself for one sample is not going to be close to the population.
The sample x1, x2, …, xn is a realization of the random variables X1, X2, X3, …, Xn.
\bar{x} = \frac{X1 + X2 + … + X_n}{n}

Population Variance and Standard Deviation

Population Variance (\sigma^2 or Var(X)) is the probability-weighted average of squared deviations from the mean.
\sigma^2 = E[(X - \mu)^2] = \sum P(X=x) \cdot (x - \mu)^2
Population Standard Deviation (\sigma) is the square root of the variance.

Coin Toss Example (Variance)

\sigma^2 = 0.25 \cdot (0-1)^2 + 0.5 \cdot (1-1)^2 + 0.25 \cdot (2-1)^2 = 0.5
\sigma = \sqrt{0.5} \approx 0.707

Sample Variance and Standard Deviation

Sample Variance (s^2) is calculated by averaging square deviations from the mean.
- s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2
- The divisor n-1 is called the degrees of freedom.
Sample Standard Deviation (s) is the square root of the variance.
- s = \sqrt{s^2}

Properties of the Sample Mean

The sample mean, \bar{x}, on average will be very close to the population mean \mu.
The variability of the sample mean is much less than the variability of the individual observations.
The sample mean may be approximately normally distributed.

Population Assumptions

xi has a common mean \mu: E[xi] = \mu for all i.
xi has a common variance \sigma^2: Var(xi) = \sigma^2 for all i.
Different observations are statistically independent: xi is statistically independent of xj, where i ≠ j.
- The value of x2 is not influenced by the value taken by x1.
Shorthand Notation: xi is distributed with the mean of \mu and the common variance of \sigma^2.
These assumptions are met if the data is obtained from simple random sampling.
first two assumptions state that the common mean and the common standard deviation exist.

Mean of the Sample Mean

Each sample mean that we calculate is going to be based on random numbers.
Each sample mean, \bar{x}, is a random number because it's based on a randomly selected sample.
The population mean of the sample means is the expected value of \bar{x}, E[\bar{x}], and it's equal to the population mean
\mu.
E[\bar{x}] = \mu

Standard Deviation of the Sample Mean

The population variance of the sample means, \sigma^2{\bar{x}}, is equal to E[(\bar{x} - \mu{\bar{x}})^2], and it turns out that it's equal to \frac{\sigma^2}{n}.
- \sigma^2_{\bar{x}} = \frac{\sigma^2}{n}
The population standard deviation of the sample means is equal to \frac{\sigma}{\sqrt{n}}.
- \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
The sample mean is less variable than the underlying data, as we saw with the census data.
The variability of the sample mean decreases as the sample size increases--larger samples lead to greater precision.

Standard Error of the Sample Mean

When the population standard deviation (\sigma) is unknown, replace \sigma^2 with the sample variance (s^2) to get the standard error of the sample mean.
The estimated variance of \bar{x} is calculated as s^2_{\bar{x}} = \frac{s^2}{n}.
- s^2{\bar{x}} = \frac{s^2}{n} = \frac{1}{n} \cdot \frac{1}{n-1} \sum (xi - \bar{x})^2
Taking the root of that will give the estimated standard deviation.
The standard error of \bar{x}, which is \frac{s}{\sqrt{n}}, is the square root of one over n minus one sum of xi minus x bar squared divided by square root of n.
- SE{\bar{x}} = \frac{s}{\sqrt{n}} = \sqrt{\frac{1}{n(n-1)} \sum (xi - \bar{x})^2}
In general, the term standard error means estimated standard deviation.

Normal Distribution and the Central Limit Theorem (CLT)

\bar{x} will be distributed with the mean of \mu because the mean of the sample means is always equal to the population mean and the variance of \frac{\sigma^2}{n}.
If the sample size is random, if the sample is selected randomly and the sample size is large (n -> ∞), then the sampling distribution of \bar{x} will be approximately normal.
If a sample, if the sample is a simple random sample and the sample size n, then CLT states that z has the standard normal distribution.
Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$, This will have the mean of zero and the standard deviation one.
CLT says that if n is large, then z will be distributed normally with the mean of zero and the standard deviation of one (N(0,1)).
Often ensuring that the sample size is at least 30 observation will result in a good approximation.
When n > 30, we get a good approximation.