Lecture Notes Flashcards
Sample Mean
- Review of Last Time:
- Calculating sample statistics (e.g., sample means) is straightforward with the formula.
- Different samples lead to different sample statistics.
- Extrapolating from Sample to Population:
- The main question is how to extrapolate from a sample to the population.
- Earnings Data Set Example:
- The AD earnings file contains 171 observations for 30-year-old female full-time workers.
- The sample mean for earnings was 41,413.
- Question: What can be said about the likely range of mean earnings for all 30-year-old female full-time workers in the entire country?
- Is the observed sample mean just an artifact of that particular sample?
- Introducing Population Concepts:
- Introduce the concept of the mean of the population.
- Provide a measure of how precisely the sample mean estimates the population mean.
Sampling from a Finite Population
- Definition of Population:
- Population is defined as all units, people, or objects of interest.
- Example: A census that records multiple characteristics for all people in a country.
- Population Mean (\mu):
- The average of all values in the population.
- Population Standard Deviation (\sigma):
- The standard deviation for all values in the population.
- 1880 Census Example:
- Population: All US population in 1880 (finite population).
- Variable: Age.
- Number of observations (N): 50,169,452 people.
- Population average age (\mu): 24.13.
- Population standard deviation (\sigma): 18.61.
- Sample from the Population:
- Sample size (n): 25 people randomly selected from the 50 million.
- Sample average ($\bar{x}$): 27.84.
- Sample standard deviation (s): 20.71.
- The histogram for the sample looks different from the population. The mean for the sample is higher than the mean of the population.
Repeated Sampling
- Experiment:
- Select more samples of size 25 from the population.
- For every sample, calculate the sample mean and sample standard deviation.
- Repeat this experiment many times.
- Using Stata:
- Open the file called adhmeans in Stata.
- The file contains 100 experiments where each time a sample of 25 observations were selected and the mean and standard deviation was saved for that sample.
- Average of the Means:
- Calculate the average of the means for all 100 samples.
- The average is 23.82, which is much closer to the population mean of 24.13 than individual samples.
Histogram of Sample Means
- Replicating the Histogram in Stata:
- Use the command
histogram mean, start(15) width(2.5)
.
- Use the command
- Observations:
- The histogram is centered around the population mean.
- The distribution is roughly symmetric.
- The standard deviation of 100 means (3.92) is much less than the individual standard deviation (e.g., 20.7).
Key Observations
- Average of Sample Means:
- The average of many sample means is close to the population mean.
- Variability:
- The sample mean is much less variable than the individual underlying observations.
- Distribution:
- The sample mean is approximately normally distributed, provided the sample is sufficiently large (e.g., at least 30 observations).
Population vs. Sample
- Population:
- The set of all observations, measurements, or experimental outcomes.
- Sample:
- A subset selected from the population.
- Statistics Convention:
- Capitalized letters for random variables (e.g., X).
- Lowercase letters for sample realizations (e.g., x).
- Example:
- Population is one, two, three, four, and five.
- Draw a value of four randomly.
- Random variable X has realized value x = 4.
Sample of Size n
- Each draw is a realization of a random variable.
- Random variable x1 may be the earnings of the first person choosing randomly from the population, x2 will be the second value and so on.
- Sample of size n has observed values x1, x2, …, xn that are realizations or outcomes of the random variables X1, X2, …, Xn.
Population Mean
- The population mean of random variable X (denoted as \mu) is the probability-weighted average of all values of X in the population.
- Expected Value:
- \mu is also denoted as the expected value of X (E[X]).
- The long-run average that's expected if we drew a value of X at random, drew a second value of X at random and keep on doing them and then we averaged this value.
- Finite Population:
- For a finite population (e.g., a census), \mu is the average of N population values of X.
- \mu = E[X] = \frac{x1^* + x2^* + … + xN^}{N} = \frac{1}{N} \sum{i=1}^{N} x_i^
- General Case:
- For an experiment leading to possible values x1^, x2^, … with probabilities P(X=x1^), P(X=x2^), …
- \mu = E[X] = P(X=x1^) \cdot x1^ + P(X=x2^) \cdot x2^ + … = \sum P(X=x) \cdot x
Coin Toss Example
- X is the number of heads obtained in two coin tosses.
- Possible values for X: 0, 1, 2.
- Probabilities: P(X=0) = 0.25, P(X=1) = 0.5, P(X=2) = 0.25.
- \mu = E[X] = 0 \cdot 0.25 + 1 \cdot 0.5 + 2 \cdot 0.25 = 1
Sample Mean
- The sample mean is the average of n sample realizations.
- Finite Population:
- Population: 1, 2, 3, 4, and 5. \mu = \frac{1+2+3+4+5}{5} = 3
- Sample of size 3 (with replacement): 2, 5, 4.
- \bar{x} = \frac{2+5+4}{3} = 3.67
- In general, the sample mean is not necessarily close to the population mean.
Coin Toss Example (Sample)
- Population mean: \mu = 1
- Toss two coins five times, record the number of heads: 1, 0, 0, 2, 1.
- \bar{x} = \frac{1+0+0+2+1}{5} = 0.8
- In general, the sample mean by itself for one sample is not going to be close to the population.
- The sample x1, x2, …, xn is a realization of the random variables X1, X2, X3, …, Xn.
- \bar{x} = \frac{X1 + X2 + … + X_n}{n}
Population Variance and Standard Deviation
- Population Variance (\sigma^2 or Var(X)) is the probability-weighted average of squared deviations from the mean.
- \sigma^2 = E[(X - \mu)^2] = \sum P(X=x) \cdot (x - \mu)^2
- Population Standard Deviation (\sigma) is the square root of the variance.
Coin Toss Example (Variance)
- \sigma^2 = 0.25 \cdot (0-1)^2 + 0.5 \cdot (1-1)^2 + 0.25 \cdot (2-1)^2 = 0.5
- \sigma = \sqrt{0.5} \approx 0.707
Sample Variance and Standard Deviation
- Sample Variance (s^2) is calculated by averaging square deviations from the mean.
- s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2
- The divisor n-1 is called the degrees of freedom.
- Sample Standard Deviation (s) is the square root of the variance.
- s = \sqrt{s^2}
Properties of the Sample Mean
- The sample mean, \bar{x}, on average will be very close to the population mean \mu.
- The variability of the sample mean is much less than the variability of the individual observations.
- The sample mean may be approximately normally distributed.
Population Assumptions
- xi has a common mean \mu: E[xi] = \mu for all i.
- xi has a common variance \sigma^2: Var(xi) = \sigma^2 for all i.
- Different observations are statistically independent: xi is statistically independent of xj, where i ≠ j.
- The value of x2 is not influenced by the value taken by x1.
- Shorthand Notation: xi is distributed with the mean of \mu and the common variance of \sigma^2.
- These assumptions are met if the data is obtained from simple random sampling.
- first two assumptions state that the common mean and the common standard deviation exist.
Mean of the Sample Mean
- Each sample mean that we calculate is going to be based on random numbers.
- Each sample mean, \bar{x}, is a random number because it's based on a randomly selected sample.
- The population mean of the sample means is the expected value of \bar{x}, E[\bar{x}], and it's equal to the population mean
\mu. - E[\bar{x}] = \mu
Standard Deviation of the Sample Mean
- The population variance of the sample means, \sigma^2{\bar{x}}, is equal to E[(\bar{x} - \mu{\bar{x}})^2], and it turns out that it's equal to \frac{\sigma^2}{n}.
- \sigma^2_{\bar{x}} = \frac{\sigma^2}{n}
- The population standard deviation of the sample means is equal to \frac{\sigma}{\sqrt{n}}.
- \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
- The sample mean is less variable than the underlying data, as we saw with the census data.
- The variability of the sample mean decreases as the sample size increases--larger samples lead to greater precision.
Standard Error of the Sample Mean
- When the population standard deviation (\sigma) is unknown, replace \sigma^2 with the sample variance (s^2) to get the standard error of the sample mean.
- The estimated variance of \bar{x} is calculated as s^2_{\bar{x}} = \frac{s^2}{n}.
- s^2{\bar{x}} = \frac{s^2}{n} = \frac{1}{n} \cdot \frac{1}{n-1} \sum (xi - \bar{x})^2
- Taking the root of that will give the estimated standard deviation.
- The standard error of \bar{x}, which is \frac{s}{\sqrt{n}}, is the square root of one over n minus one sum of xi minus x bar squared divided by square root of n.
- SE{\bar{x}} = \frac{s}{\sqrt{n}} = \sqrt{\frac{1}{n(n-1)} \sum (xi - \bar{x})^2}
- In general, the term standard error means estimated standard deviation.
Normal Distribution and the Central Limit Theorem (CLT)
- \bar{x} will be distributed with the mean of \mu because the mean of the sample means is always equal to the population mean and the variance of \frac{\sigma^2}{n}.
- If the sample size is random, if the sample is selected randomly and the sample size is large (n -> ∞), then the sampling distribution of \bar{x} will be approximately normal.
- If a sample, if the sample is a simple random sample and the sample size n, then CLT states that z has the standard normal distribution.
- Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$, This will have the mean of zero and the standard deviation one.
- CLT says that if n is large, then z will be distributed normally with the mean of zero and the standard deviation of one (N(0,1)).
- Often ensuring that the sample size is at least 30 observation will result in a good approximation.
- When n > 30, we get a good approximation.