OJ

Lecture Notes Flashcards

Sample Mean

  • Review of Last Time:
    • Calculating sample statistics (e.g., sample means) is straightforward with the formula.
    • Different samples lead to different sample statistics.
  • Extrapolating from Sample to Population:
    • The main question is how to extrapolate from a sample to the population.
  • Earnings Data Set Example:
    • The AD earnings file contains 171 observations for 30-year-old female full-time workers.
    • The sample mean for earnings was 41,413.
    • Question: What can be said about the likely range of mean earnings for all 30-year-old female full-time workers in the entire country?
    • Is the observed sample mean just an artifact of that particular sample?
  • Introducing Population Concepts:
    • Introduce the concept of the mean of the population.
    • Provide a measure of how precisely the sample mean estimates the population mean.

Sampling from a Finite Population

  • Definition of Population:
    • Population is defined as all units, people, or objects of interest.
    • Example: A census that records multiple characteristics for all people in a country.
  • Population Mean (\mu):
    • The average of all values in the population.
  • Population Standard Deviation (\sigma):
    • The standard deviation for all values in the population.
  • 1880 Census Example:
    • Population: All US population in 1880 (finite population).
    • Variable: Age.
    • Number of observations (N): 50,169,452 people.
    • Population average age (\mu): 24.13.
    • Population standard deviation (\sigma): 18.61.
  • Sample from the Population:
    • Sample size (n): 25 people randomly selected from the 50 million.
    • Sample average ($\bar{x}$): 27.84.
    • Sample standard deviation (s): 20.71.
    • The histogram for the sample looks different from the population. The mean for the sample is higher than the mean of the population.

Repeated Sampling

  • Experiment:
    • Select more samples of size 25 from the population.
    • For every sample, calculate the sample mean and sample standard deviation.
    • Repeat this experiment many times.
  • Using Stata:
    • Open the file called adhmeans in Stata.
    • The file contains 100 experiments where each time a sample of 25 observations were selected and the mean and standard deviation was saved for that sample.
  • Average of the Means:
    • Calculate the average of the means for all 100 samples.
    • The average is 23.82, which is much closer to the population mean of 24.13 than individual samples.

Histogram of Sample Means

  • Replicating the Histogram in Stata:
    • Use the command histogram mean, start(15) width(2.5).
  • Observations:
    • The histogram is centered around the population mean.
    • The distribution is roughly symmetric.
    • The standard deviation of 100 means (3.92) is much less than the individual standard deviation (e.g., 20.7).

Key Observations

  • Average of Sample Means:
    • The average of many sample means is close to the population mean.
  • Variability:
    • The sample mean is much less variable than the individual underlying observations.
  • Distribution:
    • The sample mean is approximately normally distributed, provided the sample is sufficiently large (e.g., at least 30 observations).

Population vs. Sample

  • Population:
    • The set of all observations, measurements, or experimental outcomes.
  • Sample:
    • A subset selected from the population.
  • Statistics Convention:
    • Capitalized letters for random variables (e.g., X).
    • Lowercase letters for sample realizations (e.g., x).
  • Example:
    • Population is one, two, three, four, and five.
    • Draw a value of four randomly.
    • Random variable X has realized value x = 4.

Sample of Size n

  • Each draw is a realization of a random variable.
  • Random variable x1 may be the earnings of the first person choosing randomly from the population, x2 will be the second value and so on.
  • Sample of size n has observed values x1, x2, …, xn that are realizations or outcomes of the random variables X1, X2, …, Xn.

Population Mean

  • The population mean of random variable X (denoted as \mu) is the probability-weighted average of all values of X in the population.
  • Expected Value:
    • \mu is also denoted as the expected value of X (E[X]).
    • The long-run average that's expected if we drew a value of X at random, drew a second value of X at random and keep on doing them and then we averaged this value.
  • Finite Population:
    • For a finite population (e.g., a census), \mu is the average of N population values of X.
    • \mu = E[X] = \frac{x1^* + x2^* + … + xN^}{N} = \frac{1}{N} \sum{i=1}^{N} x_i^
  • General Case:
    • For an experiment leading to possible values x1^, x2^, … with probabilities P(X=x1^), P(X=x2^), …
    • \mu = E[X] = P(X=x1^) \cdot x1^ + P(X=x2^) \cdot x2^ + … = \sum P(X=x) \cdot x

Coin Toss Example

  • X is the number of heads obtained in two coin tosses.
  • Possible values for X: 0, 1, 2.
  • Probabilities: P(X=0) = 0.25, P(X=1) = 0.5, P(X=2) = 0.25.
  • \mu = E[X] = 0 \cdot 0.25 + 1 \cdot 0.5 + 2 \cdot 0.25 = 1

Sample Mean

  • The sample mean is the average of n sample realizations.
  • Finite Population:
    • Population: 1, 2, 3, 4, and 5. \mu = \frac{1+2+3+4+5}{5} = 3
    • Sample of size 3 (with replacement): 2, 5, 4.
    • \bar{x} = \frac{2+5+4}{3} = 3.67
  • In general, the sample mean is not necessarily close to the population mean.

Coin Toss Example (Sample)

  • Population mean: \mu = 1
  • Toss two coins five times, record the number of heads: 1, 0, 0, 2, 1.
  • \bar{x} = \frac{1+0+0+2+1}{5} = 0.8
  • In general, the sample mean by itself for one sample is not going to be close to the population.
  • The sample x1, x2, …, xn is a realization of the random variables X1, X2, X3, …, Xn.
  • \bar{x} = \frac{X1 + X2 + … + X_n}{n}

Population Variance and Standard Deviation

  • Population Variance (\sigma^2 or Var(X)) is the probability-weighted average of squared deviations from the mean.
  • \sigma^2 = E[(X - \mu)^2] = \sum P(X=x) \cdot (x - \mu)^2
  • Population Standard Deviation (\sigma) is the square root of the variance.

Coin Toss Example (Variance)

  • \sigma^2 = 0.25 \cdot (0-1)^2 + 0.5 \cdot (1-1)^2 + 0.25 \cdot (2-1)^2 = 0.5
  • \sigma = \sqrt{0.5} \approx 0.707

Sample Variance and Standard Deviation

  • Sample Variance (s^2) is calculated by averaging square deviations from the mean.
    • s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2
    • The divisor n-1 is called the degrees of freedom.
  • Sample Standard Deviation (s) is the square root of the variance.
    • s = \sqrt{s^2}

Properties of the Sample Mean

  • The sample mean, \bar{x}, on average will be very close to the population mean \mu.
  • The variability of the sample mean is much less than the variability of the individual observations.
  • The sample mean may be approximately normally distributed.

Population Assumptions

  • xi has a common mean \mu: E[xi] = \mu for all i.
  • xi has a common variance \sigma^2: Var(xi) = \sigma^2 for all i.
  • Different observations are statistically independent: xi is statistically independent of xj, where i ≠ j.
    • The value of x2 is not influenced by the value taken by x1.
  • Shorthand Notation: xi is distributed with the mean of \mu and the common variance of \sigma^2.
  • These assumptions are met if the data is obtained from simple random sampling.
  • first two assumptions state that the common mean and the common standard deviation exist.

Mean of the Sample Mean

  • Each sample mean that we calculate is going to be based on random numbers.
  • Each sample mean, \bar{x}, is a random number because it's based on a randomly selected sample.
  • The population mean of the sample means is the expected value of \bar{x}, E[\bar{x}], and it's equal to the population mean
    \mu.
  • E[\bar{x}] = \mu

Standard Deviation of the Sample Mean

  • The population variance of the sample means, \sigma^2{\bar{x}}, is equal to E[(\bar{x} - \mu{\bar{x}})^2], and it turns out that it's equal to \frac{\sigma^2}{n}.
    • \sigma^2_{\bar{x}} = \frac{\sigma^2}{n}
  • The population standard deviation of the sample means is equal to \frac{\sigma}{\sqrt{n}}.
    • \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
  • The sample mean is less variable than the underlying data, as we saw with the census data.
  • The variability of the sample mean decreases as the sample size increases--larger samples lead to greater precision.

Standard Error of the Sample Mean

  • When the population standard deviation (\sigma) is unknown, replace \sigma^2 with the sample variance (s^2) to get the standard error of the sample mean.
  • The estimated variance of \bar{x} is calculated as s^2_{\bar{x}} = \frac{s^2}{n}.
    • s^2{\bar{x}} = \frac{s^2}{n} = \frac{1}{n} \cdot \frac{1}{n-1} \sum (xi - \bar{x})^2
  • Taking the root of that will give the estimated standard deviation.
  • The standard error of \bar{x}, which is \frac{s}{\sqrt{n}}, is the square root of one over n minus one sum of xi minus x bar squared divided by square root of n.
    • SE{\bar{x}} = \frac{s}{\sqrt{n}} = \sqrt{\frac{1}{n(n-1)} \sum (xi - \bar{x})^2}
  • In general, the term standard error means estimated standard deviation.

Normal Distribution and the Central Limit Theorem (CLT)

  • \bar{x} will be distributed with the mean of \mu because the mean of the sample means is always equal to the population mean and the variance of \frac{\sigma^2}{n}.
  • If the sample size is random, if the sample is selected randomly and the sample size is large (n -> ∞), then the sampling distribution of \bar{x} will be approximately normal.
  • If a sample, if the sample is a simple random sample and the sample size n, then CLT states that z has the standard normal distribution.
  • Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$, This will have the mean of zero and the standard deviation one.
  • CLT says that if n is large, then z will be distributed normally with the mean of zero and the standard deviation of one (N(0,1)).
  • Often ensuring that the sample size is at least 30 observation will result in a good approximation.
  • When n > 30, we get a good approximation.