OJ

Lecture Notes Flashcards

Population vs. Sample

  • Population: All people in the United States (census).
  • Sample: A small subset of the population.
  • Population parameters:
    • \mu (population mean)
    • \sigma (population standard deviation)
  • Sample statistics:
    • \bar{x} (sample mean)
    • s (sample standard deviation)
  • The sample mean for one sample is likely different from the population mean.

Sampling Distribution

  • If multiple random samples are selected and their means are averaged, the mean of the sample means will be close to the population mean.
  • The standard deviation of the sampling distribution is smaller than the population standard deviation by a factor of \sqrt{n}.
  • \bar{x} is a random variable.
    • The mean of \bar{x} is equal to \mu.
    • The standard deviation (standard error) of \bar{x} is \frac{\sigma}{\sqrt{n}}.

Central Limit Theorem (CLT)

  • If the sample size is large enough (n >= 30), the sampling distribution (\bar{x}) will be approximately normally distributed.
  • Using \bar{x} to infer \mu requires knowledge of \sigma.
  • \bar{x} is normally distributed with standard deviation \frac{\sigma}{\sqrt{n}}.

T-Statistic

  • In practice, \sigma is often unknown.
  • If s (sample standard deviation) is used in place of \sigma, it adds noise.
  • The test statistic has a t-distribution with fatter tails.
  • t = \frac{\bar{x} - \mu}{s / \sqrt{n}}
  • t is distributed as t with n-1 degrees of freedom.
  • The sample t-statistic is calculated as \frac{\bar{x} - \mu}{\text{standard error of } \bar{x}}.

Summary for the Sample Mean

  • Individual x_i have a common mean \mu and variance \sigma^2.
  • The average \bar{x} of n rows of x_i has mean \mu and variance \frac{\sigma^2}{n}.
  • The standardized statistic z, calculated as z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}, has a mean of 0 and a variance of 1.
  • z is standard normal as sample size n goes to infinity by the Central Limit Theorem (CLT).
  • Replacing unknown \sigma with s leads to a t-distribution with n-1 degrees of freedom.
  • The sample t-statistic, calculated as \frac{\bar{x} - \mu}{s / \sqrt{n}}, is a realization of a random variable that is approximately t-distributed with n-1 degrees of freedom.
  • For normally distributed data, this last result is exact.
  • For non-normal data, the approximation improves as n gets larger (usually n > 30 is sufficient).

Point Estimation

  • Using sample data to make inferences about the population data.
  • Estimator: A method or formula.
  • Estimate: A single number obtained by applying the estimator.
  • Point Estimate: A single estimate of \mu.

Desired Properties for Estimators

  • Unbiased
  • Consistent
  • Efficient
Unbiased Estimator
  • A statistic such that the expected value of it is equal to the population parameter.
  • E(\bar{x}) = \mu
  • The expected value is equal to the population parameter.
Consistent Estimator
  • A statistic that almost certainly gets arbitrarily close to the population parameter as the sample size gets larger.
  • \bar{x} converges to \mu as n goes to infinity.
Efficient Estimator
  • The one that has the smallest variance.

Analogy: Shooting at a Target

  • Estimating a parameter is like shooting at a target.
  • Each shot represents selecting a sample and calculating a sample statistic.

Unbiased

  • Shots are centered around the mean.
  • E(\bar{x}) = \mu
  • E(s^2) = \sigma^2 (sample variance is an unbiased estimator for population variance).

Biased

  • Shots are off-center.

Consistent

  • As n gets larger, shots get closer to the target.
  • \bar{x} gets very close to \mu when n increases.

Efficient

  • Smaller variance means more efficient.

Example

  • Method 1: Select a sample, choose the first observation as the mean.
    • E(x_1) = \mu
    • \sigma^2(x_1) = \sigma^2
  • Method 2: Select a sample, calculate the average of that sample.
    • E(\bar{x}) = \mu
    • \sigma^2(\bar{x}) = \frac{\sigma^2}{n}
  • Both are unbiased, but Method 2 is more efficient.

Best Estimators (Best Linear Unbiased Estimators)

  • Unbiased, Consistent, and Efficient.
  • Consistency is generally more important than unbiasedness or efficiency.
  • Consistency: Estimator does a better job with more information.

Representative Samples

  • The first condition is to have a random sample.
  • This is best way to ensure that it's representative.

Example

  • Estimating coffee consumption by standing at Starbucks is not representative.
  • Surveying Gold Digest readers is not representative of all Americans' golfing habits.
  • If every 20th person is sampled and all respond, the sample is likely representative.
  • If only 10% of people respond, the sample is likely not representative.
  • If every person is sampled, but only 10% respond, the sample is not representative due to self-selection.
  • Sample size: Larger is better.

Simulating random sampling in Stata:

  • First, open up the do file editor
  • Start with : clear always.

The following code simulates a coin toss in Stata, 30 times:

set obs 30
generate x = runiform() > 0.5
summarize x
return scalar x_bar = r(mean)
  • set obs 30 set the number of observations in your data to 30.
  • generate x = runiform() > 0.5 creates a variable x associated with pool 30 random variables from a uniform distribution, the uniform standard uniform distribution is going to be any number between zero and one. The command is a logical command. It's essentially what Stata is going to do, it will generate a random variable from a uniform distribution and compare it to 0.5. If it's more than 0.5, it will put one. If it's less than 0.5, it will put zero. So it returns a series of zeros and ones, the same that would be expected from a coin toss.
  • return list returns a list of values from the previous command, which was my summarize.
  • The values are not saved forever. They will be overwritten, so it's crucial to save the values using the return scalar command.
program one_sample
drop _all
generate x = runiform() > 0.5
summarize x
return scalar x_bar = r(mean)
end
drop program one_sample

simulate x_bar = r(x_bar), seed(10101) reps(5): one_sample
  • The above code returns the program as a do file.
  • program to let the program know that your are creating a program on Stata. After the program code is finished one can type end let Stata know the program's ending.
  • The next code will let simulate something many many times, this is known as Monte Carlo simulations.

Confidence Intervals

  • Alternative to point estimation.

  • Gives a range of values for the population mean.

  • 100 * (1 - alpha)% confidence interval for a population mean:

  • Formula: \bar{x} \pm t_{n-1, \frac{\alpha}{2}} * \frac{s}{\sqrt{n}}

  • \bar{x}: Sample mean

  • t_{n-1, \frac{\alpha}{2}}: Value from the t-distribution with n-1 degrees of freedom and alpha/2 in the right tail.

  • \frac{s}{\sqrt{n}}: Standard error of the mean

  • In Stata, all of the tables are in built-in. Display is just printed on the screen. So we need to get, so in Stata, you have the t distribution, right? When you are given the value on the horizontal axis and it gives you the probability. What we are doing, we need to go other way around. When we are given probability and we need to find the t on the horizontal axis. That's called the inverse. So we are going to use the function inverse t tail. So t tail will give you probability. Inverse t tail will give you critical value. It inverses the t distribution.

display invttail(170, 0.025)

Example using 'earnings' dataset

  • Suppose that it is important to estimate the average earnings for the entire population of 30 year old full-time employees.

  • Open the 'earnings' dataset in Stata. The dataset contains 171 observations for 30 year old female workers, full time workers. So that's a sample that was pulled from all 30 year old female workers in the United States, because it's a sample.

  • Summarize earnings dataset in Stata.

  • \bar{x} = 41413

  • s = 25527

  • n = 171

  • Standard error = \frac{25527}{\sqrt{171}} = 1952

  • Construct a 95% confidence interval:

    • 100 * (1 - alpha) = 95
    • alpha = 0.05
    • alpha / 2 = 0.025
  • Find t-value:

    • t(170, 0.025) = 1.974 (using invttail in Stata)
  • Calculate confidence interval:

$\mu = 41,413 \pm 1.974 * 1952$

  • Confidence interval: [37560, 45266]

To return the 95% confidence interval by default:

mean earnings

Interpretation

  • We are 95% confident that the average wage of all 30-year-old full-time female employees in the United States is between $37,560 and $45,266.
  • Confidence is in the procedure, not the specific estimate.
  • 95% of confidence intervals constructed using this procedure would include the population mean.
  • 5% of confidence intervals would not include the population mean.