OJ

Lecture Notes Flashcards

World Development Indicators Data

  • The World Development Indicators (WDI) data, supported by the World Bank, contains numerous time series for various countries.

  • To compare health expenditures in the United States and OECD countries, data on health expenditures for both regions is needed.

  • Countries can be selected individually or as groups (e.g., OECD members).

  • The data includes current health expenditures per capita in US dollars.

  • Data is available from 1990 onwards; the selection can be narrowed, for example to the last 20 years.

Downloading and Preparing Data

  • Data can be downloaded in CSV format.

  • Advanced options allow customization of the download, such as specifying how missing values are represented.

  • Missing values can be set to blank instead of the default two dots to facilitate importing into programs.

  • The variable format can be set to codes only to ensure compatibility with statistical software like Stata.

  • Notes at the end of the CSV file should be deleted as they cannot be read by statistical programs.

Importing and Reshaping Data in Stata

  • Data is imported into Stata using the import command.

  • Data often needs to be reshaped from a wide format (years in columns) to a long format (years in a single column) for analysis.

  • This involves using the reshape long command.

  • The data is reshaped by specifying the variable to reshape (YR for year) and defining the I (country code) and J (year) observations.

  • The reshaped data has the years in one column (YR) and the corresponding health expenditures in another.

  • The variable YR can be renamed to health expenditure for the United States using the rename command.

Converting Year Variable to Date Format

  • The year variable, initially an integer, must be converted to a string before being converted to a date format.

  • The tostring command is used to convert the year to a string variable.

  • The generate date command is used to create a date variable from the year string.

  • The format date command is used to format the date variable to display only the year.

Processing OECD Data and Merging Data Sets

  • The same steps are repeated for OECD countries, reshaping and renaming variables accordingly.

  • The two datasets (US and OECD) are then merged using a one-to-one merge based on the year.

  • The merge 1:1 command requires that the data be sorted by the merge variable (date) before merging.

  • After merging, the data can be graphed to compare health expenditures in the US and OECD countries over time.

Sample vs. Population

  • Population: The entire set of observations of interest.

  • Sample: A subset of the population.

  • When describing a population, random variables are considered, denoted by X, with the population mean denoted by the Greek letter mu \mu.

  • If X can take discrete values X1, X2, … with respective probabilities P(X1), P(X2), then the population mean is calculated as:

    \mu = E[X] = \sum P(X = Xi) \cdot Xi

  • The sum of probabilities must equal 1.

  • The sample mean, denoted as \bar{X}, is calculated as the sum of all observations in the sample divided by the number of observations:

\bar{X} = \frac{\sum{i=1}^{n} Xi}{n} = \frac{X1 + X2 + … + X_n}{n} - **Sample variance** (S^2) is calculated as the sum of squared deviations from the sample mean, divided by n-1:
S^2 = \frac{\sum (X\_i - \bar{X})^2}{n-1} - **Population variance** (\sigma^2) is calculated as the weighted average of squared deviations from the population mean, using probabilities as weights.
\sigma^2 = \sum P(X = X\_i) \cdot (X - \mu)^2 - Sample standard deviation (S) is the square root of the sample variance, and population standard deviation (\sigma) is the square root of the population variance.

  • Sample variance (S^2) is calculated as the sum of squared deviations from the sample mean, divided by n-1:
    S^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}

  • Population variance (\sigma^2) is calculated as the weighted average of squared deviations from the population mean, using probabilities as weights.
    \sigma^2 = \sum P(X = X_i) \cdot (X - \mu)^2

  • Sample standard deviation (S) is the square root of the sample variance, and population standard deviation (\sigma) is the square root of the population variance.

Coin Toss Example

  • Consider an experiment with two coin tosses.

  • Possible outcomes: head-head, head-tail, tail-head, tail-tail.

  • If head = 1, possible values for the random variable X: 2, 1, 1, 0.

  • Possible values: 0, 1, 2

  • Probabilities: 1/4, 1/2, 1/4

  • The mean is E[X] = (0 \cdot 1/4) + (1 \cdot 1/2) + (2 \cdot 1/4) = 1.

  • The variance is:
    Var(X) = (0-1)^2 \cdot 1/4 + (1-1)^2 \cdot 1/2 + (2-1)^2 \cdot 1/4 = 1/4 + 0 + 1/4 = 1/2

Sample Mean, Standard Deviation and Central Limit Theorem

  • Census data from 1880 (population) can be used to illustrate sample mean, standard deviation and central limit theorem.

  • Population mean (\mu) = 24.13 and standard deviation (\sigma) = 18.61.

  • A sample of 25 observations may yield a different mean (\bar{X}), e.g., 27.84, and standard deviation, e.g., 20.71.

  • The mean of the sample mean = the population mean
    \mu_{\bar{x}} = \mu

  • The standard deviation of the sample mean
    \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}

  • Central Limit Theorem: The distribution of sample means approaches a normal distribution as the sample size increases, even if the population is not normally distributed.

  • Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}

Replicating Coin Toss in Stata

  • To simulate coin tosses, Stata can generate random numbers between 0 and 1 using the uniform() function.

  • The command set obs 30 sets the number of observations to 30.

  • scalar true_mean = 0.5 defines a scalar variable representing the true mean of a fair coin toss.

  • generate x = uniform() > true_mean generates a variable x with values 1 (heads) if the random number is greater than 0.5 and 0 (tails) otherwise.

  • The mean of this sample can be calculated and stored.

  • The loop command allows repeating code multiple times.
    forvalues i = 1/5 {} repeats code five times.

  • clear clears the data set.

  • quietly suppresses the output of each command.

  • collapse (mean) x, by() collapses the data set to one observation, the mean of x.

  • append using coin_toss appends the data to an existing file.

  • save coin_toss, replace saves the updated data, replacing the existing file.