The World Development Indicators (WDI) data, supported by the World Bank, contains numerous time series for various countries.
To compare health expenditures in the United States and OECD countries, data on health expenditures for both regions is needed.
Countries can be selected individually or as groups (e.g., OECD members).
The data includes current health expenditures per capita in US dollars.
Data is available from 1990 onwards; the selection can be narrowed, for example to the last 20 years.
Data can be downloaded in CSV format.
Advanced options allow customization of the download, such as specifying how missing values are represented.
Missing values can be set to blank instead of the default two dots to facilitate importing into programs.
The variable format can be set to codes only to ensure compatibility with statistical software like Stata.
Notes at the end of the CSV file should be deleted as they cannot be read by statistical programs.
Data is imported into Stata using the import
command.
Data often needs to be reshaped from a wide format (years in columns) to a long format (years in a single column) for analysis.
This involves using the reshape long
command.
The data is reshaped by specifying the variable to reshape (YR for year) and defining the I (country code) and J (year) observations.
The reshaped data has the years in one column (YR) and the corresponding health expenditures in another.
The variable YR
can be renamed to health expenditure for the United States
using the rename
command.
The year variable, initially an integer, must be converted to a string before being converted to a date format.
The tostring
command is used to convert the year to a string variable.
The generate date
command is used to create a date variable from the year string.
The format date
command is used to format the date variable to display only the year.
The same steps are repeated for OECD countries, reshaping and renaming variables accordingly.
The two datasets (US and OECD) are then merged using a one-to-one merge based on the year.
The merge 1:1
command requires that the data be sorted by the merge variable (date) before merging.
After merging, the data can be graphed to compare health expenditures in the US and OECD countries over time.
Population: The entire set of observations of interest.
Sample: A subset of the population.
When describing a population, random variables are considered, denoted by X, with the population mean denoted by the Greek letter mu \mu.
If X can take discrete values X1, X2, … with respective probabilities P(X1), P(X2), then the population mean is calculated as:
\mu = E[X] = \sum P(X = Xi) \cdot Xi
The sum of probabilities must equal 1.
The sample mean, denoted as \bar{X}, is calculated as the sum of all observations in the sample divided by the number of observations:
\bar{X} = \frac{\sum{i=1}^{n} Xi}{n} = \frac{X1 + X2 + … + X_n}{n}
- **Sample variance** (S^2) is calculated as the sum of squared deviations from the sample mean, divided by n-1:
S^2 = \frac{\sum (X\_i - \bar{X})^2}{n-1}
- **Population variance** (\sigma^2) is calculated as the weighted average of squared deviations from the population mean, using probabilities as weights.
\sigma^2 = \sum P(X = X\_i) \cdot (X - \mu)^2
- Sample standard deviation (S) is the square root of the sample variance, and population standard deviation (\sigma) is the square root of the population variance.
Sample variance (S^2) is calculated as the sum of squared deviations from the sample mean, divided by n-1:
S^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}
Population variance (\sigma^2) is calculated as the weighted average of squared deviations from the population mean, using probabilities as weights.
\sigma^2 = \sum P(X = X_i) \cdot (X - \mu)^2
Sample standard deviation (S) is the square root of the sample variance, and population standard deviation (\sigma) is the square root of the population variance.
Consider an experiment with two coin tosses.
Possible outcomes: head-head, head-tail, tail-head, tail-tail.
If head = 1, possible values for the random variable X: 2, 1, 1, 0.
Possible values: 0, 1, 2
Probabilities: 1/4, 1/2, 1/4
The mean is E[X] = (0 \cdot 1/4) + (1 \cdot 1/2) + (2 \cdot 1/4) = 1.
The variance is:
Var(X) = (0-1)^2 \cdot 1/4 + (1-1)^2 \cdot 1/2 + (2-1)^2 \cdot 1/4 = 1/4 + 0 + 1/4 = 1/2
Census data from 1880 (population) can be used to illustrate sample mean, standard deviation and central limit theorem.
Population mean (\mu) = 24.13 and standard deviation (\sigma) = 18.61.
A sample of 25 observations may yield a different mean (\bar{X}), e.g., 27.84, and standard deviation, e.g., 20.71.
The mean of the sample mean = the population mean
\mu_{\bar{x}} = \mu
The standard deviation of the sample mean
\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
Central Limit Theorem: The distribution of sample means approaches a normal distribution as the sample size increases, even if the population is not normally distributed.
Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}
To simulate coin tosses, Stata can generate random numbers between 0 and 1 using the uniform()
function.
The command set obs 30
sets the number of observations to 30.
scalar true_mean = 0.5
defines a scalar variable representing the true mean of a fair coin toss.
generate x = uniform() > true_mean
generates a variable x with values 1 (heads) if the random number is greater than 0.5 and 0 (tails) otherwise.
The mean of this sample can be calculated and stored.
The loop command allows repeating code multiple times.forvalues i = 1/5 {
… }
repeats code five times.
clear
clears the data set.
quietly
suppresses the output of each command.
collapse (mean) x, by()
collapses the data set to one observation, the mean of x.
append using coin_toss
appends the data to an existing file.
save coin_toss, replace
saves the updated data, replacing the existing file.