1/71
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
random variable
a numerical summary of a random outcome
outcome
the mutually exclusive result of a random process
variable
a measurable characteristic of a population
sample space
the set of all possible outcomes of a random process
event
a subset of the sample space
estimator
the function of the data in the sample derived to infer the estimand
estimand
the true value in the observable population which is to be estimated
target/structural parameter
the specific unknown population parameter that is to be estimated
central limit theorem
when N is sufficiently large, the distribution of the estimated mean becomes more normal. Therefore, the variance (σ2) becomes more predictable.
Gauss-Markov OLS Assumptions
Exogeneity
No multicolinearity
Linear relationship between dependent var & independent var
Homoskedasticity
No autocorrelation
Independent & normally distributed error term
properties of OLS under Gauss-Markov assumptions
B-L-U-E
Best linear unbiased estimator
exogeneity
the x-variables and the error term are not correlated: E(εi | X) = 0. Therefore, neither the error term nor the dependent variable (Y) influence the explanatory variables (X) since they are determined outside of the model.
multicollinearity
a breach in Gauss-Markov correlation between ≥2 explanatory variables, probably because they measure a similar trait
homoskedasticity
the variance (σ2) of the error term (ε) is constant throughout the sample. Therefore, the dispersion of residuals is similar for all X. This can be visually detected through a rectangle-shaped mass of residuals in a scatter plot of residuals (the absence of change in the residuals as X changes)
heteroskedasticity
The variance (σ2) of the error term (ε) is not constant throughout the sample. Therefore, the dispersion of residuals is dissimilar for all X. This violates Gauss-Markov assumptions of OLS, meaning that the standard errors are no longer efficient; however, the sample is still unbiased. In terms of BLUE, the Heteroskedasticity sample is no longer efficient (E). This can be visually detected through a conical or trumpet-shaped pattern in a scatter plot of the residuals.
autocorrelation / serial correlation
the correlative relationship between an independent variable and its own past values. This violation of OLS assumptions often occurs in time series data.
endogeneity
correlation between explanatory vars and the error term, violating OLS assumptions: E(εi | X) ≠ 0. This creates a bilateral causal relationship between the X and Y variables. This is best detected through the Hausman test in STATA w/ the command
residuals
the differences between observed (actual) values and the estimated values predicted by the model. This is shown in SSR
grounds to reject the null hypothesis (Ho) and propose the alternative hypothesis (Ha) / statistical significance
p-value < critical value
insufficient grounds to reject the null hypothesis / statistical insignificance
p-value > critical value
p-value
the probability of observing a z-stat, t-stat, F-stat, etc. with an absolute value ≥ the observed results (more extreme than the observed results)
Triple S method of analysing variable coefficients
sign, size & significance of the a variable’s estimated coefficient
dummy variable
a numerical var expressed as 0 or 1 to represent categorical data, often gender, race, union membership, etc.
elasticity
the % change of the dependent var due to 1% change in the independent var
linear-linear model (Y = f[X])
change Y = beta change X
linear-log model (Y=f[logX])
change Y = beta/100 % change X
log-linear model (logY = f[X])
% change Y = (100)beta change X
log-log model (logY = f[logX])
% change Y = beta% change X
internal validity
a regression that successfully yields inferences applicable to the chosen population
external validity
a regression whose inferences made from a sample can also be applied to other populations
Variance Inflation Factor (VIF)
a method of identifying multicollinearity by quantifying how much correlation between predictor variables inflates the variance of a regression coefficient. This index = 1/(1-R²), multilolinearity is generally of concern if (index >10). To run this in STATA, use command VIF.
Breusch-Pagan test for heteroskedasticity
Ho: constant variance/hetsked (σ1 = σ2, etc.)
Ha: inconstant variance/homosked (σ1 ≠ σ2, etc).
To run this test in Stata, use the command ESTAT HETTEST
robust standard errors
standard errors adjusted for heteroskedastiicity. add Stata command , ROBUST following the last independent variable in a regression code line
standard errors
= variance / square root(# of observations)
Type I neoclassical measurement error
the error is uncorrelated with the true value of the variable (eg: independent inaccuracies in reporting one’s weight)
Type II neoclassical measurement error
the error is correlated with the true-value or with other variables (eg: many observations - often self-reported - intentionally misrepresent a characteristic like income, education)
conditions for instrumented regression
relevance: the instrument must correlate with the problematic endogenous variable.
exclusive restriction: the instrument only affects the outcome through the endogenous x-variable
rule of thumb for weak instrument identification
F-stat < 10 for a significance test for
Use the STATA command ESTAT FIRSTSTAGE
Hausman test
A test to determine if the estimator is consistent & efficient (adheres to BLUE)
Ho: the regressor is exogenous (E(Xiεi) = 0)
Ha: the regressor is endogenous (E(Xiεi) ≠ 0)
linear probability model
a OLS model following the binomial distribution that uses limited dependent variables. These models can suffer from issues like predicted probabilities outside the 0-1 range and heteroskedasticity.
probit model
the cumulative distribution function of independent variables which models the probability of an event’s occurrence. This model follows the standard normal distribution, hence its coefficients are interpreted as z-scores for the Probit Index. Use STATA command PROBIT.
logit model
the log of the probability of an event’s occurrence. this model follows the logistic distribution and is interpreted as the “odds” of an event happening. Use STATA command LOGIT
latent variable
a variable that cannot be observed, but can be inferred from other observable variables (eg: intelligence as measured through a test score). This type of variable often appears in logit or probit models
maximum likelihood estimation
estimating the parameters of an assumed probability distribution based on some observed data to maximise a likelihood function so that, under the assumed statistical model, the observed data is most probable.
MARGINS command
finds marginal effects
multinomial regressions
regressions for categorical data with no order/ranking that are calculated according to maximum likelihood estimation. This method assumes the independence of irrelevant alternatives and predicts the log odds of an observation being classified as a respective category.
cross-sectional data
data that provides a ‘snapshot’ of multiple observations at a given point in time (time is constant)
time-series data
data for only one variable collected at successive, recurring intervals to capture change over time
panel data
a combination of cross-sectional data and time series data
Chow Test for structural change
A statistical test to determine if the coefficients in two separate regression models are equal, often used in DiD regressions to examine changes between two groups and/or changes before/after an intervention.
Ho: coefficients are the same for every Y (no structural break exists)
Ha: coefficients are different for every Y (a structural break exists)
Difference-in-Diffferences
Causal estimator method of using control and treatments groups to examine trends in 2 groups pre- and post-intervention. This method addresses biases from pre-existing differences between the two groups and omitting time trends that would have occurred regardless of the intervention.
This method assumes:
parallel trend
exogeneity
conditional independence
balanced panel data
panel data with an equal number of observations in each cross-section and time period
unbalanced panel data
panel data with an unequal number of observations in each cross-section and time period
fixed effects (FE) Model
model for panel data that assumes characteristics to be fixed over time and correlated with observable variables
random effects (RE) model
a model for panel data that assumes unmeasurable characteristics in each observation to be random and uncorrelated with any of the model’s random variables
STATA command XTSET
organises panel data properly
error term (ε)
the unexplained portion of a dependent variable's variance that's not accounted for by the independent variables in a model
conditions for proper instrumented regression
instrument correlates with the endogenous variable
the instrument does not correlate with the error term
If these conditions are met, the instrument will only effect the endogenous variable
1st stage insstrumentation
2nd stage instrumentation
methods of addressing Heteroskedasticity
drop the hetsked variable
cluster the standard error
subdivide the hetsked variable into several new variables based on common traits and their residuals
use the log form of the hetsked variable
use hetsked robust standard errors (STATA command , ROBUST at the end of a regression command
taking the log of an explanatory variable (ln[X])
a strategy to address heteroskedasticity in an explanatory variable by stabilising its variance (σ2). If this is executed, the variable must now be interpreted as % change
F-test / joint F-test
A statistical test assessing the overall significance of a variable’s coefficient (B) to hence determine whether they should be included in a final regression model. This test can also compare the goodness-of-fit of several different suggested models. To run this test in STATA, use the command TEST X1 = X2 after a regression output.
Ho: B2 = B3 etc. = 0 (the included variables are not significantly different to 0 and therefore do not statistically explain change in Y)
Ha: B2 ≠ B3 etc. ≠ 0 (the included variables are significantly different from 0 and therefore do statistically explain change in Y)
order condition for instrumented regression
there is exactly one instrument for every endogenous variable
over-identication
there exist more instruments than endogenous variables
under-identification
there are insufficient instruments compared to endogenous variables
2-stage least squares (2sls)
long & narrow panel data
panel data w/ long time dimension & narrow range of subjects
short & wide panel data
panel data w/ short time dimension & wide range of subjects
long & wide panel data
panel data w/ long time dimension & wide range of subjects
short & narrow panel data
panel data w/ short time dimension & narrow range of subjects
heterogeneity bias
bias resulting from the omission of the unobserved fixed effect