1/143
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Why are linear models not ideal for use in economic situations?
Expect serial correlation from the residual values
ideally residual terms are unpredictable and uncorrelated
What do you do if the value being modelled grows exponentially instead of linearly?
Take natural log (and therefore exponent)
ln(y) = b + bt + e, so
y = e^(b + bt + e)
When would a linear model be appropriate, as opposed to a log-linear model?
When the growth is approximately constant
When would a log-linear model be appropriate, as opposed to a linear model?
When the growth is approximately linear
What are the three requirements for a time series to be covariance stationary?
Constant and finite:
expected values in all periods
variance in all periods
covariance with lagged versions of the time series for all
What happens to time series without covariance stationarity?
Results are economically invalid
regression will lead to spurious results
Estimate of b will be biased
Hypothesis tests will be invalid
What is an autoregressive model?
Independent variables are historical values of the dependent variables
within AR, what does it mean for a model to be incomplete?
information within the data that the model is not capturing
How do you correct for an AR model with significant serial correlation (autocorrection) - (Incomplete model)
Increase number of lags until no significant autocorrect
Testing for autocorrection in an AR model
Test for autocorrelation with a t-test
Dubran-Watson Test doesn’t work for AR models (usually works for serial correlation)
When is a time series mean-reverting?
It falls when level above mean
It rises when level below mean
How does the formula change for regression for mean-reverting level?
xt = b0 + btxt
Would become:
xt = b0/ (1 - bt)
Can you have multiple Dependent or Independent variables?
Independent variables are those that can be manipulated, while dependent variables are influenced by changes in independent variables in a given study
What are the 5 assumptions of regression?
linearity - linear relationship
Homoskedasticity - Unchanging variance
Error Independence - Observations are independent
Normality - Residuals are normally distributed
Variable Independence - no exact linear relationships between two or more independent variables
What would be the 5 violations of regression?
Nonlinearity
Heteroskedasticity
Serial correlation or autocorrelation
non-normality
Multicollinearity
What does the error term represent?
The stochastic or random part of the model, capturing any unexplained variation in the dependent variable due to randomness, measurements errors, or unobserved factors
What do the independent variables represent?
The deterministic part of the model, quantifying the observed relationship between the independent variables and the dependent variable
What is the coefficient of determination and what does it do?
Also known as R-squared, measures goodness of fit of an estimated regression to the data. It can also be defined as the ration of the variation of the dependent variable explained by the independent variables.
Quick formula for R-Squared:
SSR/SST (think alphabet)
Do you want more or less variables for multiple linear regression?
Usually you want less, to avoid overfitting (Less is More). As you add more and more independent variables R-Squared will increase
Why is adjusted R² a little bit better than R²?
Doesn’t automatically go up with the addition of more independent variables
How do you determine what the addition of a new variable will have on Adjusted R-Squared?
if coefficients t-stat >|1.0|, then A.R² will go up
If coefficients t-stat <|1.0|, then A.R² will go down
What does a lower Akaike’s information criterion (AIC) indicate?
A lower AIC indicates a better fitting mode (you want it to be as low as possible to indicate a better model)
What does Bayesian Information Criteria (BIC) indicate?
A lower BIC indicates a better-fitting model
When do we prefer AIC to BIC or vice versa?
We use AIC when we are using a model for predications, we use BIC is all we’re interested in is the best goodness of fit
In terms of Adjusted R², which data would we want to use?
Ideally you want the highest Adjusted R² value, if just interpreting R² and Adjusted-R², this would change with AIC and BIC values given potentially
What is the Coefficient?
The slope of the independent variable, and it represents the expected change in the dependent variable for a 1 unit change in the independent variable (Holding all other variables constant - this is really key to remember)
What does a coefficient of 0 mean?
Independent variable has no significance, and probably can be excluded from the regression
What are the degrees of freedom?
for multiple regression: # of data points - # of regression coefficients
(n - (k + 1)
What is the really key thing to remember about the coefficient of independent variables?
This is based on the change that would occur for a 1 unit change HOLDING ALL OTHER VARIABLES CONSTANT
How do we interpret the hypothesis test and rejecting/ not rejecting the null hypothesis?
if the calculated t-statistic > t-critical value, we can reject the null hypothesis, if the calc. t-stat < t-crit, we cannot reject the null hypothesis
What is an unrestricted model?
A model that includes ALL the variables in the initial specification
What is a restricted (or nested) model?
Restricts the slope to 0, for one or more independent variables - not all of them are used. It is nested in the unrestricted model
What is the criteria for the F-test for joint test of slope coefficients?
Exceeds the critical F-value for the selected significance level
What is Model Error
Error between a predicted value and the actual value for a dependent variable within the data set
What is sampling error?
errors created by forecasting independent variables for use in forecasting a dependent variable
What is a logistic regression (logt) model?
Represents the dependent variable as a natural logarithm of probability ratios (confiding results to a range between 0 and 1)
When should a logistic regression (logit) model be used?
When the dependent variable is discrete (i.e. not continuous)
What is the stochastic part of a model?
The error term
What is the next step after estimating the regression model?
Analyse scatterplots of variables and residuals
What is the next step after analysing the scatterplots of variables and residuals?
Seeing if the regression assumptions are satisfied
What is the next step after seeing if the regression assumptions are satisfied (and they are)
Checking if the goodness of fit is satisfactory/ significant
What is the next step after seeing if the regression assumptions are satisfied (and they are not)
adjust the model
What is the next step after checking if the goodness of fit is satisfactory/ significant? (and they are)
test with out of sample date
What is the next step after checking if the goodness of fit is satisfactory/ significant? (and they are not)
adjust the model
In terms of interpreting (scatterplot) relationships, do we want to have little or no correlation, negative correlation, or positive correlation?
We want to have little or no correlation because it suggests low multicollinearity of those variables, which is a desirable characteristic. This tells us that each variable provides unique information, leading to mode stable and reliable coefficient estimates and simplifies model interpretation and enhances performance by avoiding redundancy among predictors
Based on p-values, when would it be correct to reject the null hypothesis?
If the p-value for the independent variable is < less than the level of significance, but you should not reject the null if the p-value is greater than the level of significance
How many dummy variables should be used to incorporate qualitative independent variables into a regression model?
n - 1 dummy variables
If we had a concern that a model might have an artificially large R² and t-statistics that are understated, what regression assumption is likely violated?
Multicollinearity - standard errors for each coefficient become inflated which results in understated t-statistics, which in turn leads to coefficients being incorrectly classified as not statistically significant. It would also have inflated R² and F-statistic values, and seem to be a better fit than it actually is
When does multicollinearity occur?
when at least two independent variables are highly correlated
What does the standard error of the forecast do?
Quantifies uncertainty around the prediction, NOT improves the forecasting of the dependent variable
What is Model Specification
Set of variables included in the regression and the regression equations functional form
What does it mean to have a sound economic basis for your model?
Economic reasoning behind the choice of variables and their interactions
What does parsimony mean?
Less is more - each variable plays an essential role, additional variables don’t add spurious accuracy
What does good in-sample but bad out-of-sample performance mean?
This would be an example of overfitting: an overfit model explains the data used to fit in, but may not work well with data outside the set
What does appropriate functional form mean?
A model should incorporate non-linear forms, if appropriate
What is Homoskedasticity?
(The ONE you want) Constant variance and one assumption for valid regression
What is Heteroskedasticity?
(Not the one you want) Nonconstant variance and violates assumptions
What are the types of heteroskedasticity?
Unconditional (not a problem in linear regression) and Conditional (size of error terms is related to value of the independent variables, and is a problem in linear regression)
How do you detect Heteroskedasticity?
Breusch-Pagan (BP) test - one-tail chi-square test
What is positive serial correlation?
Residuals tend to go in groups which violates assumptions
What is negative serial correlation?
Residuals tend to bounce back and forth which violates assumption
Are coefficient estimates largely affected or unaffected for positive/ negative serial correlation?
unaffacted for positive, affected for negative
In terms of serial correlation, what does an F-stat that is too large indicate? too small?
Too large means positive, too small means negative
Are standard errors too high or too low for positive/ negative serial correlation?
too low for positive, too high for negative
Are there more Type I or Type II errors for positive/ negative serial correlation
More Type I in positive, More Type II in negative
Is False significance/ false insignificance associated with positive/ negative serial correlation
False insignificance is associated with negative serial correlation, false significance with positive serial correlation
How to test for serial correlation?
Durban-Watson Test and Breusch-Godfrey Test
Can serial correlation be eliminated?
No, serial correlation cannot be eliminated, the standard errors simply account for it
What is multicollinearity
Two or more independent variables are highly correlated with each other
What are the effects of multicollinearity
model estimates of dependent variable are unaffected
Standard errors of coefficients are too large: t-stat are too small
How do you detect multicollinearity?
Visually, it will look absolutely fine on the scatter plot.
However, a high R-squared and a significant F-stat with insignificant t-stats (very low) for all slope coefficients is evidence of multicollinearity, or
Multicollinearity may exist even when the F-stat is insignificant or t-statistics are significant
What the is Variance Inflation Factor (VIF)?
a VIF exists for each independent variable in a multiple regression:
VIF = 1/ 1 - R²
Each independent variable is regressed against the other independent variables.
VIF>5 warrants further investigation of the given independent variable
VIF>10 indicates serios multicollinearity requiring correction
How do you correct for multicollinearity?
Exclude on or more independent variables from the model until multicollinearity is no longer present, or
Use a different proxy for one of the variables
Increase the sample size
Where does serial correlation typically occur?
time-series data sets
What does the Breusch-Godfrey test check for?
Checks the regression for serial correlation
What does Variance Inflation Factor (VIF) test for?
Multicollinearity
What does the Breusch-Pagan test for?
Tests for conditional heteroskedasticity
What is a potential consequence of omitted variables?
Heteroskedasticity or serial correlation
What is a potential consequence of inappropriate variable form?
Heteroskedasticity
What is a potential consequence of inappropriate variable scaling?
Heteroskedasticity or Multicollinearity
What is a potential consequence of inappropriate data pooling?
heteroskedasticity or serial correlation
Can patterns in serially correlated residuals contain information that has the potential to be exploited?
yes
Does conditional or unconditional heteroskedasticity cause errors in statistical inference?
Conditional
What does good out-of-sample performance mean?
Model generalises well (low risk of overfitting or underfitting)
What are the examples of potentially influential data points? (not violations of assumptions)
High-leverage points
Outliers
Influential Observations
What is a high-leverage point?
An extreme value of an independent variable
What is an outlier?
An extreme value of a dependent variable
What is an influential observation?
An observation whose inclusion may significantly alter regression results
What is the Measure of Leverage?
Leverage measures the distance between the value of the i-th observation of that independent variable and the mean value of that variable across all n observations:
0 < Leverage < 1
How do you look for high-leverage position?
Measure of Leverage
What does a high measure of leverage mean? low?
The higher the leverage, the more distant the observation from the mean for the variable
How do we determine if a point has a high measure of leverage?
h > 3((k + 1) / n)
In what scenario may multicollinearity not be a major issue
If the goal of the analysis is to predict the dependent variable, rather than to understand the roles of the independent variables
What is my story prompt for remembering the regression process?
Eager Captains ESpecially Study Sailors Guarding The Buried Past
What is a studentized residual?
Quotient resulting from the division of a residual by an estimate of its s.d., a form of a students t-stat with the estimate of error varying between points
= e/s
What is Cook’s distance?
A measure of how much the estimate values of the regressed change if observation i is deleted from the sample
What does it say about the observation if Cook’s Distance (D) is > 0.5
May be influential and merits further investigation
What does it say about the observation if Cook’s Distance (D) is > 1.0
Highly likely to be an influential data point
What does it say about the observation if Cook’s Distance (D) is
> 2 x (k/m)^0.5
highly likely to be an influential data point
Does the measure of Leverage apply to Dependent or Independent variables?
Independent