Quantitative Methods in Cultural Industries
Model Diagnostics: Homoscedasticity Assumption
Model Assumptions:
The error term () in a regression model yi = \beta0 + \beta1 x{i1} + \beta2 x{i2} + … + \epsilon_i should satisfy several assumptions:
Mean of zero: E(\epsilon_i) = 0.
Constant variance (homoscedasticity): Var(\epsilon_i) = \sigma^2 for all i.
No correlation between error terms.
Normally distributed error terms.
Numerical explanatory variables should not exhibit perfect multicollinearity.
Why Check Model Assumptions?
To ensure that least squares estimators are reliable.
To understand the consequences of violating assumptions.
To detect any assumption violations.
To modify the model to satisfy the assumptions.
Diagnostics for Assumptions on the Errors:
Diagnostics are based on residuals.
Visual inspection of graphs of standardized residuals.
No violation implies no pattern in the graphs.
Homoscedasticity Assumption
Homoscedasticity: The variance of the error term is constant across all levels of the independent variables: Var(\epsilon_i) = \sigma^2.
Consequences of Heteroscedasticity
Increase in the standard error of coefficient estimates.
Low t-values.
Potentially leading to the incorrect conclusion that explanatory variables do not have a significant contribution.
Heteroscedasticity Detection
Under homoscedasticity, standardized residuals and standardized predictions are not correlated.
Check: Plot standardized residuals against standardized predictions.
Interpretation: No pattern implies no violation of homoscedasticity.
Patterns Revealing Heteroscedasticity
Specific patterns in the plot of standardized residuals versus standardized predictions, such as a fan shape, indicate heteroscedasticity.
Remedies for Heteroscedasticity
Log transformation of the dependent variable.
Log transformation of one or more explanatory variables.
Model 2: Log Transformation Example
Regress the log of the dependent variable (e.g., log of NArtzero) against the same explanatory variables.
Pay attention to the presence of 0 values in the dependent variable, which may require special handling before applying the log transformation.
Interpretation of Coefficients with Log Transformation
The coefficient of a numerical independent variable represents the average change in the logarithm of the original dependent variable associated with a unit increase in the covariate, controlling for all other covariates.
This can be interpreted as the average percentage change in the original variable associated with a unit increase in the covariate, when all other covariates are kept constant.
For example, an increase in age is associated with a 0.2% average change in the number of visits to an art museum, controlling for other variables.
Model Diagnostics: No Perfect Collinearity
No Perfect Multicollinearity
Definition:
Numerical explanatory variables should not have perfect correlation.
There should be no overlap of information or redundancy among the explanatory variables.
Consequences of Collinearity
Standard errors of the coefficients estimates become large, leading to:
Small t-statistic values.
Coefficient signs may not match prior expectations.
Diagnostics for Multicollinearity
Tolerance
Variance Inflation Factor (VIF)
Partial correlations
Tolerance
Computation:
Regress the considered explanatory variable against all other independent variables.
Compute the R-squared (R^2) of the regression.
Calculate Tolerance as Tolerance = 1 - R^2
Interpretation:
Tolerance represents the proportion of the explanatory variable's variability that is not explained by all other variables.
A greater tolerance indicates a lower inter-correlation.
Variance Inflation Factor (VIF)
Computation:
VIF = \frac{1}{Tolerance}Interpretation:
VIF measures the amount by which the variance of an explanatory variable's coefficient is increased due to collinearity.
A VIF greater than 10 suggests severe multicollinearity.
Partial Correlations
Definition: Pearson’s correlation between the dependent variable and the considered explanatory variable, controlling for all other explanatory variables.
Computation:
Regress the dependent variable on all covariates, excluding the considered one.
Regress the considered explanatory variable on all other covariates.
Calculate the residuals of the two regressions.
Partial Correlation: The Pearson’s correlation between the residuals of the two regressions.
Interpretation:
A high partial correlation (in absolute value) indicates a high correlation between the dependent variable and the independent variable, controlling for all other independent variables.
Compare the values of the Pearson’s correlation and the corresponding partial correlation to understand the effect of controlling for other variables.
Remedies for Multicollinearity
Remove one (or more than one) of the collinear covariates (e.g., remove Mother Education).
Compute a new variable from the ones affected by collinearity, and replace the multicollinear variables with this new variable.
Transform the numerical covariates into categorical variables by grouping the values.
Continue using the same model, keeping in mind that t-tests might be distorted.
Advanced Regression Techniques
Linearity Assumption
Linear Effect of a Numerical Independent Variable:
In the standard regression model,E(Y) = \beta0 + \beta1 X1 + \beta2 X_2 + … + \epsilon, numerical variables are entered without mathematical transformations.
This assumes a linear relationship between the dependent variable and each numerical independent variable, with a constant effect of a unit increase of a numerical independent variable.
Non-Linear Effects
To model non-linear effects, add polynomial terms in the considered variable (e.g., a quadratic term):Y = \beta0 + \beta1 X + \beta_2 X^2 + \epsilon
Quadratic Effect
To check for a quadratic effect of a variable (e.g., Family Income), include as an extra explanatory variable the square of the variable (Income2) and check for its significance.
Marginal Impact of Family Income
The marginal impact is determined by taking the partial derivative of the model with respect to Family Income. The equation includes both a linear and a quadratic term.
Moderation
Constant Effect: The standard regression model assumes that the marginal contribution of an independent variable does not depend on the value of any other variable.
Moderation Effect: A covariate X1 has a moderation effect on the relationship between another covariate, X2, and the dependent variable if X1 affects the size of the marginal effect of X2 on the dependent variable. This indicates an interaction between X1 and X2.
Consider whether an additional level of education has the same impact on the number of visits to an art museum for those who attended classes in visual art and those who did not attend such classes, controlling for all other variables.
Checking for Interaction
Add as an extra explanatory variable the product of the two variables (e.g., ClassvisualXEducation).
Check for the significance of the interaction term.
Mediation Effect
Direct Effect
Each explanatory variable has a direct effect on the dependent variable, measured by its slope.
Mediation Effect
The mediation effect explores whether the effect on the dependent variable of one independent variable is mediated by another independent variable.
Mediation Effect as Causal Effect: Variable X2 is a mediator in the relationship between X1 and Y if X1 causes X2 and X_2 causes Y.
Path Analysis
Path analysis is used to sketch causal relationships.
Sequence of Regression Models
Regress the dependent variable on all variables but the mediator.
Regress the mediator on the variable with an indirect effect.
Regress the dependent variable on all variables.
Steps in Path Analysis
Step 1: Regress NArtzero on Father Education to estimate the total effect (\beta_{f,total}).
Step 2: Regress Personal Education on Father Education to estimate the effect on the mediator (\beta_f).
Step 3: Regress NArtzero on Father and Personal Education to estimate the direct effect of Father Education (\beta{f,direct}) and the direct effect of Personal Education (\beta{p,direct}).
Decomposition of the Total Effect
The total effect can be decomposed into the direct effect and the mediated effect:\beta{f,total} = \beta{f,direct} + \beta_{f,mediated}
The mediated effect is calculated as the product of the coefficients for each leg of the path: \beta{f,mediated} = \betaf \cdot \beta_p
Sobel Test
The Sobel test is used to test the significance of the mediation effect. The null hypothesis is that the mediated effect is zero: H0: \beta{f,mediated} = \betaf \cdot \betap = 0
The test statistic can be computed using the formula:
Personal Education as a Significant Mediator
If the test statistic value is large enough, it indicates a significant mediation effect of Personal Education on the relationship between Father Education and NArtzero.