Model Assumptions:
The error term () in a regression model $$yi = \beta0 + \beta1 x{i1} + \beta2 x{i2} + … + \epsilon_i$$ should satisfy several assumptions:
Mean of zero: E(ϵi)=0$$E(\epsilon_i) = 0$$.
Constant variance (homoscedasticity): Var(ϵi)=σ2$$Var(\epsilon_i) = \sigma^2$$ for all i.
No correlation between error terms.
Normally distributed error terms.
Numerical explanatory variables should not exhibit perfect multicollinearity.
Why Check Model Assumptions?
To ensure that least squares estimators are reliable.
To understand the consequences of violating assumptions.
To detect any assumption violations.
To modify the model to satisfy the assumptions.
Diagnostics for Assumptions on the Errors:
Diagnostics are based on residuals.
Visual inspection of graphs of standardized residuals.
No violation implies no pattern in the graphs.
Homoscedasticity: The variance of the error term is constant across all levels of the independent variables: Var(ϵi)=σ2$$Var(\epsilon_i) = \sigma^2$$.
Increase in the standard error of coefficient estimates.
Low t-values.
Potentially leading to the incorrect conclusion that explanatory variables do not have a significant contribution.
Under homoscedasticity, standardized residuals and standardized predictions are not correlated.
Check: Plot standardized residuals against standardized predictions.
Interpretation: No pattern implies no violation of homoscedasticity.
Specific patterns in the plot of standardized residuals versus standardized predictions, such as a fan shape, indicate heteroscedasticity.
Log transformation of the dependent variable.
Log transformation of one or more explanatory variables.
Regress the log of the dependent variable (e.g., log of NArtzero) against the same explanatory variables.
Pay attention to the presence of 0 values in the dependent variable, which may require special handling before applying the log transformation.
The coefficient of a numerical independent variable represents the average change in the logarithm of the original dependent variable associated with a unit increase in the covariate, controlling for all other covariates.
This can be interpreted as the average percentage change in the original variable associated with a unit increase in the covariate, when all other covariates are kept constant.
For example, an increase in age is associated with a 0.2% average change in the number of visits to an art museum, controlling for other variables.
Definition:
Numerical explanatory variables should not have perfect correlation.
There should be no overlap of information or redundancy among the explanatory variables.
Standard errors of the coefficients estimates become large, leading to:
Small t-statistic values.
Coefficient signs may not match prior expectations.
Tolerance
Variance Inflation Factor (VIF)
Partial correlations
Computation:
Regress the considered explanatory variable against all other independent variables.
Compute the R-squared (R2$$R^2$$) of the regression.
Calculate Tolerance as Tolerance=1−R2$$Tolerance = 1 - R^2$$
Interpretation:
Tolerance represents the proportion of the explanatory variable's variability that is not explained by all other variables.
A greater tolerance indicates a lower inter-correlation.
Computation:
VIF=Tolerance1$$VIF = \frac{1}{Tolerance}$$
Interpretation:
VIF measures the amount by which the variance of an explanatory variable's coefficient is increased due to collinearity.
A VIF greater than 10 suggests severe multicollinearity.
Definition: Pearson’s correlation between the dependent variable and the considered explanatory variable, controlling for all other explanatory variables.
Computation:
Regress the dependent variable on all covariates, excluding the considered one.
Regress the considered explanatory variable on all other covariates.
Calculate the residuals of the two regressions.
Partial Correlation: The Pearson’s correlation between the residuals of the two regressions.
Interpretation:
A high partial correlation (in absolute value) indicates a high correlation between the dependent variable and the independent variable, controlling for all other independent variables.
Compare the values of the Pearson’s correlation and the corresponding partial correlation to understand the effect of controlling for other variables.
Remove one (or more than one) of the collinear covariates (e.g., remove Mother Education).
Compute a new variable from the ones affected by collinearity, and replace the multicollinear variables with this new variable.
Transform the numerical covariates into categorical variables by grouping the values.
Continue using the same model, keeping in mind that t-tests might be distorted.
Linear Effect of a Numerical Independent Variable:
In the standard regression model,$$E(Y) = \beta0 + \beta1 X1 + \beta2 X_2 + … + \epsilon$$, numerical variables are entered without mathematical transformations.
This assumes a linear relationship between the dependent variable and each numerical independent variable, with a constant effect of a unit increase of a numerical independent variable.
To model non-linear effects, add polynomial terms in the considered variable (e.g., a quadratic term):$$Y = \beta0 + \beta1 X + \beta_2 X^2 + \epsilon$$
To check for a quadratic effect of a variable (e.g., Family Income), include as an extra explanatory variable the square of the variable (Income2) and check for its significance.
Marginal Impact of Family Income
The marginal impact is determined by taking the partial derivative of the model with respect to Family Income. The equation includes both a linear and a quadratic term.
Constant Effect: The standard regression model assumes that the marginal contribution of an independent variable does not depend on the value of any other variable.
Moderation Effect: A covariate $$X1hasamoderationeffectontherelationshipbetweenanothercovariate,$$ has a moderation effect on the relationship between another covariate, $$X2,andthedependentvariableif$$, and the dependent variable if $$X1affectsthesizeofthemarginaleffectof$$ affects the size of the marginal effect of $$X2onthedependentvariable.Thisindicatesaninteractionbetween$$ on the dependent variable. This indicates an interaction between $$X1and$$ and $$X2$$.
Consider whether an additional level of education has the same impact on the number of visits to an art museum for those who attended classes in visual art and those who did not attend such classes, controlling for all other variables.
Add as an extra explanatory variable the product of the two variables (e.g., ClassvisualXEducation).
Check for the significance of the interaction term.
Each explanatory variable has a direct effect on the dependent variable, measured by its slope.
The mediation effect explores whether the effect on the dependent variable of one independent variable is mediated by another independent variable.
Mediation Effect as Causal Effect: Variable $$X2isamediatorintherelationshipbetween$$ is a mediator in the relationship between $$X1andYif$$ and Y if $$X1causes$$ causes $$X2and$$ and $$X_2$$ causes Y.
Path analysis is used to sketch causal relationships.
Regress the dependent variable on all variables but the mediator.
Regress the mediator on the variable with an indirect effect.
Regress the dependent variable on all variables.
Step 1: Regress NArtzero on Father Education to estimate the total effect (βf,total$$\beta_{f,total}$$).
Step 2: Regress Personal Education on Father Education to estimate the effect on the mediator (βf$$\beta_f$$).
Step 3: Regress NArtzero on Father and Personal Education to estimate the direct effect of Father Education ($$\beta{f,direct})andthedirecteffectofPersonalEducation($$) and the direct effect of Personal Education ($$\beta{p,direct}$$).
The total effect can be decomposed into the direct effect and the mediated effect:$$\beta{f,total} = \beta{f,direct} + \beta_{f,mediated}$$
The mediated effect is calculated as the product of the coefficients for each leg of the path: $$\beta{f,mediated} = \betaf \cdot \beta_p$$
The Sobel test is used to test the significance of the mediation effect. The null hypothesis is that the mediated effect is zero: $$H0: \beta{f,mediated} = \betaf \cdot \betap = 0$$
The test statistic can be computed using the formula:
If the test statistic value is large enough, it indicates a significant mediation effect of Personal Education on the relationship between Father Education and NArtzero.
Quantitative Methods in Cultural Industries
Model Assumptions:
The error term () in a regression model yi=β0+β1xi1+β2xi2+…+ϵi should satisfy several assumptions:
Mean of zero: E(ϵi)=0.
Constant variance (homoscedasticity): Var(ϵi)=σ2 for all i.
No correlation between error terms.
Normally distributed error terms.
Numerical explanatory variables should not exhibit perfect multicollinearity.
Why Check Model Assumptions?
To ensure that least squares estimators are reliable.
To understand the consequences of violating assumptions.
To detect any assumption violations.
To modify the model to satisfy the assumptions.
Diagnostics for Assumptions on the Errors:
Diagnostics are based on residuals.
Visual inspection of graphs of standardized residuals.
No violation implies no pattern in the graphs.
Homoscedasticity: The variance of the error term is constant across all levels of the independent variables: Var(ϵi)=σ2.
Increase in the standard error of coefficient estimates.
Low t-values.
Potentially leading to the incorrect conclusion that explanatory variables do not have a significant contribution.
Under homoscedasticity, standardized residuals and standardized predictions are not correlated.
Check: Plot standardized residuals against standardized predictions.
Interpretation: No pattern implies no violation of homoscedasticity.
Specific patterns in the plot of standardized residuals versus standardized predictions, such as a fan shape, indicate heteroscedasticity.
Log transformation of the dependent variable.
Log transformation of one or more explanatory variables.
Regress the log of the dependent variable (e.g., log of NArtzero) against the same explanatory variables.
Pay attention to the presence of 0 values in the dependent variable, which may require special handling before applying the log transformation.
The coefficient of a numerical independent variable represents the average change in the logarithm of the original dependent variable associated with a unit increase in the covariate, controlling for all other covariates.
This can be interpreted as the average percentage change in the original variable associated with a unit increase in the covariate, when all other covariates are kept constant.
For example, an increase in age is associated with a 0.2% average change in the number of visits to an art museum, controlling for other variables.
Definition:
Numerical explanatory variables should not have perfect correlation.
There should be no overlap of information or redundancy among the explanatory variables.
Standard errors of the coefficients estimates become large, leading to:
Small t-statistic values.
Coefficient signs may not match prior expectations.
Tolerance
Variance Inflation Factor (VIF)
Partial correlations
Computation:
Regress the considered explanatory variable against all other independent variables.
Compute the R-squared (R2) of the regression.
Calculate Tolerance as Tolerance=1−R2
Interpretation:
Tolerance represents the proportion of the explanatory variable's variability that is not explained by all other variables.
A greater tolerance indicates a lower inter-correlation.
Computation:
VIF=Tolerance1
Interpretation:
VIF measures the amount by which the variance of an explanatory variable's coefficient is increased due to collinearity.
A VIF greater than 10 suggests severe multicollinearity.
Definition: Pearson’s correlation between the dependent variable and the considered explanatory variable, controlling for all other explanatory variables.
Computation:
Regress the dependent variable on all covariates, excluding the considered one.
Regress the considered explanatory variable on all other covariates.
Calculate the residuals of the two regressions.
Partial Correlation: The Pearson’s correlation between the residuals of the two regressions.
Interpretation:
A high partial correlation (in absolute value) indicates a high correlation between the dependent variable and the independent variable, controlling for all other independent variables.
Compare the values of the Pearson’s correlation and the corresponding partial correlation to understand the effect of controlling for other variables.
Remove one (or more than one) of the collinear covariates (e.g., remove Mother Education).
Compute a new variable from the ones affected by collinearity, and replace the multicollinear variables with this new variable.
Transform the numerical covariates into categorical variables by grouping the values.
Continue using the same model, keeping in mind that t-tests might be distorted.
Linear Effect of a Numerical Independent Variable:
In the standard regression model,E(Y)=β0+β1X1+β2X2+…+ϵ, numerical variables are entered without mathematical transformations.
This assumes a linear relationship between the dependent variable and each numerical independent variable, with a constant effect of a unit increase of a numerical independent variable.
To model non-linear effects, add polynomial terms in the considered variable (e.g., a quadratic term):Y=β0+β1X+β2X2+ϵ
To check for a quadratic effect of a variable (e.g., Family Income), include as an extra explanatory variable the square of the variable (Income2) and check for its significance.
The marginal impact is determined by taking the partial derivative of the model with respect to Family Income. The equation includes both a linear and a quadratic term.
Constant Effect: The standard regression model assumes that the marginal contribution of an independent variable does not depend on the value of any other variable.
Moderation Effect: A covariate X1 has a moderation effect on the relationship between another covariate, X2, and the dependent variable if X1 affects the size of the marginal effect of X2 on the dependent variable. This indicates an interaction between X1 and X2.
Consider whether an additional level of education has the same impact on the number of visits to an art museum for those who attended classes in visual art and those who did not attend such classes, controlling for all other variables.
Add as an extra explanatory variable the product of the two variables (e.g., ClassvisualXEducation).
Check for the significance of the interaction term.
Each explanatory variable has a direct effect on the dependent variable, measured by its slope.
The mediation effect explores whether the effect on the dependent variable of one independent variable is mediated by another independent variable.
Mediation Effect as Causal Effect: Variable X2 is a mediator in the relationship between X1 and Y if X1 causes X2 and X2 causes Y.
Path analysis is used to sketch causal relationships.
Regress the dependent variable on all variables but the mediator.
Regress the mediator on the variable with an indirect effect.
Regress the dependent variable on all variables.
Step 1: Regress NArtzero on Father Education to estimate the total effect (βf,total).
Step 2: Regress Personal Education on Father Education to estimate the effect on the mediator (βf).
Step 3: Regress NArtzero on Father and Personal Education to estimate the direct effect of Father Education (βf,direct) and the direct effect of Personal Education (βp,direct).
The total effect can be decomposed into the direct effect and the mediated effect:βf,total=βf,direct+βf,mediated
The mediated effect is calculated as the product of the coefficients for each leg of the path: \beta{f,mediated} = \betaf \cdot \beta_p
The Sobel test is used to test the significance of the mediation effect. The null hypothesis is that the mediated effect is zero: H0: \beta{f,mediated} = \betaf \cdot \betap = 0
The test statistic can be computed using the formula:
If the test statistic value is large enough, it indicates a significant mediation effect of Personal Education on the relationship between Father Education and NArtzero.