49d ago

Quantitative Methods in Cultural Industries

Model Diagnostics: Homoscedasticity Assumption

  • Model Assumptions:

    • The error term () in a regression model $$yi = \beta0 + \beta1 x{i1} + \beta2 x{i2} + … + \epsilon_i$$ should satisfy several assumptions:

      • Mean of zero: E(ϵi)=0E(\epsilon_i) = 0$$E(\epsilon_i) = 0$$.

      • Constant variance (homoscedasticity): Var(ϵi)=σ2Var(\epsilon_i) = \sigma^2$$Var(\epsilon_i) = \sigma^2$$ for all i.

      • No correlation between error terms.

      • Normally distributed error terms.

    • Numerical explanatory variables should not exhibit perfect multicollinearity.

  • Why Check Model Assumptions?

    • To ensure that least squares estimators are reliable.

    • To understand the consequences of violating assumptions.

    • To detect any assumption violations.

    • To modify the model to satisfy the assumptions.

  • Diagnostics for Assumptions on the Errors:

    • Diagnostics are based on residuals.

    • Visual inspection of graphs of standardized residuals.

    • No violation implies no pattern in the graphs.

Homoscedasticity Assumption

  • Homoscedasticity: The variance of the error term is constant across all levels of the independent variables: Var(ϵi)=σ2Var(\epsilon_i) = \sigma^2$$Var(\epsilon_i) = \sigma^2$$.

Consequences of Heteroscedasticity

  • Increase in the standard error of coefficient estimates.

  • Low t-values.

  • Potentially leading to the incorrect conclusion that explanatory variables do not have a significant contribution.

Heteroscedasticity Detection

  • Under homoscedasticity, standardized residuals and standardized predictions are not correlated.

  • Check: Plot standardized residuals against standardized predictions.

  • Interpretation: No pattern implies no violation of homoscedasticity.

Patterns Revealing Heteroscedasticity

  • Specific patterns in the plot of standardized residuals versus standardized predictions, such as a fan shape, indicate heteroscedasticity.

Remedies for Heteroscedasticity

  • Log transformation of the dependent variable.

  • Log transformation of one or more explanatory variables.

Model 2: Log Transformation Example
  • Regress the log of the dependent variable (e.g., log of NArtzero) against the same explanatory variables.

  • Pay attention to the presence of 0 values in the dependent variable, which may require special handling before applying the log transformation.

Interpretation of Coefficients with Log Transformation
  • The coefficient of a numerical independent variable represents the average change in the logarithm of the original dependent variable associated with a unit increase in the covariate, controlling for all other covariates.

  • This can be interpreted as the average percentage change in the original variable associated with a unit increase in the covariate, when all other covariates are kept constant.

    • For example, an increase in age is associated with a 0.2% average change in the number of visits to an art museum, controlling for other variables.

Model Diagnostics: No Perfect Collinearity

No Perfect Multicollinearity

  • Definition:

    • Numerical explanatory variables should not have perfect correlation.

    • There should be no overlap of information or redundancy among the explanatory variables.

Consequences of Collinearity

  • Standard errors of the coefficients estimates become large, leading to:

    • Small t-statistic values.

    • Coefficient signs may not match prior expectations.

Diagnostics for Multicollinearity

  • Tolerance

  • Variance Inflation Factor (VIF)

  • Partial correlations

Tolerance
  • Computation:

    1. Regress the considered explanatory variable against all other independent variables.

    2. Compute the R-squared (R2R^2$$R^2$$) of the regression.

    3. Calculate Tolerance as Tolerance=1R2Tolerance = 1 - R^2$$Tolerance = 1 - R^2$$

  • Interpretation:

    • Tolerance represents the proportion of the explanatory variable's variability that is not explained by all other variables.

    • A greater tolerance indicates a lower inter-correlation.

Variance Inflation Factor (VIF)
  • Computation:
    VIF=1ToleranceVIF = \frac{1}{Tolerance}$$VIF = \frac{1}{Tolerance}$$

  • Interpretation:

    • VIF measures the amount by which the variance of an explanatory variable's coefficient is increased due to collinearity.

    • A VIF greater than 10 suggests severe multicollinearity.

Partial Correlations
  • Definition: Pearson’s correlation between the dependent variable and the considered explanatory variable, controlling for all other explanatory variables.

  • Computation:

    1. Regress the dependent variable on all covariates, excluding the considered one.

    2. Regress the considered explanatory variable on all other covariates.

    3. Calculate the residuals of the two regressions.

  • Partial Correlation: The Pearson’s correlation between the residuals of the two regressions.

  • Interpretation:

    • A high partial correlation (in absolute value) indicates a high correlation between the dependent variable and the independent variable, controlling for all other independent variables.

    • Compare the values of the Pearson’s correlation and the corresponding partial correlation to understand the effect of controlling for other variables.

Remedies for Multicollinearity

  • Remove one (or more than one) of the collinear covariates (e.g., remove Mother Education).

  • Compute a new variable from the ones affected by collinearity, and replace the multicollinear variables with this new variable.

  • Transform the numerical covariates into categorical variables by grouping the values.

  • Continue using the same model, keeping in mind that t-tests might be distorted.

Advanced Regression Techniques

Linearity Assumption

  • Linear Effect of a Numerical Independent Variable:

    • In the standard regression model,$$E(Y) = \beta0 + \beta1 X1 + \beta2 X_2 + … + \epsilon$$, numerical variables are entered without mathematical transformations.

    • This assumes a linear relationship between the dependent variable and each numerical independent variable, with a constant effect of a unit increase of a numerical independent variable.

Non-Linear Effects
  • To model non-linear effects, add polynomial terms in the considered variable (e.g., a quadratic term):$$Y = \beta0 + \beta1 X + \beta_2 X^2 + \epsilon$$

Quadratic Effect
  • To check for a quadratic effect of a variable (e.g., Family Income), include as an extra explanatory variable the square of the variable (Income2) and check for its significance.

Marginal Impact of Family Income

  • The marginal impact is determined by taking the partial derivative of the model with respect to Family Income. The equation includes both a linear and a quadratic term.

Moderation
  • Constant Effect: The standard regression model assumes that the marginal contribution of an independent variable does not depend on the value of any other variable.

  • Moderation Effect: A covariate $$X1hasamoderationeffectontherelationshipbetweenanothercovariate, has a moderation effect on the relationship between another covariate, $$ has a moderation effect on the relationship between another covariate, $$X2,andthedependentvariableif, and the dependent variable if $$, and the dependent variable if $$X1affectsthesizeofthemarginaleffectof affects the size of the marginal effect of $$ affects the size of the marginal effect of $$X2onthedependentvariable.Thisindicatesaninteractionbetween on the dependent variable. This indicates an interaction between $$ on the dependent variable. This indicates an interaction between $$X1and and $$ and $$X2$$.

  • Consider whether an additional level of education has the same impact on the number of visits to an art museum for those who attended classes in visual art and those who did not attend such classes, controlling for all other variables.

Checking for Interaction

  • Add as an extra explanatory variable the product of the two variables (e.g., ClassvisualXEducation).

  • Check for the significance of the interaction term.

Mediation Effect

Direct Effect

  • Each explanatory variable has a direct effect on the dependent variable, measured by its slope.

Mediation Effect
  • The mediation effect explores whether the effect on the dependent variable of one independent variable is mediated by another independent variable.

  • Mediation Effect as Causal Effect: Variable $$X2isamediatorintherelationshipbetween is a mediator in the relationship between $$ is a mediator in the relationship between $$X1andYif and Y if $$ and Y if $$X1causes causes $$ causes $$X2and and $$ and $$X_2$$ causes Y.

Path Analysis

  • Path analysis is used to sketch causal relationships.

Sequence of Regression Models
  • Regress the dependent variable on all variables but the mediator.

  • Regress the mediator on the variable with an indirect effect.

  • Regress the dependent variable on all variables.

Steps in Path Analysis
  • Step 1: Regress NArtzero on Father Education to estimate the total effect (βf,total\beta_{f,total}$$\beta_{f,total}$$).

  • Step 2: Regress Personal Education on Father Education to estimate the effect on the mediator (βf\beta_f$$\beta_f$$).

  • Step 3: Regress NArtzero on Father and Personal Education to estimate the direct effect of Father Education ($$\beta{f,direct})andthedirecteffectofPersonalEducation() and the direct effect of Personal Education ($$) and the direct effect of Personal Education ($$\beta{p,direct}$$).

Decomposition of the Total Effect
  • The total effect can be decomposed into the direct effect and the mediated effect:$$\beta{f,total} = \beta{f,direct} + \beta_{f,mediated}$$

  • The mediated effect is calculated as the product of the coefficients for each leg of the path: $$\beta{f,mediated} = \betaf \cdot \beta_p$$

Sobel Test
  • The Sobel test is used to test the significance of the mediation effect. The null hypothesis is that the mediated effect is zero: $$H0: \beta{f,mediated} = \betaf \cdot \betap = 0$$

  • The test statistic can be computed using the formula:

Personal Education as a Significant Mediator
  • If the test statistic value is large enough, it indicates a significant mediation effect of Personal Education on the relationship between Father Education and NArtzero.


knowt logo

Quantitative Methods in Cultural Industries

Model Diagnostics: Homoscedasticity Assumption

  • Model Assumptions:

    • The error term () in a regression model yi=β0+β1xi1+β2xi2++ϵiyi = \beta0 + \beta1 x{i1} + \beta2 x{i2} + … + \epsilon_i should satisfy several assumptions:

      • Mean of zero: E(ϵi)=0E(\epsilon_i) = 0.

      • Constant variance (homoscedasticity): Var(ϵi)=σ2Var(\epsilon_i) = \sigma^2 for all i.

      • No correlation between error terms.

      • Normally distributed error terms.

    • Numerical explanatory variables should not exhibit perfect multicollinearity.

  • Why Check Model Assumptions?

    • To ensure that least squares estimators are reliable.

    • To understand the consequences of violating assumptions.

    • To detect any assumption violations.

    • To modify the model to satisfy the assumptions.

  • Diagnostics for Assumptions on the Errors:

    • Diagnostics are based on residuals.

    • Visual inspection of graphs of standardized residuals.

    • No violation implies no pattern in the graphs.

Homoscedasticity Assumption

  • Homoscedasticity: The variance of the error term is constant across all levels of the independent variables: Var(ϵi)=σ2Var(\epsilon_i) = \sigma^2.

Consequences of Heteroscedasticity

  • Increase in the standard error of coefficient estimates.

  • Low t-values.

  • Potentially leading to the incorrect conclusion that explanatory variables do not have a significant contribution.

Heteroscedasticity Detection

  • Under homoscedasticity, standardized residuals and standardized predictions are not correlated.

  • Check: Plot standardized residuals against standardized predictions.

  • Interpretation: No pattern implies no violation of homoscedasticity.

Patterns Revealing Heteroscedasticity

  • Specific patterns in the plot of standardized residuals versus standardized predictions, such as a fan shape, indicate heteroscedasticity.

Remedies for Heteroscedasticity

  • Log transformation of the dependent variable.

  • Log transformation of one or more explanatory variables.

Model 2: Log Transformation Example
  • Regress the log of the dependent variable (e.g., log of NArtzero) against the same explanatory variables.

  • Pay attention to the presence of 0 values in the dependent variable, which may require special handling before applying the log transformation.

Interpretation of Coefficients with Log Transformation
  • The coefficient of a numerical independent variable represents the average change in the logarithm of the original dependent variable associated with a unit increase in the covariate, controlling for all other covariates.

  • This can be interpreted as the average percentage change in the original variable associated with a unit increase in the covariate, when all other covariates are kept constant.

    • For example, an increase in age is associated with a 0.2% average change in the number of visits to an art museum, controlling for other variables.

Model Diagnostics: No Perfect Collinearity

No Perfect Multicollinearity

  • Definition:

    • Numerical explanatory variables should not have perfect correlation.

    • There should be no overlap of information or redundancy among the explanatory variables.

Consequences of Collinearity

  • Standard errors of the coefficients estimates become large, leading to:

    • Small t-statistic values.

    • Coefficient signs may not match prior expectations.

Diagnostics for Multicollinearity

  • Tolerance

  • Variance Inflation Factor (VIF)

  • Partial correlations

Tolerance
  • Computation:

    1. Regress the considered explanatory variable against all other independent variables.

    2. Compute the R-squared (R2R^2) of the regression.

    3. Calculate Tolerance as Tolerance=1R2Tolerance = 1 - R^2

  • Interpretation:

    • Tolerance represents the proportion of the explanatory variable's variability that is not explained by all other variables.

    • A greater tolerance indicates a lower inter-correlation.

Variance Inflation Factor (VIF)
  • Computation:
    VIF=1ToleranceVIF = \frac{1}{Tolerance}

  • Interpretation:

    • VIF measures the amount by which the variance of an explanatory variable's coefficient is increased due to collinearity.

    • A VIF greater than 10 suggests severe multicollinearity.

Partial Correlations
  • Definition: Pearson’s correlation between the dependent variable and the considered explanatory variable, controlling for all other explanatory variables.

  • Computation:

    1. Regress the dependent variable on all covariates, excluding the considered one.

    2. Regress the considered explanatory variable on all other covariates.

    3. Calculate the residuals of the two regressions.

  • Partial Correlation: The Pearson’s correlation between the residuals of the two regressions.

  • Interpretation:

    • A high partial correlation (in absolute value) indicates a high correlation between the dependent variable and the independent variable, controlling for all other independent variables.

    • Compare the values of the Pearson’s correlation and the corresponding partial correlation to understand the effect of controlling for other variables.

Remedies for Multicollinearity

  • Remove one (or more than one) of the collinear covariates (e.g., remove Mother Education).

  • Compute a new variable from the ones affected by collinearity, and replace the multicollinear variables with this new variable.

  • Transform the numerical covariates into categorical variables by grouping the values.

  • Continue using the same model, keeping in mind that t-tests might be distorted.

Advanced Regression Techniques

Linearity Assumption

  • Linear Effect of a Numerical Independent Variable:

    • In the standard regression model,E(Y)=β0+β1X1+β2X2++ϵE(Y) = \beta0 + \beta1 X1 + \beta2 X_2 + … + \epsilon, numerical variables are entered without mathematical transformations.

    • This assumes a linear relationship between the dependent variable and each numerical independent variable, with a constant effect of a unit increase of a numerical independent variable.

Non-Linear Effects
  • To model non-linear effects, add polynomial terms in the considered variable (e.g., a quadratic term):Y=β0+β1X+β2X2+ϵY = \beta0 + \beta1 X + \beta_2 X^2 + \epsilon

Quadratic Effect
  • To check for a quadratic effect of a variable (e.g., Family Income), include as an extra explanatory variable the square of the variable (Income2) and check for its significance.

Marginal Impact of Family Income
  • The marginal impact is determined by taking the partial derivative of the model with respect to Family Income. The equation includes both a linear and a quadratic term.

Moderation
  • Constant Effect: The standard regression model assumes that the marginal contribution of an independent variable does not depend on the value of any other variable.

  • Moderation Effect: A covariate X1X1 has a moderation effect on the relationship between another covariate, X2X2, and the dependent variable if X1X1 affects the size of the marginal effect of X2X2 on the dependent variable. This indicates an interaction between X1X1 and X2X2.

  • Consider whether an additional level of education has the same impact on the number of visits to an art museum for those who attended classes in visual art and those who did not attend such classes, controlling for all other variables.

Checking for Interaction

  • Add as an extra explanatory variable the product of the two variables (e.g., ClassvisualXEducation).

  • Check for the significance of the interaction term.

Mediation Effect

Direct Effect

  • Each explanatory variable has a direct effect on the dependent variable, measured by its slope.

Mediation Effect
  • The mediation effect explores whether the effect on the dependent variable of one independent variable is mediated by another independent variable.

  • Mediation Effect as Causal Effect: Variable X2X2 is a mediator in the relationship between X1X1 and Y if X1X1 causes X2X2 and X2X_2 causes Y.

Path Analysis

  • Path analysis is used to sketch causal relationships.

Sequence of Regression Models
  • Regress the dependent variable on all variables but the mediator.

  • Regress the mediator on the variable with an indirect effect.

  • Regress the dependent variable on all variables.

Steps in Path Analysis
  • Step 1: Regress NArtzero on Father Education to estimate the total effect (βf,total\beta_{f,total}).

  • Step 2: Regress Personal Education on Father Education to estimate the effect on the mediator (βf\beta_f).

  • Step 3: Regress NArtzero on Father and Personal Education to estimate the direct effect of Father Education (βf,direct\beta{f,direct}) and the direct effect of Personal Education (βp,direct\beta{p,direct}).

Decomposition of the Total Effect
  • The total effect can be decomposed into the direct effect and the mediated effect:βf,total=βf,direct+βf,mediated\beta{f,total} = \beta{f,direct} + \beta_{f,mediated}

  • The mediated effect is calculated as the product of the coefficients for each leg of the path: \beta{f,mediated} = \betaf \cdot \beta_p

Sobel Test
  • The Sobel test is used to test the significance of the mediation effect. The null hypothesis is that the mediated effect is zero: H0: \beta{f,mediated} = \betaf \cdot \betap = 0

  • The test statistic can be computed using the formula:

Personal Education as a Significant Mediator
  • If the test statistic value is large enough, it indicates a significant mediation effect of Personal Education on the relationship between Father Education and NArtzero.