Categorical Variables in Regression

Categorical Predictors and Outcomes

Categorical Predictors

  • In previous lectures, the dependent variable was generally continuous, ranging from, say, 1 to 15 or 16.
  • Independent variables could be continuous or dichotomous (e.g., gender coded as 0 or 1).
  • A problem arises when independent variables have more than two categories. With gender, there are increasingly more than two categories.

Polytomous Predictors

  • A polytomous predictor is a nominal scale of measurement with discrete categories that have no quantitative relationship.
  • For instance, gender categories (male, female, trans) cannot be scaled in a meaningful way, unlike age categories (0-10, 11-20, 21-30), which can be scaled.

Examples of Scalable and Non-Scalable Variables

  • Scalable Variables:
    • Age
    • Income
    • Education level
    • Level of Disapproval (can assign numbers 1, 2, 3, etc., as disapproval increases)
  • Non-Scalable Variables:
    • Ethnicity
    • Religion
    • Nationality

Dealing with Non-Scalable Categorical Variables

  • Collapsing Categories:
    • If possible, collapse categories into two (dichotomous) categories (e.g., Christian or non-Christian).
    • With infectious diseases, data can be classified into infectious and non-infectious (0 and 1).
  • Creating Dummy Variables:
    • If collapsing isn't appropriate, create dummy variables to represent the polytomous variable.
    • If there are K categories, create K - 1 dummy variables.

Dummy Variables Example (Religion with 5 Categories)

  • Let K = 5 (number of categories).
  • Create 4 dummy variables: D1, D2, D3, D4.
  • D1: 1 if Catholic, 0 otherwise.
  • D2: 1 if Protestant, 0 otherwise.
  • D3: 1 if Muslim, 0 otherwise.
  • D4: 1 if Other, 0 otherwise.

Regression with Dummy Variables

  • Standard regression formula: Y = \beta0 + \beta1X + \epsilon, where Y is predicted Y, ", \beta0 is the constant, ", \beta1 is the beta. Now X is a dummy variable
  • In multiple regression:

Key Points on Polytomous Predictors

  • If a categorical variable has more than two categories and cannot be meaningfully collapsed into two, use a polytomous predictor with the K - 1 principle to create dummy variables for use in regression.

Categorical Outcomes

  • When the outcome (dependent variable) is categorical, we cannot adjust the data to make standard linear regression work; we need a different form of regression.

Example: Skin Cancer and Age

  • Goal: Determine if skin cancer is more likely to occur with age.
  • Data: Collect data with ages of participants and whether they have skin cancer (1) or not (0).
  • Group ages into blocks (e.g., 10-year intervals) and calculate the percentage of people with skin cancer in each group.
  • Observe if the likelihood of skin cancer increases with age.

Why Linear Regression Isn't Suitable

  • Linear regression assumes a continuous dependent variable.
  • In this case, the dependent variable is binary (yes/no for skin cancer).

Binary Logistic Regression

  • When we have a binary outcome, we use binary logistic regression.
  • If there are more outcomes, other forms of regression would be used.

Assumptions of Linear Regression vs. Logistic Regression

  • Linear Regression:
    • Assumes a continuous dependent variable.
    • Assumes a normal distribution.
    • Assumes linearity.
    • Assumes equal variance.
  • Logistic Regression:
    • Does not require linear relationships.
    • Does not require normal distribution.
    • Does not require homogeneity of variances.

Benefits of Logistic Regression

  • Can estimate the odds of having skin cancer at a certain age.
  • Suitable when the assumptions of linear regression are violated.

Example: Logistic Regression with Age and Sex at Birth

  • Including multiple independent variables (age and sex at birth) to predict disease states (skin cancer or not).

Assessing Model Fit

  • Chi-Square Test:
    • Determines if the overall model is significantly improved compared to a null model (50/50 chance).
    • A significant p-value (e.g., p < 0.05) indicates that the model improves our ability to categorize outcomes.
  • Hosmer-Lemeshow Test:
    • Used to assess model fit.
    • A p-value greater than 0.05 (p > 0.05) indicates a good model fit.

Individual Predictors and Odds Ratio

  • Examine the odds ratio (often written as exponential beta, e^\beta).
  • P-values indicate the significance of each predictor.
  • Example Interpretation:
    • If age has a p-value less than 0.05, it is a significant predictor of skin cancer.
    • For every unit increase in age, there is a 13.62% increase in the odds of having skin cancer if the odds ratio is greater than one and significant.

Summary of Logistic Regression Benefits

  • Can use logistic regression to address issues with categorical outcome variables.
  • Can estimate the odds ratio and determine the significance of predictors.
  • Can assess the overall model fit.

Conclusion

  • Categorical variables cannot be used as dependent variables in linear regression.
  • Binary logistic regression can be used when there is a category with two levels (e.g., 0 and 1).
  • Several types of logistic regressions exist.