Categorical Variables in Regression
Categorical Predictors and Outcomes
Categorical Predictors
- In previous lectures, the dependent variable was generally continuous, ranging from, say, 1 to 15 or 16.
- Independent variables could be continuous or dichotomous (e.g., gender coded as 0 or 1).
- A problem arises when independent variables have more than two categories. With gender, there are increasingly more than two categories.
Polytomous Predictors
- A polytomous predictor is a nominal scale of measurement with discrete categories that have no quantitative relationship.
- For instance, gender categories (male, female, trans) cannot be scaled in a meaningful way, unlike age categories (0-10, 11-20, 21-30), which can be scaled.
Examples of Scalable and Non-Scalable Variables
- Scalable Variables:
- Age
- Income
- Education level
- Level of Disapproval (can assign numbers 1, 2, 3, etc., as disapproval increases)
- Non-Scalable Variables:
- Ethnicity
- Religion
- Nationality
Dealing with Non-Scalable Categorical Variables
- Collapsing Categories:
- If possible, collapse categories into two (dichotomous) categories (e.g., Christian or non-Christian).
- With infectious diseases, data can be classified into infectious and non-infectious (0 and 1).
- Creating Dummy Variables:
- If collapsing isn't appropriate, create dummy variables to represent the polytomous variable.
- If there are K categories, create K - 1 dummy variables.
Dummy Variables Example (Religion with 5 Categories)
- Let K = 5 (number of categories).
- Create 4 dummy variables: D1, D2, D3, D4.
- D1: 1 if Catholic, 0 otherwise.
- D2: 1 if Protestant, 0 otherwise.
- D3: 1 if Muslim, 0 otherwise.
- D4: 1 if Other, 0 otherwise.
Regression with Dummy Variables
- Standard regression formula: Y = \beta0 + \beta1X + \epsilon, where Y is predicted Y, ", \beta0 is the constant, ", \beta1 is the beta. Now X is a dummy variable
- In multiple regression:
Key Points on Polytomous Predictors
- If a categorical variable has more than two categories and cannot be meaningfully collapsed into two, use a polytomous predictor with the K - 1 principle to create dummy variables for use in regression.
Categorical Outcomes
- When the outcome (dependent variable) is categorical, we cannot adjust the data to make standard linear regression work; we need a different form of regression.
Example: Skin Cancer and Age
- Goal: Determine if skin cancer is more likely to occur with age.
- Data: Collect data with ages of participants and whether they have skin cancer (1) or not (0).
- Group ages into blocks (e.g., 10-year intervals) and calculate the percentage of people with skin cancer in each group.
- Observe if the likelihood of skin cancer increases with age.
Why Linear Regression Isn't Suitable
- Linear regression assumes a continuous dependent variable.
- In this case, the dependent variable is binary (yes/no for skin cancer).
Binary Logistic Regression
- When we have a binary outcome, we use binary logistic regression.
- If there are more outcomes, other forms of regression would be used.
Assumptions of Linear Regression vs. Logistic Regression
- Linear Regression:
- Assumes a continuous dependent variable.
- Assumes a normal distribution.
- Assumes linearity.
- Assumes equal variance.
- Logistic Regression:
- Does not require linear relationships.
- Does not require normal distribution.
- Does not require homogeneity of variances.
Benefits of Logistic Regression
- Can estimate the odds of having skin cancer at a certain age.
- Suitable when the assumptions of linear regression are violated.
Example: Logistic Regression with Age and Sex at Birth
- Including multiple independent variables (age and sex at birth) to predict disease states (skin cancer or not).
Assessing Model Fit
- Chi-Square Test:
- Determines if the overall model is significantly improved compared to a null model (50/50 chance).
- A significant p-value (e.g., p < 0.05) indicates that the model improves our ability to categorize outcomes.
- Hosmer-Lemeshow Test:
- Used to assess model fit.
- A p-value greater than 0.05 (p > 0.05) indicates a good model fit.
Individual Predictors and Odds Ratio
- Examine the odds ratio (often written as exponential beta, e^\beta).
- P-values indicate the significance of each predictor.
- Example Interpretation:
- If age has a p-value less than 0.05, it is a significant predictor of skin cancer.
- For every unit increase in age, there is a 13.62% increase in the odds of having skin cancer if the odds ratio is greater than one and significant.
Summary of Logistic Regression Benefits
- Can use logistic regression to address issues with categorical outcome variables.
- Can estimate the odds ratio and determine the significance of predictors.
- Can assess the overall model fit.
Conclusion
- Categorical variables cannot be used as dependent variables in linear regression.
- Binary logistic regression can be used when there is a category with two levels (e.g., 0 and 1).
- Several types of logistic regressions exist.