Categorical Variables in Regression

Categorical Predictors and Outcomes

Categorical Predictors

In previous lectures, the dependent variable was generally continuous, ranging from, say, 1 to 15 or 16.
Independent variables could be continuous or dichotomous (e.g., gender coded as 0 or 1).
A problem arises when independent variables have more than two categories. With gender, there are increasingly more than two categories.

Polytomous Predictors

A polytomous predictor is a nominal scale of measurement with discrete categories that have no quantitative relationship.
For instance, gender categories (male, female, trans) cannot be scaled in a meaningful way, unlike age categories (0-10, 11-20, 21-30), which can be scaled.

Examples of Scalable and Non-Scalable Variables

Scalable Variables:
- Age
- Income
- Education level
- Level of Disapproval (can assign numbers 1, 2, 3, etc., as disapproval increases)
Non-Scalable Variables:
- Ethnicity
- Religion
- Nationality

Dealing with Non-Scalable Categorical Variables

Collapsing Categories:
- If possible, collapse categories into two (dichotomous) categories (e.g., Christian or non-Christian).
- With infectious diseases, data can be classified into infectious and non-infectious (0 and 1).
Creating Dummy Variables:
- If collapsing isn't appropriate, create dummy variables to represent the polytomous variable.
- If there are $K$ categories, create $K - 1$ dummy variables.

Dummy Variables Example (Religion with 5 Categories)

Let $K = 5$ (number of categories).
Create 4 dummy variables: D1, D2, D3, D4.
D1: 1 if Catholic, 0 otherwise.
D2: 1 if Protestant, 0 otherwise.
D3: 1 if Muslim, 0 otherwise.
D4: 1 if Other, 0 otherwise.

Regression with Dummy Variables

Standard regression formula: $Y = \beta<em>0 + \beta</em>1X + \epsilon$ , where $Y$ is predicted Y, $", \beta<em>0$ is the constant, $", \beta</em>1$ is the beta. Now $X$ is a dummy variable
In multiple regression:

Key Points on Polytomous Predictors

If a categorical variable has more than two categories and cannot be meaningfully collapsed into two, use a polytomous predictor with the $K - 1$ principle to create dummy variables for use in regression.

Categorical Outcomes

When the outcome (dependent variable) is categorical, we cannot adjust the data to make standard linear regression work; we need a different form of regression.

Example: Skin Cancer and Age

Goal: Determine if skin cancer is more likely to occur with age.
Data: Collect data with ages of participants and whether they have skin cancer (1) or not (0).
Group ages into blocks (e.g., 10-year intervals) and calculate the percentage of people with skin cancer in each group.
Observe if the likelihood of skin cancer increases with age.

Why Linear Regression Isn't Suitable

Linear regression assumes a continuous dependent variable.
In this case, the dependent variable is binary (yes/no for skin cancer).

Binary Logistic Regression

When we have a binary outcome, we use binary logistic regression.
If there are more outcomes, other forms of regression would be used.

Assumptions of Linear Regression vs. Logistic Regression

Linear Regression:
- Assumes a continuous dependent variable.
- Assumes a normal distribution.
- Assumes linearity.
- Assumes equal variance.
Logistic Regression:
- Does not require linear relationships.
- Does not require normal distribution.
- Does not require homogeneity of variances.

Benefits of Logistic Regression

Can estimate the odds of having skin cancer at a certain age.
Suitable when the assumptions of linear regression are violated.

Example: Logistic Regression with Age and Sex at Birth

Including multiple independent variables (age and sex at birth) to predict disease states (skin cancer or not).

Assessing Model Fit

Chi-Square Test:
- Determines if the overall model is significantly improved compared to a null model (50/50 chance).
- A significant p-value (e.g., p < 0.05) indicates that the model improves our ability to categorize outcomes.
Hosmer-Lemeshow Test:
- Used to assess model fit.
- A p-value greater than 0.05 (p > 0.05) indicates a good model fit.

Individual Predictors and Odds Ratio

Examine the odds ratio (often written as exponential beta, $e^\beta$ ).
P-values indicate the significance of each predictor.
Example Interpretation:
- If age has a p-value less than 0.05, it is a significant predictor of skin cancer.
- For every unit increase in age, there is a 13.62% increase in the odds of having skin cancer if the odds ratio is greater than one and significant.

Summary of Logistic Regression Benefits

Can use logistic regression to address issues with categorical outcome variables.
Can estimate the odds ratio and determine the significance of predictors.
Can assess the overall model fit.

Conclusion

Categorical variables cannot be used as dependent variables in linear regression.
Binary logistic regression can be used when there is a category with two levels (e.g., 0 and 1).
Several types of logistic regressions exist.