Moving from continuous predictors to categorical predictors in regression analysis.
Importance of assigning numerical values to categories to facilitate regression equations.
Define binary predictors as those with 2 categories (e.g., males and females).
Example:
Response variable (y): Height in inches
Predictor (x): Sex (categorical with males and females).
Assign numbers (0 and 1) to categories:
Males = 1
Females = 0
Dummy Variable: A variable that represents a categorical predictor using numbers, making it not a 'real' variable in traditional terms.
Fitting the model results in estimates:
Beta naught (β0) = 66.1 (average height for females)
Beta one (β1) = 3.8 (estimated difference in height between males and females).
Interpretations:
Average height for females is 66.1 inches.
Average height for males is β0 + β1 (66.1 + 3.8 = 69.9 inches).
From the regression equation:
For females (x=0): Average height = β0
For males (x=1): Average height = β0 + β1
Importance of order in interpretation:
Positive β1 indicates males are taller than females; a negative value indicates otherwise.
It is possible to assign different numbers (not just 0 and 1); however, it complicates interpretation.
Example:
Using arbitrary numbers (like 2 for males, 17 for females) complicates the regression equation.
By keeping the numbers as 0 and 1, interpretations remain simpler.
Choosing which category is 0 or 1 is somewhat arbitrary—both configurations yield valid results with interchangeable interpretations.
The category assigned a value of 0 is referred to as the reference category or baseline category.
The average of the response variable in the reference category is represented by β0.
Using factors and levels interchangeably: Factors refer to categorical variables and levels refer to categories.
Example with 3 categories: Caucasian, African, and Asian:
Cannot simply assign numbers like 0, 1, and 2; it assumes equal differences between categories.
Instead, define multiple dummy variables (x1 for African, x2 for Asian), thus representing each condition without assuming equal spacing:
Average height for Caucasians: β0
Average height for Africans: β0 + β1
Average height for Asians: β0 + β2
Average height differences:
African - Caucasian = β1
Asian - Caucasian = β2
African - Asian = β1 - β2
For k categories, k-1 dummy variables are needed.
Each dummy variable captures the difference relative to the reference category.
Importance of mutually exclusive categories (e.g., no overlap in demographic groups).
Interpretation of regression coefficients:
Look at the definition of the corresponding dummy variable to derive average differences.
Distinction between nominal and ordinal categorical variables:
Nominal: Categorical without a specific order.
Ordinal: Categories where order matters.
For ordinal predictors, either ignore order or use sophisticated models that accommodate ordinal data.
Using categorical variables in regression requires utilizing functions like factor for categorical columns.
Example of reading Framingham dataset:
Properly identify categorical variables and fit models accordingly, recognizing different interpretations for each category by observing the terms generated by the model.
Each term corresponds to one of the dummy variables, showcasing the relationship with the reference category.