DM

Lecture_20Video_20W10D1_20-_20Categorical_20predictors_20in_20MLR

Categorical Predictors in Multiple Linear Regression

Introduction to Categorical Predictors

  • Moving from continuous predictors to categorical predictors in regression analysis.

  • Importance of assigning numerical values to categories to facilitate regression equations.

Binary Predictors

  • Define binary predictors as those with 2 categories (e.g., males and females).

  • Example:

    • Response variable (y): Height in inches

    • Predictor (x): Sex (categorical with males and females).

  • Assign numbers (0 and 1) to categories:

    • Males = 1

    • Females = 0

  • Dummy Variable: A variable that represents a categorical predictor using numbers, making it not a 'real' variable in traditional terms.

  • Fitting the model results in estimates:

    • Beta naught (β0) = 66.1 (average height for females)

    • Beta one (β1) = 3.8 (estimated difference in height between males and females).

  • Interpretations:

    • Average height for females is 66.1 inches.

    • Average height for males is β0 + β1 (66.1 + 3.8 = 69.9 inches).

Expected Value Calculation

  • From the regression equation:

    • For females (x=0): Average height = β0

    • For males (x=1): Average height = β0 + β1

  • Importance of order in interpretation:

    • Positive β1 indicates males are taller than females; a negative value indicates otherwise.

Variability in Assigning Dummy Variables

  • It is possible to assign different numbers (not just 0 and 1); however, it complicates interpretation.

  • Example:

    • Using arbitrary numbers (like 2 for males, 17 for females) complicates the regression equation.

  • By keeping the numbers as 0 and 1, interpretations remain simpler.

  • Choosing which category is 0 or 1 is somewhat arbitrary—both configurations yield valid results with interchangeable interpretations.

Reference Category Concept

  • The category assigned a value of 0 is referred to as the reference category or baseline category.

  • The average of the response variable in the reference category is represented by β0.

Categorical Predictors with More than Two Categories

  • Using factors and levels interchangeably: Factors refer to categorical variables and levels refer to categories.

  • Example with 3 categories: Caucasian, African, and Asian:

    • Cannot simply assign numbers like 0, 1, and 2; it assumes equal differences between categories.

  • Instead, define multiple dummy variables (x1 for African, x2 for Asian), thus representing each condition without assuming equal spacing:

    • Average height for Caucasians: β0

    • Average height for Africans: β0 + β1

    • Average height for Asians: β0 + β2

    • Average height differences:

      • African - Caucasian = β1

      • Asian - Caucasian = β2

      • African - Asian = β1 - β2

Dummy Variable and Interpretation Rules

  • For k categories, k-1 dummy variables are needed.

  • Each dummy variable captures the difference relative to the reference category.

  • Importance of mutually exclusive categories (e.g., no overlap in demographic groups).

  • Interpretation of regression coefficients:

    • Look at the definition of the corresponding dummy variable to derive average differences.

Ordinal Predictors

  • Distinction between nominal and ordinal categorical variables:

    • Nominal: Categorical without a specific order.

    • Ordinal: Categories where order matters.

  • For ordinal predictors, either ignore order or use sophisticated models that accommodate ordinal data.

Application in R

  • Using categorical variables in regression requires utilizing functions like factor for categorical columns.

  • Example of reading Framingham dataset:

    • Properly identify categorical variables and fit models accordingly, recognizing different interpretations for each category by observing the terms generated by the model.

    • Each term corresponds to one of the dummy variables, showcasing the relationship with the reference category.