DM

Lecture_20Video_20W10D2_20-_20Categorical_20predictors_20in_20MLR_20-_20part_202

Collecting Midterm Exams

  • Midterm exams returned for face-to-face students.

  • Online students should collect their original exams via email to avoid scanning.

  • Scanning and posting on Canvas will be done if necessary.

Categorical Predictors Overview

  • Focus on categorical predictors previously discussed.

  • Importance of defining dummy variables.

  • Mechanism explained for coding with categorical predictors.

Dummy Variables

  • Categorical predictors (e.g., race) often require multiple dummy variables based on the number of categories.

  • Example: Race as a predictor with 3 categories requires 2 dummy variables (for hypothesis testing).

  • Hypotheses:

    • Null Hypothesis: Both beta coefficients are zero (no effect).

    • Alternative Hypothesis: At least one beta is non-zero (some effect).

Factor Function in R

  • Use of the factor() function is crucial for defining categorical variables:

    • Wrap categorical predictors with factor() in regression models.

    • Automatically handles dummy variable creation.

Example of the Factor Function

  • Given a vector with numerics (1, 2, 3):

    • Using factor helps R treat them as categories instead of numerics.

    • levels showcase the categories defined.

Modifying Categorical Variables

  • Use labels in the factor() function to change category names.

  • Verifying categorical conversion with is.numeric() and is.factor().

Analyzing Categorical Data

  • Using the table() function summarizes frequencies of each category.

  • Example with gender: Shows counts of males and females.

Visualization

  • Bar plots and table functions provide visual representation of frequencies.

  • Using cross-tabulation with table() allows for joint distributions (e.g., gender vs. blood pressure group).

Summary Functions

  • summary() function provides insights into numeric predictors and frequencies for categorical ones.

  • Ensure categorical variables like BP groups are properly defined as factors to avoid misleading summaries.

Box Plots and Residuals

  • Box plots are effective for displaying distributions of a continuous variable against a categorical variable.

  • Useful in assumption checking after model fitting—e.g., residuals vs. categorical predictors.

Fitting Models in R

  • Fitting multiple linear regression models:

    • Example: Model comparisons using ANOVA (Analysis of Variance).

    • Models may differ in predictors for hypothesis testing.

ANOVA for Nested Models

  • Valid for testing effects across multiple variables:

    • Null Hypothesis: No significant model difference (e.g., certain beta coefficients are zero).

    • Framework: Fit two models and compare them, paying attention to missing values and handling them accordingly.

Missing Values Handling

  • Removing incomplete observations to ensure same dataset size for models to avoid errors in ANOVA.

  • Utilization of sub-databases focusing only on necessary columns to simplify analysis without missing values.

Conclusion of Hypothesis Testing

  • ANOVA p-values determine whether to reject the null hypothesis.

  • Beta coefficients allow for interpretation of average differences between categorical groups (e.g., BMI groups).

  • The significance of outcomes must be contextualized with other variables held constant.