Lecture_20Video_20W10D2_20-_20Categorical_20predictors_20in_20MLR_20-_20part_202
Collecting Midterm Exams
Midterm exams returned for face-to-face students.
Online students should collect their original exams via email to avoid scanning.
Scanning and posting on Canvas will be done if necessary.
Categorical Predictors Overview
Focus on categorical predictors previously discussed.
Importance of defining dummy variables.
Mechanism explained for coding with categorical predictors.
Dummy Variables
Categorical predictors (e.g., race) often require multiple dummy variables based on the number of categories.
Example: Race as a predictor with 3 categories requires 2 dummy variables (for hypothesis testing).
Hypotheses:
Null Hypothesis: Both beta coefficients are zero (no effect).
Alternative Hypothesis: At least one beta is non-zero (some effect).
Factor Function in R
Use of the
factor()
function is crucial for defining categorical variables:Wrap categorical predictors with
factor()
in regression models.Automatically handles dummy variable creation.
Example of the Factor Function
Given a vector with numerics (1, 2, 3):
Using
factor
helps R treat them as categories instead of numerics.levels
showcase the categories defined.
Modifying Categorical Variables
Use
labels
in thefactor()
function to change category names.Verifying categorical conversion with
is.numeric()
andis.factor()
.
Analyzing Categorical Data
Using the
table()
function summarizes frequencies of each category.Example with gender: Shows counts of males and females.
Visualization
Bar plots and table functions provide visual representation of frequencies.
Using cross-tabulation with
table()
allows for joint distributions (e.g., gender vs. blood pressure group).
Summary Functions
summary()
function provides insights into numeric predictors and frequencies for categorical ones.Ensure categorical variables like BP groups are properly defined as factors to avoid misleading summaries.
Box Plots and Residuals
Box plots are effective for displaying distributions of a continuous variable against a categorical variable.
Useful in assumption checking after model fitting—e.g., residuals vs. categorical predictors.
Fitting Models in R
Fitting multiple linear regression models:
Example: Model comparisons using ANOVA (Analysis of Variance).
Models may differ in predictors for hypothesis testing.
ANOVA for Nested Models
Valid for testing effects across multiple variables:
Null Hypothesis: No significant model difference (e.g., certain beta coefficients are zero).
Framework: Fit two models and compare them, paying attention to missing values and handling them accordingly.
Missing Values Handling
Removing incomplete observations to ensure same dataset size for models to avoid errors in ANOVA.
Utilization of sub-databases focusing only on necessary columns to simplify analysis without missing values.
Conclusion of Hypothesis Testing
ANOVA p-values determine whether to reject the null hypothesis.
Beta coefficients allow for interpretation of average differences between categorical groups (e.g., BMI groups).
The significance of outcomes must be contextualized with other variables held constant.