Midterm exams returned for face-to-face students.
Online students should collect their original exams via email to avoid scanning.
Scanning and posting on Canvas will be done if necessary.
Focus on categorical predictors previously discussed.
Importance of defining dummy variables.
Mechanism explained for coding with categorical predictors.
Categorical predictors (e.g., race) often require multiple dummy variables based on the number of categories.
Example: Race as a predictor with 3 categories requires 2 dummy variables (for hypothesis testing).
Hypotheses:
Null Hypothesis: Both beta coefficients are zero (no effect).
Alternative Hypothesis: At least one beta is non-zero (some effect).
Use of the factor()
function is crucial for defining categorical variables:
Wrap categorical predictors with factor()
in regression models.
Automatically handles dummy variable creation.
Given a vector with numerics (1, 2, 3):
Using factor
helps R treat them as categories instead of numerics.
levels
showcase the categories defined.
Use labels
in the factor()
function to change category names.
Verifying categorical conversion with is.numeric()
and is.factor()
.
Using the table()
function summarizes frequencies of each category.
Example with gender: Shows counts of males and females.
Bar plots and table functions provide visual representation of frequencies.
Using cross-tabulation with table()
allows for joint distributions (e.g., gender vs. blood pressure group).
summary()
function provides insights into numeric predictors and frequencies for categorical ones.
Ensure categorical variables like BP groups are properly defined as factors to avoid misleading summaries.
Box plots are effective for displaying distributions of a continuous variable against a categorical variable.
Useful in assumption checking after model fitting—e.g., residuals vs. categorical predictors.
Fitting multiple linear regression models:
Example: Model comparisons using ANOVA (Analysis of Variance).
Models may differ in predictors for hypothesis testing.
Valid for testing effects across multiple variables:
Null Hypothesis: No significant model difference (e.g., certain beta coefficients are zero).
Framework: Fit two models and compare them, paying attention to missing values and handling them accordingly.
Removing incomplete observations to ensure same dataset size for models to avoid errors in ANOVA.
Utilization of sub-databases focusing only on necessary columns to simplify analysis without missing values.
ANOVA p-values determine whether to reject the null hypothesis.
Beta coefficients allow for interpretation of average differences between categorical groups (e.g., BMI groups).
The significance of outcomes must be contextualized with other variables held constant.