Comprehensive Study Notes on Confounding Variables and Control Variables
Confounding Variables
Confounding Variables:
Definition: Any variable that distorts the relationship between a main effect (predictor variable) and a response variable.
Importance: They can lead to misconceptions about the relationship between predictor(s) and response if not identified accurately.
Motivation: Confounding Variables
Scenario: Use of the number of reviews of a product to predict total sales.
Question: What kind of relationship exists?
Answer: The relationship tends to be __.
Noteworthy Trend: As the number of reviews increases, total sales tend to .
Is it a causal relationship?
Answer: __.
Insight: Customers write reviews a product; more likely that __.
Key point: Customers who leave reviews are generally more __, which may not impact total sales.
Potential issues: ___ not being considered in this analysis may be driving sales instead.
Question: What variables may be causally related to both number of reviews and total sales?
Answers:
_.
_ (likes, shares, etc.).
Spending on __.
How to determine if the relationship between reviews and sales is influenced by other variables?
Answer: Check for .
Confounding Variable in Depth
Definition: Confounding variable is a common cause of both the predictor (main effect) and the response variable, leading to a misinterpretation of their relationship.
Observed Relationship:
The relationship observed between variables 𝑋 and 𝑌 exists because confounding variable 𝐶 is related to both 𝑋 and 𝑌 but was not included in the initial model.
Notably: 𝐶 confounds the relationship if the effect of 𝑋 differs significantly when 𝐶 is included in the model.
Impact of a Confounding Variable
Scenario: Including advertising spending in the model as a predictor.
Question: What happened to the slope coefficients when including advertising spending with reviews?
Answers:
Reviews: Coefficient shifted from to ___.
Advertising: __ predictor that the model, likely the relationship between reviews and sales.
Example of Confounding Variables
Advertising as a Confounder:
Reasoning:
Sales tend to as reviews increase.
Advertising spending correlates with .
More advertising boosts , attracting customers and potentially leading to more reviews.
With more people , a greater number write reviews after purchases.
After including advertising, the relationship between reviews and sales appears .
Mitigating Impact of a Confounding Variable: Include it in the model as a .
Control Variable
Definition: Control variable is a variable related to both the predictor(s) and response, included in regression to accurately assess the relationship among the variables of interest.
Difference from Confounding Variable:
Confounding Variable: Distorts the relationship between 𝑋 and 𝑌.
A confounding variable that remains unmapped influences the relationship as its impact can’t be computed.
Control Variable: Adjusts for significant relationships and facilitates accurate assessments between 𝑋 and 𝑌.
Control variable that can be measured becomes a control when included in the regression.
Importance of Control Variables
Reasons:
To avoid misleading conclusions about relationships.
Omitting a control variable may mislead to believe a significant correlation exists between predictors and responses, possibly indicating an indirect relationship.
To accurately reflect the effect of the main predictor on the response.
Omitting may result in incorrectly assigning significance to predictors, distorting predictions.
To optimize the model and improve predictive power.
Even if a control variable is not the primary interest, its inclusion enhances model robustness and reliability.
When to Include a Control Variable
Scenario: Let 𝑋 be the primary variable, 𝑌 be the response variable, and 𝐶 be control variable.
Criteria for Inclusion:
Strong Correlation: When there is a strong correlation between 𝑋 and 𝑌 corrected by including 𝐶.
This usually indicates that 𝐶 is confounding.
Effect Modification: When the effect of 𝑋 on 𝑌 is modified (usually increased) by including 𝐶 but remains significant.
Model Improvement: When incorporating 𝐶 refines model predictions and reduces error, even if the impact is minimal.
Examples of Identifying Confounders
Scenario 1: Relationship between years of school (𝑋), income (𝑌, in thousands), and IQ (𝐶).
Is IQ a Confounder? Answer: ____.
Relationship Insight: Education as a predictor of income, IQ correlates highly with both.
Scenario 2: Analyzing study hours (𝑋), GPA (𝑌), and sleep (𝐶).
Is Sleep a Confounder? Answer: ____.
Insight on Correlations:
Study hours are a strong predictor of GPA.
Sleep Hours is somewhat correlated with GPA
Sleep and study hours are only weak correlation measured as only (𝑟 = 0.16 ).
Determining If a Variable is a Confounder
Steps to Assess:
Fit the regression model of 𝑌 = .
Obtain slope estimate .
Calculate a 95% confidence interval for the slope.
Fit the regression model .
Obtain slope estimate with 𝐶 included as predictor.
Compare with the 95% confidence interval for .
Results Interpretation:
If contained in C.I.:
𝐶 is not confounder, and its model inclusion is not imperative but optional.
If not contained in C.I.:
𝐶 is a confounder as and are meaningfully different and necessary for model accuracy.
Scenario Examples for Determining Confounding
Scenario #1: Reviews and Advertising Spending
Existing Regression Equation:
Original slope for reviews: eta1 = _
Confidence interval C.I. = ___ = __.
Inference based on analysis:
Advertising spending ____ the relationship, including __ impacts the __.
Should number of reviews remain in the model? Answer: .
Scenario #2: Years of Education and IQ
Existing Regression Equation:
Original slope for education: eta1 = _
Confidence Interval C.I. = ___ = __.
Scenario #3: Study Hours and GPA
Existing Regression Equation:
Original slope for study hours: eta1 = _
Confidence Interval C.I. = ___ = __.
Comparing Confounders and Control Variables
Venn Diagram Analysis: Controls: ; Confounders:
Not Explicitly Discussed:
Related to both 𝑋 and 𝑌, may influence the slope of the initial predictor.
Not Control Variables:
Difficult to quantify, tough to include in modeling processes.