Looks like no one added any tags here yet for you.
What is regression a powerful tool for?
epidemiology
What are Directed Acrylic Graphs? (DAGs); what are these graphs made of?
simple too communicate casual relationships; nodes and edges
nodes
variables we are studying
edges
represent casual relationship between two variables
What CAN and CANNOT DAGs have
they can be made an arbritarially complex but cannot contain a cycle
why can’t DAGs contain a cycle?
You're often modeling something like cause-and-effect.one event can lead to another, but it can't eventually lead back to the first event, ensuring clarity in how data and processes flow.
how do we qualify strength of the casual relationship in DAGs?
regression and the weight coeffiecient
What is the difference between a regression and DAG?
DAGs help us visualize a simple relationship and a regression model quantifies it
what does the weight coefficient tell us?
whether the input is significantly associated with the outcome
main effect
casual relationship that is of primary interest
Covariates
variable that is not the main effect but may have an effect on the outcome (does the link between smoking and lung cancer still exist when we take gender into account?)
Mediators
intermediate variable that explain the process by which an exposure leads to an outcome (why and how an effect happens) ( if the impact of caloric intake (X) on weight (Y) happens through metabolism (M).)
confounders
third variable that affects both the main variable and the outcome, which can mess up the real connection between them. (if we’re studying whether owning a lighter is linked to lung cancer, smoking is a confounder.)
effect modification
a variable that modifies the causal relationship between the main effect and the outcome; the main effect has a different impact in different circumstances. (we might explore whether education’s impact on income varies by race.)
How does the regression model examine how off our predictions were?
bias (how far off our predictions were)
Variance (how much the model’s predictions change when we use different data sets)
Random error (natural unpredictability)
Reducing bias ____ variance
increases
high bias leads to
underfitting
underfitting
when the model looks smooth but is too simple to capture true patterns
high variance and low bias leads to
overfittin
overfitting
model is too complex fits the data to the outcome but is not tailored to the training set. (reading a book but not being able to summarize-works well on know examples but not new ones)
how do we handle overfitting and undercutting?
train-test technique
train-set split
splitting the into two subset data depending on size and requirements; one for training and saving the other subset for testing
example of train-test split
Mean Squared Error in Linear Regression
Limitation of train-test split
may not capture full complexity and diversity of data (especially if data set is small) therefore the model may not generalize well
how do we handle the limitation of train-split?
cross-validation
cross-validation
provides a more accurate description of how the model will perform on unseen data test-train splits
What are the benefits of cross-validation?
reduces variability
maximize data usage