1/53
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What is the difference between predictive analytics and statistics?
Predictive analytics predicts future outcomes to aid in decision making, whereas statistics describes, explains, and infers relationships within data
What is overfitting?
The model learns the training data too well - including its noise - and it fails to generalize to new data
What happens if a model is overfitting?
The model will perform great on training data but poorly on test or unseen data (it tries to memorize instead of understanding the patterns)
T/F: Linear Regression = “Good Fit”
True
T/F: Polynomial Regression = “Over fit”
True
Do overfitted models use too many predictors?
Yes and this causes poor model performance
How do we avoid overfitting initially when creating a model?
We split our data into training, validation, and test sets
What is simple linear regression?
A model used to predict one variable (y) using another variable (x)
What is the simple linear regression equation?
Y = B0 + B1(X) + e
What does Beta 0 represent in the simple linear regression equation?
It is the intercept, which is the value of y when x = 0
What does Beta 1 represent in the simple linear regression equation?
It is the slope describing how much y changes when x increases by 1 unit
What is the "e" in the simple linear regression equation?
(error) The leftover part that cannot be explained by x. Can be random noise or other factors
What does it mean to "fit a line"?
Finding the straight line that best represents the relationship between two variables
When fitting a line, what should you minimize?
The total vertical distance between each data point and the line
Why do we choose the MSE when fitting a regression line?
It makes the prediction errors as small as possible, squaring the errors removes negatives
What does it mean to measure predictive error?
Focuses on how well the model predicts new data, not just how well it fits the training data
In a good regression model, how should the residuals (errors) look?
They should be symmetrically distributed around zero, meaning the model doesn't under/over fit
(True/False) If residuals (errors) are skewed, then the model is considered biased.
True
Can we use regression to help fill in missing values in a data set?
Yes, the variable with the missing values will become the dependent variable while using complete cases to build the regression equation
What happens when an interaction effect is present?
The impact of one factor depends on the level of the other factor (synergy)
What is the formula of a model before adding an interaction term?
Y = B0 + B1X1 + B2X2 + error
- where x1 and x2 each affect y independently
- x2 doesn't change the effect of x1 yet
What is the formula of a model after adding an interaction term?
Y = B0 + B1X1 + B2X2 + B3X1X2 + error
- x1 and x2 interact; their effects on y depend on each other
- changing x2 changes how x1 affects y
What is a non-linear relationship called?
Polynomial Regression
Degree 0 (polynomial regression) is a…
constant line
Degree 1 (polynomial regression) is a…
straight line
Degree 3 (polynomial regression) is…
more complex curve (S-shaped)
(True/False) Dummy variables are also called "indicator variables."
True
Can regression models handle more than just numbers?
No, so we turn non-numerical outcomes (categorical outcomes) into dummy variables
How many fewer dummy variables do we need than the number of categories?
One
(True/False) One group is always the reference group, where all dummy variables equal zero.
True
What are the four assumptions of linear regression models?
1. Linearity of Residuals
2. Normal Distribution of Residuals
3. Equal Variance of Residuals
4. Independence of Residuals
What does high variance equal?
Overfitting
What is bias?
The error from making the model too simple (underfitting)
What is an example of bias?
Using a straight line when the real relationship is curved
What is variance?
Error from making the model too complex (overfitting)
What is an example of variance?
Using a wiggly line that fits every training point perfectly but fails on new data
The Tradeoff: If you have a simple model, there is...
high bias, low variance
The Tradeoff: If you have a complex model, there is...
high variance, low bias
What is the "sweet spot" of a model?
A model that is complex enough to capture the real structure, but simple enough to generalize to new data.
What is validity in a regression model?
A model that measures what it's supposed to (ex: a job performance model uses skills or experience and not favorite color)
What are the traits of linearity of residuals?
1. You plot residuals vs predicted values, and they look like a random cloud around zero
2. A U-Shape indicates non-linearity
What are the traits of the Normal Distribution of residuals?
1. Residuals centered around zero (bell-shaped)
2. If the points curve, the residuals are not normal
3. The data itself does not have to be normal, only the residuals
What are the traits of Equal Variance of residuals?
The spread of errors stays the same no matter what the prediction is.
What is homoscedasticity?
The residuals (errors) are evenly scattered across all levels of the predicted values, not getting wider or narrower as predictions change
If residuals are in a flat and random pattern that is...
homoscedasticity (good)
Is residuals are in a curve pattern that is...
heteroscedasticity (bad)
What are the traits of Independence of residuals?
The spread of residuals should be independent, where one observation's error doesn't affect another
Where are you most likely to see independence of residuals?
In time series or repeated measures data, where values are collected from the same source over time
What is Cook's Distance?
Measures how much a single data point influences the overall regression model (e.g., large values mean that the point has a strong effect on the fitted line)
What is cross-validation?
1. Tests how well a model performs on unseen data
2. The dataset is split into K parts, and the model is trained on K-1 parts
3. Process repeats K times
4. helps avoid overfitting
5. useful with small datasets
How do you calculate error?
E = actual - predicted value
Degree 2 (polynomial regression)
curve (u-shaped or inverted U)
What does high bias mean in R?
Your model is underfitting because it is not flexible enough to capture the true pattern
Why do we need to check the assumptions?
To get unbiased estimates, makes sure to model will perform well in prediction, and the inferences are accurate (test hypotheses, confidence interval, etc.)