1/21
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
The most common statistical model is
the linear regression model. This model predicts : the average Y-value, or E(y), for a specified value of X (NOT INDIVIDUAL Y-VALUES_. The true intercept value (in the whole population) is denoted Beta 0, while the true slope value (in the population) is denoted Beta1
Y is called the
dependent or response variable because the model predicts values of Y. Int hsi sense, Y depends on , or responds to X.
Since we regress the values of Y on X, Y is also referred to as the regressant. The primary purpose of the regression model is to understand and predict the behavior of Y based on X.
X is called the
independent or explanatory variable, because the model uses it to predict or explain Y. In data science contexts, X is often called the feature, and other times it is called the regressor because it forms the input for the regression model.
How is the equation for the linear regression model calculated?
For a given line, compute the errors, Ei for each data point, then minimize these errors. Adding positive and negative errors would cancel out so to avoid this, we square the errors then add them up. The LSL minimizes the sun of these squared errors. No other line will have a smaller sum of squared errors.
To make a scatterplot in R
plot(data$y ~data$x)
To fit a linear model in R
model = lm(data$y ~data$x)
The name of the model is arbitrary, but we need to give it a name so we can view the summary output and other model components.
model = lm(data$whatever ~ data$Whatever) summary(model)
Interpretation of the Coefficients - example
E(Average salary) = 30,030 + 0.807(Annual Cost)
The slope = 0.807 means for each $1 increase in Cost per year(x), average salary (y) is predicted to increase by $0.807. The slope gives the predicted increase in Y per 1-unit increase of X.
The intercept, gives the predicted, hypothetical value of Average salary (=Y) when Cost (=X) = 0.
The intercept should only be interpreted IF it has a practical meaning, and IF there is no extrapolation.
Extrapolation
occurs when we try to predict a y value outside the range of the x values. This can lead to wrong predictions and be very misleading.
Residuals
The error 𝜖𝑖 between the actual and predicted value is called the residual.
A positive residual implies the actual value lies above its mean, that is, above the trendline.
A negative residual implies the actual value lies below its mean, that is, below the trendline.
So first calculate the predicted, then subtract that from the actual to get this,
The success of our goal of predicting the changes in Y is given by
R². A high R² indicates a strong relationship. R² gives the proportion of variation in Y that can be predicted (accounted for) by the model. 1-R² is the proportion of the variation in Y that is accounted for by other factors not in the current model.
Is R² affected by changes in units?
No, it is not, but slope is affected.
IF R² is 0.2508, what does that imply?
That only 25% of the variation in Average x for different x can be predicted on the basis of y
A good model will have
small RSE, and a high R²
A weak model will have
large RSE, and a low R²
What conditions need to be in place for the linear model to provide optimal predictions?
Linearity - to fit a linear model, the trend should be linear (straight) rather than nonlinear(Curved) - The scatterplot should be straight, not curved. Residual plot should show points randomly scattered around 0 (like a cloud), not curved.
Constant Variance - the variance of residuals (Vertical spread) should be constant and not show a fan shape
Normality : A histogram of residuals should show a Normal shape (unimodal, and bell shaped)
After fitting the model, we have to
plot the residuals. Ideally, the residuals would not show any pattern, only random scattered points.
If the residual plot shows a non-linear (curved) pattern, the linearity condition is_
not satisfied, and some type of nonlinear curve, e.g. quadratic would provide better predictions
If the residual plot shows a fan-shaped pattern this means
the condition for constant variance is not satisfied. A better model could be obtained by transforming the data, e.g. ln(Y)
QQ Plot in R - What it’s used for & command
This is to plot the observed values of the residuals, vs. the theoretical values from a Normal distribution. If the residuals follow a Normal distribution, they would fall along the line Y=X (observed residuals = theoretical normal)
Commands ibeing
qqnorm(model$residuals)
qqline(model$residuals)
T/F We can never infer causation on the basis on a strong correlation alone
TRUE!!
An observational study can show there is a link but
there could be an alternate path or lurking variable driving (causing) the trend
If you’re testing to see if a linear relationship between X and Y really exists..
You test this claim through a hypothesis test, which takes into account the inherent uncertainty from random sampling.
Ho: B1 = 0 (no linear relationship exists, no trend, model not useful)
Ha : B1 =/ 0 (linear relationship exists, trend exists, model useful)
If H0 is true, the line will be horizontal (no trend) and there is no linear relationship between X and Y, meaning the model would not be useful to predict Y.
If Ha is true, there will be a positive or negative trend (slope), and a linear relationship exists between X and Y