1/31
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
SIMPLE LINEAR REGRESSION
A machine learning technique to predict values from one independent variable
Ex: Predict the wage of an employee based on years of experience
By simple linear regression, the best fit line for the data is created and the target variable values are predicted based on the line
y = b0 + b1 * x1
Equation of line
MULTIPLE LINEAR REGRESSION
Multiple Linear Regression is an extension of Simple Linear regression where the model depends on more than 1 independent variable for the prediction results.
Allows to evaluate the relationship between two variables while controlling for the effect of other variables
y = b0 + b1 x1 + .... + bn * xn
Equation for the multiple linear regression
Overfitting
Adding more independent variables to a multiple regression procedure does not mean the regression is better or offer better predictions; it can be worse — ________
MULTICOLLINEARITY
More independent variables creates more relationships among them, not only are the independent variables related to the dependent variables, they are related to each other —- _______
Correlations
Scatter plots
Simple regressions
To avoid multicollinearity and overfitting, we can do the following
categorical variables
called dummy variables or indicator variable
Linearity
Homoscedasticity
Multivariate normality
Lack of Multicollinearity
Independence of errors
5 assumptions to make for multiple linear regression
Lack of Multicollinearity
It is assumed that there is little or no _______ in the data. _______ generally occurs when there are high correlations between two or more predictor variables.
Multivariate normality
Multiple Regression assumes that the residuals are normally distributed.
Homoscedasticity
Constant variance of the errors should be maintained or are roughly the same
Linearity
The relationship between dependent and independent variables should be linear.
Dummy Variable Trap
is a condition in which two or more are highly correlated. To solve the dummy variable trap is to drop one of the categorical variables.
LOGISTIC REGRESSION
used to model the relationship between one or more predictor variables and response variable/s where the output variable is a categorical variable.
it is a statistical technique that is commonly used in machine learning for classification tasks, where the goal is to predict the probability of an observation belonging to a certain class.
Linear regression
used when the response variable is continuous
regression attempts to find the line of best fit that minimizes the sum of the squared errors between the predicted and actual values
Logistic regression
response variable is binary (i.e., yes or no, 1 or 0)
logistic regression uses the logistic function to model the probability of a binary outcome.
Binary
Multinomial
Ordinal
TYPES OF LOGISTIC REGRESSION
Binary Logistic Regression
the response variable has only two possible outcomes, often denoted as 0 or 1
Multinomial Logistic Regression
the response variable has more than two categories but is still nominal (Ex: Red, Blue, Green)
Ordinal Logistic Regression
the response variable is ordinal, which means that it has a natural ordering (Ex: High school, College, Graduate School)
CONTINGENCY TABLES
a representation of the frequency of observations falling under various categories of two or more variables.
CONDITIONAL PROBABILITY
defines the probability of a certain event happening, given that a certain related event is true or has already happened.
Probability(Purchase/Male) = Total num of purchases (male) / Total num of males in group
The conditional probability of a purchase, given the customer is male, is denoted as follows
ODDS RATIO
Odds of success for a group are defined as the ratio of probability of successes (true) to the probability of failures (false)
ex: a ratio of odds of success (purchase in this case) for each group (male and female in this case).
DECISION BOUNDARY
the boundary that separates the two classes, typically represented as a straight line in two-dimensional space or a hyperplane in higher-dimensional space.
determined by the coefficients or weights of the logistic regression model.
the set of points where the model is equally likely to predict a positive or negative outcome.
INVERSE LOGIT
In a logit function graph, 0 to 1 ran along the x-axis but we want the probabilities to be on the y-axis. We can achieve that by taking the inverse of the logit function
ESTIMATED REGRESSION EQUATION
The natural algorithm of the odds ratio is equivalent to a linear function of the independent variables. The antilog of the logit function allows us to find the estimated regression equation.
Receiver Operating Characteristic (ROC) Curve
measures the performance of a classifier at all possible thresholds
The _____ ______ of a perfect classifier would have a line that goes from bottom left to top left and top left to top right. On the other hand, if the ROC curve is just the diagonal line then the model is just doing random guessing. Any useful classifier should have an ROC curve falling between these two curves.
useful for comparing the performance of different classifiers over the same dataset. For example, suppose we have three classifiers A, B, and C and their respective ROC curves, as shown below:
Area Under the Curve (AUC)
A perfect classifier has ________=1.0, and random guessing results in _______=0.5