DATA MINING PRELIMS (REGRESSION)

0.0(0)
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/31

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

32 Terms

1
New cards

SIMPLE LINEAR REGRESSION

A machine learning technique to predict values from one independent variable

Ex: Predict the wage of an employee based on years of experience

By simple linear regression, the best fit line for the data is created and the target variable values are predicted based on the line

2
New cards

y = b0 + b1 * x1

Equation of line

3
New cards

MULTIPLE LINEAR REGRESSION

Multiple Linear Regression is an extension of Simple Linear regression where the model depends on more than 1 independent variable for the prediction results.

Allows to evaluate the relationship between two variables while controlling for the effect of other variables

4
New cards

y = b0 + b1 x1 + .... + bn * xn

Equation for the multiple linear regression

5
New cards
6
New cards

Overfitting

Adding more independent variables to a multiple regression procedure does not mean the regression is better or offer better predictions; it can be worse — ________

7
New cards

MULTICOLLINEARITY

More independent variables creates more relationships among them, not only are the independent variables related to the dependent variables, they are related to each other —- _______

8
New cards

Correlations
Scatter plots
Simple regressions

To avoid multicollinearity and overfitting, we can do the following

9
New cards

categorical variables

called dummy variables or indicator variable

10
New cards

Linearity
Homoscedasticity
Multivariate normality
Lack of Multicollinearity
Independence of errors

5 assumptions to make for multiple linear regression

11
New cards

Lack of Multicollinearity

It is assumed that there is little or no _______ in the data. _______ generally occurs when there are high correlations between two or more predictor variables.

12
New cards

Multivariate normality

Multiple Regression assumes that the residuals are normally distributed.

13
New cards

Homoscedasticity

Constant variance of the errors should be maintained or are roughly the same

14
New cards

Linearity

The relationship between dependent and independent variables should be linear.

15
New cards

Dummy Variable Trap

is a condition in which two or more are highly correlated. To solve the dummy variable trap is to drop one of the categorical variables.

16
New cards

LOGISTIC REGRESSION

used to model the relationship between one or more predictor variables and response variable/s where the output variable is a categorical variable.

it is a statistical technique that is commonly used in machine learning for classification tasks, where the goal is to predict the probability of an observation belonging to a certain class.

17
New cards

Linear regression

used when the response variable is continuous

regression attempts to find the line of best fit that minimizes the sum of the squared errors between the predicted and actual values

18
New cards
19
New cards

Logistic regression

response variable is binary (i.e., yes or no, 1 or 0)

logistic regression uses the logistic function to model the probability of a binary outcome.

20
New cards

Binary
Multinomial
Ordinal

TYPES OF LOGISTIC REGRESSION

21
New cards

Binary Logistic Regression

the response variable has only two possible outcomes, often denoted as 0 or 1

22
New cards

Multinomial Logistic Regression

the response variable has more than two categories but is still nominal (Ex: Red, Blue, Green)

23
New cards

Ordinal Logistic Regression

the response variable is ordinal, which means that it has a natural ordering (Ex: High school, College, Graduate School)

24
New cards

CONTINGENCY TABLES

a representation of the frequency of observations falling under various categories of two or more variables.

25
New cards

CONDITIONAL PROBABILITY

defines the probability of a certain event happening, given that a certain related event is true or has already happened.

26
New cards

Probability(Purchase/Male) = Total num of purchases (male) / Total num of males in group

The conditional probability of a purchase, given the customer is male, is denoted as follows

27
New cards

ODDS RATIO

Odds of success for a group are defined as the ratio of probability of successes (true) to the probability of failures (false)

ex: a ratio of odds of success (purchase in this case) for each group (male and female in this case).

28
New cards

DECISION BOUNDARY

the boundary that separates the two classes, typically represented as a straight line in two-dimensional space or a hyperplane in higher-dimensional space.

determined by the coefficients or weights of the logistic regression model.

the set of points where the model is equally likely to predict a positive or negative outcome.

29
New cards

INVERSE LOGIT

In a logit function graph, 0 to 1 ran along the x-axis but we want the probabilities to be on the y-axis. We can achieve that by taking the inverse of the logit function

30
New cards

ESTIMATED REGRESSION EQUATION

The natural algorithm of the odds ratio is equivalent to a linear function of the independent variables. The antilog of the logit function allows us to find the estimated regression equation.

31
New cards

Receiver Operating Characteristic (ROC) Curve

measures the performance of a classifier at all possible thresholds

The _____ ______ of a perfect classifier would have a line that goes from bottom left to top left and top left to top right. On the other hand, if the ROC curve is just the diagonal line then the model is just doing random guessing. Any useful classifier should have an ROC curve falling between these two curves.

useful for comparing the performance of different classifiers over the same dataset. For example, suppose we have three classifiers A, B, and C and their respective ROC curves, as shown below:

32
New cards

Area Under the Curve (AUC)

A perfect classifier has ________=1.0, and random guessing results in _______=0.5