Classification Models - Predictive Analytics Final Exam

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/63

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

64 Terms

1
New cards

What is a true positive?

We predicted yes (they have cancer) but in reality they do have cancer

2
New cards

What is the true negative rate?

We predicted no (they do not have cancer) and in reality they do not have the disease

3
New cards

What are the first four steps of building a classification model?

1. Purpose of Analysis
2. Collect & Prepare the Data
3. Explore the Data
4. Split into training (60-70%), validation (15-20%), and test (15-20%) sets

4
New cards

What is the training set?

A subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable

5
New cards

What is the validation set?

a subset of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features

6
New cards

What is a test set?

A subset of the dataset that is held out during the training process for the purpose of generating the final estimate.

7
New cards

What is a feature of the logistic regression?

It does not give a yes or no, but a probability value between 0 and 1 that something will happen.

8
New cards

What is the original linear functional form?

P(X) = B0 + B1(X) + e

9
New cards

Why doesn't the original linear functional form work?

In this form, the function can take on any number (negative or greater than one), but probability values can only be between 0 and 1.

10
New cards

How does logistic regression solve the problem of the original linear functional form?

It squishes values into the range of 0 to 1 for probability to become valid.

11
New cards

What steps does the logistic regression take to fix the probability?

1. First, the logit = log of odds. This is where the log of odds takes on any value from negative infinity to positive infinity, which helps the model's probability directly.

2. Next, remove the log; the odds themselves are always positive (now we have 0 to positive infinity).

3. Now, we convert the odds back into probability with the logistic formula (giving a value between 0 and 1)

12
New cards

What is logit?

the log of odds

13
New cards

How do you calculate odds?

= probability / (1 - probability)

14
New cards

How do you calculate probability?

= odds / (1 + odds)

15
New cards

(True/False) Odds and probability just express likelihood, while logistic regression moves back and forth between them.

True

16
New cards

When can you use a logistic regression?

when the outcome is binary (only 2 options)

17
New cards

What does a logistic regression assume?

It assumes a linear relationship between the predictors and the logit (log of odds) of the outcome

18
New cards

What is the formula for calculating the logit?

= a constant (Beta0) + effects of predictors (B1, B2,...) + unexplained randomness (error)

19
New cards

How do linear and logistic regression predict differently?

Linear regression predicts "how much," whereas logistic regression predicts "yes/no" through probability

20
New cards

(True/False) Logistic regression turns probability into actual classifications like 0 and 1?

True

21
New cards

What is the "cutoff" rule in terms of probability?

If probability > cutoff → classify as Yes
If probability ≤ cutoff → classify as No

22
New cards

(True/False) Logistic regression first calculates the logit, converts it into a probability, and then uses that to classify.

True

23
New cards

If we use a lower cutoff number than 0.5, are there more or fewer false negatives versus false positives?

fewer false negatives but more false positives

24
New cards

If we use a higher cutoff number than 0.5, are there more or fewer false negatives versus false positives?

fewer false positives but more false negatives

25
New cards

What do the coefficients tell you in terms of logit?

Each coefficient tells you how much the log-odds change when that predictor increases by 1.

26
New cards

(True/False) Logistic regression is like linear regression, but instead of predicting a continuous number, it predicts a categorical outcome like yes/no or 0/1.

True

27
New cards

In a logistic regression, your predictors can be numerical, but what does the outcome have to be?

Categorical

28
New cards

What is a confusion matrix?

A tabulation of the predicted and actual value counts for each possible class

29
New cards

What is accuracy?

how close a measurement comes to the actual value of whatever is measured

30
New cards

How do you calculate accuracy?

TP+TN/ ALL

31
New cards

What is precision?

How often I'm right when I say 'yes.'

32
New cards

How do you calculate precision?

TP / (TP + FP)

33
New cards

What is recall (sensitivity)?

How many real 'yes' cases I actually found (also known as the true positive rate)

34
New cards

How do you calculate recall (sensitivity)?

TP / (TP + FN)

35
New cards

What is misclassification?

How often I'm wrong (also known as the error rate)

36
New cards

How do you calculate misclassification?

(FN + FP) / ALL

37
New cards

How else could you calculate accuracy?

1 - misclassification rate

38
New cards

What is the false positive rate?

We predicted yes (they have cancer) but in reality they do not have cancer (type I error)

39
New cards

How do you calculate the false positive rate?

FP / (TN + FP)

40
New cards

How do you calculate the true negative rate?

TN / (TN + FP)

41
New cards

What is the false negative rate?

We predicted no (they don’t have cancer) but in reality they do have cancer (aka, Type II error)

42
New cards

How do you calculate the false negative rate?

FN / (FN + TP)

43
New cards

Sensitivity maximizes what?

True positive rates

44
New cards

Specificity maximizes what?

True negative rates

45
New cards

What does K represent in KNN?

the number of neighbors you choose to consider for a new data point

46
New cards

In KNN, what two measurements are used to calculate the closeness of the data points?

Euclidean measure and Manhattan distance

47
New cards

How do you calculate using the Euclidean measure?

D(i,j) = sqrt of (xi1 - xj1) ^2 + (xi2 - xj2)^2 +...

48
New cards

Why is standardization important?

Different variables can have very different scales, and without standardization, big numbers would dominate and skew results.

49
New cards

What is the formula for the standardization of variables?

Xi = Xi - average(x) / std (X)

50
New cards

Is KNN a parametric model?

KNN makes no assumptions about how the variances relate; it just looks at neighbors, so no, it is not.

51
New cards

Does KNN have a training phase?

It is a lazy learner with no training phase; it just stores all the data and uses it when predicting.

52
New cards

Does logistic regression have a training phase?

It learns during training, so its predictions are faster, so yes, it does.

53
New cards

(True/False) Does the value of k (number of neighbors) strongly affect performance?

True

54
New cards

If k is too small, what happens to the model?

The model is too sensitive (overfitting)

55
New cards

If k is too big, what happens to the model?

The model is too simple (misses patterns)

56
New cards

Do we know the best K well in advance?

No, therefore, we try different K values on the validation set and see which K gives the best accuracy

57
New cards

What are the traits of KNN?

1. Flexible
2. No assumptions
3. Slow and hard to explain

58
New cards

What are the traits of Logistic Regression?

1. Makes assumptions
2. Faster
3. Easier to interpret

59
New cards

What is a multinomial logistic regression?

a model used if the outcome variable has more than two categories

60
New cards

How many dummy variables are in a multinomial logistic regression?

M - 1 dummy variables

61
New cards

(True/False) Each dummy variable has a value of 1 for its category and 0 for all others.

True

62
New cards

Does the reference category use a dummy variable?

No

63
New cards

How does the multinomial regression work?

1. Builds one equation for each category
2. Each equation shows how the predictors affect the probability of being in that category vs the reference

64
New cards

What is scaling in R?

1. adjusts variables so they are on a common scale
2. centers each variable
3. makes categorical variables a factor