Classification Models - Predictive Analytics Final Exam

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/63

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

64 Terms

New cards

What is a true positive?

We predicted yes (they have cancer) but in reality they do have cancer

New cards

What is the true negative rate?

We predicted no (they do not have cancer) and in reality they do not have the disease

New cards

What are the first four steps of building a classification model?

1. Purpose of Analysis
2. Collect & Prepare the Data
3. Explore the Data
4. Split into training (60-70%), validation (15-20%), and test (15-20%) sets

New cards

What is the training set?

A subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable

New cards

What is the validation set?

a subset of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features

New cards

What is a test set?

A subset of the dataset that is held out during the training process for the purpose of generating the final estimate.

New cards

What is a feature of the logistic regression?

It does not give a yes or no, but a probability value between 0 and 1 that something will happen.

New cards

What is the original linear functional form?

P(X) = B0 + B1(X) + e

New cards

Why doesn't the original linear functional form work?

In this form, the function can take on any number (negative or greater than one), but probability values can only be between 0 and 1.

New cards

How does logistic regression solve the problem of the original linear functional form?

It squishes values into the range of 0 to 1 for probability to become valid.

New cards

What steps does the logistic regression take to fix the probability?

1. First, the logit = log of odds. This is where the log of odds takes on any value from negative infinity to positive infinity, which helps the model's probability directly.

2. Next, remove the log; the odds themselves are always positive (now we have 0 to positive infinity).

3. Now, we convert the odds back into probability with the logistic formula (giving a value between 0 and 1)

New cards

What is logit?

the log of odds

New cards

How do you calculate odds?

= probability / (1 - probability)

New cards

How do you calculate probability?

= odds / (1 + odds)

New cards

(True/False) Odds and probability just express likelihood, while logistic regression moves back and forth between them.

True

New cards

When can you use a logistic regression?

when the outcome is binary (only 2 options)

New cards

What does a logistic regression assume?

It assumes a linear relationship between the predictors and the logit (log of odds) of the outcome

New cards

What is the formula for calculating the logit?

= a constant (Beta0) + effects of predictors (B1, B2,...) + unexplained randomness (error)

New cards

How do linear and logistic regression predict differently?

Linear regression predicts "how much," whereas logistic regression predicts "yes/no" through probability

New cards

(True/False) Logistic regression turns probability into actual classifications like 0 and 1?

True

New cards

What is the "cutoff" rule in terms of probability?

If probability > cutoff → classify as Yes
If probability ≤ cutoff → classify as No

New cards

(True/False) Logistic regression first calculates the logit, converts it into a probability, and then uses that to classify.

True

New cards

If we use a lower cutoff number than 0.5, are there more or fewer false negatives versus false positives?

fewer false negatives but more false positives

New cards

If we use a higher cutoff number than 0.5, are there more or fewer false negatives versus false positives?

fewer false positives but more false negatives

New cards

What do the coefficients tell you in terms of logit?

Each coefficient tells you how much the log-odds change when that predictor increases by 1.

New cards

(True/False) Logistic regression is like linear regression, but instead of predicting a continuous number, it predicts a categorical outcome like yes/no or 0/1.

True

New cards

In a logistic regression, your predictors can be numerical, but what does the outcome have to be?

Categorical

New cards

What is a confusion matrix?

A tabulation of the predicted and actual value counts for each possible class

New cards

What is accuracy?

how close a measurement comes to the actual value of whatever is measured

New cards

How do you calculate accuracy?

TP+TN/ ALL

New cards

What is precision?

How often I'm right when I say 'yes.'

New cards

How do you calculate precision?

TP / (TP + FP)

New cards

What is recall (sensitivity)?

How many real 'yes' cases I actually found (also known as the true positive rate)

New cards

How do you calculate recall (sensitivity)?

TP / (TP + FN)

New cards

What is misclassification?

How often I'm wrong (also known as the error rate)

New cards

How do you calculate misclassification?

(FN + FP) / ALL

New cards

How else could you calculate accuracy?

1 - misclassification rate

New cards

What is the false positive rate?

We predicted yes (they have cancer) but in reality they do not have cancer (type I error)

New cards

How do you calculate the false positive rate?

FP / (TN + FP)

New cards

How do you calculate the true negative rate?

TN / (TN + FP)

New cards

What is the false negative rate?

We predicted no (they don’t have cancer) but in reality they do have cancer (aka, Type II error)

New cards

How do you calculate the false negative rate?

FN / (FN + TP)

New cards

Sensitivity maximizes what?

True positive rates

New cards

Specificity maximizes what?

True negative rates

New cards

What does K represent in KNN?

the number of neighbors you choose to consider for a new data point

New cards

In KNN, what two measurements are used to calculate the closeness of the data points?

Euclidean measure and Manhattan distance

New cards

How do you calculate using the Euclidean measure?

D(i,j) = sqrt of (xi1 - xj1) ^2 + (xi2 - xj2)^2 +...

New cards

Why is standardization important?

Different variables can have very different scales, and without standardization, big numbers would dominate and skew results.

New cards

What is the formula for the standardization of variables?

Xi = Xi - average(x) / std (X)

New cards

Is KNN a parametric model?

KNN makes no assumptions about how the variances relate; it just looks at neighbors, so no, it is not.

New cards

Does KNN have a training phase?

It is a lazy learner with no training phase; it just stores all the data and uses it when predicting.

New cards

Does logistic regression have a training phase?

It learns during training, so its predictions are faster, so yes, it does.

New cards

(True/False) Does the value of k (number of neighbors) strongly affect performance?

True

New cards

If k is too small, what happens to the model?

The model is too sensitive (overfitting)

New cards

If k is too big, what happens to the model?

The model is too simple (misses patterns)

New cards

Do we know the best K well in advance?

No, therefore, we try different K values on the validation set and see which K gives the best accuracy

New cards

What are the traits of KNN?

1. Flexible
2. No assumptions
3. Slow and hard to explain

New cards

What are the traits of Logistic Regression?

1. Makes assumptions
2. Faster
3. Easier to interpret

New cards

What is a multinomial logistic regression?

a model used if the outcome variable has more than two categories

New cards

How many dummy variables are in a multinomial logistic regression?

M - 1 dummy variables

New cards

(True/False) Each dummy variable has a value of 1 for its category and 0 for all others.

True

New cards

Does the reference category use a dummy variable?

New cards

How does the multinomial regression work?

1. Builds one equation for each category
2. Each equation shows how the predictors affect the probability of being in that category vs the reference

New cards

What is scaling in R?

1. adjusts variables so they are on a common scale
2. centers each variable
3. makes categorical variables a factor