1/63
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What is a true positive?
We predicted yes (they have cancer) but in reality they do have cancer
What is the true negative rate?
We predicted no (they do not have cancer) and in reality they do not have the disease
What are the first four steps of building a classification model?
1. Purpose of Analysis
2. Collect & Prepare the Data
3. Explore the Data
4. Split into training (60-70%), validation (15-20%), and test (15-20%) sets
What is the training set?
A subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable
What is the validation set?
a subset of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features
What is a test set?
A subset of the dataset that is held out during the training process for the purpose of generating the final estimate.
What is a feature of the logistic regression?
It does not give a yes or no, but a probability value between 0 and 1 that something will happen.
What is the original linear functional form?
P(X) = B0 + B1(X) + e
Why doesn't the original linear functional form work?
In this form, the function can take on any number (negative or greater than one), but probability values can only be between 0 and 1.
How does logistic regression solve the problem of the original linear functional form?
It squishes values into the range of 0 to 1 for probability to become valid.
What steps does the logistic regression take to fix the probability?
1. First, the logit = log of odds. This is where the log of odds takes on any value from negative infinity to positive infinity, which helps the model's probability directly.
2. Next, remove the log; the odds themselves are always positive (now we have 0 to positive infinity).
3. Now, we convert the odds back into probability with the logistic formula (giving a value between 0 and 1)
What is logit?
the log of odds
How do you calculate odds?
= probability / (1 - probability)
How do you calculate probability?
= odds / (1 + odds)
(True/False) Odds and probability just express likelihood, while logistic regression moves back and forth between them.
True
When can you use a logistic regression?
when the outcome is binary (only 2 options)
What does a logistic regression assume?
It assumes a linear relationship between the predictors and the logit (log of odds) of the outcome
What is the formula for calculating the logit?
= a constant (Beta0) + effects of predictors (B1, B2,...) + unexplained randomness (error)
How do linear and logistic regression predict differently?
Linear regression predicts "how much," whereas logistic regression predicts "yes/no" through probability
(True/False) Logistic regression turns probability into actual classifications like 0 and 1?
True
What is the "cutoff" rule in terms of probability?
If probability > cutoff → classify as Yes
If probability ≤ cutoff → classify as No
(True/False) Logistic regression first calculates the logit, converts it into a probability, and then uses that to classify.
True
If we use a lower cutoff number than 0.5, are there more or fewer false negatives versus false positives?
fewer false negatives but more false positives
If we use a higher cutoff number than 0.5, are there more or fewer false negatives versus false positives?
fewer false positives but more false negatives
What do the coefficients tell you in terms of logit?
Each coefficient tells you how much the log-odds change when that predictor increases by 1.
(True/False) Logistic regression is like linear regression, but instead of predicting a continuous number, it predicts a categorical outcome like yes/no or 0/1.
True
In a logistic regression, your predictors can be numerical, but what does the outcome have to be?
Categorical
What is a confusion matrix?
A tabulation of the predicted and actual value counts for each possible class
What is accuracy?
how close a measurement comes to the actual value of whatever is measured
How do you calculate accuracy?
TP+TN/ ALL
What is precision?
How often I'm right when I say 'yes.'
How do you calculate precision?
TP / (TP + FP)
What is recall (sensitivity)?
How many real 'yes' cases I actually found (also known as the true positive rate)
How do you calculate recall (sensitivity)?
TP / (TP + FN)
What is misclassification?
How often I'm wrong (also known as the error rate)
How do you calculate misclassification?
(FN + FP) / ALL
How else could you calculate accuracy?
1 - misclassification rate
What is the false positive rate?
We predicted yes (they have cancer) but in reality they do not have cancer (type I error)
How do you calculate the false positive rate?
FP / (TN + FP)
How do you calculate the true negative rate?
TN / (TN + FP)
What is the false negative rate?
We predicted no (they don’t have cancer) but in reality they do have cancer (aka, Type II error)
How do you calculate the false negative rate?
FN / (FN + TP)
Sensitivity maximizes what?
True positive rates
Specificity maximizes what?
True negative rates
What does K represent in KNN?
the number of neighbors you choose to consider for a new data point
In KNN, what two measurements are used to calculate the closeness of the data points?
Euclidean measure and Manhattan distance
How do you calculate using the Euclidean measure?
D(i,j) = sqrt of (xi1 - xj1) ^2 + (xi2 - xj2)^2 +...
Why is standardization important?
Different variables can have very different scales, and without standardization, big numbers would dominate and skew results.
What is the formula for the standardization of variables?
Xi = Xi - average(x) / std (X)
Is KNN a parametric model?
KNN makes no assumptions about how the variances relate; it just looks at neighbors, so no, it is not.
Does KNN have a training phase?
It is a lazy learner with no training phase; it just stores all the data and uses it when predicting.
Does logistic regression have a training phase?
It learns during training, so its predictions are faster, so yes, it does.
(True/False) Does the value of k (number of neighbors) strongly affect performance?
True
If k is too small, what happens to the model?
The model is too sensitive (overfitting)
If k is too big, what happens to the model?
The model is too simple (misses patterns)
Do we know the best K well in advance?
No, therefore, we try different K values on the validation set and see which K gives the best accuracy
What are the traits of KNN?
1. Flexible
2. No assumptions
3. Slow and hard to explain
What are the traits of Logistic Regression?
1. Makes assumptions
2. Faster
3. Easier to interpret
What is a multinomial logistic regression?
a model used if the outcome variable has more than two categories
How many dummy variables are in a multinomial logistic regression?
M - 1 dummy variables
(True/False) Each dummy variable has a value of 1 for its category and 0 for all others.
True
Does the reference category use a dummy variable?
No
How does the multinomial regression work?
1. Builds one equation for each category
2. Each equation shows how the predictors affect the probability of being in that category vs the reference
What is scaling in R?
1. adjusts variables so they are on a common scale
2. centers each variable
3. makes categorical variables a factor