1/71
CAPSTONE class.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Classification is…
the process of assigning data points to predefined categories or classes
supervised learning task
Supervised learning task
the model learns from labeled data where each data point already belongs to a specific class
Goal of Classification
build a model that can then accurately predict the class of new, unseen data points based on their features and characteristics
What are categories?
clearly defined, distinct groups that data points can be assigned to
Examples:
classifying emails as spam or not, predicting medical image shows a tumor or not, or grouping customers into buying segments
What are features?
characteristics or attributes of the data points that the model uses to make predictions
Examples:
features for classifying emails might include keywords, sender info, text length
What is model training?
the model is trained on a set of labeled data points where true class of each point is known. The model learns from these examples to identify patterns and relationships between features and class labels
What are the predictions?
once trained, the model is used to predict the class of new, unseen data points points based on their features. the specific algorithm used for classification vary depending on type of data and complexity
Logistic Regression
popular algorithm for binary classification problems (two classes)
Examples: customer churn? (yes/no), email spam? (spam/notspam)
Support Vector Machines (SVMs)
Effect for both binary and multi-class problems, good for handling high-dimensional data
Examples: classifying handwritten digits, text classification
Decision Trees
easy to interpret, good for visualizing the decision-making process, can handle both binary and multi-class problems
Examples: loan approval (income, marriage, debt), medical triage, customer purchase behavior
Random Forests
ensemble method combining multiple decision trees, often leads to improved accuracy and robustness
Examples: employee attrition, credit risk classification
Neural Networks
powerful and flexible for complex proble,s especially with large amounts of data
Examples: image classification, speech recognition, deep text classification
Main difference in regression and classification
classification predicts discrete categories with each data point belonging to a specific pre-defined group, spam/not spam, healthy/unhealthy
regression predicts continuous numerical values, these values can fall anywhere within a given range, like predicting housing prices, stock market trends, or temp changes
Classification model types
logistic regression, decision trees, SVMs, random forests
Regression model types
linear regression, polynomial regression, ridge regression, lasso regression, neural networks
Classification evaluation metrics
accuracy, precision, recall, F1 score
Regression evaluation metrics
mean squared error (MSE), root mean squared error (RMSE), R-squared coefficient
model selection for classification (part 1)
understanding the data and problem:
types of features: numerical, categorical, text, images, or mixed?
data quality: missing values, outliers, imbalance, noise
problem complexity: linearly separable or complex relationships
desired interpretability: need to understand model decisions?
model selection for classification (Part 2)
choose candidate models:
start with simpler models: logistic regression, decision trees, Naive Bayes
Consider powerful options: random forests, SVMs, neural networks
match model capabilities to data: tree-based for mixed data, linear models for numerical SVMs for high-dimensional
Model selection for classification (Part 3)
split data into training and testing sets:
use robust splitting method like stratified sampling to ensure representative class proportions in both sets
Common split proportions: 80% for training, 20% for testing
Model selection for classification (part 4)
train and evaluate models
train each model on the training set
What is accuracy?
Overall correct predictions
What is precision?
proportion of true positives among predicted positives
What is recall?
proportion of true positives correctly identified
What is F1-score?
balances precision and recall
blank
blank
What’s an ROC curve?
plots true positive rate vs false positive rate
What’s an AUC?
higher AUC indicates better performance
Model selection for classification (part 5)
optimize and tune hyperparameters:
adjust model parameters to improve performance (tree depth, regularization strength, kernel type)
Use techniques like grid search or randomized search to explore hyperparameter combinations effectively
Model selection for classification (Part 6)
consider ensemble methods:
combine multiple models for potentially better performance:
Random forests: ensemble of decision trees
Gradient boosting: improves models sequentially
Model selection for classification (part 7)
Address Data Issues:
handle missing values: imputation, deletion, or model-specific strategies
balance imbalanced classes: resampling, cost-sensitive learning, or specialized algorithms
feature selection: remove irrelevant or redundant features for efficiency and better generalization
model selection for classification (part 8)
validate and select the final model:
use cross-validation for more robust evaluation: train and test multiple times on different data folds
assess performance on a separate validation set if available
choose the model with the best performance on unseen data, considering interpretability, computational cost, and deployment requirements.
Why is linear regression usually not appropriate for classification?
Because it can produce predictions below 0 or above 1, and for multiclass problems it imposes an artificial numeric ordering on categories.
What does logistic regression model?
The probability that an observation belongs to a particular class.
Why is logistic regression better than linear regression for binary classification?
Because logistic regression keeps predicted probabilities between 0 and 1
What shape does the logistic function have?
An S-shaped curve
What are the odds in logistic regression?
Odds = p / (1-p), where p is the probability of the event.
What is the logit?
The log-odds, or log(p/(1-p)).
In logistic regression, how is a coefficient interpreted?
A one-unit increase in a predictor changes the log-odds by the coefficient.
How are logistic regression coefficients estimated?
By maximum likelihood, not least squares.
What does a positive logistic regression coefficient mean?
Increasing that predictor increases the probability of the event/class of interest.
What is multiple logistic regression?
Logistic regression using more than one predictor.
Why can a variable’s sign change between simple and multiple logistic regression?
Because of confounding or correlation among predictors.
What is confounding?
When the relationship between a predictor and the response is distorted because another related predictor is left out.
What is the main idea of LDA?
Instead of modeling P(Y|X) directly like logistic regression, LDA models the distribution of predictors within each class and then uses Bayes’ theorem for classification.
When can LDA be better than logistic regression?
When classes are approximately Gaussian, sample size is small, or classes are well separated.
What key assumption does LDA make?
Each class has a normal distribution and all classes share a common covariance matrix.
Why is LDA called “linear”?
Because it produces linear decision boundaries
What is QDA?
Quadratic Discriminant Analysis, a discriminant method like LDA but with more flexibility.
What key assumption differs between LDA and QDA?
LDA assumes a common covariance matrix, while QDA allows each class to have its own covariance matrix.
Why is QDA more flexible than LDA?
Because it can produce quadratic/nonlinear decision boundaries.
What is the tradeoff between LDA and QDA?
LDA has lower variance but can have more bias; QDA has lower bias but higher variance.
When is LDA usually preferred over QDA?
When the training set is small or the common covariance assumption is reasonable.
When is QDA usually preferred over LDA?
When the training set is large and the true boundary is nonlinear or class covariances differ.
What is KNN?
K-Nearest Neighbors, a nonparametric classifier that assigns a class based on the majority class among the K closest training observations.
Why is KNN called nonparametric?
Because it does not assume a specific functional form for the decision boundary
When does KNN tend to perform well?
When the true decision boundary is highly nonlinear.
What is one downside of KNN compared with logistic regression or LDA?
It does not give easy coefficient-based interpretation of which predictors matter.
What happens when K in KNN is very small?
The model is more flexible, with low bias and high variance.
What happens when K in KNN is large?
The model is smoother, with higher bias and lower variance.
What is a confusion matrix?
A table comparing predicted classes to true classes.
What does the diagonal of a confusion matrix represent?
Correct classifications.
What is sensitivity?
The proportion of actual positives correctly identified.
What is specificity?
The proportion of actual negatives correctly identified.
Why can overall error rate be misleading?
Because a classifier can have low overall error but still perform poorly on the class you care most about, especially with class imbalance.
What happens when you lower the classification threshold?
You usually catch more positives, increasing sensitivity, but also create more false positives.
What kind of decision boundary do logistic regression and LDA usually produce?
Linear decision boundaries
What kind of decision boundary can QDA produce?
Quadratic/nonlinear decision boundaries.
Which methods are easiest to interpret: logistic regression/LDA or KNN?
Logistic regression and LDA are easier to interpret
Which method is more likely to win when the boundary is strongly nonlinear?
KNN or sometimes QDA, depending on the situation.
Which methods tend to do better when data are limited and the boundary is close to linear?
Logistic regression or LDA.
Main comparison to memorize?
Logistic regression and LDA are similar and often good for simpler/linear problems; QDA is more flexible but higher variance; KNN is most flexible and can do well for nonlinear boundaries but is less interpretable.