1/25
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What are training, validation, and test datasets?
Training set: Used to fit/train the model
Validation set: Used to tune model parameters (e.g., choose k in KNN)
Test set: Used to evaluate final model performance (unbiased)
Why do we split data?
To prevent overfitting and get an unbiased estimate of model performance on new data.
What are the typical split proportions?
Training: ~50–95%
Test: ~5–50%
What is a random seed and why is it important?
A fixed starting point for randomness that ensures reproducibility of data splits and results.
What is accuracy and how is it calculated?
Proportion of correct predictions
Accuracy = (TP+TN) / Total
What is precision and how is it calculated?
Of predicted positives, how many are correct
Precision = TP / (TP + FP)
What is recall and how do you calculate it?
Of actual positives, how many were correctly identified
Recall = TP / (TP + FN)
When is precision the most important measure?
When false positives are costly.
When is recall the most important measure?
When missing true positives is costly.
What is a confusion matrix?
A table comparing predicted vs true labels.
Predicted Positive | Predicted Negative | |
|---|---|---|
Actual Positive | TP | FN |
Actual Negative | FP | TN |
How do you create a confusion matrix?
conf_mat(data, truth = actual, estimate = predicted)
What is a single validation split?
One split into training + validation to tune model.
The dataset is divided into three parts:
Training set → used to fit the model
Validation set → used to tune parameters (e.g., choose k)
Test set → used once for final evaluation
The validation split is created once and used once as a fixed reference.
What is cross-validation?
A method used to estimate how well a model generalizes to new data by repeatedly splitting the training data into different training and validation subsets.
How do you choose the optimal value of k in K-nearest neighbors?
You choose k by testing multiple values and selecting the one that maximizes predictive performance on unseen data using cross-validation.
Step-by-step process:
Define a range of k values (ex. 1 to 100).
Perform cross-validation on the training data:
Split data into folds
Train on some folds, validate on others
Compute performance metrics (usually accuracy for classification) for each k
Select the k with the highest average validation accuracy
Refit the model using this k on the full training set
Evaluate once on the test set (final unbiased estimate)
What is overfitting?
Model fits training data too closely → poor generalization
KNN: k too small
What is underfitting?
Model too simple → misses patterns
KNN: k too large
Tradeoffs in KNN?
Small k → Low bias because it fits the training data closely. High variance because it's sensitive to noise/outliers (overfitting).
Large k → High bias because it makes the boundary too simple, missing underlying patterns. Low variance because it averages over more points, making predictions stable and less sensitive to noise (underfitting).
What are the advantages of the K-nearest neighbors (KNN) algorithm?
Simple and intuitive:
KNN is easy to understand because it classifies points based on similarity (distance to neighbors), making it highly interpretable.
No training phase:
KNN does not build a model in advance, it simply stores the data. This makes it fast to set up and avoids errors from incorrectly fitting a model during training.
Flexible / non-parametric:
KNN makes no assumptions about the shape of the data (e.g., linear relationships), allowing it to capture complex, non-linear patterns.
Adaptable to data:
The model structure changes naturally with the data, making it effective when patterns are irregular or unknown.
Works well for small datasets:
Since predictions require comparing to all training points, smaller datasets keep computation manageable and efficient.
What are the disadvantages of the K-nearest neighbors (KNN) algorithm?
Slow at prediction time:
For every new point, KNN must compute distances to all training observations, making it computationally expensive for large datasets.
Sensitive to irrelevant features:
Distance calculations treat all features equally, so irrelevant variables can distort distances and lead to incorrect classifications.
Requires feature scaling:
Variables with larger numerical ranges dominate distance calculations, so data must be standardized or normalized to avoid bias.
Choice of k is critical: Selecting k requires careful tuning (e.g., cross-validation).
Memory intensive:
KNN must store the entire dataset, which can be inefficient with large data.
How does v-fold cross validation work?
Split the training data into v equal parts (folds)
For each fold:
Train the model on v − 1 folds
Validate it on the remaining fold
Repeat this process v times, so each fold is used once as validation
Average the performance metrics (e.g., accuracy) across all folds
Complete the code to split data into training and testing sets:
set.seed(123)
data_split <- initial_split(data, prop = ___, strata = ___)
data_train <- training(data_split)
data_test <- testing(data_split)
0.75, Class
Complete the code for 5-fold cross-validation:
data_vfold <- vfold_cv(data_train, v = ___, strata = ___)
5, Class
Fill in the blanks to define a KNN classifier:
knn_spec <- nearest_neighbor(weight_func = ___, neighbors = ___) |>
set_engine("___") |>
set_mode("___")
“rectangular”, tune(), “kknn”, “classification”
Complete prediction code:
predictions <- predict(___, ___) |>
bind_cols(___)
knn_fit, data_test, data_test
Fill in the blank to compute accuracy:
accuracy(predictions, truth = ___, estimate = ___)
Class, .pred_class
What does tune() indicate in a model specification?
That the parameter (k) will be optimized using validation or cross-validation.