Chapter 6 Classification II: evaluation & tuning

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/25

There's no tags or description

Looks like no tags are added yet.

Last updated 6:26 AM on 4/21/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

26 Terms

New cards

What are training, validation, and test datasets?

Training set: Used to fit/train the model
Validation set: Used to tune model parameters (e.g., choose k in KNN)
Test set: Used to evaluate final model performance (unbiased)

New cards

Why do we split data?

To prevent overfitting and get an unbiased estimate of model performance on new data.

New cards

What are the typical split proportions?

Training: ~50–95%
Test: ~5–50%

New cards

What is a random seed and why is it important?

A fixed starting point for randomness that ensures reproducibility of data splits and results.

New cards

What is accuracy and how is it calculated?

Proportion of correct predictions

Accuracy = (TP+TN) / Total

New cards

What is precision and how is it calculated?

Of predicted positives, how many are correct

Precision = TP / (TP + FP)

New cards

What is recall and how do you calculate it?

Of actual positives, how many were correctly identified

Recall = TP / (TP + FN)

New cards

When is precision the most important measure?

When false positives are costly.

New cards

When is recall the most important measure?

When missing true positives is costly.

New cards

What is a confusion matrix?

A table comparing predicted vs true labels.

	Predicted Positive	Predicted Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

New cards

How do you create a confusion matrix?

conf_mat(data, truth = actual, estimate = predicted)

New cards

What is a single validation split?

One split into training + validation to tune model.

The dataset is divided into three parts:

Training set → used to fit the model
Validation set → used to tune parameters (e.g., choose k)
Test set → used once for final evaluation

The validation split is created once and used once as a fixed reference.

New cards

What is cross-validation?

A method used to estimate how well a model generalizes to new data by repeatedly splitting the training data into different training and validation subsets.

New cards

How do you choose the optimal value of k in K-nearest neighbors?

You choose k by testing multiple values and selecting the one that maximizes predictive performance on unseen data using cross-validation.

Step-by-step process:

Define a range of k values (ex. 1 to 100).
Perform cross-validation on the training data:
- Split data into folds
- Train on some folds, validate on others
Compute performance metrics (usually accuracy for classification) for each k
Select the k with the highest average validation accuracy
Refit the model using this k on the full training set
Evaluate once on the test set (final unbiased estimate)

New cards

What is overfitting?

Model fits training data too closely → poor generalization

KNN: k too small

New cards

What is underfitting?

Model too simple → misses patterns

KNN: k too large

New cards

Tradeoffs in KNN?

Small k → Low bias because it fits the training data closely. High variance because it's sensitive to noise/outliers (overfitting).
Large k → High bias because it makes the boundary too simple, missing underlying patterns. Low variance because it averages over more points, making predictions stable and less sensitive to noise (underfitting).

New cards

What are the advantages of the K-nearest neighbors (KNN) algorithm?

Simple and intuitive:
KNN is easy to understand because it classifies points based on similarity (distance to neighbors), making it highly interpretable.
No training phase:
KNN does not build a model in advance, it simply stores the data. This makes it fast to set up and avoids errors from incorrectly fitting a model during training.
Flexible / non-parametric:
KNN makes no assumptions about the shape of the data (e.g., linear relationships), allowing it to capture complex, non-linear patterns.
Adaptable to data:
The model structure changes naturally with the data, making it effective when patterns are irregular or unknown.
Works well for small datasets:
Since predictions require comparing to all training points, smaller datasets keep computation manageable and efficient.

New cards

What are the disadvantages of the K-nearest neighbors (KNN) algorithm?

Slow at prediction time:
For every new point, KNN must compute distances to all training observations, making it computationally expensive for large datasets.
Sensitive to irrelevant features:
Distance calculations treat all features equally, so irrelevant variables can distort distances and lead to incorrect classifications.
Requires feature scaling:
Variables with larger numerical ranges dominate distance calculations, so data must be standardized or normalized to avoid bias.
Choice of k is critical: Selecting k requires careful tuning (e.g., cross-validation).
Memory intensive:
KNN must store the entire dataset, which can be inefficient with large data.

New cards

How does v-fold cross validation work?

Split the training data into v equal parts (folds)
For each fold:
- Train the model on v − 1 folds
- Validate it on the remaining fold
Repeat this process v times, so each fold is used once as validation
Average the performance metrics (e.g., accuracy) across all folds

New cards

Complete the code to split data into training and testing sets:

set.seed(123)

data_split <- initial_split(data, prop = ___, strata = ___)

data_train <- training(data_split)

data_test <- testing(data_split)

0.75, Class

New cards

Complete the code for 5-fold cross-validation:

data_vfold <- vfold_cv(data_train, v = ___, strata = ___)

5, Class

New cards

Fill in the blanks to define a KNN classifier:

knn_spec <- nearest_neighbor(weight_func = ___, neighbors = ___) |>

set_engine("___") |>

set_mode("___")

“rectangular”, tune(), “kknn”, “classification”

New cards

Complete prediction code:

predictions <- predict(___, ___) |>
bind_cols(___)

knn_fit, data_test, data_test

New cards

Fill in the blank to compute accuracy:

accuracy(predictions, truth = ___, estimate = ___)

Class, .pred_class

New cards

What does tune() indicate in a model specification?

That the parameter (k) will be optimized using validation or cross-validation.