Chapter 6 Classification II: evaluation & tuning

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/25

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 6:26 AM on 4/21/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

26 Terms

1
New cards

What are training, validation, and test datasets?

  • Training set: Used to fit/train the model

  • Validation set: Used to tune model parameters (e.g., choose k in KNN)

  • Test set: Used to evaluate final model performance (unbiased)

2
New cards

Why do we split data?

To prevent overfitting and get an unbiased estimate of model performance on new data.

3
New cards

What are the typical split proportions?

  • Training: ~50–95%

  • Test: ~5–50%

4
New cards

What is a random seed and why is it important?

A fixed starting point for randomness that ensures reproducibility of data splits and results.

5
New cards

What is accuracy and how is it calculated?

Proportion of correct predictions

Accuracy = (TP+TN) / Total

6
New cards

What is precision and how is it calculated?

Of predicted positives, how many are correct

Precision = TP / (TP + FP)

7
New cards

What is recall and how do you calculate it?

Of actual positives, how many were correctly identified

Recall = TP / (TP + FN)

8
New cards

When is precision the most important measure?

When false positives are costly.

9
New cards

When is recall the most important measure?

When missing true positives is costly.

10
New cards

What is a confusion matrix?

A table comparing predicted vs true labels.

Predicted Positive

Predicted Negative

Actual Positive

TP

FN

Actual Negative

FP

TN

11
New cards

How do you create a confusion matrix?

conf_mat(data, truth = actual, estimate = predicted)

12
New cards

What is a single validation split?

One split into training + validation to tune model.

The dataset is divided into three parts:

  • Training set → used to fit the model

  • Validation set → used to tune parameters (e.g., choose k)

  • Test set → used once for final evaluation

The validation split is created once and used once as a fixed reference.

13
New cards

What is cross-validation?

A method used to estimate how well a model generalizes to new data by repeatedly splitting the training data into different training and validation subsets.

14
New cards

How do you choose the optimal value of k in K-nearest neighbors?

You choose k by testing multiple values and selecting the one that maximizes predictive performance on unseen data using cross-validation.

Step-by-step process:

  1. Define a range of k values (ex. 1 to 100).

  2. Perform cross-validation on the training data:

    • Split data into folds

    • Train on some folds, validate on others

  3. Compute performance metrics (usually accuracy for classification) for each k

  4. Select the k with the highest average validation accuracy

  5. Refit the model using this k on the full training set

  6. Evaluate once on the test set (final unbiased estimate)

15
New cards

What is overfitting?

Model fits training data too closely → poor generalization

  • KNN: k too small

16
New cards

What is underfitting?

Model too simple → misses patterns

  • KNN: k too large

17
New cards

Tradeoffs in KNN?

  • Small k → Low bias because it fits the training data closely. High variance because it's sensitive to noise/outliers (overfitting).

  • Large k → High bias because it makes the boundary too simple, missing underlying patterns. Low variance because it averages over more points, making predictions stable and less sensitive to noise (underfitting).

18
New cards

What are the advantages of the K-nearest neighbors (KNN) algorithm?

  • Simple and intuitive:
    KNN is easy to understand because it classifies points based on similarity (distance to neighbors), making it highly interpretable.

  • No training phase:
    KNN does not build a model in advance, it simply stores the data. This makes it fast to set up and avoids errors from incorrectly fitting a model during training.

  • Flexible / non-parametric:
    KNN makes no assumptions about the shape of the data (e.g., linear relationships), allowing it to capture complex, non-linear patterns.

  • Adaptable to data:
    The model structure changes naturally with the data, making it effective when patterns are irregular or unknown.

  • Works well for small datasets:
    Since predictions require comparing to all training points, smaller datasets keep computation manageable and efficient.

19
New cards

What are the disadvantages of the K-nearest neighbors (KNN) algorithm?

  • Slow at prediction time:
    For every new point, KNN must compute distances to all training observations, making it computationally expensive for large datasets.

  • Sensitive to irrelevant features:
    Distance calculations treat all features equally, so irrelevant variables can distort distances and lead to incorrect classifications.

  • Requires feature scaling:
    Variables with larger numerical ranges dominate distance calculations, so data must be standardized or normalized to avoid bias.

  • Choice of k is critical: Selecting k requires careful tuning (e.g., cross-validation).

  • Memory intensive:
    KNN must store the entire dataset, which can be inefficient with large data.

20
New cards

How does v-fold cross validation work?

  • Split the training data into v equal parts (folds)

  • For each fold:

    • Train the model on v − 1 folds

    • Validate it on the remaining fold

  • Repeat this process v times, so each fold is used once as validation

  • Average the performance metrics (e.g., accuracy) across all folds

21
New cards

Complete the code to split data into training and testing sets:

set.seed(123)

data_split <- initial_split(data, prop = ___, strata = ___)

data_train <- training(data_split)

data_test <- testing(data_split)

0.75, Class

22
New cards

Complete the code for 5-fold cross-validation:

data_vfold <- vfold_cv(data_train, v = ___, strata = ___)

5, Class

23
New cards

Fill in the blanks to define a KNN classifier:

knn_spec <- nearest_neighbor(weight_func = ___, neighbors = ___) |>

set_engine("___") |>

set_mode("___")

“rectangular”, tune(), “kknn”, “classification”

24
New cards

Complete prediction code:

predictions <- predict(___, ___) |>
bind_cols(___)

knn_fit, data_test, data_test

25
New cards

Fill in the blank to compute accuracy:

accuracy(predictions, truth = ___, estimate = ___)

Class, .pred_class

26
New cards

What does tune() indicate in a model specification?

That the parameter (k) will be optimized using validation or cross-validation.