Chapter 5 Classification I: training & predicting

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/20

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 5:52 AM on 4/23/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

21 Terms

1
New cards

When should you use classification?

When the response variable is categorical (e.g., yes/no, species type) and you want to predict class labels.

2
New cards

What is a training dataset?

A dataset with predictors and known labels used to teach the model patterns for making predictions on new data

3
New cards

What is Euclidean distance?

The straight-line distance between two points in space.

<p><span>The straight-line distance between two points in space.</span></p>
4
New cards

What is K-nearest neighbors (KNN)?

A classification algorithm that assigns a class based on the majority label among the K closest training points.

5
New cards

Why is scaling important in KNN?

Because distance calculations are sensitive to variable scale; larger-scale variables dominate distances.

6
New cards

Steps in KNN classification workflow

  1. Split data (train/test)

  1. Preprocess (scale, center, clean)

  2. Specify model

  3. Fit model to training data

  4. Predict on new/test data

  5. Evaluate performance

7
New cards

What is a recipe in tidymodels?

A set of preprocessing steps applied to data before modeling.

It includes steps such as centering and scaling predictors, handling missing values (imputation), and balancing classes (e.g., upsampling).

8
New cards

Centering vs scaling

  • Centering: subtract mean

  • Scaling: divide by standard deviation

step_center(all_predictors()) + step_scale(all_predictors())

9
New cards

What is imputation?

Filling in missing values.

Syntax: step_impute_mean(all_predictors())

10
New cards

What is balancing data?

Adjusting class distribution to avoid bias toward majority class.

Syntax to upsample: step_upsample(outcome_variable)

11
New cards

What does bake() do?

Applies the trained recipe to new data

12
New cards

KNN model specification in R

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>
set_engine("kknn") |>
set_mode("classification")

13
New cards

What does set_mode("classification") do?

Tells the model to predict categorical outcomes.

14
New cards

What is a workflow and why do we use it?

Combines preprocessing (recipe) and model into one object. Ensures consistent application of preprocessing and modeling steps.

15
New cards

glimpse()

Quickly view structure of a dataset

16
New cards

distinct()

Returns unique rows/values — removes duplicate rows

17
New cards

fct_recode()

Is used to rename or relabel the categories (levels) of a factor variable.

For example, if a variable has levels like "M" and "F", you can change them to "Male" and "Female" without altering the data itself.

18
New cards

drop_na()

Is used in a data wrangling pipeline to remove rows that contain missing (NA) values

data |> drop_na()

19
New cards

bind_rows() vs bind_cols()

  • bind_rows(): stacks datasets vertically

  • bind_cols(): combines datasets horizontally

20
New cards

When to use drop_na() vs imputation?

  • drop_na(): when few missing values

  • imputation: when you want to keep data

drop_na() removes entire rows that contain missing values, so you lose those observations.

Imputation fills in missing values (e.g., using the mean), allowing you to keep all observations.

21
New cards

Why and how preprocess data?

  • Ensure variables are comparable and usable

Common steps:

recipe(class ~ ., data = training_data) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors()) |>
  step_impute_mean(all_predictors()) |>
  step_downsample(class)