Chapter 5 Classification I: training & predicting

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/20

There's no tags or description

Looks like no tags are added yet.

Last updated 5:52 AM on 4/23/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

21 Terms

New cards

When should you use classification?

When the response variable is categorical (e.g., yes/no, species type) and you want to predict class labels.

New cards

What is a training dataset?

A dataset with predictors and known labels used to teach the model patterns for making predictions on new data

New cards

What is Euclidean distance?

The straight-line distance between two points in space.

New cards

What is K-nearest neighbors (KNN)?

A classification algorithm that assigns a class based on the majority label among the K closest training points.

New cards

Why is scaling important in KNN?

Because distance calculations are sensitive to variable scale; larger-scale variables dominate distances.

New cards

Steps in KNN classification workflow

Split data (train/test)

Preprocess (scale, center, clean)
Specify model
Fit model to training data
Predict on new/test data
Evaluate performance

New cards

What is a recipe in tidymodels?

A set of preprocessing steps applied to data before modeling.

It includes steps such as centering and scaling predictors, handling missing values (imputation), and balancing classes (e.g., upsampling).

New cards

Centering vs scaling

Centering: subtract mean
Scaling: divide by standard deviation

step_center(all_predictors()) + step_scale(all_predictors())

New cards

What is imputation?

Filling in missing values.

Syntax: step_impute_mean(all_predictors())

New cards

What is balancing data?

Adjusting class distribution to avoid bias toward majority class.

Syntax to upsample: step_upsample(outcome_variable)

New cards

What does bake() do?

Applies the trained recipe to new data

New cards

KNN model specification in R

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>
set_engine("kknn") |>
set_mode("classification")

New cards

What does set_mode("classification") do?

Tells the model to predict categorical outcomes.

New cards

What is a workflow and why do we use it?

Combines preprocessing (recipe) and model into one object. Ensures consistent application of preprocessing and modeling steps.

New cards

glimpse()

Quickly view structure of a dataset

New cards

distinct()

Returns unique rows/values — removes duplicate rows

New cards

fct_recode()

Is used to rename or relabel the categories (levels) of a factor variable.

For example, if a variable has levels like "M" and "F", you can change them to "Male" and "Female" without altering the data itself.

New cards

drop_na()

Is used in a data wrangling pipeline to remove rows that contain missing (NA) values

data |> drop_na()

New cards

bind_rows() vs bind_cols()

bind_rows(): stacks datasets vertically
bind_cols(): combines datasets horizontally

New cards

When to use drop_na() vs imputation?

drop_na(): when few missing values
imputation: when you want to keep data

drop_na() removes entire rows that contain missing values, so you lose those observations.

Imputation fills in missing values (e.g., using the mean), allowing you to keep all observations.

New cards

Why and how preprocess data?

Ensure variables are comparable and usable

Common steps:

recipe(class ~ ., data = training_data) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors()) |>
  step_impute_mean(all_predictors()) |>
  step_downsample(class)