Lecture Notes on Sentence Embeddings and K-Nearest Neighbors

KNN (K-Nearest Neighbors)

KNN is a supervised learning algorithm used for classification.
It assumes similar things cluster together.
A point is a representation of data with multiple features.
Uses distance (e.g., Euclidean distance) to measure proximity between points.
The parameter k determines how many neighbors are considered.
Classification is based on the majority class among the k-nearest neighbors.

Outliers

Outliers can negatively affect KNN classification by skewing the majority vote.

Class Imbalance

Class imbalance can occur when one class has significantly more instances than others.
Overlapping points can lead to misclassification.

Choosing K

Use trial and error to find the optimal k.
There is a rule of thumb for picking k.
k = \sqrt{n}, where n is the number of training points
Choose an odd number for k to avoid ties.

Weighted KNN

Give more weight to closer neighbors by calculating the inverse of the distance.

When to Use/Not Use KNN

KNN is an instance-based learning algorithm with no explicit training phase.
Requires carrying around the dataset for classifying new points during computation.
Not recommended for large datasets due to computational cost.
It is important to ensure all features are on the same scale by normalizing it between 0 and 1.
High-dimensional spaces can reduce the effectiveness of distance measurements.

Examples

Weather Prediction: Converts categorical data to numerical labels using Label Encoder.
Wine Classification: Demonstrates splitting the dataset into training and testing sets to evaluate different k values.
Sklearn.neighbors classifier method is used. The training data has to be passed for the model and you predict using the test set and then you can measure accuracy.