Lecture Notes on Sentence Embeddings and K-Nearest Neighbors
KNN (K-Nearest Neighbors)
KNN is a supervised learning algorithm used for classification.
It assumes similar things cluster together.
A point is a representation of data with multiple features.
Uses distance (e.g., Euclidean distance) to measure proximity between points.
The parameter k determines how many neighbors are considered.
Classification is based on the majority class among the k-nearest neighbors.
Outliers
Outliers can negatively affect KNN classification by skewing the majority vote.
Class Imbalance
Class imbalance can occur when one class has significantly more instances than others.
Overlapping points can lead to misclassification.
Choosing K
Use trial and error to find the optimal k.
There is a rule of thumb for picking k.
k = \sqrt{n}, where n is the number of training pointsChoose an odd number for k to avoid ties.
Weighted KNN
Give more weight to closer neighbors by calculating the inverse of the distance.
When to Use/Not Use KNN
KNN is an instance-based learning algorithm with no explicit training phase.
Requires carrying around the dataset for classifying new points during computation.
Not recommended for large datasets due to computational cost.
It is important to ensure all features are on the same scale by normalizing it between 0 and 1.
High-dimensional spaces can reduce the effectiveness of distance measurements.
Examples
Weather Prediction: Converts categorical data to numerical labels using Label Encoder.
Wine Classification: Demonstrates splitting the dataset into training and testing sets to evaluate different k values.
Sklearn.neighbors classifier method is used. The training data has to be passed for the model and you predict using the test set and then you can measure accuracy.