LU

Lecture Notes on Sentence Embeddings and K-Nearest Neighbors

KNN (K-Nearest Neighbors)

  • KNN is a supervised learning algorithm used for classification.

  • It assumes similar things cluster together.

  • A point is a representation of data with multiple features.

  • Uses distance (e.g., Euclidean distance) to measure proximity between points.

  • The parameter k determines how many neighbors are considered.

  • Classification is based on the majority class among the k-nearest neighbors.

Outliers

  • Outliers can negatively affect KNN classification by skewing the majority vote.

Class Imbalance

  • Class imbalance can occur when one class has significantly more instances than others.

  • Overlapping points can lead to misclassification.

Choosing K

  • Use trial and error to find the optimal k.

  • There is a rule of thumb for picking k.
    k = \sqrt{n}, where n is the number of training points

  • Choose an odd number for k to avoid ties.

Weighted KNN

  • Give more weight to closer neighbors by calculating the inverse of the distance.

When to Use/Not Use KNN

  • KNN is an instance-based learning algorithm with no explicit training phase.

  • Requires carrying around the dataset for classifying new points during computation.

  • Not recommended for large datasets due to computational cost.

  • It is important to ensure all features are on the same scale by normalizing it between 0 and 1.

  • High-dimensional spaces can reduce the effectiveness of distance measurements.

Examples

  • Weather Prediction: Converts categorical data to numerical labels using Label Encoder.

  • Wine Classification: Demonstrates splitting the dataset into training and testing sets to evaluate different k values.

  • Sklearn.neighbors classifier method is used. The training data has to be passed for the model and you predict using the test set and then you can measure accuracy.