Support Vector Machines (SVM)
Support Vector Machines (SVM)
Introduction
- SVM is a machine learning algorithm used for classification.
- The goal is to find the best line (or hyperplane in higher dimensions) that separates different classes of data points.
- The "best" line is the one that maximizes the margin, which is the widest separation between the classes.
Core Concepts
- Margin: The widest region that separates two classes. A larger margin indicates a better separation.
- Support Vectors: The data points that lie on the boundary of the margin. These points are crucial for defining the separating hyperplane. Once these vectors are determined, other data points can be disregarded.
- Optimization Problem: Training an SVM involves finding the optimal hyperplane by comparing points from different classes to determine the shortest distance between them.
- The distance calculation involves a dot product between the points.
Linear SVM
- The goal is to find the widest margin that separates classes.
- Margin is the biggest distance.
- If data isn't perfect, outliers can cause issues.
- Strict margin definitions can lead to misclassification of new observations due to outliers.
Soft Margin
- Allows for misclassifications during training to improve generalization.
- Trades off a wider margin by allowing some errors in the training data.
- Regularization: A technique used to control the number of misclassifications allowed. This helps balance training accuracy with generalization performance.
Regularization
- A parameter that allows for misclassifications in the training data.
- Balances being good in training vs. penalizing errors to improve generalization.
Non-Linear SVM and Kernel Trick
- Non-Linear Transformations: Applying a transformation to data points to make them linearly separable in a higher-dimensional space.
- If data is not linearly separable in its original space, it can be transformed into a higher-dimensional space where it is separable.
- Examples of transformations:
- Mapping a 1D dataset to 2D using x_1^2.
- Using the modulo operator (mod) to separate points.
- The implication of finding boundaries is that it helps identify the most accurate classifications.
Kernel Functions
- Transform data to a higher dimension where it becomes linearly separable
- Challenge: Finding the right transformation function and performing the transformation can be computationally expensive, especially with high-dimensional data (e.g., images).
The Kernel Trick
- A method to compute the dot product (relationship) between points in the transformed space without explicitly transforming the points.
- Avoids the computational cost of direct transformation.
- Kernel functions take vectors from the original dimension and provide the dot product of these vectors in the feature space.
- K(vi, vj) gives the relationship (dot product) between vectors vi and vj in the transformed (feature) space.
Example: Polynomial Kernel
- Avoids nasty calculations required for point-by-point transformation.
- If the dataset is small enough, it is possible to transform the data and apply a linear SVM.
Choosing Points for Comparison
- Pair by pair comparison is done during the optimization to find the maximum margin.
Practical Examples and Implementation
Iris Dataset
- A classic dataset with three types of flowers (setosa, versicolor, virginica).
- The goal is to classify the flowers based on their features.
Implementation Details
- SVC: Support Vector Classification (used for classification problems).
- SVR: Support Vector Regression (used for regression problems).
- Common Kernel Functions:
- Linear: For linearly separable data.
- RBF (Radial Basis Function): Commonly used in neural networks as an activation function.
- Polynomial.
Gamma Parameter
- Controls the complexity of the decision boundary.
- A higher gamma leads to a more complex, rugged boundary, which can increase the model's complexity.
- Finding the right gamma involves experimentation using cross-validation.
Code Example
- Creating different SVMs with different kernel functions to create different regions.
Faces Dataset
- High-dimensional data set.
- Each image contains nearly 3,000 pixels.
- Dimension reduction techniques are used.
PCA (Principal Component Analysis)
- Reduces the number of dimensions (e.g., from 3,000 pixels to 150 components).
- Improves performance.
Pipeline
- Use pipelines to reduce dimensions before applying SVM.
- Separate training and test sets.
Model Selection
- Use tools like Grid Search.
- Specify values for regularization and gamma.
Evaluation
- Evaluate the process, not just the classification results.
- Assess how the output is read and if it is good or bad.
- Play around with parameters to see if you can get a better result.
Key Takeaways
- SVMs use a kernel trick to find relationships in high-dimensional spaces without transforming data.
- Find margins and boundaries that separate classes.
- Important parameters to keep in mind:
- Regularization: Allows for misclassification, avoids overfitting, and improves generalization.
- Gamma: Controls the complexity of decision boundaries.