Dimensionality Reduction and Embedding Visualization
Dimensionality Reduction
- In real-world data, datasets often have a high number of features (dimensions).
- Example: An Iris dataset might have features like petal length, petal width, and sepal width.
- Real-world datasets can have hundreds or thousands of features.
Problems with High Dimensionality
- Computational Complexity: Too much data to handle.
- Presence of Noise: Irrelevant features can be present.
- Difficulty in Understanding: Hard to determine which features are truly relevant.
- Visualization: Inability to visualize data beyond 3 dimensions.
Dimensionality Reduction Techniques
- Used to reduce the number of features in a dataset.
- Aims to retain as much important information as possible.
Principal Component Analysis (PCA)
- Published in 1901 by Pearson et al.
- A linear technique.
- Finds a linear projection of high-dimensional data onto a lower-dimensional subspace.
- Maximizes the variance of the data (preserves as much information as possible).
Basic Idea
- Converts a large quantity of features into a single value called a principal component (PC).
- Form of: PC = (feat1 * 20) + (feat2 * -2) + (feat3 * 0) + …
Intuition
- Example: Linear relationship between weight (wt) and miles per gallon (mpg).
- If data points fit tightly around a line, it's reasonable to summarize the data with one dimension similar to that line.
Change in Basis
- End up with a new coordinate system where axes are eigenvectors of the covariance matrix.
- Rotate coordinate axes so the covariance matrix looks diagonal (variables are uncorrelated in the new basis).
Process
- PCA fits the dataset to a new coordinate system.
- Most significant variance is found on the first coordinate.
- Each subsequent coordinate is orthogonal to the last and has less variance.
- In practice:
- Transforms a set of correlated variables into a set of uncorrelated principal components.
Steps to Perform PCA
- Form the data matrix X containing the data.
- Calculate the covariance matrix S based on X.
- Solve Se = le for the eigenvectors e and eigenvalues l.
- Solve P = Xe to calculate the principal components (N PCs).
- Transforms a large number of variables into a smaller number of uncorrelated variables called principal components (PCs).
- Captures redundancy in the dataset.
- Eigenvector: A direction.
- Eigenvalue: Tells us how much variance there is in the data in that direction.
- X is of size K x N
Variance Retention
- Amount of variance retained by each principal component is measured by the eigenvalue.
- Useful when variables within the dataset are highly correlated.
- Correlation indicates redundancy in the data.
- PCA reduces original variables into a smaller number of new variables (principal components) explaining most of the variance.
Component Variance Explanation
- PC1 explains more variance than any other component.
- PC2 explains more than every other except PC1.
- PC3 explains more than every other except PC1 and PC2.
- Example: If first 3 components explain 55% (40+13+2) of variance, take those 3 to approximate the dataset.
Advantages of PCA
- Removes multicollinearity (correlation between features).
- Removes noise from data.
- Removes redundant features.
Disadvantages of PCA
- Time complexity is O(nd^2 + d^3), where d is the number of dimensions and n is the number of word embeddings.
- Absence of interpretability for the new dimensions.
- PCA fits well with simple data, being a linear reduction technique, but real-world word embeddings may not be linear.
- Outliers highly affect the results.
- Word embedding matrix is already a low-rank representation.
t-distributed Stochastic Neighbor Embedding (t-SNE)
- GOAL: Given a set of N high-dimensional objects x1, x2, …xN, t-SNE computes probabilities p{ij} that are proportional to the similarity of objects xi and xj.
- Reconstruct a lower dimensional representation y1, y2, …, yN that reflects the similarities p{ij} as well as possible using the similarity in the lower space.
Two Main Steps
- Calculate similarity probabilities in the high-dimensional space:
- For each point, calculate the probability that another point is its neighbor using a Gaussian distribution.
- If two points are close in the original space, they will have a higher probability of being considered neighbors.
- Project into a low-dimensional space (2D or 3D):
- Assign the points of the new representation random initial coordinates.
- Calculate a new similarity probability between the points based on a Student-t distribution (which has heavier tails than the Gaussian, helping to separate the clusters better).
- Iteratively adjust the positions of the points to minimize the difference between the similarity probabilities in the original and the reduced space.
Detailed Steps
Computing Pairwise Similarities in the High-Dimensional Space:
- For each point i, the conditional probability that point j is a neighbor is calculated as:
- where sigma is an adaptive bandwidth (perplexity)
Computing Pairwise Similarities in the Low-Dimensional Space:
- For each point i, the similarity is computed using a t-Student with one degree of freedom:
Minimize the Kullback-Leibler (KL) divergence between p{ij} and q{ij}
Updating Points via Gradient Descent
Basic Idea
- Convert similarities between data points into probabilities, mapping them into a lower-dimensional space.
- Compute pairwise similarities between data points in the full high dimensional space xi, xj
- Full high dimensional space
- The embedding vector x_i is in the high dimensional space.
- We center a Gaussian over x_i. (shown as the purple circle)
- We measure the density of all the other points under this Gaussian (e.g. x_j).
- We renormalize to obtain a coherent probability distribution (p_{j|i}) probability distribution over pairs of points, where the probability of picking a given pair of points is proportional to their similarity
- Full high dimensional space
- For each data point we get a distribution representing the similarities between a data point and all the other points in the full high dimensional space
- Random generate an embedding space and assign each data point to a fixed position in the low dimensional space
- Similar to the high dimensional space, compute similarities between data points in the low dimensional space
- Repeat the process 3. 4. To minimize the divergence between the pairwise similarities in the high dim. space and the ones in the low dim. space
Algorithm
- Simple version of t-Distributed Stochastic Neighbor Embedding.
Data
- Data set X = {x1,x2,…,x_n},
Cost Function Parameters
- Perplexity Perp,
Optimization Parameters
- Number of iterations T, learning rate \eta, momentum \alpha(t).
Result
- Low-dimensional data representation (T) = {1,2,..Y_n}.
Process
compute pairwise affinities
set P{ij} = \frac{P{ji}+P_{ij}}{2n}
with perplexity Perp (using Equation 1)
sample initial solution (0) = {1, 2, ..,y_n} from \mathcal{N}(0, 10^{-4I})
for t=1 to T do
compute low-dimensional affinities q_{ij} (using Equation 4)
compute gradient (using Equation 5)
\frac{SC }{Sy}
set y(t) = y(t−1) +1 \frac{SC}{Sy} +\alpha(t) (y(t−1) − y(t-2))
end
end
Parameter Tuning
- Main parameter is perplexity (measure of uncertainty in the value of a sample from a discrete probability distribution):
- Perplexity is high if points of similar characteristics are clustered together;
- It’s like a guess about the number of close neighbors each point has.
- In general typical values ranges between 5 and 50.
- Getting the most from t-SNE may mean analyzing multiple plots with different perplexities.
Advantages of t-SNE
- Good at preserving local structure of the data
- Good at visualization
- More robust to outliers
- It captures nonlinear relationships (non parametric method that does not make assumption about the distribution)
Disadvantages of t-SNE
- Still computational expensive, especially with a lot of data points
- Sensitive to the choice of parameters (see perplexity)
- Non-deterministic
Uniform Manifold Approximation and Projection (UMAP)
- Conceptually very similar to t-SNE but with small improvements:
- A bit quicker then t-SNE;
- Preserves more global structure;
- We can add new data to existing projection
- Instead of the perplexity we have:
- Nearest neighbours: number of expected nearest neighbours (almost as perplexity)
- Minimum distance: how tightly UMAP packs points which are close together
Core Idea
- Imagine you have a large dataset representing words or documents in a high-dimensional space, where each dimension corresponds to a specific feature or aspect of the data.
- UMAP works by preserving both local and global structure in our data, much like t-SNE.
- Think of our word embedding matrix as a complex landscape with hills, valleys, and plateaus.
- Each word embedding in our dataset represents a point in this landscape.
- Goal: Flatten this landscape onto a 2D or 3D map while preserving distances between points as accurately as possible.
- UMAP imagines a rubber sheet stretched over your landscape:
- It tries to represent each point (word embeddings) on this rubber sheet such that nearby points stay close together and distant points remain far apart.
- UMAP uses cross-entropy instead of the KL divergence during the optimization phase, being able to attract and repulse points from each other.
- With this new choice of loss function, placing objects that are far away in high-dim. space nearby in the low- dimensional space is penalized
Steps
- Computing Local Similarity
- For each point, UMAP identifies a set of neighbors using a distance metric (e.g., Euclidean).
- The n_neighbors parameter controls how many neighbors are considered.
- For each point, UMAP identifies a set of neighbors using a distance metric (e.g., Euclidean).
- Computing the Probability of Connection
- For each pair of points i and j, a weight is assigned to their connection based on their distance.
- This weight follows a function similar to a Gaussian distribution, but adapted for each point.
- For each pair of points i and j, a weight is assigned to their connection based on their distance.
- Building a Graph
- All weighted connections are combined into a graph, where probabilities indicate how likely two points are to be close in the original structure.
- Once the graph is built in the original data space, UMAP constructs a similar graph in the lower-dimensional space and tries to make it as similar as possible to the original.
- Random Initialization of Points in the Reduced Space
- Points are placed randomly in the new space.
- Optimization via Gradient Descent
- UMAP applies gradient descent to adjust the positions of points in the new space so that the graph structure remains as similar as possible.
- The optimization is based on Kullback-Leibler divergence
- UMAP applies gradient descent to adjust the positions of points in the new space so that the graph structure remains as similar as possible.
UMAP Parameters
- Nearest neighbours:
- As n_neighbors increases, UMAP connects more and more neighboring points when constructing the graph representation of the high-dimensional data, which leads to a projection that more accurately reflects the global structure.
- At very low values, any notion of global structure is almost completely lost.
UMAP vs t-SNE
- There is never a single best approach.
- The choice of parameters depends heavily on the data selected.
- UMAP suffers from “containment”
Comparison of PCA, t-SNE, and UMAP
| Description | PCA | t-SNE | UMAP |
|---|---|---|---|
| Linear vs nonlinear | PCA has a linear approach on finding orthogonal axes that explain the maximum variance | t-SNE and UMAP are nonlinear that aim to preserve both local and global structures in the data, usually better for complex relationships | t-SNE and UMAP are nonlinear that aim to preserve both local and global structures in the data, usually better for complex relationships |
| Global vs local | PCA preserves more the global structure of the data, capturing overall variance | t-SNE and UMAP preserve more local structure, emphasizing the relationships between nearby data points. Generally better at revealing clusters and patterns. | |
| Time-complexity | Fastest of the three when there is a lot of data | More computationally expensive for large datasets | More computationally expensive for large datasets but is designed to be a bit more scalable to bigger datasets |
| Interpretability | PCA provides interpretable results as linear combination of the initial features | t-SNE and UMAP produce transformed embeddings in a lower-dim. Space that are not directly interpretable in terms of original features | |
| Parameter tuning | – | Perplexity Learning rate Iterations | N. neighbors Min. distance Various metrics |
| Commonly used for | Dimensionality reduction, feature selection, and noise reduction. | Data exploration and visualization, revealing complex structures and clusters |
Embedding Visualization
- Embeddings are internal representations of models
- These can be used by models but are difficult for humans to understand
- In general, they are vectors, that is, a list of numbers that can be represented in Cartesian coordinates
- Example:
- 🙋: “The cat sleeps on the floor”
- 🤖: [1.263, 8.223, 0.198, …, -13.249, 221.149]
- Example:
- The concept is very similar to a high dimensional dataset where dimensionality reduction can be applied!
Advantages of Dimensionality Reduction for Embedding Visualization
- Less features means less computational requirements when using a model;
- Possibility to visualize the embeddings in the embedding space, reducing the dimensions up to 3.