Dimensionality Reduction and Embedding Visualization

Dimensionality Reduction

In real-world data, datasets often have a high number of features (dimensions).
Example: An Iris dataset might have features like petal length, petal width, and sepal width.
Real-world datasets can have hundreds or thousands of features.

Problems with High Dimensionality

Computational Complexity: Too much data to handle.
Presence of Noise: Irrelevant features can be present.
Difficulty in Understanding: Hard to determine which features are truly relevant.
Visualization: Inability to visualize data beyond 3 dimensions.

Dimensionality Reduction Techniques

Used to reduce the number of features in a dataset.
Aims to retain as much important information as possible.

Principal Component Analysis (PCA)

Published in 1901 by Pearson et al.
A linear technique.
Finds a linear projection of high-dimensional data onto a lower-dimensional subspace.
Maximizes the variance of the data (preserves as much information as possible).

Basic Idea

Converts a large quantity of features into a single value called a principal component (PC).
Form of: $PC = (feat1 * 20) + (feat2 * -2) + (feat3 * 0) + …$

Intuition

Example: Linear relationship between weight (wt) and miles per gallon (mpg).
If data points fit tightly around a line, it's reasonable to summarize the data with one dimension similar to that line.

Change in Basis

End up with a new coordinate system where axes are eigenvectors of the covariance matrix.
Rotate coordinate axes so the covariance matrix looks diagonal (variables are uncorrelated in the new basis).

Process

PCA fits the dataset to a new coordinate system.
- Most significant variance is found on the first coordinate.
- Each subsequent coordinate is orthogonal to the last and has less variance.
In practice:
- Transforms a set of correlated variables into a set of uncorrelated principal components.

Steps to Perform PCA

Form the data matrix $X$ containing the data.
Calculate the covariance matrix $S$ based on $X$ .
Solve $Se = le$ for the eigenvectors $e$ and eigenvalues $l$ .
Solve $P = Xe$ to calculate the principal components (N PCs).

Transforms a large number of variables into a smaller number of uncorrelated variables called principal components (PCs).
Captures redundancy in the dataset.
- Eigenvector: A direction.
- Eigenvalue: Tells us how much variance there is in the data in that direction.
$X$ is of size $K x N$

Variance Retention

Amount of variance retained by each principal component is measured by the eigenvalue.
Useful when variables within the dataset are highly correlated.
Correlation indicates redundancy in the data.
PCA reduces original variables into a smaller number of new variables (principal components) explaining most of the variance.

Component Variance Explanation

PC1 explains more variance than any other component.
PC2 explains more than every other except PC1.
PC3 explains more than every other except PC1 and PC2.
Example: If first 3 components explain 55% (40+13+2) of variance, take those 3 to approximate the dataset.

Advantages of PCA

Removes multicollinearity (correlation between features).
Removes noise from data.
Removes redundant features.

Disadvantages of PCA

Time complexity is $O(nd^2 + d^3)$ , where $d$ is the number of dimensions and $n$ is the number of word embeddings.
Absence of interpretability for the new dimensions.
PCA fits well with simple data, being a linear reduction technique, but real-world word embeddings may not be linear.
Outliers highly affect the results.
Word embedding matrix is already a low-rank representation.

t-distributed Stochastic Neighbor Embedding (t-SNE)

GOAL: Given a set of N high-dimensional objects $x1, x2, …xN$ , t-SNE computes probabilities $p{ij}$ that are proportional to the similarity of objects $xi$ and $xj$ .
Reconstruct a lower dimensional representation $y1, y2, …, yN$ that reflects the similarities $p{ij}$ as well as possible using the similarity in the lower space.

Two Main Steps

Calculate similarity probabilities in the high-dimensional space:
- For each point, calculate the probability that another point is its neighbor using a Gaussian distribution.
- If two points are close in the original space, they will have a higher probability of being considered neighbors.
Project into a low-dimensional space (2D or 3D):
- Assign the points of the new representation random initial coordinates.
- Calculate a new similarity probability between the points based on a Student-t distribution (which has heavier tails than the Gaussian, helping to separate the clusters better).
- Iteratively adjust the positions of the points to minimize the difference between the similarity probabilities in the original and the reduced space.

Detailed Steps

Computing Pairwise Similarities in the High-Dimensional Space:
- For each point $i$ , the conditional probability that point $j$ is a neighbor is calculated as:
- where sigma is an adaptive bandwidth (perplexity)
Computing Pairwise Similarities in the Low-Dimensional Space:
- For each point $i$ , the similarity is computed using a t-Student with one degree of freedom:
Minimize the Kullback-Leibler (KL) divergence between $p{ij}$ and $q{ij}$
Updating Points via Gradient Descent

Basic Idea

Convert similarities between data points into probabilities, mapping them into a lower-dimensional space.

Compute pairwise similarities between data points in the full high dimensional space $xi$ , $xj$
- Full high dimensional space
 - The embedding vector $x_i$ is in the high dimensional space.
 - We center a Gaussian over $x_i$ . (shown as the purple circle)
 - We measure the density of all the other points under this Gaussian (e.g. $x_j$ ).
 - We renormalize to obtain a coherent probability distribution ( $p_{j|i}$ ) probability distribution over pairs of points, where the probability of picking a given pair of points is proportional to their similarity

For each data point we get a distribution representing the similarities between a data point and all the other points in the full high dimensional space
Random generate an embedding space and assign each data point to a fixed position in the low dimensional space
Similar to the high dimensional space, compute similarities between data points in the low dimensional space
Repeat the process 3. 4. To minimize the divergence between the pairwise similarities in the high dim. space and the ones in the low dim. space

Algorithm

Simple version of t-Distributed Stochastic Neighbor Embedding.

Data

Data set $X = {x1,x2,…,x_n}$ ,

Cost Function Parameters

Perplexity Perp,

Optimization Parameters

Number of iterations $T$ , learning rate $\eta$ , momentum $\alpha(t)$ .

Result

Low-dimensional data representation $(T) = {1,2,..Y_n}$ .

Process

compute pairwise affinities

$set P{ij} = \frac{P{ji}+P_{ij}}{2n}$

with perplexity Perp (using Equation 1)
sample initial solution $(0) = {1, 2, ..,y_n}$ from $\mathcal{N}(0, 10^{-4I})$

for $t=1$ to $T$ do

compute low-dimensional affinities $q_{ij}$ (using Equation 4)
compute gradient (using Equation 5)

$\frac{SC }{Sy}$
set $y(t) = y(t−1) +1 \frac{SC}{Sy} +\alpha(t) (y(t−1) − y(t-2))$
end

end

Parameter Tuning

Main parameter is perplexity (measure of uncertainty in the value of a sample from a discrete probability distribution):
- Perplexity is high if points of similar characteristics are clustered together;
- It’s like a guess about the number of close neighbors each point has.
- In general typical values ranges between 5 and 50.
- Getting the most from t-SNE may mean analyzing multiple plots with different perplexities.

Advantages of t-SNE

Good at preserving local structure of the data
Good at visualization
More robust to outliers
It captures nonlinear relationships (non parametric method that does not make assumption about the distribution)

Disadvantages of t-SNE

Still computational expensive, especially with a lot of data points
Sensitive to the choice of parameters (see perplexity)
Non-deterministic

Uniform Manifold Approximation and Projection (UMAP)

Conceptually very similar to t-SNE but with small improvements:
- A bit quicker then t-SNE;
- Preserves more global structure;
- We can add new data to existing projection
Instead of the perplexity we have:
- Nearest neighbours: number of expected nearest neighbours (almost as perplexity)
- Minimum distance: how tightly UMAP packs points which are close together

Core Idea

Imagine you have a large dataset representing words or documents in a high-dimensional space, where each dimension corresponds to a specific feature or aspect of the data.
UMAP works by preserving both local and global structure in our data, much like t-SNE.
Think of our word embedding matrix as a complex landscape with hills, valleys, and plateaus.
Each word embedding in our dataset represents a point in this landscape.
Goal: Flatten this landscape onto a 2D or 3D map while preserving distances between points as accurately as possible.
UMAP imagines a rubber sheet stretched over your landscape:
- It tries to represent each point (word embeddings) on this rubber sheet such that nearby points stay close together and distant points remain far apart.
UMAP uses cross-entropy instead of the KL divergence during the optimization phase, being able to attract and repulse points from each other.
- With this new choice of loss function, placing objects that are far away in high-dim. space nearby in the low- dimensional space is penalized

Steps

Computing Local Similarity
- For each point, UMAP identifies a set of neighbors using a distance metric (e.g., Euclidean).
  - The n_neighbors parameter controls how many neighbors are considered.
Computing the Probability of Connection
- For each pair of points i and j, a weight is assigned to their connection based on their distance.
  - This weight follows a function similar to a Gaussian distribution, but adapted for each point.
Building a Graph
- All weighted connections are combined into a graph, where probabilities indicate how likely two points are to be close in the original structure.

Once the graph is built in the original data space, UMAP constructs a similar graph in the lower-dimensional space and tries to make it as similar as possible to the original.

Random Initialization of Points in the Reduced Space
- Points are placed randomly in the new space.
Optimization via Gradient Descent
- UMAP applies gradient descent to adjust the positions of points in the new space so that the graph structure remains as similar as possible.
  - The optimization is based on Kullback-Leibler divergence

UMAP Parameters

Nearest neighbours:
- As n_neighbors increases, UMAP connects more and more neighboring points when constructing the graph representation of the high-dimensional data, which leads to a projection that more accurately reflects the global structure.
- At very low values, any notion of global structure is almost completely lost.

UMAP vs t-SNE

There is never a single best approach.
The choice of parameters depends heavily on the data selected.
UMAP suffers from “containment”

Comparison of PCA, t-SNE, and UMAP

Description	PCA	t-SNE	UMAP
Linear vs nonlinear	PCA has a linear approach on finding orthogonal axes that explain the maximum variance	t-SNE and UMAP are nonlinear that aim to preserve both local and global structures in the data, usually better for complex relationships	t-SNE and UMAP are nonlinear that aim to preserve both local and global structures in the data, usually better for complex relationships
Global vs local	PCA preserves more the global structure of the data, capturing overall variance	t-SNE and UMAP preserve more local structure, emphasizing the relationships between nearby data points. Generally better at revealing clusters and patterns.
Time-complexity	Fastest of the three when there is a lot of data	More computationally expensive for large datasets	More computationally expensive for large datasets but is designed to be a bit more scalable to bigger datasets
Interpretability	PCA provides interpretable results as linear combination of the initial features	t-SNE and UMAP produce transformed embeddings in a lower-dim. Space that are not directly interpretable in terms of original features
Parameter tuning	–	Perplexity Learning rate Iterations	N. neighbors Min. distance Various metrics
Commonly used for	Dimensionality reduction, feature selection, and noise reduction.	Data exploration and visualization, revealing complex structures and clusters

Embedding Visualization

Embeddings are internal representations of models
These can be used by models but are difficult for humans to understand
In general, they are vectors, that is, a list of numbers that can be represented in Cartesian coordinates
- Example:
  - 🙋: “The cat sleeps on the floor”
  - 🤖: [1.263, 8.223, 0.198, …, -13.249, 221.149]
The concept is very similar to a high dimensional dataset where dimensionality reduction can be applied!

Advantages of Dimensionality Reduction for Embedding Visualization

Less features means less computational requirements when using a model;
Possibility to visualize the embeddings in the embedding space, reducing the dimensions up to 3.