Dimensionality Reduction and Embedding Visualization

Dimensionality Reduction

  • In real-world data, datasets often have a high number of features (dimensions).
  • Example: An Iris dataset might have features like petal length, petal width, and sepal width.
  • Real-world datasets can have hundreds or thousands of features.

Problems with High Dimensionality

  • Computational Complexity: Too much data to handle.
  • Presence of Noise: Irrelevant features can be present.
  • Difficulty in Understanding: Hard to determine which features are truly relevant.
  • Visualization: Inability to visualize data beyond 3 dimensions.

Dimensionality Reduction Techniques

  • Used to reduce the number of features in a dataset.
  • Aims to retain as much important information as possible.

Principal Component Analysis (PCA)

  • Published in 1901 by Pearson et al.
  • A linear technique.
  • Finds a linear projection of high-dimensional data onto a lower-dimensional subspace.
  • Maximizes the variance of the data (preserves as much information as possible).

Basic Idea

  • Converts a large quantity of features into a single value called a principal component (PC).
  • Form of: PC = (feat1 * 20) + (feat2 * -2) + (feat3 * 0) + …

Intuition

  • Example: Linear relationship between weight (wt) and miles per gallon (mpg).
  • If data points fit tightly around a line, it's reasonable to summarize the data with one dimension similar to that line.

Change in Basis

  • End up with a new coordinate system where axes are eigenvectors of the covariance matrix.
  • Rotate coordinate axes so the covariance matrix looks diagonal (variables are uncorrelated in the new basis).

Process

  • PCA fits the dataset to a new coordinate system.
    • Most significant variance is found on the first coordinate.
    • Each subsequent coordinate is orthogonal to the last and has less variance.
  • In practice:
    • Transforms a set of correlated variables into a set of uncorrelated principal components.

Steps to Perform PCA

  1. Form the data matrix X containing the data.
  2. Calculate the covariance matrix S based on X.
  3. Solve Se = le for the eigenvectors e and eigenvalues l.
  4. Solve P = Xe to calculate the principal components (N PCs).
  • Transforms a large number of variables into a smaller number of uncorrelated variables called principal components (PCs).
  • Captures redundancy in the dataset.
    • Eigenvector: A direction.
    • Eigenvalue: Tells us how much variance there is in the data in that direction.
  • X is of size K x N

Variance Retention

  • Amount of variance retained by each principal component is measured by the eigenvalue.
  • Useful when variables within the dataset are highly correlated.
  • Correlation indicates redundancy in the data.
  • PCA reduces original variables into a smaller number of new variables (principal components) explaining most of the variance.

Component Variance Explanation

  • PC1 explains more variance than any other component.
  • PC2 explains more than every other except PC1.
  • PC3 explains more than every other except PC1 and PC2.
  • Example: If first 3 components explain 55% (40+13+2) of variance, take those 3 to approximate the dataset.

Advantages of PCA

  • Removes multicollinearity (correlation between features).
  • Removes noise from data.
  • Removes redundant features.

Disadvantages of PCA

  • Time complexity is O(nd^2 + d^3), where d is the number of dimensions and n is the number of word embeddings.
  • Absence of interpretability for the new dimensions.
  • PCA fits well with simple data, being a linear reduction technique, but real-world word embeddings may not be linear.
  • Outliers highly affect the results.
  • Word embedding matrix is already a low-rank representation.

t-distributed Stochastic Neighbor Embedding (t-SNE)

  • GOAL: Given a set of N high-dimensional objects x1, x2, …xN, t-SNE computes probabilities p{ij} that are proportional to the similarity of objects xi and xj.
  • Reconstruct a lower dimensional representation y1, y2, …, yN that reflects the similarities p{ij} as well as possible using the similarity in the lower space.

Two Main Steps

  1. Calculate similarity probabilities in the high-dimensional space:
    • For each point, calculate the probability that another point is its neighbor using a Gaussian distribution.
    • If two points are close in the original space, they will have a higher probability of being considered neighbors.
  2. Project into a low-dimensional space (2D or 3D):
    • Assign the points of the new representation random initial coordinates.
    • Calculate a new similarity probability between the points based on a Student-t distribution (which has heavier tails than the Gaussian, helping to separate the clusters better).
    • Iteratively adjust the positions of the points to minimize the difference between the similarity probabilities in the original and the reduced space.

Detailed Steps

  1. Computing Pairwise Similarities in the High-Dimensional Space:

    • For each point i, the conditional probability that point j is a neighbor is calculated as:
    • where sigma is an adaptive bandwidth (perplexity)
  2. Computing Pairwise Similarities in the Low-Dimensional Space:

    • For each point i, the similarity is computed using a t-Student with one degree of freedom:
  3. Minimize the Kullback-Leibler (KL) divergence between p{ij} and q{ij}

  4. Updating Points via Gradient Descent

Basic Idea

  • Convert similarities between data points into probabilities, mapping them into a lower-dimensional space.
  1. Compute pairwise similarities between data points in the full high dimensional space xi, xj
    • Full high dimensional space
      • The embedding vector x_i is in the high dimensional space.
      • We center a Gaussian over x_i. (shown as the purple circle)
      • We measure the density of all the other points under this Gaussian (e.g. x_j).
      • We renormalize to obtain a coherent probability distribution (p_{j|i}) probability distribution over pairs of points, where the probability of picking a given pair of points is proportional to their similarity
  • For each data point we get a distribution representing the similarities between a data point and all the other points in the full high dimensional space
  • Random generate an embedding space and assign each data point to a fixed position in the low dimensional space
  • Similar to the high dimensional space, compute similarities between data points in the low dimensional space
  • Repeat the process 3. 4. To minimize the divergence between the pairwise similarities in the high dim. space and the ones in the low dim. space

Algorithm

  • Simple version of t-Distributed Stochastic Neighbor Embedding.

Data

  • Data set X = {x1,x2,…,x_n},

Cost Function Parameters

  • Perplexity Perp,

Optimization Parameters

  • Number of iterations T, learning rate \eta, momentum \alpha(t).

Result

  • Low-dimensional data representation (T) = {1,2,..Y_n}.

Process

compute pairwise affinities

set P{ij} = \frac{P{ji}+P_{ij}}{2n}

with perplexity Perp (using Equation 1)
sample initial solution (0) = {1, 2, ..,y_n} from \mathcal{N}(0, 10^{-4I})

for t=1 to T do

compute low-dimensional affinities q_{ij} (using Equation 4)
compute gradient (using Equation 5)

\frac{SC }{Sy}
set y(t) = y(t−1) +1 \frac{SC}{Sy} +\alpha(t) (y(t−1) − y(t-2))
end

end

Parameter Tuning

  • Main parameter is perplexity (measure of uncertainty in the value of a sample from a discrete probability distribution):
    • Perplexity is high if points of similar characteristics are clustered together;
    • It’s like a guess about the number of close neighbors each point has.
    • In general typical values ranges between 5 and 50.
    • Getting the most from t-SNE may mean analyzing multiple plots with different perplexities.

Advantages of t-SNE

  • Good at preserving local structure of the data
  • Good at visualization
  • More robust to outliers
  • It captures nonlinear relationships (non parametric method that does not make assumption about the distribution)

Disadvantages of t-SNE

  • Still computational expensive, especially with a lot of data points
  • Sensitive to the choice of parameters (see perplexity)
  • Non-deterministic

Uniform Manifold Approximation and Projection (UMAP)

  • Conceptually very similar to t-SNE but with small improvements:
    • A bit quicker then t-SNE;
    • Preserves more global structure;
    • We can add new data to existing projection
  • Instead of the perplexity we have:
    • Nearest neighbours: number of expected nearest neighbours (almost as perplexity)
    • Minimum distance: how tightly UMAP packs points which are close together

Core Idea

  • Imagine you have a large dataset representing words or documents in a high-dimensional space, where each dimension corresponds to a specific feature or aspect of the data.
  • UMAP works by preserving both local and global structure in our data, much like t-SNE.
  • Think of our word embedding matrix as a complex landscape with hills, valleys, and plateaus.
  • Each word embedding in our dataset represents a point in this landscape.
  • Goal: Flatten this landscape onto a 2D or 3D map while preserving distances between points as accurately as possible.
  • UMAP imagines a rubber sheet stretched over your landscape:
    • It tries to represent each point (word embeddings) on this rubber sheet such that nearby points stay close together and distant points remain far apart.
  • UMAP uses cross-entropy instead of the KL divergence during the optimization phase, being able to attract and repulse points from each other.
    • With this new choice of loss function, placing objects that are far away in high-dim. space nearby in the low- dimensional space is penalized

Steps

  1. Computing Local Similarity
    • For each point, UMAP identifies a set of neighbors using a distance metric (e.g., Euclidean).
      • The n_neighbors parameter controls how many neighbors are considered.
  2. Computing the Probability of Connection
    • For each pair of points i and j, a weight is assigned to their connection based on their distance.
      • This weight follows a function similar to a Gaussian distribution, but adapted for each point.
  3. Building a Graph
    • All weighted connections are combined into a graph, where probabilities indicate how likely two points are to be close in the original structure.
  • Once the graph is built in the original data space, UMAP constructs a similar graph in the lower-dimensional space and tries to make it as similar as possible to the original.
  1. Random Initialization of Points in the Reduced Space
    • Points are placed randomly in the new space.
  2. Optimization via Gradient Descent
    • UMAP applies gradient descent to adjust the positions of points in the new space so that the graph structure remains as similar as possible.
      • The optimization is based on Kullback-Leibler divergence

UMAP Parameters

  • Nearest neighbours:
    • As n_neighbors increases, UMAP connects more and more neighboring points when constructing the graph representation of the high-dimensional data, which leads to a projection that more accurately reflects the global structure.
    • At very low values, any notion of global structure is almost completely lost.

UMAP vs t-SNE

  • There is never a single best approach.
  • The choice of parameters depends heavily on the data selected.
  • UMAP suffers from “containment”

Comparison of PCA, t-SNE, and UMAP

DescriptionPCAt-SNEUMAP
Linear vs nonlinearPCA has a linear approach on finding orthogonal axes that explain the maximum variancet-SNE and UMAP are nonlinear that aim to preserve both local and global structures in the data, usually better for complex relationshipst-SNE and UMAP are nonlinear that aim to preserve both local and global structures in the data, usually better for complex relationships
Global vs localPCA preserves more the global structure of the data, capturing overall variancet-SNE and UMAP preserve more local structure, emphasizing the relationships between nearby data points. Generally better at revealing clusters and patterns.
Time-complexityFastest of the three when there is a lot of dataMore computationally expensive for large datasetsMore computationally expensive for large datasets but is designed to be a bit more scalable to bigger datasets
InterpretabilityPCA provides interpretable results as linear combination of the initial featurest-SNE and UMAP produce transformed embeddings in a lower-dim. Space that are not directly interpretable in terms of original features
Parameter tuningPerplexity Learning rate IterationsN. neighbors Min. distance Various metrics
Commonly used forDimensionality reduction, feature selection, and noise reduction.Data exploration and visualization, revealing complex structures and clusters

Embedding Visualization

  • Embeddings are internal representations of models
  • These can be used by models but are difficult for humans to understand
  • In general, they are vectors, that is, a list of numbers that can be represented in Cartesian coordinates
    • Example:
      • 🙋: “The cat sleeps on the floor”
      • 🤖: [1.263, 8.223, 0.198, …, -13.249, 221.149]
  • The concept is very similar to a high dimensional dataset where dimensionality reduction can be applied!

Advantages of Dimensionality Reduction for Embedding Visualization

  • Less features means less computational requirements when using a model;
  • Possibility to visualize the embeddings in the embedding space, reducing the dimensions up to 3.