L14 – Natural language processing: More on embeddings

Getting Text into Our Model

Text vectorization is the process of converting text data into numerical representations that machine learning models can understand. This typically involves the following steps:

  1. Standardization: The initial step often involves cleaning and standardizing the text data. This includes:

    • Removing diacritics (accents and other marks) from characters.

    • Eliminating punctuation to reduce noise in the data.

    • Converting all text to lowercase to ensure uniformity.

  2. Tokenization: This step involves breaking down the text into smaller units called tokens. Tokens can be:

    • Words: Splitting the text into individual words.

    • Subwords: Breaking words into smaller parts, useful for handling rare words.

    • Groups of words (N-grams): Considering sequences of words as single tokens.

  3. Indexing: After tokenization, each token is assigned a unique integer value. This creates a vocabulary where each token is associated with an index.

  4. Encoding: The final step is to convert the integer indices into numerical vectors that can be used as input to machine learning models. Common encoding methods include:

    • Embeddings: Representing tokens as dense vectors in a high-dimensional space.

    • One-Hot Encoding: Creating a binary vector for each token, with a 1 at the index corresponding to the token and 0s elsewhere.

Tokenization

Tokenization algorithms vary, with different language models often employing unique schemes tailored to their specific requirements. For experimentation and understanding, a tokeniser playground is available.

Embeddings

Embedding layers serve as a crucial component in converting categorical data, such as words, into continuous vector representations. These layers are conceptually similar to performing one-hot encoding followed by a linear Dense layer, but with optimized performance and memory usage. The equation is as follows:

embedding = keras.Sequential([
keras.layers.StringLookup(outputmode="one-hot"), keras.layers.Dense( units=embeddingdim, use_bias=False, activation=None
)
])

Computing Similarity in Embedding Space

The arrangement of tokens in the embedding space reflects their semantic relationships. Similarity between tokens can be quantified using the following measures:

  1. Euclidean (L2) Norm Distance: This measures the straight-line distance between two points a and b in the embedding space, providing a measure of their dissimilarity.

||(a − b)||2 = \sqrt{(a1 − b1)^2 + … + (an − b_n)^2}

  1. Cosine Similarity: This measures the cosine of the angle between two vectors a and b, indicating their similarity in orientation, regardless of their magnitude.

cos(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}

Other Uses of Embeddings

Vector Search: Utilized in Google Search, YouTube, Play, and other platforms.

The MatchIt Fast demo showcases the vector similarity search capabilities of the Vertex AI Matching Engine, akin to the technology behind Google Image Search, YouTube, and Google Play.

Typical keyword search retrieves content using keywords, tags, or labels:

SELECT id
FROM content
WHERE tag IN ('movie', 'music' ...)

Vector search identifies relevant content through vector similarity:

movie: (0.1, 0.02, 0.3)

Embed Everything

Embeddings are versatile and can represent various data types, including sentences, images, and audio. This enables the development of multi-modal machine learning models that integrate diverse sources of information.

Sentence Embedding

Example: Universal Sentence Encoder

This technique assesses the similarity between entire sentences by mapping them to vector representations.

Example:

How old are you? [0.3, 0.2, …]

What is your age? [0.2, 0.1, …]

My phone is good. [0.9, 0.6, …]

Your cellphone looks great. [0.9, 0.6, …]

Image Embedding

For images, a different approach is needed since individual pixels lack semantic meaning. Convolutional Neural Networks (CNNs) are employed for feature extraction. The flattened output of the last Conv layer can serve as the embedding space, capturing high-level image features.

Audio Embedding

CNNs are also utilized for audio embedding.

Example: Whisper (OpenAI)

  • Log-mel spectrogram

  • Sinusoidal Positional Encoding

  • Learned Positional Encoding

  • Multi-task training format

Latent Spaces

The output of embedding layers, convolutional layers, and other feature extraction layers collectively form a latent feature space. This space encapsulates knowledge applicable to various tasks, facilitating transfer learning.

Transfer learning involves pretraining a model on one task and then adapting it to a different task with minimal modifications. This approach is valuable because it allows leveraging models trained on easily accessible tasks to solve more complex and challenging problems, particularly in language modeling.

Pretrained Embeddings

Embedding leaderboards provide performance metrics, including the maximum number of tokens, embedding size, and the number of parameters, aiding in the selection of appropriate pretrained embeddings.

Measuring Embedding Quality

Evaluating semantic textual similarity poses challenges. Benchmark tasks, such as classification, clustering, and information retrieval problems with human-annotated solutions, are used to assess NLP model performance.

sentences1 = ["The new movie is awesome"]
sentences2 = ["The dog plays in the garden", "The new movie is so great", "A woman watches TV"]

# Compute embeddings for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)

# Compute cosine similarities
similarities = model.similarity(embeddings1, embeddings2)

print(similarities)

Example:

The new movie is awesome - The dog plays in the garden: 0.0543

The new movie is awesome - The new movie is so great: 0.8939

The new movie is awesome - A woman watches TV: -0.0502

Contextualized Word Embeddings

After training, embedding layers are static, mapping each token to a fixed embedding vector. However, a single token can have multiple meanings depending on the context.

  • "You are right about this"

  • "Make a right turn at the intersection"

By considering the input as a sequence, the model can infer the correct meaning from context.

Additional resources

Pretrained models on KerasHub, GEMMA, Google AI, Google for Developers