L18_Multimodal Transformers

General Transformers

Transformers, initially designed for sequential language data such as text, have demonstrated a remarkable ability to generalize across various data types due to their minimal assumptions about the input data structure. This flexibility has led to their successful adaptation in numerous fields beyond natural language processing.

Transformers have achieved state-of-the-art results in large-scale models for:

  • Text: Natural language understanding and generation tasks.

  • Images: Computer vision tasks such as image classification, object detection, and semantic segmentation.

  • Video: Video analysis tasks including action recognition and video understanding.

  • Audio: Speech recognition, speech synthesis, and music generation.

  • Point clouds: 3D data processing tasks like object recognition and scene understanding in autonomous driving and robotics.

In the context of Large Language Models (LLMs) and multimodal learning, these different data types are referred to as modalities. LLMs can process and integrate information from multiple modalities to perform more complex tasks that require a comprehensive understanding of the world.

Visual Attention

Before the widespread adoption of the transformer architecture, visual attention mechanisms were explored using combinations of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) with attention. These early implementations paved the way for the development of more sophisticated attention mechanisms in transformers.

Visual attention mechanisms can be used as an explanation tool by visualizing the attention matrix. The attention matrix highlights the regions of the input image that the model focuses on when making predictions. However, these mechanisms can sometimes make mistakes, focusing on irrelevant features or failing to capture important contextual information.

Transformers for Computer Vision

A significant challenge in processing non-textual inputs like images is defining a suitable token representation that captures the essential information in the input data.

  • Simplest Approach: Treat each pixel as a token.

    • Problem: The attention matrix is quadratic in the number of tokens, O(n^2), where n is the number of tokens. This results in huge matrices for large images, making the computation infeasible due to memory and computational constraints.

  • Two Common Approaches:

    • Cut the image into patches: Divide the image into smaller, non-overlapping patches and treat each patch as a token. This reduces the number of tokens and the computational cost of the attention mechanism.

    • First apply convolutional layers: Use convolutional layers to extract feature maps from the image and then treat these feature maps as tokens. This approach leverages the ability of CNNs to capture spatial hierarchies and reduce the dimensionality of the input.

Vision Transformer (ViT) for Classification

The Vision Transformer (ViT) is a pioneering architecture that adapts the transformer model for image classification tasks by processing images as sequences of patches.

  1. Split the image into N × N patches.

    • Instead of processing height × width × colour channels number of tokens, we represent the image as (patch size) × colour channels tokens, significantly reducing the sequence length.

  2. For each patch:

    • Flatten it to a 1D vector: Reshape each patch into a flat vector to be processed as input to the transformer.

    • Make embeddings: Project the flattened patch vectors into a high-dimensional embedding space to capture semantic information.

    • Add positional embeddings: Incorporate positional information to the patch embeddings to inform the transformer about the location of each patch in the original image.

    • Treat it as any generic sequence and input to a transformer encoder: Process the embedded patch sequences using standard transformer encoder layers.

Vision Transformer architecture includes:

  • Patch splitting and flattening: Dividing the image into patches and reshaping them into flat vectors.

  • Linear projection of flattened patches to create embeddings: Projecting the patch vectors into a high-dimensional embedding space.

  • Positional embeddings added to patch embeddings: Adding positional information to the patch embeddings.

  • Transformer encoder layers: Processing the embedded patch sequences using standard transformer encoder layers.

  • Classification head for output: A classification layer to predict the image class based on the transformer encoder output.

Positional Embeddings

Positional embeddings are crucial for introducing information about the position of each patch in the image, as the transformer architecture itself is permutation-invariant and does not inherently understand the spatial relationships between patches.

  • Option 1: Handcrafted encodings, like the sinusoidal encoding of the original transformer.

    • Gets a little complicated and generally does not perform very well compared to learned embeddings.

  • Option 2: Learned embeddings

    • Instead of embedding word 1, 2, …, we trivially extend the procedure to embed patch (1,1), (2,1), …

    • keras.layers.Embedding can be used for any input dimensionality (N, M)

Combined Architecture

CNNs can be effectively combined with transformers for more complex computer vision tasks such as object detection. These hybrid architectures leverage the strengths of both CNNs and transformers to achieve state-of-the-art results.

  • CNN backbone to extract image features: Use a CNN to extract high-level feature maps from the input image. CNNs are good at capturing local spatial hierarchies and are computationally efficient.

  • Transformer encoder to process features: Use a transformer encoder to process the feature maps extracted by the CNN backbone. The transformer can capture long-range dependencies and contextual information in the feature maps.

  • Transformer decoder with object queries for detection: Use a transformer decoder with object queries to predict the bounding boxes and class labels of objects in the image.

This architecture uses:

  • A CNN backbone.

  • A transformer encoder.

  • A transformer decoder.

  • Prediction heads.

Inductive Biases

  • Inductive bias: Assumption built into the model architecture.

  • Linear models (y = ax + b): Assume data is linear.

  • CNNs: Assume translational equivariance, hierarchical structure.

  • RNNs: Assume meaningful ordering.

  • Transformers: Assume some relation between input tokens, but that’s about it.

  • Good: General-purpose architecture.

  • Bad: Requires (a lot) more training data than problem-specific models compared to architectures with strong inductive biases.

Combining Vision and Text

Vision-language models can perform tasks such as:

  • Object Localization: Identifying the location of objects in an image based on textual descriptions or queries.

  • Zero-shot Segmentation: Segmenting objects in an image without any prior training examples, based on textual descriptions or category names.

  • Zero-shot Visual Question Answering: Answering questions about an image without any prior training on question-answer pairs, based on visual and textual information.

  • One-shot Learning with Instructions: Learning to perform new tasks from a single example, guided by textual instructions.

These models combine visual and language understanding to answer questions, segment images, and perform few-shot learning based on textual instructions. They enable machines to understand and reason about the world in a more human-like way by integrating information from different modalities.

Audio Transformers

Typical tasks:

  • Speech-to-text: Transcribing spoken language into written text.

  • Text-to-speech: Synthesizing speech from written text.

  • Speech-to-speech: Translating spoken language from one language to another.

  • Text-to-music: Generating music from written text or symbolic representations.

Two common ways to process audio data:

  • Work on raw waveform data (Amplitude as function of time): Process the audio signal directly as a sequence of amplitude values.

  • Convert waveforms to spectrograms (Frequency content as function of time): Transform the audio signal into a spectrogram, which represents the frequency content of the audio over time.

Audio Transformers Architecture

Audio transformers typically utilize an encoder-decoder structure to process audio data and generate corresponding outputs.

Encoder:

  • Processes the input audio (spectrogram or raw waveform): Extracts relevant features from the input audio signal.

  • May include convolutional layers and positional encodings: Uses convolutional layers to capture local patterns and positional encodings to provide information about the temporal structure of the audio.

Decoder:

  • Generates the output sequence (text or audio): Produces the desired output based on the encoded audio representation.

  • Uses cross-attention to attend to the encoder output: Attends to the relevant parts of the encoded audio when generating the output sequence.

Audio Transformers for Speech Synthesis

Speech synthesis can use an acoustic prompt containing speaking rate, intonation and articulation to determine the generated style and tone, allowing for more expressive and natural-sounding speech.

The model uses:

  • A text prompt: The text to be synthesized into speech.

  • An acoustic prompt: An audio sample that provides information about the desired speaking style and tone.

  • A transformer: A transformer network that processes the text and acoustic prompts to generate the output speech.

  • A discrete tokenizer: A module that converts the text and audio into discrete tokens that can be processed by the transformer.

  • An audio decoder: Generates the final audio waveform from the transformer output.

Transformers for Other Uses - AlphaStar

AlphaStar: Mastering the real-time strategy game StarCraft II, demonstrating the capability of transformers to handle complex sequential decision-making tasks in dynamic environments.

AlphaStar's behavior is generated by a deep neural network that receives input data from the raw game interface (a list of units and their properties) and outputs a sequence of instructions that constitute an action within the game.

The neural network architecture applies a transformer torso to the units (similar to relational deep reinforcement learning), combined with a deep LSTM core, an auto-regressive policy head with a pointer network, and a centralized value baseline.

Multimodal Transformers

Transformer encoder-decoder structure allows for a single model to operate on several different types of data, enabling seamless integration and processing of multimodal information.

If it can be encoded into an embedding space, it can be used as a context for a decoder, allowing the model to leverage information from multiple sources to generate context-aware outputs.