L7 Vision Transformers

0.0(0)
studied byStudied by 6 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/24

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

25 Terms

1
New cards

Visual tokenization

The process of converting image data into sequences of discrete tokens suitable for transformers. Unlike traditional CNNs, which process images in a spatially-aware, hierarchical manner, visual tokenization treats images as a sequence.

Steps:

  1. Patching

  2. Patch embedding

  3. Quantization (optional)

  4. Positional encoding

<p>The process of converting image data into sequences of discrete tokens suitable for transformers. Unlike traditional CNNs, which process images in a spatially-aware, hierarchical manner, visual tokenization treats images as a sequence.</p><p>Steps:</p><ol><li><p>Patching</p></li><li><p>Patch embedding</p></li><li><p>Quantization (optional)</p></li><li><p>Positional encoding</p></li></ol><p></p>
2
New cards

Patch embedding techniques

  • Learned linear projection layer (after flattening the patches) → uniform for each patch

  • Fixed CNN-backbone

3
New cards

Positional embedding types

  • learned (e.g. embedding layer)

  • fixed (e.g. sine/cosine)

4
New cards

Positional encoding in ViTs

CNNs inherently capture spatial relationships via local receptive fields, while transformers rely on positional encodings to recreate this spatial structure.

Types:

  • additive: epatch + epos(x,y) → the absolute position is used

  • relative: depends on the distance between the query’s and key’s position in the sequence, instead of absolute positions → calculated in each attention layer

  • bias: a learned bias term is added to the attention scores. Sometimes two-phase attention is used, first with (Q, erel , V), the results of this are used as a bias in the original (Q, K, V) attention.

5
New cards

ViT Architecture

Vision Transformers (ViT) process images as a sequence of patches, embedding each patch, and passing these embeddings through transformer layers, traditionally designed for NLP tasks.

Unlike CNNs, ViT does not rely on convolutions, allowing it to leverage self-attention mechanisms to model global relationships more directly.

<p>Vision Transformers (ViT) process images as a sequence of patches, embedding each patch, and passing these embeddings through transformer layers, traditionally designed for NLP tasks.</p><p>Unlike CNNs, ViT does not rely on convolutions, allowing it to leverage self-attention mechanisms to model global relationships more directly.</p>
6
New cards

Difference between ViT’s encoder and normal transformer encoder

Normalization comes before the first layer, not after = pre-normalized transformer encoder -> adds extra regulation

2. MLP head added: only uses the first element of the sequence, which has the role of aggregating all the inputs which are needed for the classification

<p>Normalization comes before the first layer, not after = pre-normalized transformer encoder -&gt; adds extra regulation</p><p>2. MLP head added: only uses the first element of the sequence, which has the role of aggregating all the inputs which are needed for the classification</p>
7
New cards

Classification token in ViT

A dedicated token added to patch embeddings in ViT, which has the role of aggregating all the inputs which are needed for the classification, therefore learning a global representation for the entire image. During inference, the transformer output for this token is passed to the MLP head for classification.

=> In CNNs, global representations are achieved through pooling layers, but ViT’s CLS token offers a more flexible, learnable representation.

8
New cards

Higher Resolution Processing in ViTs

Adapting the transformer architecture to handle larger image resolutions, allowing the model to capture finer details by processing longer sequences of image patches.

Only the positional encoding is dependent on size → we can interpolate the pre-trained positional encoding values to get more points for a longer input

<p>Adapting the transformer architecture to handle larger image resolutions, allowing the model to capture finer details by processing longer sequences of image patches.</p><p>Only the positional encoding is dependent on size → we can interpolate the pre-trained positional encoding values to get more points for a longer input</p>
9
New cards

SWIN transformer

  • able to process images of arbitrary size using a hierarchical design of multiple resolution levels

  • local attention: limited to non-overlapping windows

  • introduces a window-shifting mechanism for information flow across distant regions

  • tokens are merged (pooled) by a learnable projection at each level

<ul><li><p>able to process images of arbitrary size using a hierarchical design of multiple resolution levels</p></li><li><p>local attention: limited to non-overlapping windows</p></li><li><p>introduces a window-shifting mechanism for information flow across distant regions</p></li><li><p>tokens are merged (pooled) by a learnable projection at each level</p></li></ul><p></p>
10
New cards

Advantage of SWIN over ViT

Dramatically reduces computational complexity from quadratic to linear with respect to image size

→ more scalable to high-resolution images

→ widely used for semantic segmentation and object detection

<p>Dramatically reduces computational complexity from quadratic to linear with respect to image size</p><p>→  more scalable to high-resolution images</p><p>→ widely used for semantic segmentation and object detection</p>
11
New cards

Window shifting in SWIN

Shifting attention windows in consecutive layers in a cyclic manner, enabling interactions between adjacent windows and allowing information to propagate across the entire image.

12
New cards

Learnable relative attention bias in SWIN

  • a bias term is learned for each possible relative position of two tokens within the window
    → (2M-1)² bias for each head

  • the bias corresponding to the current query and key is added to the attention score

  • adapts well to different input sizes and can generalize better across various resolutions

  • ←→ Requires additional training to learn optimal bias terms for each window

<ul><li><p>a bias term is learned for each possible relative position of two tokens within the window<br>→ (2M-1)² bias for each head</p></li><li><p>the bias corresponding to the current query and key is added to the attention score</p></li><li><p>adapts well to different input sizes and can generalize better across various resolutions</p></li><li><p>←→ Requires additional training to learn optimal bias terms for each window</p></li></ul><p></p>
13
New cards

Knowledge distillation

A training technique where a smaller model (the student) is trained to mimic the outputs of a larger, pre-trained model (the teacher), often by using the teacher’s “soft” probability distributions instead of hard labels.

Advantages:

  • smaller, more efficient models with faster inference

  • can transfer performance from large, complex models to simpler models

Disadvantages:

  • The student model’s performance is limited by the quality of the teacher

  • distillation doesn’t always generalize well to significantly smaller models.

14
New cards

DEIT

= Data Efficient Image Transformer

A ViT variant that incorporates a distillation token in the input sequence as well

→ learns from a teacher model by optimizing the mean of the classification and distillation loss (cross-entropy of the ε-smoothed hard teacher and student logits)

→ the teacher’s prediction is set to 1 − ε, and the rest of the probability mass is distributed among the other classes to get soft labels

→ during inference, the dist. and class. logits are summed before softmax

Use Cases: Applied in scenarios where training data is limited but high-quality CNN models are available for transfer learning.

<p>= Data Efficient Image Transformer</p><p>A ViT variant that incorporates a distillation token in the input sequence as well</p><p>→ learns from a teacher model by optimizing the mean of the classification and distillation loss (cross-entropy of the ε-smoothed hard teacher and student logits)</p><p>→ the teacher’s prediction is set to 1 − ε, and the rest of the probability mass is distributed among the other classes to get soft labels</p><p>→ during inference, the dist. and class. logits are summed before softmax</p><p><strong>Use Cases</strong>: Applied in scenarios where training data is limited but high-quality CNN models are available for transfer learning.</p>
15
New cards

SimMIM

= Simple Masked Image Modeling

  • A self-supervised pre-training approach similar to BERT in NLP

  • random image patches are masked, and the model is trained to reconstruct the missing parts using a MASK token

  • L1 loss between predicted and ground-truth patches is used

=> Improves feature learning, reduces model “data hunger,” and enables pre-trained ViTs to generalize well across downstream tasks.

<p>= Simple Masked Image Modeling</p><ul><li><p>A self-supervised pre-training approach similar to BERT in NLP</p></li><li><p>random image patches are masked, and the model is trained to reconstruct the missing parts using a MASK token</p></li><li><p>L1 loss between predicted and ground-truth patches is used</p></li></ul><p>=&gt; Improves feature learning, reduces model “data hunger,” and enables pre-trained ViTs to generalize well across downstream tasks.</p>
16
New cards

DINO

= self-DIstillation with NO labels

  • Aims to find representations which are stable in the dataset

  • Leverages self-distillation without labels by using an exponential moving average-based teacher model derived from the student’s past states.

  • The teacher predictions are mean-centered and sharpened, guiding the student to learn stable and informative features.

  • Optimizes cross-entropy between the student and

    the slowly evolving teacher

<p>= self-DIstillation with NO labels</p><ul><li><p>Aims to find representations which are stable in the dataset</p></li><li><p>Leverages self-distillation without labels by using an exponential moving average-based teacher model derived from the student’s past states.</p></li><li><p>The teacher predictions are mean-centered and sharpened, guiding the student to learn stable and informative features.</p></li><li><p>Optimizes cross-entropy between the student and</p><p>the slowly evolving teacher</p></li></ul><p></p>
17
New cards

DINOv2

Masked modeling is added (only to student), along with data curation

=> achieves extremely good image embeddings

18
New cards

CoCA

= Contrastive Captioners

  • a multi-modal model

  • embeds images and text into a joint embedding space for contrastive learning

  • the model can use both modalities for few-shot learning and retrieval tasks → e.g. image captioning or based on a text prompt finding the most relevant images from a database

  • the image encoder is run only once, while the text decoders are run for each generated token

  • The image encodings serve as the keys and values, while the unimodal text decoder’s current output is the query

<p>= Contrastive Captioners</p><ul><li><p>a multi-modal model</p></li><li><p>embeds images and text into a joint embedding space for contrastive learning</p></li><li><p>the model can use both modalities for few-shot learning and retrieval tasks → e.g. image captioning or based on a text prompt finding the most relevant images from a database</p></li><li><p>the image encoder is run only once, while the text decoders are run for each generated token</p></li><li><p>The image encodings serve as the keys and values, while the unimodal text decoder’s current output is the query</p></li></ul><p></p>
19
New cards

CoCA training

1. Encoder pre-training (supervised classification)

2. Simultaneous contrastive and reconstruction training

  • Contrastive training to match pooled image embedding and CLS token from text representation (pair-wise dot-product similarity)

  • Reconstruction training to reconstruct the original caption from the encoded image (autoregressive text prediction, using all image embeddings and all text embeddings generated so far)

<p>1. Encoder pre-training (supervised classification)</p><p>2. Simultaneous contrastive and reconstruction training</p><ul><li><p>Contrastive training to match pooled image embedding and CLS token from text representation (pair-wise dot-product similarity)</p></li><li><p>Reconstruction training to reconstruct the original caption from the encoded image (autoregressive text prediction, using all image embeddings and all text embeddings generated so far)</p></li></ul><p></p>
20
New cards

Large World Models

These models aim to learn generative representations across multiple modalities (e.g., text, image, video), creating a comprehensive model of the world, which can generalize across tasks and domains.

21
New cards

Recipe for Large World Models

1. Train a tokenizer (Encoder-Decoder model)

2. Train a transformer with Masked Modeling or Predict Next Token task in the latent space of the VAE.

3. For multiple modalities train multiple VAEs and require the single transformer to indicate output modality.

4. Use the corresponding encoder and decoder for each modality in the sequence.

<p>1. Train a tokenizer (Encoder-Decoder model)</p><p>2. Train a transformer with Masked Modeling or Predict Next Token task in the latent space of the VAE.</p><p>3. For multiple modalities train multiple VAEs and require the single transformer to indicate output modality.</p><p>4. Use the corresponding encoder and decoder for each modality in the sequence.</p>
22
New cards

Why might discrete latent spaces be beneficial over continuous ones?

Discrete latent spaces enforce a bottleneck that can lead to more meaningful, structured, compact representations.

  • Compact, efficient code.

  • Easy autoregressive modeling (discrete code acts as tokens)

  • Interpretability.

  • Reduces posterior collapse.

23
New cards

VQ-VAE (Vector Quantized VAE)

  • encodes the input into a discrete latent space

  • uses hard quantization: assigns each embedded input vector to the nearest vector from a fixed codebook

  • decodes from the assigned codebook vectors

(It's called variational because it models discrete latent variables with approximate inference, using quantization as a discrete bottleneck instead of a continuous distribution.)

<ul><li><p>encodes the input into a discrete latent space</p></li><li><p>uses hard quantization: assigns each embedded input vector to the nearest vector from a fixed codebook</p></li><li><p>decodes from the assigned codebook vectors</p></li></ul><p></p><p>(It's called variational because  it models discrete latent variables with approximate inference, using quantization as a discrete bottleneck instead of a continuous distribution.)</p><p></p>
24
New cards

dVAE (discrete VAE)

  • a type of autoencoder that represents input data in a discrete latent space

  • Unlike continuous VAEs, dVAE reduces the continuous latent space to discrete units = greedily samples from a distribution

  • => enhances interpretability and enables tokenization for transformer processing ←→ may loose detail

Unlike VQ-VAE, this does not encode the input into a latent representation, it encodes it into a distribution over the codebook for each position ( Instead of outputting a continuous vector in latent space, it outputs a soft distribution over discrete codes.)

Then during training, it samples from this distribution to get a stochastic representation for that position, and then decode from them

<-> during inference, it typically chooses the codebook vector with the highest probability for deterministic encoding

<ul><li><p>a type of autoencoder that represents input data in a discrete latent space</p></li><li><p>Unlike continuous VAEs, dVAE reduces the continuous latent space to discrete units = greedily samples from a distribution</p></li><li><p>=&gt; enhances interpretability and enables tokenization for transformer processing ←→ may loose detail</p></li></ul><p></p><p>Unlike VQ-VAE, this does not encode the input into a latent representation, it encodes it into a distribution over the codebook for each position ( Instead of outputting a continuous vector in latent space, it outputs a soft distribution over discrete codes.) </p><p>Then during training, it samples from this distribution to get a stochastic representation for that position, and then decode from them</p><p>&lt;-&gt; during inference, it typically chooses the codebook vector with the highest probability for deterministic encoding</p><p></p>
25
New cards

VQ-GAN

  • combines the principles of vector quantization and adversarial training

  • the generator is a VQ-VAE => the discriminator ensures the creation of an efficient codebook

  • compared to VQ-VAE, in addition to reconstruction loss, this model includes a perceptual loss → ensures that generated images are not only visually accurate but also perceptually similar to real images.

Advantages:

  • creates sharper images (VQ-VAE’s output is often blurry)

  • perceptual loss ensures semantic consistency

<ul><li><p>combines the principles of vector quantization and adversarial training</p></li><li><p>the generator is a VQ-VAE =&gt; the discriminator ensures the creation of an efficient codebook</p></li><li><p>compared to VQ-VAE, in addition to reconstruction loss, this model includes a perceptual loss → ensures that generated images are not only visually accurate but also perceptually similar to real images.</p></li></ul><p></p><p>Advantages:</p><ul><li><p>creates sharper images (VQ-VAE’s output is often blurry)</p></li><li><p>perceptual loss ensures semantic consistency</p></li></ul><p></p>