Convolutional Neural Networks and Transformers

Convolutional Neural Networks (CNNs)

Designed for data with a grid-like structure, such as images.
Greyscale images are represented as matrices of pixel values, typically ranging from 0 to 1.
Example: A 6x6 pixel image.

Dense Layers

Each input variable is connected to all hidden units in the subsequent layer.
Each connection has a unique weight associated with it.
Drawbacks:
- Captures too much redundant detail without emphasizing important features.
- Fails to generalize well to unseen data.

Convolutional Layers

Sparse interactions: Each neuron in a convolutional layer receives input from only a local region of the input.
Parameter sharing: The same filter (set of weights) is used across different locations in the input.

CNN: Sparse Interactions

Convolution involves applying a kernel (filter) to an image, computing a weighted average.
Flattening transforms the output of convolutional layers into a one-dimensional vector, suitable for input to fully connected layers.

CNN: Parameter Sharing

The same set of parameters (filter) is used in multiple locations.
Mapping between inputs and hidden units is a convolution between input variables and the filter.
Advantages:
- Translation invariance: CNNs are relatively insensitive to the location of objects in the image.
- Fewer parameters: e.g., 10 vs. 1332 in a dense layer example.

CNN: Strides

Strides reduce the number of hidden units and store only the most important information.
The stride determines how many pixels the filter shifts over the image at each step.
Example: Shifting by two pixels.

CNN: Pooling Layer

Pooling layers are added after convolutional layers.
They operate on a region of pixels.
No extra trainable parameters are introduced.
Strides are also used to condense information.
Types: Average pooling and max pooling.

CNN: Pooling Layer Benefits

Pooling layers increase the model's invariance to small input translations.
Convolutional layer with stride s = 2 requires four times fewer computations than a layer with stride s = 1, combined with a pooling layer with stride s = 2.
Kernel shifts in both row and column directions in the convolutional layer, while the pooling layer impacts one direction at a time.

CNN: Multiple Channels

A single filter may not capture all interesting properties.
Multiple filters are used, each with its own set of parameters.
Each filter produces its own set of hidden units, forming a channel.
Each layer of hidden units in a CNN is organized into a tensor with dimensions (rows x columns x channels).

CNN: Full Architecture

Consists of multiple convolutional layers.
The number of rows and columns in the hidden layers decreases as it proceeds.
The number of channels increases to encode more high-level features.
Typically ends with one or more dense layers.
For image classification, a softmax layer is often placed at the end to output probabilities in the range [0,1], representing confidence of predictions.

Dropout

Dropout reduces overfitting by training multiple models and averaging their predictions.
It combines many neural networks without training them separately.
During dropout, some hidden units are randomly removed, creating an ensemble member.

Training with Dropout

Stochastic gradient descent is used to train with dropout.
A mini-batch of data approximates the gradient in each step.
Weights from units are multiplied by the probability (r) of that unit being included during training.

CNN Illustration

Convolutional Layer: Core layer with learnable filters that extract features.
ReLU Layer: Converts negative pixels to zero, introducing non-linearity.
Pooling Layer: Reduces spatial size, computation, and parameters. Max pooling is a common approach.

CNN: Caveats

Successful DL representations share weights:
- CNNs share weights across space.
- RNNs share weights across time.
- GNNs share weights across neighborhoods.
Repeated convolutions with small kernels are more efficient than a single convolution with a large kernel.
- Two 3x3 kernels are equivalent to (and more efficient than) a 5x5 kernel.
Deep Residual NN (ResNet) introduces skip connections between consecutive layers.
Generative models: GANs etc.

Transformers

Transformer: Why?

Limits the length of signal paths required to learn long-range dependencies.
RNNs (e.g., LSTM) do not allow parallelization within training examples.

Transformer: How?

Self-attention: Query-Keys-Values
Positional encoding
Cross-attention
Addresses short-term memory problems and vanishing gradients in RNNs.

NLP Input

Input embeddings
Tokens (Tokenization)
Example: "The cat sat on the mat."

NLP Prediction using Transformer

Autoregressive LLM
Predicts future values based on past values.
Example: "The cat sat on the ~"

Transformer: Self-Attention

Self-attention models intricate relationships between tokens, making them context-aware.
Enables models like GPT-3/4 and BERT to generate coherent text and understand nuanced language.
Based on the "Attention Is All You Need" paper, 2017.

Transformer: Positional Encoding

Example: French: Un chat noir boit de l'eau → A cat black drinks water
Without positional information, the Transformer treats input tokens as a bag of words.
The original Transformer uses Sinusoidal Positional Encoding.

Self-Attention (Cornerstone)

Queries--Keys--Values
Query (Q): current token's "question".
Key (K): what other tokens "offer".
Value (V): actual content of other tokens.
Example sentence: "The cat sat on the mat because it was tired."

Self-Attention cont.

Attention calculation:

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

Where:
- Q = Query matrix
- K = Key matrix
- V = Value matrix
- d_k = Dimensionality of the keys
- Wo\Є\, R^{Dxdk}
- W^Q\Є\, R^{DXd_k}
- W^K\Є\, R^{DXd_k}
  - W^V\Є\, R^{DXd_v}
- D = d_{model} (embedding \, dimensionality)
softmax(\frac{QK^T}{\sqrt{d_k}}): scaled dot-product attention
(\alpha0, \alpha1, \alpha2, \alpha3, \alpha_4) = softmax (unnormalized)

Multi-Head (self-) Attention

Split Q, K, V into h smaller vectors (heads).
Different heads learn different patterns.
Benefits: Long-Range Dependencies, Parallelisation, Interpretability

Transformer: Head Control

Ensure it looks only backwards during the prediction (masking).
Typical number of Heads: 8–16.

Mask: Padding and Look-Ahead

During training, NO peeking into the future is allowed.
Padding masks handle input structure.
Look-ahead masks enforce temporal causality.
In the decoder, both masks are combined during self-attention.

Transformer: Cross-Attention

Parallelism, faster training and inference times
Queries from the decoder, keys/values from the encoder output.
Used only in decoder layers.
Captures inter-sequence alignment.

Transformer: Pipeline

Input/Output Paths: Input sequence, Target sequence.
Embedding & Encoding: Tokenize and convert sequences to embeddings with positional encoding.
Encoder: Processes input embeddings through N layers of self-attention and feed-forward networks, outputting context-rich representations of the input.
Decoder: Uses encoder output and target embeddings; each layer includes masked self-attention and cross-attention.
Final Output: Projected to vocabulary size and passed through softmax for token probabilities.
Training: Compute cross-entropy loss between predictions and ground truth.
Inference: Autoregressively generate tokens.

Transformer Architecture

Encoder consists of N layers with multi-head attention and feed-forward networks.
Decoder consists of N layers with masked multi-head attention, multi-head attention with encoder output, and feed-forward networks.
Both encoder and decoder use residual connections and layer normalization.

Generative Models

A generative model includes the distribution of the data itself and how likely a given example will be.
A set of contrastive algorithms
Based on Bayes' Rule
Examples: Bayesian networks, Variational Autoencoders (VAEs), Diffusion models, Generative Adversarial Networks (GANs) etc.
Compared to discriminative models

Generative Adversarial Network (GAN)

Trains two neural networks to compete to generate more authentic new data from a given training dataset.

The generator learns to generate fake data. Generated instances become negative training examples for the discriminator.
The discriminator learns to distinguish the generator's fake data from real data. The discriminator penalizes the generator for producing implausible results.

Schematic: Generator produces samples from random input; the discriminator tries to distinguish real images from generated samples. The Generator and Discriminator compete and each have corresponding losses.

Diffusion Models

Core: Progressively adding Gaussian noise to a dataset and then learning to reverse this process.
Concept related to non-equilibrium thermodynamics: properties of a thermodynamic system that do not change with time, and that can be changed to another condition only at the expense of effects on other systems.
Forward process (Diffusion): Adding noise.
Reverse process (Denoising): Removing noise.
Analogy: Particles move from high to low concentration, an irreversible process.

Markov Chain in Diffusion Models

A Markov chain is a modelling tool used to predict a system's state in the future.
Noise adding process following the Markov chain.
Markov chain makes the noise adding events trackable (memoryless structure).

Forward & Reverse diffusions

Process involves ~1000 steps
Diffusion can be described as a Discrete Stochastic Differential Equation.

Schematic Image transformed into noisy image step by step.

Forward & Reverse diffusions (Diffusion U-Net)

Reverse process achieved, for example, by a CNN.
Uses a classic U-Net architecture for segmentation.

U-Net Architecture Steps

Encoder: Progressively downsamples the input; each layer includes convolutional layers, Residual Blocks, and Adaptive Group Normalisation (AdaGN) conditioned on the timestep embedding.
Bottleneck: Processes features at the lowest resolution; uses self-attention to model global dependencies.
Decoder: Upsamples features back to the original resolution; skip connections merge encoder features with decoder features to recover fine details.
Timestep Conditioning: The timestep embedding is injected into every Residual Block via AdaGN.
Output: Predicts the noise.

GANs vs. Diffusion Models

GANs limitation:
- Mode collapse: The generator fails to capture the entire distribution of the training data, resulting in repetitive or limited samples.
- Achieving a stable equilibrium between the generator and discriminator could be challenging.
Diffusion models limitation:
- Tend to be computationally intensive and have longer training durations.
- Fine tuning is more challenging to get the best samples.