L22_More Generative Deep Learning

Autoencoders Revisited

Autoencoders are neural networks trained to reconstruct their input by first encoding (compressing) the data into a lower-dimensional representation called the latent space, and then decoding from this latent space back to the original dimensions. The latent space is a compressed representation of the input data. TensorBoard can be used to visualize and explore this latent space. By randomly sampling values in the latent space, new outputs that resemble the training data can be generated. However, not all sampled values are meaningful, and a projector tool can be used for visualization.

Vectors in Latent Space

In language models, specific directions in the latent space often correspond to semantic concepts or word meanings. This concept extends to images as well. For example, when generating faces, vector arithmetic in the latent space can modify features like smiling or the degree of mouth openness, as demonstrated in arXiv:1609.04468. The latent space can be manipulated to control the attributes of the generated outputs.

Latent Space Sampling

The concept of latent space sampling is general and can be applied to various model types. Decoders are used to map random points in the latent space back to realistic data. A common sampling strategy is to draw samples from a normal distribution, as detailed in "Deep Learning with Python" by F. Chollet. This method leverages the properties of the normal distribution to ensure a smooth and continuous latent space.

Improving the Autoencoder

Standard autoencoder latent spaces are often not ideal for generating realistic output due to irregularities and discontinuities. To improve the generative capabilities of autoencoders, constraints and requirements can be imposed on the structure of the latent space. Instead of learning a fixed encoding, the model tries to learn a probability distribution, which leads to the variational autoencoder (VAE).

Learning Distributions

In machine learning, models often predict a single best value, requiring a functional relationship between input features and the target. However, by modeling the distribution from which data originated, we can relax the functional assumption. For example, a mixture density network can predict parameters of normal distributions, from which new, realistic data points can be sampled. This approach allows for more flexible and expressive modeling.

Variational Autoencoder (VAE)

The variational autoencoder (VAE) explicitly models distributions in the latent space. The process involves:

  1. Encoding input data into parameters of a probability distribution (e.g., mean and variance for a normal distribution).

  2. Sampling a point from this distribution.

  3. Decoding this point back to the input format.

Since encoder and decoder layers are nonlinear transformations, complex input distributions can be transformed into a normal (Gaussian) looking latent space and then back to the original data space. To enforce this latent space distribution, the loss function is modified to combine reconstruction error (e.g., MSE) and the difference between the latent distribution and a normal distribution, which is measured using KL divergence. This ensures that the latent space is well-structured and continuous.

The latent space becomes continuous, allowing sampling along different latent space dimensions to combine or remove concepts. An example is generating MNIST digits with only 2 latent dimensions, where each dimension controls different aspects of the digit's appearance.

Generative Adversarial Networks (GANs)

GANs consist of two models that are trained in tandem:

  • A generator that decodes random values into an image.

  • A discriminator that takes in images and predicts whether they are real or generated by the generator.

The models are trained adversarially so that the generator creates increasingly realistic images that can fool the discriminator. Vector arithmetic can be performed in the latent space (arXiv:1511.06434) to manipulate image features.

The training procedure is complex and sensitive to hyperparameter configurations. GANs are well-studied in research but are often challenging to use in practice due to their instability and sensitivity to the training process.

Diffusion Models

The core idea behind diffusion models is to add random noise to the original image in a step-wise manner. For each step, the model is trained to remove the noise—a process known as denoising diffusion probabilistic model (DDPM). A diffusion model can be conceptualized as a variational autoencoder where the encoder process involves adding noise.

To generate new images:

  1. Start with an image of random noise.

  2. Run stepwise denoising.

  3. Converge at a realistic image.

Forward Encoder

The noise addition process is the central component of the diffusion model. The forward process transforms an initial, clean image x0 into a corrupted image x1. Here, \, β1 represents the variance of the noise distribution at step 1, and \, ε1 is the noise itself. The transformation is given by:

x1 = \sqrt{1 - β1} x0 + \sqrt{β1} \, ε_1

where \, β_1 < 1.

The book represents this transformation as a probability distribution (eq. 17-5):

q(x1|x0) = N(\sqrt{1 - β1} x0, β_1 I)

where \, N is the normal distribution and \, I is the identity matrix.

Diffusion Steps

The diffusion step is run T times, with \, T ≈ 10^3. The distribution for an arbitrary step \, t is then:

q(xt|x{t-1}) = N(\sqrt{1 - βt} x{t-1}, β_t I)

Since the sum of normal distributions is also normal, we can express the entire forward process from start to \, t as (eq. 17-6):

q(xt|x0) = N(\sqrt{\bar{α}t} x0, (1 - \bar{α}_t)I)

The factor \, \sqrt{\bar{α}_t} represents the remaining amount of the original image, with

\bar{α}t = (1 - β1) \cdot (1 - β2) \cdot \ldots \cdot (1 - βt)

Noise (Variance) Scheduling

\, β_t should start at 0 (indicating no noise) and gradually increase to 1 (indicating complete noise).

  • Naïve approach: Change \, β_t in fixed steps (linear schedule).

  • Better approach: Start adding noise more slowly, which is often achieved using a cosine schedule.

Reverse Decoder

While the distributions of the forward encoder \, q(xt|x{t-1}) are straightforward, computing the reverse distribution \, q(x{t-1}|xt) analytically is nearly impossible. However, we can approximate it using a deep learning model:

Build a model that learns the approximate data distribution \, p(x{t-1} | xt, w), where \, w represents the parameters of the model.

The original DDPM papers used a U-Net type architecture. The choice of model is independent of the diffusion process itself, provided that:

  • Output dimensions match input dimensions.

  • The model can encode the time step information.

Generating New Samples

With the trained decoder model in place, new (unseen) images can be generated:

  • Step 0: Draw random pixel values from a normal distribution.

  • Steps 1 to 4000: Sequentially run the decoder to remove noise.

Properties of diffusion models:

  • Easy to train compared to GANs.

  • Generate results of superb quality.

  • Relatively slow at generating new samples.

Modern Diffusion Models

The most advanced image generation models are often variants of latent diffusion:

Run the diffusion process in latent space rather than pixel space. To construct the latent space, a variational autoencoder is typically used.

Conditional Generation

Generating data from entirely random noise results in unconditional sampling. While this produces realistic representations of the training data, it lacks control over the output.

Guided diffusion methods condition the generation process on additional input data such as class labels or text prompts.

This conditioning relies on:

  • Encoding the input prompt through, for example, a language model, and concatenating it with the input to the denoising network.

  • Using cross-attention layers that allow the denoising network to attend to relevant prompt tokens.