L21_Generative Deep Learning

Generative Deep Learning

Introduction

This lecture focuses on generative deep learning methods, specifically how to convert sampled values into realistic data. The methods discussed will include Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), with a brief mention of Diffusion Models. Generative models aim to learn the underlying probability distribution of a dataset, enabling the generation of new samples that resemble the original data. These models have a wide range of applications, including image synthesis, text generation, and anomaly detection.

Text Generation

Previously, we explored generative deep learning for text generation, where the key was to sample instead of deterministically picking the top prediction. This involves using models like Recurrent Neural Networks (RNNs) or Transformers to generate sequences of words. By sampling from the predicted probability distribution over the vocabulary, the model can produce diverse and creative text.

Autoencoders Revisited

Autoencoders are models trained to reconstruct their input by encoding (compressing) the data to a smaller dimensionality, known as the latent space. This latent space can be explored using tools like TensorBoard. The process involves:

  1. Encoding: Compressing data to a lower-dimensional latent space. The encoder network maps the input data to a latent representation, capturing the most important features in a compact form.

  2. Decoding: Reconstructing the original data from the latent space. The decoder network takes the latent representation and reconstructs the original input data.

A projector can be used to visualize the latent space. Random values can be sampled from the latent space to generate new outputs, although not all sampled values are meaningful. The quality of the generated samples depends on the structure and properties of the latent space.

Latent Space Vectors

In language models, directions in the latent space are related to concepts or word meanings. This can also be observed in images. For example, in generated faces:

  • Moving in one direction might increase the smile intensity.

  • Moving in another direction might control whether the mouth is open or closed.

Latent Space Sampling

Latent space sampling can be generalized to different model types. In this lecture, we focus on decoders that transform random points in the latent space into realistic data. A common sampling strategy is to draw from a normal distribution. This involves generating random samples from a standard normal distribution and using them as input to the decoder network.

Improving Autoencoders

In practice, standard autoencoder latent spaces are not ideal for generating realistic output. To improve this, we can impose requirements on the structure of the latent space. This can be achieved by using techniques like regularization or by modifying the architecture of the autoencoder.

  • Instead of learning a fixed encoding, we aim to learn a distribution. This allows the model to capture the uncertainty and variability in the data.

This leads us to the concept of Variational Autoencoders (VAEs).

Learning Distributions

In machine learning, we often predict a single best value, which requires a functional relationship (one-to-one or many-to-one) between input features and the target. However, modeling the distribution from which the data came can provide more flexibility and allow for generating diverse outputs.

  • However, we can model the distribution from which the data came. This involves estimating the parameters of a probability distribution that best describes the data.

Example: Mixture density networks can predict parameters of normal distributions, from which we can sample realistic new data points. Mixture density networks combine multiple normal distributions to model complex and multimodal data distributions.

Variational Autoencoders (VAEs)

The core idea of VAEs is to model distributions in the latent space:

  1. Encode input data into parameters (mean and variance) of a probability distribution (typically a normal distribution). The encoder network outputs the mean and variance of the latent distribution, which are used to sample latent vectors.

  2. Sample a point from this distribution. This involves using the reparameterization trick to sample from the latent distribution in a way that allows for backpropagation.

  3. Decode this point back to the input format. The decoder network takes the sampled latent vector and reconstructs the original input data.

VAE Details

Since encoder and decoder layers are non-linear transformations, we can transform complicated input distributions into a normal (Gaussian) looking latent space and back. To enforce this latent space distribution, we modify the loss function by combining:

  • Reconstruction error (e.g., Mean Squared Error, MSE). This measures how well the decoder can reconstruct the original input data from the latent representation.

  • The difference between the latent distribution and a normal distribution, measured by Kullback-Leibler (KL) divergence. This encourages the latent distribution to be close to a standard normal distribution, which improves the smoothness and interpretability of the latent space.

Latent Space Continuity

The latent space in VAEs becomes continuous. We can sample along different latent space dimensions to combine or remove concepts. An example is shown using MNIST digits with 2 latent dimensions. By interpolating between different points in the latent space, we can generate smooth transitions between different concepts or styles.

Generative Adversarial Networks (GANs)

GANs were considered very innovative around 2014 but have been somewhat superseded by diffusion models. They gained popularity through projects like "thispersondoesnotexist.com," which is based on StyleGAN. GANs consist of two neural networks, a generator and a discriminator, that are trained in an adversarial manner. The generator tries to generate realistic samples, while the discriminator tries to distinguish between real and generated samples.

GAN Structure

GANs involve two models working in tandem:

  • A generator that decodes random values into an image. The generator takes random noise as input and transforms it into a realistic image.

  • A discriminator that takes in images and predicts whether they are real or generated by the generator. The discriminator takes an image as input and outputs a probability indicating whether the image is real or generated.

The training process involves pitting these two networks against each other. The generator tries to fool the discriminator, while the discriminator tries to correctly classify real and generated images. This adversarial process leads to the generator producing increasingly realistic images.

GAN Training

The training procedure for GANs is complex. Vector arithmetic can still be performed in the latent space. However, GANs are well-studied in research but seldom used in practice because they are very sensitive to the configuration of the training procedure and are generally difficult to work with. GANs often suffer from issues like mode collapse, where the generator produces only a limited variety of samples, and training instability, where the discriminator and generator get stuck in a loop and fail to converge.