Notes on Generative Adversarial Nets (GANs)

Adversarial Framework: Generative Adversarial Nets

Two models fought in a minimax game: a generative model G that learns the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Training objective for G: maximize the probability that D makes a mistake on samples produced by G.
The framework corresponds to a minimax two-player game with a unique solution in the space of arbitrary functions G and D: G recovers the training data distribution and D equals 1/2 everywhere.
When G and D are both multilayer perceptrons (MLPs), the entire system can be trained with backpropagation; no Markov chains or unrolled inference networks are required during training or generation.
Experimental results demonstrate qualitative and quantitative evaluation of generated samples, showing potential of the framework.

The Adversarial Training Game

Special case considered: generative model generates samples by passing random noise through a multilayer perceptron; discriminative model is also a multilayer perceptron.
In this case, both G and D can be trained using backpropagation and dropout; sampling from G uses forward propagation only; no approximate inference or Markov chains are necessary.
Notation:
- Let p_data(x) be the data distribution.
- Let p_z(z) be the prior over input noise variables.
- G(z; g) maps noise z to data space; G is differentiable with parameters g.
- D(x; _d) outputs a single scalar; D(x) is the probability that x came from data rather than G.
Training objective (minimax game):
- Discriminator maximizes the probability of correct labeling for both real and generated samples.
- Generator minimizes log(1 - D(G(z))).
- Value function: V(D,G) = \mathbb{E}{x \sim p{data}(x)}[\log D(x)] + \mathbb{E}{z \sim pz(z)}[\log(1 - D(G(z)))]
- Game formulation: \minG \maxD V(D,G)
Training procedure (Algorithm 1):
- Alternate between updating D and G; typically k steps updating D for every step updating G.
- Rationale: maintaining D near its optimum for the current G to provide useful gradients to G; if G changes too quickly, D cannot adapt, destabilizing training.
Intuition (Figure 1):
- As training progresses, pg (the generator’s distribution) becomes closer to Pdata; D becomes a poorer detector; at convergence, D(x) = 1/2 for all x when pg = Pdata.

Theoretical Results

Generator implicitly defines a distribution pg as the distribution of G(z) with z ~ p_z.
Objective theory in a non-parametric setting:
- Prove global optimum pg = P_data exists for the minimax game (Theorem 1).
- Show that Algorithm 1 optimizes the objective and converges to the desired pg under certain conditions.
Key results:
- Optimal discriminator for fixed generator: for all x,
  D^{*}(x) = \frac{P{data}(x)}{P{data}(x) + p_g(x)}.
- When D is optimal, the value of the game becomes
  C(G) = \mathbb{E}{x \sim P{data}}[\log D^{}(x)] + \mathbb{E}{z \sim pz}[\log(1 - D^{}(G(z)))] = \log(4) + 2 \mathrm{JSD}(P{data} | pg).
- Global minimum of C(G) is achieved if and only if pg = P_data; at that point, C(G) = -\log 4.
- Jensen-Shannon divergence (JSD) is defined as
  \mathrm{JSD}(P{data} | pg) = \tfrac{1}{2} \mathrm{KL}(P{data} | M) + \tfrac{1}{2} \mathrm{KL}(pg | M),\quad M = \tfrac{1}{2}(P{data} + pg).
Convergence of Algorithm 1 (Proposition 2):
- If G and D have enough capacity and the discriminator is allowed to reach its optimum given G, and pg is updated to improve
  \mathbb{E}{x \sim P{data}}[\log D^{}(x)] + \mathbb{E}{z \sim pz}[\log(1 - D^{}(G(z)))]
  then pg converges to P_data.
- Important caveat: In practice, G is represented with finite capacity (e.g., MLPs), so the non-parametric convergence guarantees do not strictly apply, but empirical results are strong.

Algorithm 1: Minibatch SGD training of Generative Adversarial Nets

For each training iteration:
- For k steps do
- Sample a minibatch of m noise samples {z^{(i)}} from p_z(z).
- Sample a minibatch of m real data examples {x^{(i)}} from p_data(x).
- Update the discriminator by ascending its stochastic gradient:
  \nabla{\u03b8d} \frac{1}{m} \sum_{i=1}^m \left[ \log D(x^{(i)}) + \log(1 - D(G(z^{(i)}))) \right].
- Sample a minibatch of m noise samples {z^{(i)}} from p_z(z).
- Update the generator by descending its stochastic gradient:
  \nabla{\u03b8g} \frac{1}{m} \sum_{i=1}^m \log(1 - D(G(z^{(i)}))).
Practical note: The gradient can be implemented with standard gradient-based rules; momentum was used in experiments.
Practical alternative for G’s objective (to improve gradient signal early in training):
- Instead of minimizing log(1 - D(G(z))), maximize log D(G(z)).

Practical Implementation Details

Model design:
- Generator: mixture of rectified linear units (ReLU) and sigmoid activations.
- Discriminator: uses Maxout activations.
- Dropout applied in training the discriminator.
Training data and evaluation:
- Datasets: MNIST, Toronto Face Database (TFD), CIFAR-10.
- Parzen window density estimation to evaluate test-set likelihood of samples from G:
- Fit a Gaussian Parzen window to generated samples and compute log-likelihood on the test set.
- Parzen σ parameter selection via cross-validation on the validation set.
- Note: This likelihood estimate has high variance in high dimensions, but is a practical evaluation when exact likelihood is intractable.
Visual results:
- Figures show samples drawn from the generator after training.
- Samples are competitive with existing generative models and demonstrate the potential of the adversarial framework.
Observations:
- The framework can generate sharp, even degenerate distributions, unlike some Markov-chain-based methods that require blurriness to mix between modes.

Advantages and Disadvantages

Disadvantages:
- There is no explicit explicit representation of p_g(x).
- The discriminator must be synchronized with the generator during training (e.g., avoid the Helvetica problem where G collapses many z values to the same x if D is not updated frequently enough).
- Training dynamics can be delicate without proper balancing between G and D.
Advantages:
- No Markov chains or approximate inference needed; backpropagation suffices for learning gradients.
- No explicit inference required during learning; flexible to incorporate a wide variety of differentiable functions.
- Can represent very sharp distributions and potentially handle multi-modality better than some chain-based methods.
Comparative perspective (Table 2):
- Deep directed graphical models require inference during training; generative autoencoders involve tradeoffs between mixing and reconstruction.
- Adversarial models avoid MCMC and explicit density representation; rely on a discriminative network to drive the generator.
- Synchronization between D and G is a unique challenge in GANs (the Helvetica problem).

Extensions and Future Work

The paper outlines several straightforward extensions of the adversarial framework:
1) Conditional generative model: p(x | c) by adding c as input to both G and D.
2) Learned approximate inference: train an auxiliary network to predict z given x (an inference network), akin to wake-sleep, but post-training of the generator.
3) Conditional modeling for subsets: model p(xS | x{S}) for any subset S of indices of x by training a family of conditional models sharing parameters.
4) Semi-supervised learning: use features from the discriminator or inference network to improve classifiers when labeled data is limited.
5) Efficiency improvements: better coordination between G and D or smarter sampling of z to accelerate training.
Overall claim: GANs open up many research directions and demonstrate the viability of the adversarial modeling framework.

Related Work and Context

GANs sit among a family of approaches that use discriminative criteria to train generative models, including nuisance methods like Noise-contrastive Estimation (NCE) and contrastive divergences, deep Boltzmann machines, and variational autoencoders (VAEs).
Key distinctions:
- VAEs pair a differentiable generator with a recognition model for approximate inference; GANs pair a generator with a discriminator but do not require explicit inference or a tractable likelihood.
- NCE uses a fixed noise distribution as the negative class; GANs treat the discriminator as a dynamic model that learns to distinguish data from generated samples.
Adversarial nets differ from adversarial examples in purpose: adversarial nets aim to train a generative model, whereas adversarial examples are inputs designed to fool discriminative models.

Practical Takeaways for Exam Preparation

Core idea: Generative Adversarial Nets train two models in a minimax game to align the generator distribution pg with the data distribution P_data by training a discriminator to distinguish real vs generated data.
Central mathematics:
- Value function: V(D,G)=\mathbb{E}{x\sim p{data}}[\log D(x)]+\mathbb{E}{z\sim pz}[\log(1-D(G(z)))].
- Global optimum for fixed G: D^{*}(x)=\frac{p{data}(x)}{p{data}(x)+p_g(x)}.
- Global objective under optimal D: C(G)=\log(4)+2\mathrm{JSD}(p{data}|pg).
- Global optimum occurs iff pg=p{data}.
Training procedure key points:
- Alternate updating D and G; use k steps of D per G-step (often k=1 in practice).
- If D becomes too good early on, switch to maximizing log D(G(z)) for G’s objective to maintain strong gradients.
Practical limitations and benefits:
- No explicit density pg(x) is learned; training stability depends on balancing D and G.
- GANs can model sharp distributions and multi-modality without Markov chains; rely on backpropagation for gradients.
Important extensions to remember for exams:
- Conditional GANs (p(x|c)) by feeding c to G and D.
- Inference networks for approximate z-prediction given x (inference for post-training).
- Semi-supervised learning and multi-conditional modeling by sharing parameters across conditionals.

Key References and Concepts Mentioned (Context)

Foundational papers and concepts referenced include backpropagation, dropout, rectified linear units, maxout activations, and Gaussian Parzen window estimation for evaluating sample likelihoods when explicit likelihoods are intractable.
Notable comparisons include VAEs, NCE, predictive minimization, and adversarial examples as distinct concepts with different objectives.