Notes on Generative Adversarial Nets (GANs)
Adversarial Framework: Generative Adversarial Nets
- Two models fought in a minimax game: a generative model G that learns the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
- Training objective for G: maximize the probability that D makes a mistake on samples produced by G.
- The framework corresponds to a minimax two-player game with a unique solution in the space of arbitrary functions G and D: G recovers the training data distribution and D equals 1/2 everywhere.
- When G and D are both multilayer perceptrons (MLPs), the entire system can be trained with backpropagation; no Markov chains or unrolled inference networks are required during training or generation.
- Experimental results demonstrate qualitative and quantitative evaluation of generated samples, showing potential of the framework.
The Adversarial Training Game
- Special case considered: generative model generates samples by passing random noise through a multilayer perceptron; discriminative model is also a multilayer perceptron.
- In this case, both G and D can be trained using backpropagation and dropout; sampling from G uses forward propagation only; no approximate inference or Markov chains are necessary.
- Notation:
- Let p_data(x) be the data distribution.
- Let p_z(z) be the prior over input noise variables.
- G(z; g) maps noise z to data space; G is differentiable with parameters g.
- D(x; _d) outputs a single scalar; D(x) is the probability that x came from data rather than G.
- Training objective (minimax game):
- Discriminator maximizes the probability of correct labeling for both real and generated samples.
- Generator minimizes log(1 - D(G(z))).
- Value function: V(D,G) = \mathbb{E}{x \sim p{data}(x)}[\log D(x)] + \mathbb{E}{z \sim pz(z)}[\log(1 - D(G(z)))]
- Game formulation: \minG \maxD V(D,G)
- Training procedure (Algorithm 1):
- Alternate between updating D and G; typically k steps updating D for every step updating G.
- Rationale: maintaining D near its optimum for the current G to provide useful gradients to G; if G changes too quickly, D cannot adapt, destabilizing training.
- Intuition (Figure 1):
- As training progresses, pg (the generator’s distribution) becomes closer to Pdata; D becomes a poorer detector; at convergence, D(x) = 1/2 for all x when pg = Pdata.
Theoretical Results
- Generator implicitly defines a distribution pg as the distribution of G(z) with z ~ p_z.
- Objective theory in a non-parametric setting:
- Prove global optimum pg = P_data exists for the minimax game (Theorem 1).
- Show that Algorithm 1 optimizes the objective and converges to the desired pg under certain conditions.
- Key results:
- Optimal discriminator for fixed generator: for all x,
D^{*}(x) = \frac{P{data}(x)}{P{data}(x) + p_g(x)}. - When D is optimal, the value of the game becomes
C(G) = \mathbb{E}{x \sim P{data}}[\log D^{}(x)] + \mathbb{E}{z \sim pz}[\log(1 - D^{}(G(z)))] = \log(4) + 2 \mathrm{JSD}(P{data} | pg). - Global minimum of C(G) is achieved if and only if pg = P_data; at that point, C(G) = -\log 4.
- Jensen-Shannon divergence (JSD) is defined as
\mathrm{JSD}(P{data} | pg) = \tfrac{1}{2} \mathrm{KL}(P{data} | M) + \tfrac{1}{2} \mathrm{KL}(pg | M),\quad M = \tfrac{1}{2}(P{data} + pg).
- Convergence of Algorithm 1 (Proposition 2):
- If G and D have enough capacity and the discriminator is allowed to reach its optimum given G, and pg is updated to improve
\mathbb{E}{x \sim P{data}}[\log D^{}(x)] + \mathbb{E}{z \sim pz}[\log(1 - D^{}(G(z)))]
then pg converges to P_data. - Important caveat: In practice, G is represented with finite capacity (e.g., MLPs), so the non-parametric convergence guarantees do not strictly apply, but empirical results are strong.
Algorithm 1: Minibatch SGD training of Generative Adversarial Nets
- For each training iteration:
- For k steps do
- Sample a minibatch of m noise samples {z^{(i)}} from p_z(z).
- Sample a minibatch of m real data examples {x^{(i)}} from p_data(x).
- Update the discriminator by ascending its stochastic gradient:
\nabla{\u03b8d} \frac{1}{m} \sum_{i=1}^m \left[ \log D(x^{(i)}) + \log(1 - D(G(z^{(i)}))) \right]. - Sample a minibatch of m noise samples {z^{(i)}} from p_z(z).
- Update the generator by descending its stochastic gradient:
\nabla{\u03b8g} \frac{1}{m} \sum_{i=1}^m \log(1 - D(G(z^{(i)}))).
- Practical note: The gradient can be implemented with standard gradient-based rules; momentum was used in experiments.
- Practical alternative for G’s objective (to improve gradient signal early in training):
- Instead of minimizing log(1 - D(G(z))), maximize log D(G(z)).
Practical Implementation Details
- Model design:
- Generator: mixture of rectified linear units (ReLU) and sigmoid activations.
- Discriminator: uses Maxout activations.
- Dropout applied in training the discriminator.
- Training data and evaluation:
- Datasets: MNIST, Toronto Face Database (TFD), CIFAR-10.
- Parzen window density estimation to evaluate test-set likelihood of samples from G:
- Fit a Gaussian Parzen window to generated samples and compute log-likelihood on the test set.
- Parzen σ parameter selection via cross-validation on the validation set.
- Note: This likelihood estimate has high variance in high dimensions, but is a practical evaluation when exact likelihood is intractable.
- Visual results:
- Figures show samples drawn from the generator after training.
- Samples are competitive with existing generative models and demonstrate the potential of the adversarial framework.
- Observations:
- The framework can generate sharp, even degenerate distributions, unlike some Markov-chain-based methods that require blurriness to mix between modes.
Advantages and Disadvantages
- Disadvantages:
- There is no explicit explicit representation of p_g(x).
- The discriminator must be synchronized with the generator during training (e.g., avoid the Helvetica problem where G collapses many z values to the same x if D is not updated frequently enough).
- Training dynamics can be delicate without proper balancing between G and D.
- Advantages:
- No Markov chains or approximate inference needed; backpropagation suffices for learning gradients.
- No explicit inference required during learning; flexible to incorporate a wide variety of differentiable functions.
- Can represent very sharp distributions and potentially handle multi-modality better than some chain-based methods.
- Comparative perspective (Table 2):
- Deep directed graphical models require inference during training; generative autoencoders involve tradeoffs between mixing and reconstruction.
- Adversarial models avoid MCMC and explicit density representation; rely on a discriminative network to drive the generator.
- Synchronization between D and G is a unique challenge in GANs (the Helvetica problem).
Extensions and Future Work
- The paper outlines several straightforward extensions of the adversarial framework:
1) Conditional generative model: p(x | c) by adding c as input to both G and D.
2) Learned approximate inference: train an auxiliary network to predict z given x (an inference network), akin to wake-sleep, but post-training of the generator.
3) Conditional modeling for subsets: model p(xS | x{S}) for any subset S of indices of x by training a family of conditional models sharing parameters.
4) Semi-supervised learning: use features from the discriminator or inference network to improve classifiers when labeled data is limited.
5) Efficiency improvements: better coordination between G and D or smarter sampling of z to accelerate training. - Overall claim: GANs open up many research directions and demonstrate the viability of the adversarial modeling framework.
Related Work and Context
- GANs sit among a family of approaches that use discriminative criteria to train generative models, including nuisance methods like Noise-contrastive Estimation (NCE) and contrastive divergences, deep Boltzmann machines, and variational autoencoders (VAEs).
- Key distinctions:
- VAEs pair a differentiable generator with a recognition model for approximate inference; GANs pair a generator with a discriminator but do not require explicit inference or a tractable likelihood.
- NCE uses a fixed noise distribution as the negative class; GANs treat the discriminator as a dynamic model that learns to distinguish data from generated samples.
- Adversarial nets differ from adversarial examples in purpose: adversarial nets aim to train a generative model, whereas adversarial examples are inputs designed to fool discriminative models.
Practical Takeaways for Exam Preparation
- Core idea: Generative Adversarial Nets train two models in a minimax game to align the generator distribution pg with the data distribution P_data by training a discriminator to distinguish real vs generated data.
- Central mathematics:
- Value function: V(D,G)=\mathbb{E}{x\sim p{data}}[\log D(x)]+\mathbb{E}{z\sim pz}[\log(1-D(G(z)))].
- Global optimum for fixed G: D^{*}(x)=\frac{p{data}(x)}{p{data}(x)+p_g(x)}.
- Global objective under optimal D: C(G)=\log(4)+2\mathrm{JSD}(p{data}|pg).
- Global optimum occurs iff pg=p{data}.
- Training procedure key points:
- Alternate updating D and G; use k steps of D per G-step (often k=1 in practice).
- If D becomes too good early on, switch to maximizing log D(G(z)) for G’s objective to maintain strong gradients.
- Practical limitations and benefits:
- No explicit density pg(x) is learned; training stability depends on balancing D and G.
- GANs can model sharp distributions and multi-modality without Markov chains; rely on backpropagation for gradients.
- Important extensions to remember for exams:
- Conditional GANs (p(x|c)) by feeding c to G and D.
- Inference networks for approximate z-prediction given x (inference for post-training).
- Semi-supervised learning and multi-conditional modeling by sharing parameters across conditionals.
Key References and Concepts Mentioned (Context)
- Foundational papers and concepts referenced include backpropagation, dropout, rectified linear units, maxout activations, and Gaussian Parzen window estimation for evaluating sample likelihoods when explicit likelihoods are intractable.
- Notable comparisons include VAEs, NCE, predictive minimization, and adversarial examples as distinct concepts with different objectives.