10. Generative Adversarial & Diffusion Learning

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/36

There's no tags or description

Looks like no tags are added yet.

Last updated 1:35 PM on 6/2/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

37 Terms

New cards

Explain what Adversarial Inputs are, and provide a concrete example of how they affect image classifiers.

Adversarial inputs are real data vectors that have been modified with slight, intentionally engineered pixel-level noise (e.g., $\epsilon = 0.007$ ). These changes are imperceptible to humans but completely disrupt a machine learning model's internal feature boundaries.

Example: Adding structured noise to a picture of a "panda" can cause a model that previously classified it correctly with 57.7% confidence to misclassify the identical-looking image as a "gibbon" with 99.3% confidence.

New cards

Write out the standard GAN Min-Max Objective Function and break down the explicit goal of each network ($G$ and $D$).

$\min_{\theta} \max_{\phi} V(D_\phi, G_\theta) = \frac{1}{2}\mathbb{E}_{x \sim p^*(x)}[\log D_{\phi}(x)] + \frac{1}{2}\mathbb{E}_{z \sim q(z)}[\log(1 - D_{\phi}(G_{\theta}(z)))]$

Discriminator ( $D_\phi$ ) Goal: Maximize the objective. It aims to output a high score ( $D(x) \to 1$ ) for real data points $x$ and a low score ( $D(G(z)) \to 0$ ) for generated fake samples.
Generator ( $G_\theta$ ) Goal: Minimize the objective. It aims to produce samples realistic enough to force the discriminator to misclassify them as real ( $D(G(z)) \to 1$ ), driving $\log(1 - D(G(z)))$ toward negative infinity.

New cards

Outline the precise algorithmic training sequence step-by-step for a standard GAN during a single optimization loop.

Train the Discriminator:

Sample a minibatch of random latent noise from the prior: $z \sim q(z)$ .
Sample a minibatch of real data instances from the target dataset: $x \sim p^*(x)$ .
Pass both batches through $D_\phi$ , compute the adversarial loss, and update the discriminator parameters $\phi$ via Stochastic Gradient Descent (SGD) to maximize $V(D, G)$ .

Train the Generator (for $k$ steps):

Sample a fresh minibatch of random latent noise: $z \sim q(z)$ .
Pass the noise through $G_\theta$ and then through the frozen $D_\phi$ .
Update the generator parameters $\theta$ via SGD to minimize $V(D,G)$ (by maximizing the discriminator's detection error rate).

New cards

Differentiate between Conditional GANs (cGAN), CycleGANs, and StyleGANs regarding their structural operations.

Conditional GAN (cGAN): Feeds auxiliary information (like class labels or domain maps) directly into both $G$ and $D$ to steer generation toward specific targets (e.g., rendering an image from a line drawing).

CycleGAN: Translates features across separate, unpaired image collections by setting up two distinct adversarial networks that map back-and-forth, enforcing a cycle-consistency loss ( $G(F(x)) \approx x$ .
StyleGAN: Disentangles latent space representations by mapping standard noise vectors $Z$ into a specialized intermediate space $W$ . This intermediate space acts as a style modifier at different layer resolutions, allowing for independent control over attributes like pose, age, or gender.

New cards

Define Vanishing Gradients and Mode Collapse as they apply to GAN training failures.

Vanishing Gradients: Occurs when the Discriminator becomes too skilled too quickly. If $D$ achieves near-perfect classification accuracy, the generator's loss flattens out, and its learning gradients drop to near-zero. This leaves $G$ with no actionable feedback for parameter tuning.
Mode Collapse: A failure mode where the Generator stops exploring the full diversity of the target data distribution. Instead, it discovers a narrow subset of variations that consistently fools the discriminator, leading it to repeatedly output the same limited set of samples.

New cards

Contrast VAEs and GANs on Latent Mapping style and the Perceptual Quality of their outputs.

Variational Autoencoders (VAEs): Optimize an explicit analytical lower bound over data likelihood, mapping inputs to localized continuous distributions. Because they maximize pixel-level probability averages, their generated outputs frequently appear blurry.

Generative Adversarial Networks (GANs): Learn an implicit data distribution through competition rather than an explicit density function. Because the latent space optimizes purely against a discriminator's critiques rather than pixel-level averages, they generate sharp, high-fidelity images.

New cards

What is the fundamental operational mechanism of a Flow-Based Model, and how does it differ from a GAN?

Unlike GANs, which generate samples implicitly without computing densities, Flow-Based Models explicitly approximate the data's true probability density function.

They achieve this by taking a simple base distribution (such as a standard 2D Gaussian) and passing it through a progressive sequence of invertible, bi-differentiable (bijective) transformation functions. This allows the model to compute exact log-likelihoods while retaining fast data synthesis capabilities.

New cards

Describe the Forward and Reverse pathways of a Variational Diffusion Model (VDM) under its Markov Process framework.

Forward Diffusion Process: A fixed, tractable Markov chain that systematically injects small increments of Gaussian noise to clear data ( $x_0$ ) over a series of sequential timesteps ( $x_1, x_2, \dots$ ). No model training occurs here; it terminates when the input becomes completely unstructured noise ( $x_T$ ).
Reverse Diffusion Process: An approximate, learned Markov chain. A trained neural network (typically a U-Net architecture) takes the noisy vector $x_t$ and predicts the exact noise contribution at that timestep, subtracting it to iteratively reconstruct clean data ( $x_0$ ).

New cards

What is the $h$ -space in a Variational Diffusion Model, and how can it be used for image editing?

The $h$ -space is the semantic latent space spanned by the bottleneck activations within the diffusion model's core U-Net architecture. By extracting these activations and performing Principal Component Analysis (PCA) on them, developers can isolate independent direction vectors that correspond to specific physical features. This allows for smooth, semantically isolated editing (e.g., altering age or facial pose) without affecting other image details.

New cards

Contrast the implementation strategies of Imagen, DALL-E 2, and Stable Diffusion for text-to-image synthesis.

Imagen: Passes text prompts through a massive pre-trained language model, then routes the resulting embeddings through a diffusion process directly in the pixel space, using cascaded super-resolution models to upscale the output.

DALL-E 2: Uses a prior network to transform text descriptions into transformer-based CLIP embeddings, which then condition a diffusion decoder to synthesize styled imagery.
Stable Diffusion (Latent Diffusion): Maximizes efficiency by running its progressive forward-reverse diffusion pipeline entirely inside a compact pre-trained VAE latent space rather than on raw pixel grids. It integrates CLIP and Vision Transformers (ViT) to inject cross-attention textual conditioning.

New cards

Why can traditional semi-supervised approximate inference techniques (like those in VAEs) not be directly applied to standard GANs? How do Semi-Supervised GANs (SS-GANs) bypass this?

Standard GANs suffer from two structural limitations:

They do not learn an inference network/encoder mapping data back to latent space ( $x \to z$ ).
They do not model an explicit probability density function $p(x)$ over the data space.

Because of this, SS-GANs bypass the need for an explicit encoder by directly modifying the architecture of the critic (discriminator) to handle classification and adversarial tasks simultaneously.

New cards

Describe the output layer architecture of the Critic in a Semi-Supervised GAN. How many outputs does it feature, and what do they represent?

The modified critic features $C + 1$ outputs:

The first $C$ outputs correspond to the actual semantic class labels of the dataset (e.g., Cat, Dog, Car, etc.).
The $(C + 1)$ -th output corresponds to a dedicated "fake" class label used for standard adversarial discrimination.

New cards

Explain the expected classification behavior of the SS-GAN critic when processing:

Labeled Real Data
Unlabeled Real Data
Generated (Fake) Data

Labeled Real Data ( $x, y$ ): The critic is optimized to output the exact, true semantic class label $y \in \{1, \dots, C\}$ .

Unlabeled Real Data ( $x$ ): The critic is trained to raise the aggregate probability mass across any of the valid $C$ real classes, while suppressing the probability of the $(C + 1)$ -th "fake" class.
Generated Data ( $G(z)$ ): The critic is trained to classify these outputs explicitly into the $(C + 1)$ -th "fake" class.

New cards

Write out the complete Critic Loss Function ($\mathcal{L}_{\text{critic}}$) for a Semi-Supervised GAN.

$\mathcal{L}_{\text{critic}} = -\mathbb{E}_{x,y \sim p(x,y)} [\log p_\theta(y \mid x)] - \mathbb{E}_{x \sim p(x)} [\log(1 - p_\theta(y = C + 1 \mid x))] - \mathbb{E}_{z \sim p(z)} [\log p_\theta(y = C + 1 \mid G(z))]$

Where $\theta$ represents the parameterized weights of the modified multi-class critic network.

New cards

Break down the structural optimization roles of Term 1, Term 2, and Term 3 within the SS-GAN Critic Loss Function.

$\mathcal{L}_{\text{critic}} = \underbrace{-\mathbb{E}_{x,y} [\log p_\theta(y \mid x)]}_{\text{Term 1: Supervised Learning}} \underbrace{- \mathbb{E}_{x} [\log(1 - p_\theta(y = C + 1 \mid x))]}_{\text{Term 2: Unsupervised Real Learning}} \underbrace{- \mathbb{E}_{z} [\log p_\theta(y = C + 1 \mid G(z))]}_{\text{Term 3: Adversarial Fake Learning}}$

Term 1 (Supervised): Maximizes classification accuracy on available labeled real examples.
Term 2 (Unsupervised Real): Minimizes the probability that unlabeled real data is flagged as fake, forcing the model to distribute probability mass across the $C$ real semantic categories.
Term 3 (Adversarial Fake): Maximizes the probability that fakes synthesized by the generator are correctly routed to the $(C+1)$ -th "fake" slot.

New cards

Write out the Generator Loss Function ( $\mathcal{L}_{\text{generator}}$ ) for a Semi-Supervised GAN and explain its ultimate training goal.

$\mathcal{L}_{\text{generator}} = \mathbb{E}_{z \sim p(z)} [\log p_\theta(y = C + 1 \mid G(z))]$

Goal: This function acts directly as a min-max counterpart to the third term of the critic loss. It minimizes the probability that the critic detects its generated samples as fake, mathematically forcing the critic to accidentally assign the synthetic outputs into one of the $C$ valid real-world semantic categories.

New cards

Explain the fundamental architectural asymmetry that motivates the design of a standard Diffusion Model.

It is highly computationally challenging to transform unstructured Gaussian noise into a highly structured data manifold directly in a single step. However, it is analytically simple to destroy data structure by gradually adding noise.

Diffusion models exploit this by setting up two symmetric processes: a fixed, multi-step Forward Process that perturbs data into noise, and a learned, deep Reverse Process trained to invert each perturbation step sequentially.

New cards

Compare DDPMs with Variational Autoencoders (VAEs) and Normalizing Flows regarding internal dimensionality and transformation properties.

vs. VAEs: Unlike typical VAE bottlenecks, every single intermediate latent state $x_t$ ( $t = 1 \dots T$ ) in a DDPM maintains the exact same spatial and numerical dimensionality as the initial data input $x_0$ . This completely bypasses low-dimensional posterior collapse.

vs. Normalizing Flows: Unlike flows, the intermediate step transitions in a DDPM are stochastic rather than deterministic, and they do not require mathematically invertible layers or Jacobian determinant calculations.

New cards

Write out the conditional transition equation for a single step of the Forward (Diffusion) Process, and define its variables.

$q(x_t \mid x_{t-1}) = \mathcal{N}\left(x_t \;{\Big |}\; \sqrt{1 - \beta_t}x_{t-1}, \,\beta_t \mathbf{I}\right)$

$x_t$ : The perturbed latent state at timestep $t$.
$\beta_t \in (0, 1)$ : The variance coordinate dictated by a predefined noise schedule at timestep $t$ .
$\mathbf{I}$ : The identity matrix confirming independent isotropic Gaussian noise injection.

New cards

State the Closed-Form Marginalization equation (the Diffusion Kernel) that allows a DDPM to sample $x_t$ directly from $x_0$ , and define its components.

$q(x_t \mid x_0) = \mathcal{N}\left(x_t \;{\Big |}\; \sqrt{\bar{\alpha}_t}x_0, \,(1 - \bar{\alpha}_t)\mathbf{I}\right)$

Where $\alpha_t \triangleq 1 - \beta_t$ , and $\bar{\alpha}_t \triangleq \prod_{s=1}^t \alpha_s$ . This allows the system to bypass step-by-step sequential simulation during training by scaling $x_0$ deterministically and adding a scaled noise vector.

New cards

How do the perceptual effects of the forward process shift across visual media from early timesteps ( $t \to 1$ ) to later timesteps ( $t \to T$ )?

The unconditional marginal step acts as a progressive Gaussian convolution over the image manifold:

Early Steps ( $t \to 1$ ): Wipe out high-frequency spatial details first (e.g., fine clothing textures, sharp object edges, pixel grain).
Later Steps ( $t \to T$ ): Progressively destroy low-frequency structural information (e.g., global semantic shapes, object orientations, color layouts).

New cards

Write out the comprehensive joint distribution function of the Reverse Generative Model decoder.

$p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t) \quad \text{where } p(x_T) = \mathcal{N}(0, \mathbf{I})$

And each individual reverse transition is parameterized as a shared Gaussian neural network:

$p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\left(x_{t-1} \;{\Big |}\; \mu_\theta(x_t, t), \,\mathbf{\Sigma}_\theta(x_t, t)\right)$

New cards

Provide the structural breakdown of the negative ELBO ( $\mathcal{L}_{\text{VLB}}$ ) used to fit a DDPM, detailing the operational roles of its three components: $\mathcal{L}_T$ , $\mathcal{L}_{t-1}$ , and $\mathcal{L}_0$ .

$\mathcal{L}_{\text{VLB}}(\mathbf{x}_0) = \underbrace{D_{\text{KL}}(q(x_T \mid x_0) \parallel p(x_T))}_{\mathcal{L}_T} + \sum_{t=2}^T \underbrace{\mathbb{E}_{q(x_t \mid x_0)} \left[ D_{\text{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_\theta(x_{t-1} \mid x_t)) \right]}_{\mathcal{L}_{t-1}} - \underbrace{\mathbb{E}_{q(x_1 \mid x_0)}[\log p_\theta(x_0 \mid x_1)]}_{\mathcal{L}_0}$

$\mathcal{L}_T$ (Prior Loss): Confirms the final forward step matches the standard unlearned Gaussian noise prior. (Evaluates to near-zero given a proper schedule).
$\mathcal{L}_{t-1}$ (Denoising Transitions): Minimizes the statistical distance between the learned reverse step and the true tractable tract posterior distribution.
$\mathcal{L}_0$ (Reconstruction Term): Measures the log-likelihood of producing the clean final image from the first latent step.

New cards

Explain why re-parameterizing a DDPM to predict the Injected Noise Vector ( $\epsilon$ ) rather than the denoised mean ( $\mu_\theta$ ) is mathematically sound.

Because any intermediate state can be expressed as $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon$ , we can solve for $x_0$ and substitute it directly into the true posterior mean equation. This yields:

$\tilde{\mu}_t(x_t, x_0) = \frac{1}{\sqrt{\alpha_t}}\left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon \right)$

This proves that the only unknown variable at timestep $t$ is the random noise vector $\epsilon$ . Consequently, parameterizing the network as $\epsilon_\theta(x_t, t)$ to predict this noise automatically resolves the calculation of the denoised mean vector.

New cards

Write out the formal Simplified Loss Objective Function ( $\mathcal{L}_{\text{simple}}$ ) used to train modern DDPMs. What is its core structural adjustment compared to $$\mathcal{L}_{\text{VLB}}$$ ?

$\mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{x_0, \,\epsilon \sim \mathcal{N}(0, \mathbf{I}), \,t \sim \text{Unif}(1, T)}\left[ \left\| \epsilon - \epsilon_\theta\left(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, \,t\right) \right\|^2 \right]$

Adjustment: It drops the complex maximum-likelihood weighting coefficients from the analytical KL calculation, setting them to a constant value ( $\lambda_t = 1$ ). This down-weights difficult denoising steps, significantly enhancing perceptual fidelity and training stability.

New cards

How does a Variational Diffusion Model (VDM) fundamentally differ from a standard DDPM regarding its noise schedule formulation?

nstead of utilizing a hard-coded, static schedule ( $\beta_t$ ), a VDM treats the noise progression parameters as learnable components optimized directly via the ELBO. The model parameterizes the Signal-to-Noise Ratio (SNR) as a monotonically decreasing function of continuous time:

$R(t) = \frac{\alpha_t^2}{\sigma_t^2} = \exp(-\gamma_\phi(t))$

Where $\gamma_\phi(t)$ is parameterized by a monotonic neural network architecture.

New cards

Explain the continuous-time boundary invariance property proved by converting the diffusion loss integral to a function of the signal-to-noise ratio ( $v = R(t)$ ).

Changing the integration variables converts the continuous-time loss into:

$\mathcal{L}_D(x_0) = \frac{1}{2}\mathbb{E}_{\epsilon \sim \mathcal{N}(0,\mathbf{I})}\left[ \int_{R_{\text{min}}}^{R_{\text{max}}} \|x_0 - \tilde{x}_\theta(z_v, v)\|_2^2 \,dv \right]$

This conversion mathematically proves that the exact functional trajectory of the intermediate noise schedule does not impact model fitting, provided the boundary conditions ( $R_{\text{min}}$ and $R_{\text{max}}$ ) remain identical.

New cards

Detail the mechanics of Low-Discrepancy Variance Reduction when sampling timesteps for a training minibatch of size $k$ in a continuous-time VDM.

Standard random Monte Carlo time sampling introduces high statistical variance across minibatches. To minimize estimation error, a deterministic low-discrepancy sampler splits the time domain evenly. A single anchor value $u_0 \sim \text{Unif}(0, 1)$ is sampled, and distributed timesteps are assigned across each element $i$ in the minibatch via:

$t^i = \text{mod}\left(u_0 + \frac{i}{k}, \,1\right)$

This guarantees an even, stratified sampling across the entire noise timeline.

New cards

Contrast Prescribed (Explicit) Probabilistic Models with Implicit Probabilistic Models on density evaluation and generative execution.

Prescribed Models: Provide a direct, parametric formulation of the log-likelihood function ( $\log p_\theta(x)$ ), allowing explicit pointwise density evaluation (e.g., ARMs, VAEs).

Implicit Models: Do not define an evaluable likelihood function ( $p(x)$ cannot be directly computed). Instead, they instantiate a stochastic simulator or function ( $G_\theta(z)$ ) to transform a latent distribution directly into synthesized observations (e.g., GANs).

New cards

Write out the optimization framework that defines the "Learning by Comparison" methodology for implicit models.

Because implicit models lack an explicit likelihood expression, they are optimized using a two-sample approach that passes expectations through a parameterized critic network ( $D_\phi$ ):

$D(p^*, q) = \max_\phi \mathcal{F}(D_\phi, p^*, q) \quad \text{where} \quad \mathcal{F}(D_\phi, p^*, q_\theta) = \mathbb{E}_{x \sim p^*(x)}[f(x, \phi)] + \mathbb{E}_{z \sim q(z)}[g(G_\theta(z), \phi)]$

This allows distances between distributions to be computed entirely via sample-driven Monte Carlo estimation.

New cards

Front: Show how a binary classifier $D(x)$ can be algebraically converted to yield the Density Ratio $r(x) = \frac{p^*(x)}{q_\theta(x)}$ between two distributions.

Assuming a classifier is trained with a Bernoulli log-loss to separate real samples ($y=1$) from generated fakes ( $y=0$ ), the probability matching simplifies to:

$D(x) = p(y=1 \mid x) = \frac{p^*(x)}{p^*(x) + q_\theta(x)}$

Dividing $D(x)$ by its complement $1 - D(x)$ yields:

$\frac{D(x)}{1 - D(x)} = \frac{\frac{p^*(x)}{p^*(x) + q_\theta(x)}}{\frac{q_\theta(x)}{p^*(x) + q_\theta(x)}} = \frac{p^*(x)}{q_\theta(x)} = r(x)$

The density ratio is exactly 1 everywhere if and only if the distributions are identical.

New cards

Prove that maximizing the standard binary cross-entropy discriminator objective yields a value mathematically equivalent to the Jensen-Shannon Divergence (JSD).

The analytical optimum for a discriminator maximizing the standard GAN loss is:

$D^*(x) = \frac{p^*(x)}{p^*(x) + q_\theta(x)}$

2. Substitute $D^*(x)$ back into the objective function:

$V(G, D^*) = \mathbb{E}_{x \sim p^*} \left[\log \frac{p^*(x)}{p^*(x)+q_\theta(x)}\right] + \mathbb{E}_{x \sim q_\theta} \left[\log \frac{q_\theta(x)}{p^*(x)+q_\theta(x)}\right]$

3. Multiply the denominators by 2 to align with average distributions, pulling out the constants:

$= \mathbb{E}_{x \sim p^*} \left[\log \frac{p^*(x)}{\frac{p^*(x)+q_\theta(x)}{2}}\right] - \log 2 + \mathbb{E}_{x \sim q_\theta} \left[\log \frac{q_\theta(x)}{\frac{p^*(x)+q_\theta(x)}{2}}\right] - \log 2$

4. Convert the expectations into KL divergence integrals:

$V(G, D^*) = D_{\text{KL}}\left(p^* \;\parallel\; \frac{p^* + q_\theta}{2}\right) + D_{\text{KL}}\left(q_\theta \;\parallel\; \frac{p^* + q_\theta}{2}\right) - 2\log 2 = 2 \cdot \text{JSD}(p^*, q_\theta) - \log 4$

New cards

Map the Brier Score and Hinge Loss rules to the specific alternative divergence constraints they minimize.

Brier Score: Maps directly to minimizing the Pearson $\chi^2$ Divergence (the structural foundation of Least Squares GANs / LS-GAN).

Hinge Loss: Maps directly to minimizing the Total Variation Distance.

New cards

Define an Integral Probability Metric (IPM) expression, and explain how the Wasserstein GAN (WGAN) is derived from it.

IPMs quantify statistical distance via a bounded class of real-valued test/witness functions $\mathcal{F}$ :

$\mathcal{I}_{\mathcal{F}}(p^*, q_\theta) = \sup_{f \in \mathcal{F}} \left| \mathbb{E}_{x \sim p^*(x)}[f(x)] - \mathbb{E}_{x \sim q_\theta(x)}[f(x)] \right|$

WGAN Derivation: Constraining the function class $\mathcal{F}$ to the set of all 1-Lipschitz functions ( $\|f\|_{\text{Lip}} \le 1$ ) yields the Earth Mover's (Wasserstein-1) distance. Because calculating the exact supremum is intractable, it is approximated using a regularized neural critic $D_\phi$ under a Lipschitz constraint (e.g., via gradient penalties or spectral normalization).

New cards

Write out the kernel-trick expression for squared Maximum Mean Discrepancy (MMD) that allows optimization without computing a variational lower bound.

By constraining the witness function class to a norm of one within a Reproducing Kernel Hilbert Space ( $\|f\|_{\text{RKHS}} = 1$ ), the metric can be computed directly using sample kernels:

$\text{MMD}^2(p^*, q_\theta) = \mathbb{E}_{x, x' \sim p^*}[K(x, x')] - 2\mathbb{E}_{x \sim p^*, z \sim q}[K(x, G_\theta(z))] + \mathbb{E}_{z, z' \sim q}[K(G_\theta(z), G_\theta(z'))]$

New cards

Contrast Density Ratio Models (e.g., standard JSD GANs) and Density Difference Models (e.g., Wasserstein IPMs) regarding the Non-Overlapping Support Problem.

Density Ratio Models (JSD / KL): Suffer from severe theoretical failures if the real and fake distributions share no overlapping support. In this regime, $D_{\text{KL}} \to \infty$ and $\text{JSD} \to \log 2$ , causing the gradients to drop to zero almost everywhere, halting all generator learning.

Density Difference Models (Wasserstein / MMD): Successfully track distances across disjoint spaces. By enforcing continuous smoothness criteria (Lipschitz or RKHS bounds), they provide clean, stable, and predictable gradients even across non-overlapping data manifolds.

New cards

State the optimization caveat regarding the interpretation of the training loss curve in practical GAN implementations.

Because the generator minimizes a lower-bound approximation of an $f$-divergence or the Wasserstein metric during training, a decreasing empirical loss curve provides no mathematical guarantee that the true underlying divergence between the distributions is actually decreasing. This stands in sharp contrast to Variational Autoencoders, which maximize a strict, mathematically rigorous lower bound on data log-likelihood.