9. Variational Generative Models

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/32

There's no tags or description

Looks like no tags are added yet.

Last updated 10:06 AM on 6/2/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

33 Terms

New cards

Explain the difference between Observable Data ( $x$ ) and Latent Variables ( $z$ ) using a structural analogy.

Observable Data ( $x \in \mathbb{R}^D$ ): High-dimensional, raw data vectors directly accessible to the system (e.g., raw pixel arrays, raw audio waveforms).

Latent Variables ( $z \in \mathbb{R}^L$ ): Lower-dimensional (L < D) explanatory variables that are not directly observed but capture the underlying core mechanisms or factors generating the data.
Analogy: Plato's Myth of the Cave—Observable data ( $x$ ) represents the high-dimensional shadows projected onto a wall, while latent variables ( $z$ ) are the simpler, lower-dimensional structural forms creating those shadows.

New cards

Contrast the operational paradigms and density objectives of Autoregressive Models vs. Variational Autoencoders (VAEs).

Autoregressive Models: Implement exact, deterministic density estimation. They learn the exact conditional probability of each variable given a history of all past variables using deep sequence predictors (e.g., Transformers).

Variational Autoencoders (VAEs): Implement stochastic approximation. They optimize a probabilistic proxy for data likelihood called the Evidence Lower Bound (ELBO) across a probabilistic Encoder-Decoder network.

New cards

Contrast the density tracking and architectural elements of Generative Adversarial Networks (GANs), Flow-based Models, and Diffusion Models.

Generative Adversarial Networks (GANs): Implicit density modeling. They completely avoid explicit density equations, opting for a two-player zero-sum game where a Generator network ( $G(z)$ ) competes directly against an adversarial Discriminator network ( $D(x)$ ).

Flow-based Models: Exact density calculation. They map simple prior distributions directly to complex data spaces through a sequence of invertible, deterministic mathematical functions ( $f(x)$ and $f^{-1}(z)$ ).
Diffusion Models: Iterative stochastic evaluation. They systematically add Gaussian noise to a data distribution via a forward process and train a network to parameters that reverse it.

New cards

State the core mathematical objective function of $k$ -Means Clustering and its primary structural limitation as a generative model.

$\text{Objective: } J(M,Z)=\sum_{n=1}^{N}||x_n - \mu_{z_n}||^2$

Where $x_n$ is a data vector and $\mu_{z_n}$ is its assigned cluster centroid.

Core Limitation: It enforces a hard assignment mechanism ( $z_n^*$ belongs exclusively to a single cluster). It provides no probabilistic uncertainty, variance bounds, or continuous density tracking, making it incapable of processing or generating realistic data variations.

New cards

Write out the probability density function for a Gaussian Mixture Model (GMM). What algorithms solve its parameters?

$p(y \mid \theta) = \sum_{k=1}^{K} \pi_k \mathcal{N}(y \mid \mu_k, \Sigma_k)$

Where $\pi_k$ represents categorical mixing weights ( $\sum \pi_k = 1$ ), and $\mu_k, \Sigma_k$ parameterize each separate Gaussian component.

Optimization: Because cluster assignments are hidden, it is solved under Maximum Likelihood Estimation via the Expectation-Maximization (EM) algorithm, where the E-step computes the posterior responsibilities ( $r_{nk}$ ) and the M-step updates the component parameters.

New cards

Explain why a Traditional Autoencoder (AE) fails to generate new samples, and how a Variational Autoencoder (VAE) structurally resolves this.

Traditional AE: Uses a deterministic bottleneck layer to compute a single static point vector. Because no constraints are placed on the latent space, it forms highly disconnected clusters and severe structural discontinuities. Sampling from empty voids yields corrupted, nonsensical outputs.
Variational Autoencoder (VAE): Converts the bottleneck into a continuous, parameter-controlled probability distribution. The encoder outputs statistical parameters—a mean vector ( $\mu$ ) and a variance vector ( $\sigma$ )—and the latent variable $z$ is dynamically sampled from this distribution before entering the decoder.

New cards

Define Continuity and Completeness as they relate to the geometric optimization of a VAE's latent space.

Continuity: The geometric property ensuring that points positioned close together inside the latent representation manifold resolve to highly similar semantic content when mapped through the decoder network.

Completeness: The property ensuring that any arbitrary coordinate vector sampled directly from the fixed prior distribution space ( $\mathcal{N}(0, I)$ ) maps to a valid, realistic, and high-fidelity data generation.

New cards

Why is the true mathematical posterior distribution $p_\theta(z \mid x)$ in a VAE considered completely intractable during standard inference?

Evaluating the true posterior requires computing the total marginal log-likelihood of the observed data:

$p_\theta(z \mid x) = \frac{p_\theta(x, z)}{p_\theta(x)} \quad \text{where} \quad p_\theta(x) = \int p_\theta(x, z) dz$

To solve this integral directly, the model would have to evaluate all possible infinite configurations of the latent space $z$ . This makes the computation completely intractable (NP-hard). VAEs bypass this by introducing an encoder network ( $q_\phi(z \mid x)$ ) to approximate the true posterior.

New cards

Provide the step-by-step mathematical derivation of the Evidence Lower Bound (ELBO) starting from the KL Divergence between $q_\phi(z \mid x)$ and $p_\theta(z \mid x)$ .

Start with the analytical KL Divergence definition:

$D_{KL}\big(q_\phi(z \mid x) \parallel p_\theta(z \mid x)\big) = \mathbb{E}_{z \sim q_\phi}\left[\log \frac{q_\phi(z \mid x)}{p_\theta(z \mid x)}\right]$

Substitute $p_\theta(z \mid x) = \frac{p_\theta(x, z)}{p_\theta(x)}$ via Bayes' rule:

$= \mathbb{E}_{z \sim q_\phi}\left[\log \left( \frac{q_\phi(z \mid x) \cdot p_\theta(x)}{p_\theta(x, z)} \right)\right]$

Expand the log of products into addition/subtraction terms:

$= \mathbb{E}_{z \sim q_\phi}\left[\log q_\phi(z \mid x) - \log p_\theta(x, z) + \log p_\theta(x)\right]$

Since $\log p_\theta(x)$ is independent of the latent variable $z$ , extract it from the expectation:

$D_{KL}\big(q_\phi(z \mid x) \parallel p_\theta(z \mid x)\big) = \mathbb{E}_{z \sim q_\phi}\big[\log q_\phi(z \mid x) - \log p_\theta(x, z)\big] + \log p_\theta(x)$

Isolate the marginal log-likelihood ( $\log p_\theta(x)$ ):

$\log p_\theta(x) = \mathbb{E}_{z \sim q_\phi}\big[\log p_\theta(x, z) - \log q_\phi(z \mid x)\big] + D_{KL}\big(q_\phi(z \mid x) \parallel p_\theta(z \mid x)\big)$

Decompose the joint distribution $p_\theta(x, z)$ into $p_\theta(x \mid z)p(z)$ :

$\log p_\theta(x) = \mathbb{E}_{z \sim q_\phi}\big[\log p_\theta(x \mid z)\big] - D_{KL}\big(q_\phi(z \mid x) \parallel p(z)\big) + D_{KL}\big(q_\phi(z \mid x) \parallel p_\theta(z \mid x)\big)$

Because $D_{KL}\big(q_\phi(z \mid x) \parallel p_\theta(z \mid x)\big) \ge 0$ , dropping it establishes the strict floor known as the ELBO:

$\text{ELBO}(\phi, \theta; x) = \mathbb{E}_{z \sim q_\phi(z \mid x)}\big[\log p_\theta(x \mid z)\big] - D_{KL}\big(q_\phi(z \mid x) \parallel p(z)\big)$

New cards

Break down the specific individual operations and structural roles of Term 1 and Term 2 within the ELBO formula.

$\text{ELBO}(\phi, \theta; x) = \underbrace{\mathbb{E}_{z \sim q_\phi(z \mid x)}\big[\log p_\theta(x \mid z)\big]}_{\text{Term 1: Reconstruction Loss}} - \underbrace{D_{KL}\big(q_\phi(z \mid x) \parallel p(z)\big)}_{\text{Term 2: KL Regularization Loss}}$

Term 1 (Reconstruction): Measures how effectively the decoder network converts the latent samples back into high-fidelity reproductions of the training data.
Term 2 (KL Regularization): Acts as a statistical penalty that forces the encoder's inferred distribution to match the simple Gaussian prior distribution $p(z) = \mathcal{N}(0, I)$ , ensuring the latent space remains smooth and continuous.

New cards

Explain the mechanical limitation of direct stochastic sampling in neural graphs, and write the mathematical formula for the Reparameterization Trick.

Limitation: Standard backpropagation requires fully deterministic paths to compute partial derivatives. Sampling a latent code $z$ directly from a stochastic distribution node $\mathcal{N}(\mu, \sigma^2)$ breaks this path, making it impossible to pass gradients back through the sampling layer to train the encoder.
Formula: The reparameterization trick isolates the stochasticity by shifting it to an external noise variable ( $\epsilon$ ):
$z = \mu + \sigma \odot \epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, I)$
This allows $\mu$ and $\sigma$ to remain deterministic nodes in the computational graph, enabling unobstructed backpropagation.

New cards

Write out the loss objective function for a $\beta$-VAE and explain how configuring $\beta > 1$ alters latent features.

$\mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}_{z \sim q_\phi(z \mid x)}\big[\log p_\theta(x \mid z)\big] - \beta D_{KL}\big(q_\phi(z \mid x) \parallel p(z)\big)$

Effect of \beta > 1: Heavily penalizing the KL divergence term forces the model to strictly align with an independent diagonal prior distribution. This compresses the bottleneck, forcing the network to discover the most efficient, uncorrelated, and statistically independent latent factors (e.g., separating head rotation cleanly from facial expression).
Trade-off: Excessively high $\beta$ factors can over-compress the bottleneck, hurting reconstruction quality and washing out fine details.

New cards

Detail the multi-modal pipeline of DALL-E 1, explaining how it integrates discrete VAE tokenization with autoregressive text parsing.

1. Discrete VAE Tokenization (VQ-VAE): To circumvent the massive memory footprint of raw pixel spaces, raw images are passed through a Vector Quantized VAE, compressing them into dense grids of discrete codebook tokens.

2. Text Representation: Descriptive language prompt strings are tokenized into contextual representations using standard Byte Pair Encoding (BPE).

3. Autoregressive Integration: Both the text tokens and the discrete image tokens are concatenated into a single continuous sequence. A massive autoregressive Transformer processes this sequence, learning to model their joint distribution to synthesize new images directly from novel textual prompts.

New cards

Define Undercomplete and Overcomplete representations in Autoencoders. How does each regime prevent the network from learning a trivial identity mapping.

Undercomplete ( $L \ll D$ ): Constrains the latent space dimension ( $L$ ) to be significantly smaller than the input dimension ( $D$ ). This forces the data through a narrow structural bottleneck, compelling the network to learn only the most salient features.
Overcomplete ( $L \gg D$ ): Keeps the latent space dimension larger than the input dimension but limits network capacity via structural regularization. This prevents the identity mapping by injecting input noise, penalizing derivatives, or forcing activation sparsity.

New cards

Under what explicit architectural conditions is a Bottleneck Autoencoder mathematically equivalent to Principal Component Analysis (PCA)?

An autoencoder becomes mathematically equivalent to PCA if it satisfies three conditions:

It features a single hidden bottleneck layer (L < D).
It utilizes strictly linear activations ( $z = W_1x$ and $\hat{x} = W_2z$ ).
It minimizes the standard squared reconstruction error.

Under these constraints, the combined weight matrix $\hat{W} = W_2W_1$ is forced to learn an orthogonal projection onto the first $L$ eigenvectors of the data's empirical covariance matrix.

New cards

Explain the mathematical relationship between a Denoising Autoencoder's (DAE) residual error and the data distribution as noise variance approaches zero ($\sigma \to 0$).

When trained with Gaussian corruption and squared error loss, the residual error vector field directly approximates the score function of the data density:

$\text{Residual Error: } e(x) = r(\tilde{x}) - x \approx \nabla_x \log p(x)$

Geometrically, this means the DAE learns a vector field where all error vectors point directly inward toward the nearest region of high probability density along the lower-dimensional data manifold.

New cards

Write out the explicit regularization penalty term for a Contractive Autoencoder (CAE) and explain its variables.

The CAE appends the Frobenius norm of the encoder's Jacobian matrix directly to the loss:

$\Omega(z, x) = \lambda \left\| \frac{\partial f_e(x)}{\partial x} \right\|_F^2 = \lambda \sum_k \|\nabla_x h_k(x)\|_2^2$

$\lambda$ : The regularization scaling hyperparameter.
$f_e(x)$ / $h_k(x)$ : The activation value of the $k$ -th hidden latent unit in the encoder.

New cards

How do the reconstruction loss and the Jacobian penalty interact dynamically to map a data manifold in a Contractive Autoencoder (CAE)?

The Jacobian penalty forces the encoder to become flat or constant, actively minimizing sensitivity to small variations in the input vector (local contraction).
The reconstruction loss counters this by demanding that the model preserve vital data traits.

Together, they force the network to remain highly sensitive only along the specific directional vectors that define the true underlying data manifold, while remaining completely insensitive to variations perpendicular to it.

New cards

Why do Contractive Autoencoders (CAEs) often utilize tied weights ( $W_{\text{decoder}} = W_{\text{encoder}}^T$ ), and what is the architecture's primary computational limitation?

Tied Weights Rationale: Prevents a degenerate solution where the encoder artificially shrinks the Jacobian by multiplying the input by an infinitesimally small scalar $\epsilon$ , while the decoder trivially cancels it out by multiplying by $1/\epsilon$ .

Limitation: CAEs are highly computationally expensive and slow to train because tracking and calculating the full encoder Jacobian matrix during backpropagation scales poorly with layer dimensions.

New cards

Contrast $\ell_1$ Activity Regularization with KL Divergence Frequency Matching as mechanisms for enforcing sparsity in an autoencoder.

\ell_1$$Activity Regularization (\lambda \|z\|_1 $): Applies an absolute penalty to individual activations. This can be aggressive, often leading to a regime where specific neurons are completely and permanently deactivated across the entire dataset.</li></ul><ul><li>KL Divergence Frequency Matching: Tracks the average empirical activation frequency $ q_k $ of each hidden unit across a minibatch and penalizes its divergence from a tiny target distribution $ p $ (e.g., $ p = 0.1 $):$ \Omega = \lambda \sum_k D_{KL}(p \parallel q_k)$$
This ensures that roughly 90% of the neurons are quiet at any given time step, preventing individual units from shutting down permanently across the whole dataset.

New cards

State the explicit mathematical likelihood distributions used by a VAE Decoder ( $p_\theta(x \mid z)$ ) for both continuous data and binary data.

Continuous Data (Gaussian Likelihood):

$p_\theta(x \mid z) = \mathcal{N}(x \mid f_d(z; \theta), \sigma^2 I)$

Binary Data (Bernoulli Likelihood):
$p_\theta(x \mid z) = \prod_{i=1}^D \text{Ber}(x_i \mid f_d(z; \theta))$
Where $f_d(z; \theta)$ represents the non-linear decoder neural network transformation.

New cards

Use Jensen's Inequality to prove that the VAE Evidence Lower Bound (ELBO) establishes a strict analytical lower bound on the true marginal log-evidence $\log p_\theta(x)$ .

Start with the continuous integral form of the ELBO:

$\mathcal{L}(\theta, \phi \mid x) = \int q_\phi(z \mid x) \log \frac{p_\theta(x, z)}{q_\phi(z \mid x)} dz$

Because the logarithm function is concave, applying Jensen's Inequality allows us to pull the log operation outside the expectation integral:

$\le \log \int q_\phi(z \mid x) \frac{p_\theta(x, z)}{q_\phi(z \mid x)} dz$

Cancel the approximate posterior $q_\phi(z \mid x)$ terms inside the integrand:

$= \log \int p_\theta(x, z) dz$

By definition, integrating out the latent variable $z$ yields the true marginal evidence:

$= \log p_\theta(x)$

New cards

Write out the fast, closed-form analytical equation for the KL Regularization Term when matching an inferred Gaussian distribution against a standard normal prior $p(z) = \mathcal{N}(0, I)$ .

$D_{KL}(q \parallel p) = -\frac{1}{2} \sum_{k=1}^K \left( \log \sigma_k^2 - \sigma_k^2 - \mu_k^2 + 1 \right)$

Where $K$ is the total dimensionality of the latent space bottleneck, and $\mu_k, \sigma_k^2$ are the localized means and variances output by the inference encoder network.

New cards

Contrast Directed PGMs, Autoregressive Models (ARMs), and Generative Adversarial Networks (GANs) on Density Evaluation and Sampling Speed.

Directed PGMs: Features Exact, Fast Density Evaluation paired with Fast Sampling Speed over sparse Directed Acyclic Graphs (DAGs).

Autoregressive Models (ARMs): Features Exact, Fast Density Evaluation but suffers from Slow, Sequential Sampling Speed because each token must be generated iteratively.
Generative Adversarial Networks (GANs): Density evaluation is Not Available (implicit modeling), but features Fast, Parallel Sampling Speed via a single forward pass through the generator network.

New cards

Compare VAEs, Normalizing Flows, and Diffusion Models regarding their Latent Space Dimension constraints.

Variational Autoencoders (VAEs): Enforces a Compressed latent space ( $\mathbb{R}^L$ where $L \ll D$ ), squeezing data through a probabilistic bottleneck.

Normalizing Flows: Requires an Uncompressed latent space ( $\mathbb{R}^D$ where $L = D$ ) to maintain full mathematical invertibility across its transformations.
Diffusion Models: Utilizes an Uncompressed latent space ( $\mathbb{R}^D$ where $L = D$ ) matching the exact structural dimensions of the input across its forward-reverse denoising steps.

New cards

Explain the mathematical formulation and structural rationale behind Latent Space Interpolation.

Given two anchor inputs $x_1$ and $$x_2 $, their structural latent embeddings are extracted via an encoder $ z_1 = e(x_1) $ and $ z_2 = e(x_2) $. A linear blend is calculated across the latent space:$ z = \lambda z_1 + (1 - \lambda)z_2 \quad \text{where} \quad 0 \le \lambda \le 1 $Decoding this path ($ x' = d(z)$$) synthesizes a smooth semantic morph between the inputs. This is highly effective because while the raw pixel space is highly curved and nonlinear, the learned latent manifold has approximately zero curvature.

New cards

Write out the mathematical derivation for Latent Space Attribute Arithmetic (e.g., adding sunglasses to a face).

1. Compute an attribute offset vector $\Delta$ by isolating the average embeddings of images possessing the target attribute ( $z^+$ ) and subtracting the average embeddings of images lacking it ( $z^-$ ):

$\Delta = \frac{1}{N_+}\sum z^+ - \frac{1}{N_-}\sum z^-$

2. To project this attribute predictably onto a novel unaligned image vector's embedding ( $z_{\text{new}}$ ), scale and add the offset vector before decoding:

$z_{\text{modified}} = z_{\text{new}} + s\Delta$

Where $s$ is a scalar hyperparameter controlling attribute intensity.

New cards

What is the core contradiction when evaluating Continuous Probability Densities on discrete image/audio data, and how does Uniform Dequantization resolve it?

Raw images/audio are stored as discrete integers (e.g., pixel intensities from 0 to 255), yet are typically evaluated using continuous probability density functions ( $p(x) \ge 0$ ). Because continuous density peaks can infinitely exceed 1, the continuous Negative Log-Likelihood (NLL) can anomalously tend toward negative infinity.

Uniform Dequantization resolves this by adding uniform noise $u \sim \mathcal{U}(0, 1)$ directly to the discrete coordinates. Evaluating the density of this dequantized continuous version establishes a mathematically rigorous lower bound on the true discrete model log-likelihood via Jensen's Inequality.

New cards

Provide an algebraic example proving that a high Log-Likelihood Score does not guarantee high-quality sample generation.

Consider a model $q_2(x)$ that combines 1% of an optimal density model $q_0(x)$ with 99% of a white noise model $q_1(x)$ . This network will generate poor, noisy samples 99% of the time, yet its log-likelihood per pixel changes by a negligible constant factor compared to the perfect model:

$\log q_2(x) \ge \log [0.01 q_0(x)] = \log q_0(x) - \log(100) = \log q_0(x) - 4.6$

The model achieves a stellar likelihood score despite generating visual trash.

New cards

Write out the mathematical equation for the Inception Score (IS). Explain its two internal entropy objectives.

$\text{IS} = \exp \left( \mathbb{E}_{x \sim p_\theta(x)} \left[ D_{KL}(p_{\text{disc}}(Y \mid x) \parallel p_\theta(Y)) \right] \right)$

Expanding the logs yields: $\log(\text{IS}) = H(p_\theta(Y)) - \mathbb{E}_{x \sim p_\theta(x)} [H(p_{\text{disc}}(Y \mid x))]$

Maximize $H(p_\theta(Y))$ (Marginal Entropy): The model must generate samples evenly balanced across all known classes, ensuring high variety and avoiding label drop.
Minimize $H(p_{\text{disc}}(Y \mid x))$ (Conditional Entropy): Individual generated samples must contain distinct, sharp class features that allow the classifier to identify a single label with high confidence.

New cards

Write out the formula for the Fréchet Inception Distance (FID) and state its primary statistical limitation.

FID treats deep features extracted via an Inception classifier as multi-dimensional Gaussians, calculating the Wasserstein-2 distance between the real data ( $\mu_d, \Sigma_d$ ) and model samples ( $\mu_m, \Sigma_m$ ):

$\text{FID} = \|\mu_m - \mu_d\|_2^2 + \text{tr}\left( \Sigma_d + \Sigma_m - 2(\Sigma_d\Sigma_m)^{1/2} \right)$

Limitation: FID suffers from significant sample bias, meaning scores systematically shift based on the total count of samples used to evaluate the empirical means and covariances. To circumvent this bias, the Kernel Inception Distance (KID) can be used to calculate maximum mean discrepancy.

New cards

How do Precision and Recall in generative evaluation decouple sample quality from sample diversity?

Aggregate metrics like FID collapse all failure modes into a single scalar. Precision and recall fix this by using $k$ -nearest neighbor boundaries in a classifier's feature space:

Precision (Quality Metric): Quantifies the fraction of model-generated samples that fall inside the local neighborhood clusters of the true data distribution. High precision means no low-quality or corrupted samples.
Recall (Diversity/Coverage Metric): Quantifies the fraction of real validation samples that fall inside the local neighborhood clusters of the model distribution. High recall means the model has successfully avoided mode collapse.

New cards

Write out the mathematical definitions for Precision and Recall based on the binary $k$-nearest neighbor coverage function $f_k(\phi, \Phi)$.

Given the binary coverage function $f_k$ (which outputs 1 if feature vector $\phi$ falls within the nearest-neighbor radius of the reference set $\Phi$ ), precision and recall are defined as:

$\text{precision}(\Phi_{\text{model}}, \Phi_{\text{data}}) = \frac{1}{|\Phi_{\text{model}}|} \sum_{\phi \in \Phi_{\text{model}}} f_k(\phi, \Phi_{\text{data}})$

$\text{recall}(\Phi_{\text{model}}, \Phi_{\text{data}}) = \frac{1}{|\Phi_{\text{data}}|} \sum_{\phi \in \Phi_{\text{data}}} f_k(\phi, \Phi_{\text{model}})$