1/32
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Explain the difference between Observable Data (x) and Latent Variables (z) using a structural analogy.
Observable Data (x∈RD): High-dimensional, raw data vectors directly accessible to the system (e.g., raw pixel arrays, raw audio waveforms).
Latent Variables (z∈RL): Lower-dimensional (L < D) explanatory variables that are not directly observed but capture the underlying core mechanisms or factors generating the data.
Analogy: Plato's Myth of the Cave—Observable data (x) represents the high-dimensional shadows projected onto a wall, while latent variables (z) are the simpler, lower-dimensional structural forms creating those shadows.
Contrast the operational paradigms and density objectives of Autoregressive Models vs. Variational Autoencoders (VAEs).
Autoregressive Models: Implement exact, deterministic density estimation. They learn the exact conditional probability of each variable given a history of all past variables using deep sequence predictors (e.g., Transformers).
Variational Autoencoders (VAEs): Implement stochastic approximation. They optimize a probabilistic proxy for data likelihood called the Evidence Lower Bound (ELBO) across a probabilistic Encoder-Decoder network.
Contrast the density tracking and architectural elements of Generative Adversarial Networks (GANs), Flow-based Models, and Diffusion Models.
Generative Adversarial Networks (GANs): Implicit density modeling. They completely avoid explicit density equations, opting for a two-player zero-sum game where a Generator network (G(z)) competes directly against an adversarial Discriminator network (D(x)).
Flow-based Models: Exact density calculation. They map simple prior distributions directly to complex data spaces through a sequence of invertible, deterministic mathematical functions (f(x) and f−1(z)).
Diffusion Models: Iterative stochastic evaluation. They systematically add Gaussian noise to a data distribution via a forward process and train a network to parameters that reverse it.
State the core mathematical objective function of k-Means Clustering and its primary structural limitation as a generative model.
Objective: J(M,Z)=n=1∑N∣∣xn−μzn∣∣2
Where xn is a data vector and μzn is its assigned cluster centroid.
Core Limitation: It enforces a hard assignment mechanism (zn∗ belongs exclusively to a single cluster). It provides no probabilistic uncertainty, variance bounds, or continuous density tracking, making it incapable of processing or generating realistic data variations.
Write out the probability density function for a Gaussian Mixture Model (GMM). What algorithms solve its parameters?
p(y∣θ)=k=1∑KπkN(y∣μk,Σk)
Where πk represents categorical mixing weights (∑πk=1), and μk,Σk parameterize each separate Gaussian component.
Optimization: Because cluster assignments are hidden, it is solved under Maximum Likelihood Estimation via the Expectation-Maximization (EM) algorithm, where the E-step computes the posterior responsibilities (rnk) and the M-step updates the component parameters.
Explain why a Traditional Autoencoder (AE) fails to generate new samples, and how a Variational Autoencoder (VAE) structurally resolves this.
Traditional AE: Uses a deterministic bottleneck layer to compute a single static point vector. Because no constraints are placed on the latent space, it forms highly disconnected clusters and severe structural discontinuities. Sampling from empty voids yields corrupted, nonsensical outputs.
Variational Autoencoder (VAE): Converts the bottleneck into a continuous, parameter-controlled probability distribution. The encoder outputs statistical parameters—a mean vector (μ) and a variance vector (σ)—and the latent variable z is dynamically sampled from this distribution before entering the decoder.
Define Continuity and Completeness as they relate to the geometric optimization of a VAE's latent space.
Continuity: The geometric property ensuring that points positioned close together inside the latent representation manifold resolve to highly similar semantic content when mapped through the decoder network.
Completeness: The property ensuring that any arbitrary coordinate vector sampled directly from the fixed prior distribution space (N(0,I)) maps to a valid, realistic, and high-fidelity data generation.
Why is the true mathematical posterior distribution pθ(z∣x) in a VAE considered completely intractable during standard inference?
Evaluating the true posterior requires computing the total marginal log-likelihood of the observed data:
pθ(z∣x)=pθ(x)pθ(x,z)wherepθ(x)=∫pθ(x,z)dz
To solve this integral directly, the model would have to evaluate all possible infinite configurations of the latent space z. This makes the computation completely intractable (NP-hard). VAEs bypass this by introducing an encoder network (qϕ(z∣x)) to approximate the true posterior.
Provide the step-by-step mathematical derivation of the Evidence Lower Bound (ELBO) starting from the KL Divergence between qϕ(z∣x) and pθ(z∣x).
Start with the analytical KL Divergence definition:
DKL(qϕ(z∣x)∥pθ(z∣x))=Ez∼qϕ[logpθ(z∣x)qϕ(z∣x)]
Substitute pθ(z∣x)=pθ(x)pθ(x,z) via Bayes' rule:
=Ez∼qϕ[log(pθ(x,z)qϕ(z∣x)⋅pθ(x))]
Expand the log of products into addition/subtraction terms:
=Ez∼qϕ[logqϕ(z∣x)−logpθ(x,z)+logpθ(x)]
Since logpθ(x) is independent of the latent variable z, extract it from the expectation:
DKL(qϕ(z∣x)∥pθ(z∣x))=Ez∼qϕ[logqϕ(z∣x)−logpθ(x,z)]+logpθ(x)
Isolate the marginal log-likelihood (logpθ(x)):
logpθ(x)=Ez∼qϕ[logpθ(x,z)−logqϕ(z∣x)]+DKL(qϕ(z∣x)∥pθ(z∣x))
Decompose the joint distribution pθ(x,z) into pθ(x∣z)p(z):
logpθ(x)=Ez∼qϕ[logpθ(x∣z)]−DKL(qϕ(z∣x)∥p(z))+DKL(qϕ(z∣x)∥pθ(z∣x))
Because DKL(qϕ(z∣x)∥pθ(z∣x))≥0, dropping it establishes the strict floor known as the ELBO:
ELBO(ϕ,θ;x)=Ez∼qϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∥p(z))
Break down the specific individual operations and structural roles of Term 1 and Term 2 within the ELBO formula.
ELBO(ϕ,θ;x)=Term 1: Reconstruction LossEz∼qϕ(z∣x)[logpθ(x∣z)]−Term 2: KL Regularization LossDKL(qϕ(z∣x)∥p(z))
Term 1 (Reconstruction): Measures how effectively the decoder network converts the latent samples back into high-fidelity reproductions of the training data.
Term 2 (KL Regularization): Acts as a statistical penalty that forces the encoder's inferred distribution to match the simple Gaussian prior distribution p(z)=N(0,I), ensuring the latent space remains smooth and continuous.
Explain the mechanical limitation of direct stochastic sampling in neural graphs, and write the mathematical formula for the Reparameterization Trick.
Limitation: Standard backpropagation requires fully deterministic paths to compute partial derivatives. Sampling a latent code z directly from a stochastic distribution node N(μ,σ2) breaks this path, making it impossible to pass gradients back through the sampling layer to train the encoder.
Formula: The reparameterization trick isolates the stochasticity by shifting it to an external noise variable (ϵ):
z=μ+σ⊙ϵwhereϵ∼N(0,I)
This allows $\mu$ and $\sigma$ to remain deterministic nodes in the computational graph, enabling unobstructed backpropagation.
Write out the loss objective function for a $\beta$-VAE and explain how configuring $\beta > 1$ alters latent features.
Lβ-VAE=Ez∼qϕ(z∣x)[logpθ(x∣z)]−βDKL(qϕ(z∣x)∥p(z))
Effect of \beta > 1: Heavily penalizing the KL divergence term forces the model to strictly align with an independent diagonal prior distribution. This compresses the bottleneck, forcing the network to discover the most efficient, uncorrelated, and statistically independent latent factors (e.g., separating head rotation cleanly from facial expression).
Trade-off: Excessively high β factors can over-compress the bottleneck, hurting reconstruction quality and washing out fine details.
Detail the multi-modal pipeline of DALL-E 1, explaining how it integrates discrete VAE tokenization with autoregressive text parsing.
1. Discrete VAE Tokenization (VQ-VAE): To circumvent the massive memory footprint of raw pixel spaces, raw images are passed through a Vector Quantized VAE, compressing them into dense grids of discrete codebook tokens.
2. Text Representation: Descriptive language prompt strings are tokenized into contextual representations using standard Byte Pair Encoding (BPE).
3. Autoregressive Integration: Both the text tokens and the discrete image tokens are concatenated into a single continuous sequence. A massive autoregressive Transformer processes this sequence, learning to model their joint distribution to synthesize new images directly from novel textual prompts.
Define Undercomplete and Overcomplete representations in Autoencoders. How does each regime prevent the network from learning a trivial identity mapping.
Undercomplete (L≪D): Constrains the latent space dimension (L) to be significantly smaller than the input dimension (D). This forces the data through a narrow structural bottleneck, compelling the network to learn only the most salient features.
Overcomplete (L≫D): Keeps the latent space dimension larger than the input dimension but limits network capacity via structural regularization. This prevents the identity mapping by injecting input noise, penalizing derivatives, or forcing activation sparsity.
Under what explicit architectural conditions is a Bottleneck Autoencoder mathematically equivalent to Principal Component Analysis (PCA)?
An autoencoder becomes mathematically equivalent to PCA if it satisfies three conditions:
It features a single hidden bottleneck layer (L < D).
It utilizes strictly linear activations (z=W1x and x^=W2z).
It minimizes the standard squared reconstruction error.
Under these constraints, the combined weight matrix W^=W2W1 is forced to learn an orthogonal projection onto the first L eigenvectors of the data's empirical covariance matrix.
Explain the mathematical relationship between a Denoising Autoencoder's (DAE) residual error and the data distribution as noise variance approaches zero ($\sigma \to 0$).
When trained with Gaussian corruption and squared error loss, the residual error vector field directly approximates the score function of the data density:
Residual Error: e(x)=r(x~)−x≈∇xlogp(x)
Geometrically, this means the DAE learns a vector field where all error vectors point directly inward toward the nearest region of high probability density along the lower-dimensional data manifold.
Write out the explicit regularization penalty term for a Contractive Autoencoder (CAE) and explain its variables.
The CAE appends the Frobenius norm of the encoder's Jacobian matrix directly to the loss:
Ω(z,x)=λ∂x∂fe(x)F2=λk∑∥∇xhk(x)∥22
λ: The regularization scaling hyperparameter.
fe(x) / hk(x): The activation value of the k-th hidden latent unit in the encoder.
How do the reconstruction loss and the Jacobian penalty interact dynamically to map a data manifold in a Contractive Autoencoder (CAE)?
The Jacobian penalty forces the encoder to become flat or constant, actively minimizing sensitivity to small variations in the input vector (local contraction).
The reconstruction loss counters this by demanding that the model preserve vital data traits.
Together, they force the network to remain highly sensitive only along the specific directional vectors that define the true underlying data manifold, while remaining completely insensitive to variations perpendicular to it.
Why do Contractive Autoencoders (CAEs) often utilize tied weights (Wdecoder=WencoderT), and what is the architecture's primary computational limitation?
Tied Weights Rationale: Prevents a degenerate solution where the encoder artificially shrinks the Jacobian by multiplying the input by an infinitesimally small scalar ϵ, while the decoder trivially cancels it out by multiplying by 1/ϵ.
Limitation: CAEs are highly computationally expensive and slow to train because tracking and calculating the full encoder Jacobian matrix during backpropagation scales poorly with layer dimensions.
Contrast ℓ1 Activity Regularization with KL Divergence Frequency Matching as mechanisms for enforcing sparsity in an autoencoder.
\ell_1$$Activity Regularization (\lambda \|z\|_1</strong></span><strong>):</strong>Appliesanabsolutepenaltytoindividualactivations.Thiscanbeaggressive,oftenleadingtoaregimewherespecificneuronsarecompletelyandpermanentlydeactivatedacrosstheentiredataset.</p></li></ul><ul><li><p><strong>KLDivergenceFrequencyMatching:</strong>Trackstheaverageempiricalactivationfrequency<span>q_k</span>ofeachhiddenunitacrossaminibatchandpenalizesitsdivergencefromatinytargetdistribution<span>p</span>(e.g.,<span>p = 0.1</span>):</p><p>\Omega = \lambda \sum_k D_{KL}(p \parallel q_k)$$
This ensures that roughly 90% of the neurons are quiet at any given time step, preventing individual units from shutting down permanently across the whole dataset.
State the explicit mathematical likelihood distributions used by a VAE Decoder (pθ(x∣z)) for both continuous data and binary data.
Continuous Data (Gaussian Likelihood):
pθ(x∣z)=N(x∣fd(z;θ),σ2I)
Binary Data (Bernoulli Likelihood):
pθ(x∣z)=i=1∏DBer(xi∣fd(z;θ))
Where fd(z;θ) represents the non-linear decoder neural network transformation.
Use Jensen's Inequality to prove that the VAE Evidence Lower Bound (ELBO) establishes a strict analytical lower bound on the true marginal log-evidence logpθ(x).
Start with the continuous integral form of the ELBO:
L(θ,ϕ∣x)=∫qϕ(z∣x)logqϕ(z∣x)pθ(x,z)dz
Because the logarithm function is concave, applying Jensen's Inequality allows us to pull the log operation outside the expectation integral:
≤log∫qϕ(z∣x)qϕ(z∣x)pθ(x,z)dz
Cancel the approximate posterior qϕ(z∣x) terms inside the integrand:
=log∫pθ(x,z)dz
By definition, integrating out the latent variable z yields the true marginal evidence:
=logpθ(x)
Write out the fast, closed-form analytical equation for the KL Regularization Term when matching an inferred Gaussian distribution against a standard normal prior p(z)=N(0,I).
DKL(q∥p)=−21k=1∑K(logσk2−σk2−μk2+1)
Where K is the total dimensionality of the latent space bottleneck, and μk,σk2 are the localized means and variances output by the inference encoder network.
Contrast Directed PGMs, Autoregressive Models (ARMs), and Generative Adversarial Networks (GANs) on Density Evaluation and Sampling Speed.
Directed PGMs: Features Exact, Fast Density Evaluation paired with Fast Sampling Speed over sparse Directed Acyclic Graphs (DAGs).
Autoregressive Models (ARMs): Features Exact, Fast Density Evaluation but suffers from Slow, Sequential Sampling Speed because each token must be generated iteratively.
Generative Adversarial Networks (GANs): Density evaluation is Not Available (implicit modeling), but features Fast, Parallel Sampling Speed via a single forward pass through the generator network.
Compare VAEs, Normalizing Flows, and Diffusion Models regarding their Latent Space Dimension constraints.
Variational Autoencoders (VAEs): Enforces a Compressed latent space (RL where L≪D), squeezing data through a probabilistic bottleneck.
Normalizing Flows: Requires an Uncompressed latent space (RD where L=D) to maintain full mathematical invertibility across its transformations.
Diffusion Models: Utilizes an Uncompressed latent space (RD where L=D) matching the exact structural dimensions of the input across its forward-reverse denoising steps.
Explain the mathematical formulation and structural rationale behind Latent Space Interpolation.
Given two anchor inputs x1 and $$x_2</span>,theirstructurallatentembeddingsareextractedviaanencoder<span>z_1 = e(x_1)</span>and<span>z_2 = e(x_2)</span>.Alinearblendiscalculatedacrossthelatentspace:</p><p>z = \lambda z_1 + (1 - \lambda)z_2 \quad \text{where} \quad 0 \le \lambda \le 1</p><p>Decodingthispath(<span>x' = d(z)$$) synthesizes a smooth semantic morph between the inputs. This is highly effective because while the raw pixel space is highly curved and nonlinear, the learned latent manifold has approximately zero curvature.
Write out the mathematical derivation for Latent Space Attribute Arithmetic (e.g., adding sunglasses to a face).
1. Compute an attribute offset vector Δ by isolating the average embeddings of images possessing the target attribute (z+) and subtracting the average embeddings of images lacking it (z−):
Δ=N+1∑z+−N−1∑z−
2. To project this attribute predictably onto a novel unaligned image vector's embedding (znew), scale and add the offset vector before decoding:
zmodified=znew+sΔ
Where s is a scalar hyperparameter controlling attribute intensity.
What is the core contradiction when evaluating Continuous Probability Densities on discrete image/audio data, and how does Uniform Dequantization resolve it?
Raw images/audio are stored as discrete integers (e.g., pixel intensities from 0 to 255), yet are typically evaluated using continuous probability density functions (p(x)≥0). Because continuous density peaks can infinitely exceed 1, the continuous Negative Log-Likelihood (NLL) can anomalously tend toward negative infinity.
Uniform Dequantization resolves this by adding uniform noise u∼U(0,1) directly to the discrete coordinates. Evaluating the density of this dequantized continuous version establishes a mathematically rigorous lower bound on the true discrete model log-likelihood via Jensen's Inequality.
Provide an algebraic example proving that a high Log-Likelihood Score does not guarantee high-quality sample generation.
Consider a model q2(x) that combines 1% of an optimal density model q0(x) with 99% of a white noise model q1(x). This network will generate poor, noisy samples 99% of the time, yet its log-likelihood per pixel changes by a negligible constant factor compared to the perfect model:
logq2(x)≥log[0.01q0(x)]=logq0(x)−log(100)=logq0(x)−4.6
The model achieves a stellar likelihood score despite generating visual trash.
Write out the mathematical equation for the Inception Score (IS). Explain its two internal entropy objectives.
IS=exp(Ex∼pθ(x)[DKL(pdisc(Y∣x)∥pθ(Y))])
Expanding the logs yields: log(IS)=H(pθ(Y))−Ex∼pθ(x)[H(pdisc(Y∣x))]
Maximize H(pθ(Y)) (Marginal Entropy): The model must generate samples evenly balanced across all known classes, ensuring high variety and avoiding label drop.
Minimize H(pdisc(Y∣x)) (Conditional Entropy): Individual generated samples must contain distinct, sharp class features that allow the classifier to identify a single label with high confidence.
Write out the formula for the Fréchet Inception Distance (FID) and state its primary statistical limitation.
FID treats deep features extracted via an Inception classifier as multi-dimensional Gaussians, calculating the Wasserstein-2 distance between the real data (μd,Σd) and model samples (μm,Σm):
FID=∥μm−μd∥22+tr(Σd+Σm−2(ΣdΣm)1/2)
Limitation: FID suffers from significant sample bias, meaning scores systematically shift based on the total count of samples used to evaluate the empirical means and covariances. To circumvent this bias, the Kernel Inception Distance (KID) can be used to calculate maximum mean discrepancy.
How do Precision and Recall in generative evaluation decouple sample quality from sample diversity?
Aggregate metrics like FID collapse all failure modes into a single scalar. Precision and recall fix this by using k-nearest neighbor boundaries in a classifier's feature space:
Precision (Quality Metric): Quantifies the fraction of model-generated samples that fall inside the local neighborhood clusters of the true data distribution. High precision means no low-quality or corrupted samples.
Recall (Diversity/Coverage Metric): Quantifies the fraction of real validation samples that fall inside the local neighborhood clusters of the model distribution. High recall means the model has successfully avoided mode collapse.
Write out the mathematical definitions for Precision and Recall based on the binary $k$-nearest neighbor coverage function $f_k(\phi, \Phi)$.
Given the binary coverage function fk (which outputs 1 if feature vector $\phi$ falls within the nearest-neighbor radius of the reference set Φ), precision and recall are defined as:
precision(Φmodel,Φdata)=∣Φmodel∣1ϕ∈Φmodel∑fk(ϕ,Φdata)
recall(Φmodel,Φdata)=∣Φdata∣1ϕ∈Φdata∑fk(ϕ,Φmodel)