6. Representation Learning

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/52

There's no tags or description

Looks like no tags are added yet.

Last updated 9:33 AM on 6/12/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

53 Terms

New cards

What is the "Curse of Dimensionality" in unsupervised learning and how does it impact data metrics?

High-dimensional spaces lead to extreme data sparsity because the volume of the space scales exponentially. This makes traditional distance metrics (like Euclidean distance) less meaningful because all data points tend to become roughly equidistant from one another.

New cards

Map out the 4 "Classic Approaches" to handling large, high-dimensional datasets.

1. Numerosity reduction: Regression, Clustering.

2. Dimensionality reduction: Feature selection.

3. Feature transformation: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Independent Component Analysis (ICA).

4. Data compression: Autoencoders.

New cards

What is the algorithmic step-by-step process to perform Principal Component Analysis (PCA)?

1. Normalize data: Compute the mean vector $\mathbf{x}$ and center the data.

2. Compute Covariance: Generate the $n \times n$ covariance matrix.

3. Decompose Matrix: Calculate and sort eigenvalues ( $\lambda_1 \ge \lambda_2 \ge \dots$ ) and their corresponding eigenvectors.

4. Select Subspace: Choose the top $k$ eigenvectors.

5. Project: Project the original data into the new $k$ -dimensional subspace.

New cards

Write out the full Covariance Matrix Equation ( $\mathbf{\Sigma}$ ) and define its variables.

$\mathbf{\Sigma} = \frac{1}{N-1}\sum_{n=1}^{N}(\mathbf{x}_{n}-\overline{\mathbf{x}})(\mathbf{x}_{n}-\overline{\mathbf{x}})^{T} = \frac{1}{N-1}\mathbf{X}^{T}\mathbf{X}$

$\mathbf{x}_n$ : Individual high-dimensional data sample vector.
$\overline{\mathbf{x}}$ : Computed mean vector across all sample observations ( $\overline{\mathbf{x}} = \frac{1}{N}\sum \mathbf{x}_n$ ).
$\mathbf{X}$ : Mean-centered data matrix where rows correspond to individual observations.

New cards

Explain the core structural difference between how PCA and Factor Analysis define a "Latent Factor."

PCA: Describes a latent factor simply as a direct linear composite of the observed variables (projecting along axes of greatest variance).

Factor Analysis: Is a probabilistic measurement model that defines observed variables as continuous mixtures of latent factors, explicitly incorporating mean transformations ( $\mathbf{W}$ ), a prior, and unique sensor noise.

New cards

In the Probabilistic Factor Analysis framework, what is the assumed Prior Distribution of the latent factors $\mathbf{z}$ ?

The latent factors $\mathbf{z}$ are assumed to follow a simple standard normal distribution, meaning they are mutually independent and zero-centered:

$p(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I})$

New cards

Write the generation equation and the Conditional Likelihood function $p(\mathbf{x} \mid \mathbf{z})$ for Factor Analysis.

Generation Equation: $\mathbf{x} = \mathbf{W}\mathbf{z} + \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \text{where} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{\Psi})$

( $\mathbf{W}$ is the factor loading matrix; $\mathbf{\Psi}$ is a diagonal matrix containing unique feature variances).

Conditional Likelihood:

$p(\mathbf{x} \mid \mathbf{z}) = \mathcal{N}(\mathbf{x}; \mathbf{W}\mathbf{z} + \boldsymbol{\mu}, \mathbf{\Psi})$

New cards

What is the Marginal Distribution $p(\mathbf{x})$ in Factor Analysis obtained by integrating out the latent vector $\mathbf{z}$ ?

$p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) d\mathbf{z} = \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \mathbf{W}\mathbf{W}^T + \mathbf{\Psi})$

This distribution serves as a foundation for deep generative models.

New cards

Why is Factor Analysis insufficient for data like the "Cocktail Party Effect," and how does Independent Component Analysis (ICA) resolve it?

Factor Analysis assumes Gaussian distributions, which cannot distinguish individual underlying source directions when waveforms are mixed together. ICA resolves this by explicitly modeling non-Gaussian continuous sources and learning a demixing matrix that minimizes mutual information between latent components.

New cards

How do high-dimensional and low-dimensional similarity metrics differ in t-SNE?

High-Dimensional Space: Computes the probability $p_{j\mid i}$ that point $\mathbf{x}_i$ picks $\mathbf{x}_j$ as its neighbor under a localized Gaussian distribution.

Low-Dimensional Space: Maps coordinates $\mathbf{y}_i$ and $\mathbf{y}_j$ using a heavy-tailed Student-t distribution (1 degree of freedom) to compute the low-dimensional probability $q_{ij}$ .

New cards

What is the Crowding Problem in dimensionality reduction and how does t-SNE's Student-t distribution fix it?

In high-dimensional spaces, volume scales exponentially. When projecting down to a $2\text{D}$ space, there isn't enough room to preserve moderate distances, causing points to clump into a dense, uninterpretable cluster. The heavy-tailed Student-t distribution allows moderate distances to be stretched out, separating distinct clusters during optimization.

New cards

Write out the reconstruction loss equations for a baseline Bottleneck Autoencoder versus an $L_1$ Regularized Sparse Autoencoder.

Bottleneck / Undercomplete AE:

$\mathcal{L}_{\text{Reconstruction}}(\phi, \theta) = \frac{1}{N}\sum_{i=1}^{N} \|\mathbf{x}_i - g_\theta(f_\phi(\mathbf{x}_i))\|_2^2$

$L_1$ Sparse AE:

$\mathcal{L}_{\text{Sparse}} = \mathcal{L}_{\text{Reconstruction}} + \lambda \sum_{j} |z_j|$

New cards

Write out the exact KL-Divergence penalty equation used to enforce activation sparsity in a Sparse Autoencoder.

$\sum_{j} \text{KL}(\rho \parallel \hat{\rho}_j) = \sum_{j} \left( \rho \log \frac{\rho}{\hat{\rho}_j} + (1-\rho) \log \frac{1-\rho}{1-\hat{\rho}_j} \right)$

Where $\rho$ is the tiny target activation probability and $\hat{\rho}_j$ is the average activation of neuron $j$ over the minibatch.

New cards

How do advanced architectures like SegNet or U-Net handle hidden representations differently than standard bottleneck autoencoders?

Instead of forcing all information strictly through a single central low-dimensional bottleneck, they use skip connections or pooling indices to link encoder-decoder pairs. This allows latent representations to form across multiple hierarchical levels, which is highly effective for preservation tasks like image segmentation.

New cards

Define the Manifold Hypothesis and explain how neural networks utilize it.

The hypothesis states that most naturally occurring high-dimensional data (e.g., 784D MNIST digits) lies on a much lower-dimensional manifold embedded within that space. Neural networks use hidden layers to learn non-linear coordinate transformations that untangle these overlapping manifolds, rendering features linearly separable.

New cards

Contrast Distributed Representations with localist ("one-hot") encodings.

Localist encodings assign one concept strictly to one neuron (like a sparse one-hot vector). A distributed representation uses a many-to-many relationship where multiple features are active simultaneously; each neuron participates in representing many concepts, and each concept is expressed across many neurons.

New cards

What is Disentanglement in latent spaces and what are its key goals?

Disentanglement ensures that distinct, interpretable factors of variation (e.g., a face's hair color, angle, or smile) are cleanly separated along individual, isolated dimensions of the latent space. Key goals include smoothness, linearity, compositionality, coherence, and hierarchical organization.

New cards

Compare the architectures of CBOW and Skip-gram in Word2Vec.

Continuous Bag-of-Words (CBOW): Predicts a missing central target word ( $w_t$ ) given its surrounding context words. It averages the context vectors, making it faster to train and highly accurate for frequent words.

Skip-gram: Predicts the surrounding context words given a single central target word. It weights closer context words heavier, excels with small datasets, and handles rare words exceptionally well.

New cards

What design problem with Skip-gram led to the creation of GloVe (Global Vectors)?

Skip-gram requires an explicit cross-entropy normalization over the entire vocabulary ( $\mathcal{V}$ ) to compute probabilities, which is incredibly expensive. GloVe solves this by training unsupervised directly on global word-word co-occurrence matrix statistics ( $X$ ), minimizing a robust squared loss function that completely avoids vocabulary-wide normalization steps.

New cards

What is an Autoencoder (AE) and what is its primary objective?

An autoencoder is a neural network trained to replicate its input to its output through a low-dimensional bottleneck layer. Its core goal is data compression and unsupervised feature learning.

New cards

What are the components, mappings, and reconstruction function of a basic Autoencoder?

Encoder ( $f_e$ ): Maps input to a latent representation, $x \to z$ .

Decoder ( $f_d$ ): Maps the latent representation back to an input reconstruction, $z \to \hat{x}$ .
Reconstruction Function:
- $r(x) = f_d(f_e(x))$

New cards

Compare the two main loss functions used in the baseline Autoencoder framework.

Squared Reconstruction Error:

$L(\theta) = \|r(x) - x\|_2^2$

Probabilistic/Generative Framework:
$L(\theta) = -\log p(x|r(x))$

New cards

What is the degenerate "identity solution" in Autoencoders, and how do Undercomplete representations prevent it?

If hidden layers are too wide, the network will simply memorize the data and learn the identity function ( $r(x) = x$ ).

An Undercomplete representation forces the latent dimension ( $L$ ) to be strictly smaller than the input dimension ( $D$ ), creating a narrow bottleneck that compels the network to capture only the most meaningful patterns.

New cards

How does an Overcomplete Autoencoder representation function without collapsing into a trivial identity mapping?

In an Overcomplete Autoencoder, the latent dimension is larger than the input ( $L \gg D$ ). To prevent simple memorization, explicit regularization is imposed—such as adding noise to inputs, forcing activation sparsity, or penalizing derivatives.

New cards

What is a Linear Bottleneck Autoencoder mathematically equivalent to, and why?

It is mathematically equivalent to Principal Component Analysis (PCA).

If the encoder ( $z = W_1x$ ) and decoder ( $\hat{x} = W_2z$ ) use strictly linear mappings and minimize squared reconstruction error, the model acts as an orthogonal projection onto the first $L$ eigenvectors of the data's empirical covariance matrix.

New cards

Why do Convolutional (CNN) architectures typically outperform Multi-Layer Perceptrons (MLPs) on image reconstruction tasks (e.g., Fashion MNIST)?

MLPs flatten images into 1D vectors, destroying structural layout. CNN architectures utilize localized kernels and weight sharing, which explicitly preserves spatial structural information.

New cards

Describe the training mechanism of a Denoising Autoencoder (DAE).

Instead of reconstructing a clean input, a DAE is fed a corrupted input $\tilde{x}$ (distorted via Gaussian noise or Bernoulli dropout) and is trained to output the original, uncorrupted version $x$ . This forces the network to learn manifold structures to "hallucinate" missing details.

New cards

What mathematical vector field is approximated by a Denoising Autoencoder (DAE) as noise variance approaches zero ( $\sigma \to 0$ )?

he residual error $e(x) = r(\tilde{x}) - x$ approximates the score function (the gradient of the log data density):

$e(x) \approx \nabla_x \log p(x)$

This creates a vector field where all vectors point directly toward higher-probability regions on the underlying data manifold.

New cards

How do Contractive Autoencoders (CAEs) force the network to be insensitive to small input variations?

By adding a penalty term based on the Frobenius norm of the encoder's Jacobian matrix:

$\Omega(z, x) = \lambda \left\| \frac{\partial f_e(x)}{\partial x} \right\|_F^2 = \lambda \sum_k \|\nabla_x h_k(x)\|_2^2$

This forces the encoder activations $h_k(x)$ to remain stable under minor perturbations of the input $x$ .

New cards

Why do Contractive Autoencoders (CAEs) utilize tied weights ( $W_d = W_e^T$ ), and what is their engineering drawback?

Tied Weights: Prevent a trivial solution where the encoder shrinks the input by an infinitesimally small $\epsilon$ to satisfy the penalty while the decoder simply multiplies by $1/\epsilon$ .

Drawback: CAEs are computationally slow to train due to the high overhead of computing the Jacobian matrix at every step.

New cards

How do $\ell_1$ Regularization and KL Divergence penalties differ in how they enforce a Sparse Autoencoder architecture?

$\ell_1$ Regularization ( $\lambda\|z\|_1$ ): Can be overly harsh, permanently turning many neurons completely off across all data samples.

KL Divergence Penalty: Compares average hidden unit activation $q_k$ across a minibatch to a tiny target probability $p$ (e.g., 0.1) via $\lambda \sum_k D_{KL}(p \parallel q_k)$ . This yields a dynamic, biological-like sparse firing pattern (~70% off on average) without creating completely dead neurons.

New cards

What are the probabilistic outputs of a VAE's Encoder and Decoder networks?

ncoder / Inference Network ( $q_\phi(z|x)$ ): Outputs the parameters (mean $\mu$ and diagonal covariance $\Sigma$ ) of a Gaussian distribution: $\mathcal{N}(z|f_{e,\mu}(x), \text{diag}(f_{e,\sigma}(x)))$ .
Decoder / Generative Model ( $p_\theta(x|z)$ ): Outputs a probability distribution over the input space, such as a Gaussian Likelihood for continuous data or a Bernoulli Likelihood for binary data.

New cards

What is Amortized Inference in the context of Variational Autoencoders?

It is the technique of using a single neural network ( $q_\phi$ ) to instantly predict and approximate the optimal latent distribution parameters for any given data sample, entirely replacing slow, iterative per-point optimization algorithms.

New cards

Write out the Evidence Lower Bound (ELBO) equation for VAEs and define what its two components measure.

$\mathcal{L}(\theta, \phi|x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \parallel p(z))$

Term 1 (Expected Log-Likelihood): Measures reconstruction accuracy.
Term 2 (KL Divergence): Regularizes the network by forcing the approximate posterior $q_\phi(z|x)$ to match a simple prior distribution, usually a standard Gaussian $p(z) = \mathcal{N}(0, I)$ .

New cards

What is the closed-form equation for the KL Divergence term in a VAE when matching a Gaussian posterior to a standard Gaussian prior?

$D_{KL}(q \parallel p) = -\frac{1}{2} \sum_{k=1}^K \left( \log \sigma_k^2 - \sigma_k^2 - \mu_k^2 + 1 \right)$

New cards

What is the Reparameterization Trick in VAEs and why is it mathematically required?

Gradients cannot backpropagate through a random sampling node ( $z \sim q_\phi(z|x)$ ). To bypass this, the trick isolates the stochasticity by shifting the random sample to an independent noise variable $\epsilon \sim \mathcal{N}(0, I)$ , defining $z$ deterministically as:

$z = \mu + \sigma \odot \epsilon$

This allows gradients to flow smoothly back through the mean ( $\mu$ ) and variance ( $\sigma$ ) networks during backpropagation.

New cards

Compare Deterministic AEs and VAEs regarding Latent Space Mapping and Generative Ability.

Latent Space Mapping:
Deterministic AE: Maps an input directly to a discrete, fixed point vector (delta function).
VAE: Maps an input to a continuous probability distribution (Gaussian mean $\mu$ and variance $\sigma$ ).

Generative Ability:

Deterministic AE: No generative capability. It cannot sample cleanly and has no mathematical framework for handling random latent inputs outside of its exact training points.
VAE: Yes, fully generative. You can easily sample random vectors from a standard Gaussian distribution $z \sim \mathcal{N}(0, I)$ and run them through the decoder to generate entirely new, synthetic data.

New cards

Compare Deterministic AEs and VAEs regarding Latent Space Topology and Latent Interpolation.

Latent Space Topology:
- Deterministic AE: Often non-smooth and fragmented. Distinct classes can end up overlapping or containing empty "dead gaps" where the decoder fails.
- VAE: Continual and locally smooth due to being forced to integrate with the standard Gaussian prior distribution.
Latent Interpolation:
- Deterministic AE: Poorer transitions. Traveling between intermediate points often crosses unmapped or unrealistic regions, causing distorted outputs.
- VAE: Seamless latent space interpolation. Moving along a line between points using $z = \lambda z_1 + (1-\lambda)z_2$ yields a smooth, continuous morphing between items.

New cards

In LSA, how does changing the local sliding window size ( $h$ ) change what type of linguistic information is captured?

Small windows (e.g., $h=2$ ): Capture syntactic behavior (e.g., grouping words with similar grammatical roles, like matching "dog" with "cat").
Large windows (e.g., $h=30$ ): Capture semantic/topical themes (e.g., matching "dog" with "kennel, bark, puppy").

New cards

Write the formulas for Pointwise Mutual Information (PMI) and Positive PMI (PPMI). Why is PPMI preferred over raw counts?

$\text{PMI}(i, j) = \log \frac{p(i, j)}{p(i)p(j)}$

$\text{PPMI}(i, j) = \max(\text{PMI}(i, j), 0)$

Raw co-occurrence counts perform poorly because they over-index frequent, uninformative words. PPMI isolates true context-word relationships and zeros out negative values, which are statistically unreliable in finite corpora.

New cards

Write out the low-rank Singular Value Decomposition (SVD) formula used in Latent Semantic Analysis (LSA) and define what represents the word vs. document embedding.

$C_{ij} \approx \sum_{k=1}^K u_{ik}s_k v_{kj}$

Word embedding: Represented by the row vector $u_i$ .
Document/Context embedding: Represented by $s \odot v_j$ .
It minimizes the Frobenius norm error: $\min \|C - \hat{C}\|_F^2$ .

New cards

State the core premise of the Distributional Hypothesis (Harris, 1954; Firth, 1957).

word is characterized by the company it keeps." Words are considered semantically similar if they consistently appear within similar contexts. Standard embedding models leverage this by learning mappings from a word’s local or global context to its vector representation.

New cards

Why do sparse one-hot vectors fail to represent semantic relationships between words? Give an example.

One-hot vectors treat all words as completely independent orthogonal categories. For example, the related pair ("man", "woman") and the entirely unrelated pair ("man", "banana") both yield an identical Hamming distance of 1, failing to capture semantic proximity.

New cards

Describe the objective and vector calculation mechanism of the Continuous Bag of Words (CBOW) model.

Goal: Predicts a target central word $w_t$ given its surrounding context window $w_{t-m : t+m}$ .
Mechanism: It averages the continuous embeddings of the surrounding context words into a single joint vector $v_t$ :
$v_t = \frac{1}{2m} \sum_{h=1}^m (v_{w_{t+h}} + v_{w_{t-h}})$

New cards

Write out the conditional distribution formula for Word2vec's Skip-gram model and explain why each word has two distinct vectors.

$\log p(w_o | w_c) = u_o^T v_c - \log \sum_{i \in V} \exp(u_i^T v_c)$

Each word has two spaces to handle its active structural role: $v_i$ is used when the word acts as the central target word, and $u_i$ is used when it acts as a surrounding context word.

New cards

How does Skip-gram with Negative Sampling (SGNS) convert a multi-class problem into a binary classification task?

Evaluating a standard softmax requires computing a denominator over the entire vocabulary $V$ . SGNS converts this by treating true pairs $(w_t, w_{t+j})$ as positive targets ( $D=1$ ) while drawing $K$ alternative "noise words" $w_k$ from a custom unigram distribution and labeling them as negative ($$D=0$$).

New cards

Write out the mathematical objective function for Skip-gram with Negative Sampling (SGNS).

Objective: Maximize the probability of true pairs while minimizing noise pairs:

$p(w_{t+j} | w_t) = \sigma(u_{w_{t+j}}^T v_{w_t}) \prod_{k=1}^K \left[1 - \sigma(u_{w_k}^T v_{w_t})\right]$

Where noise words are sampled from: $p(w) \propto \text{freq}(w)^{3/4}$ to boost the chances of selecting rare words.

New cards

What is the final objective function of GloVe, and how is the final word representation calculated when training is complete?

Objective:

$L = \sum_{i \in V} \sum_{j \in V} h(x_{ij}) \left( u_j^T v_i + b_i + c_j - \log x_{ij} \right)^2$

Where $h(x_{ij}) = \min\left(1, (x_{ij}/100)^{0.75}\right)$ balances frequent words. The final word representation is computed by averaging its dual vector assignments:

$\frac{v_i + u_i}{2}$

New cards

Explain the mathematical framework used to solve linear word analogies (e.g., $a:b :: c:?$) in an embedding space.

To solve $a:b :: c:?$ , compute the direction vector $\delta = v_b - v_a$ (which acts as a semantic transformation vector, like gender). The target vector $v_d$ is calculated by shifting the vector $c$ along that same direction:

$v_d \approx v_c + (v_b - v_a)$

New cards

What underlying linguistic co-occurrence condition causes linear geometry and analogies to hold true in word vector spaces?

Linear geometry emerges because the underlying ratios of co-occurrence probabilities with arbitrary words $w$ remain stable across related semantic concepts:

$\frac{p(w|\text{man})}{p(w|\text{woman})} \approx \frac{p(w|\text{king})}{p(w|\text{queen})}$

New cards

What are the core assumptions of the RAND-WALK generative model (Arora et al., 2016)?

Text generation is driven by a latent discourse vector $z_t \in \mathbb{R}^D$ moving via a slow Gaussian random walk.
Words are generated via a log-bilinear model: $p(w|z_t) = \frac{\exp(z_t^T v_w)}{Z(z_t)}$ .
High-dimensional properties of isotropic Gaussian priors cause the partition function $Z(z_t) \approx Z$ to behave as a self-normalizing constant.

New cards

What is the major proof result of the RAND-WALK model concerning Word2vec and GloVe?

It proves that the mathematical Pointwise Mutual Information (PMI) of two words directly mirrors their low-dimensional vector dot product:
$\text{PMI}(w, w') \approx \frac{v_w^T v_{w'}}{D}$
This means minimizing Word2vec and GloVe objectives is mathematically equivalent to optimizing a frequency-weighted SVD factorization of the empirical PMI matrix.

New cards

What is the critical limitation of static word embeddings (LSA, Word2vec, GloVe), and how do Contextual Embeddings solve it?

Static embeddings assign a single, fixed vector to each word, making them unable to resolve polysemy or homonyms (e.g., "Apple computer" vs. "eating an apple" share the exact same vector).

Contextual word embeddings (e.g., Transformers) dynamically generate a word's vector representation as a function of its entire surrounding sentence.