2. Decision and Information Theory

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/25

There's no tags or description

Looks like no tags are added yet.

Last updated 1:25 PM on 6/5/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

26 Terms

New cards

Model Uncertainty Definition

Model Uncertainty: Assuming there is a true mapping f(x), but our model f̂(x; θ) has parameters θ that are uncertain given the observed data D.

New cards

Model Parameters (θ) in Linear vs. Logistic Models

Model Parameters (θ): Learned internal variables, such as weights (w) representing feature importance and bias (b) representing the offset.

Linear Model: f(x; θ) = b + wx -
Logistic Model: Uses the same parameters but applies a softmax or sigmoid function to output probabilities.

New cards

Likelihood as a Distribution Definition

Likelihood as a Distribution: Capturing observations (x ∈ D) and classes (C) as a conditional probability distribution: p(y = c | x; θ).

New cards

What is the Posterior Predictive Distribution PPD? State its goal and its mathematical formula.

Goal: To predict the output y for a new input x by integrating over all possible parameter configurations instead of relying on a single point estimate.
Formula: p(y | x, D) = ∫ p(y | x, θ) p(θ | D) dθ

New cards

State Bayes' Theorem formula as applied to parameters θ and data D. Define all 4 of its components.

Formula:

p(θ | D) = ( p(D | θ) * p(θ) ) / p(D)

Components:

Posterior p(θ | D): Our updated belief about the parameters after observing the data.
Likelihood p(D | θ): How well the parameters explain the observed data.
Prior p(θ): Our initial belief about the parameters before seeing any data.
Evidence p(D): The marginal likelihood, computed as ∫ p(D | θ)p(θ)dθ. It serves as a normalising constant to ensure the posterior sums/integrates to 1.

New cards

Why is computing the Evidence p(𝓓) often intractable?

Because it requires solving a complex integral over high-dimensional parameter spaces: ∫ p(𝓓|θ)p(θ)dθ.

New cards

What is Maximum Likelihood Estimation (MLE)? State its optimization formula.

Definition: A point estimation method where we choose the specific parameters θ that make the observed training data look as probable as possible. It completely ignores any prior beliefs.
Formula: θ_MLE = argmax_θ p(D | θ)
Connection: Because multiplying probabilities can cause numerical issues, we typically compute MLE by minimizing the Negative Log-Likelihood.

New cards

What is Negative Log-Likelihood (NLL)? State its formula and why it is used.

Definition: The negative logarithm of the likelihood function. It turns the product of probabilities into a sum of log-probabilities and flips the problem from maximization to minimization. *
Formula: NLL(θ) ≜ -log p(D | θ)
θ_MLE = argmin_θ [ - ∑ log p(x_i | θ) ]

New cards

What is Maximum A Posteriori (MAP) Estimation? State its optimization formula.

Definition: A point estimation method that chooses the parameters θ that maximize the posterior distribution. It functions as MLE plus an explicit prior distribution.
Formula:
- θ_MAP = argmax_θ p( θ | D)
- = argmin_θ [ -log p(D | θ) - log p(θ) ]
Connection: The introduction of the prior term (-log p(θ)) works directly as a regularizer to prevent overfitting.

New cards

How do MLE and MAP differ regarding priors?

MLE assumes no prior (or a uniform prior), while MAP explicitly utilizes a prior distribution which acts as a regularizer.

New cards

What is Mean Squared Error (MSE)? How does it link mathematically to MLE and MAP?

Definition: A loss function that measures the average squared difference between estimated values and the actual true values.
Link to MLE: MSE is derived directly from MLE if we assume our data contains additive Gaussian noise: ε ~ N(0, σ²). Minimizing MSE is identical to maximizing the likelihood under a Gaussian noise assumption.
Link to MAP: If we add a zero-mean Gaussian prior p(θ) over the parameters to an MSE setup (Ridge Regression), the MAP estimate becomes: Loss = MSE + λ||θ||²₂
Equation:
$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

New cards

What loss function results from combining an MLE Gaussian noise assumption with a Gaussian prior (Ridge Regression)?

Loss = MSE + λ||θ||²_2 (MAP estimation).

New cards

What is Logistic Regression? Describe its setup and its loss function link.

Definition: A classification model that calculates probabilities using a sigmoid/softmax applied to its parameters: ŷ = σ(wᵀx).
Connection: Minimizing the Negative Log-Likelihood (NLL) of a Bernoulli distribution under this setup directly yields the standard Cross-Entropy Loss function

New cards

What is Posterior Expected Loss (Risk)?

The sum of the losses of an action across all possible hidden states, weighted by their posterior probability: R(a|x) = ∑ ℓ(h, a)p(h|x).

New cards

What is the Rejection Option in classification?

A strategy where the model outputs "I don't know" or rejects the action if the highest class probability falls below a set confidence threshold.

New cards

Entropy $\mathbb{H}(X)$ Definition and formula

Definition: The expected average level of uncertainty, information, or surprise in a system.
$\mathbb{H}(X) = -\sum_{i=1}^{n} p(x_i) \log_b p(x_i)$
- 1 → highest surprise
- 0.5 less surprise
- 0 → certain outcome
Core Intuition: Entropy measures how predictable a system is. The more unpredictable the system, the higher the entropy.

New cards

What is Cross-Entropy between distributions p and q, H(p, q)? State its formula and how it connects to parameter estimation.

Definition: Measures the average surprise when you use a predicted distribution ( $q$ ) to describe a true distribution ( $p$ ).
Formula: $\mathbb{H}(p, q) \triangleq -\sum_{k=1}^{K} p_k \log q_k$
Estimation Connection: Minimizing cross-entropy loss is mathematically equivalent to maximizing the likelihood θ (MLE).

New cards

Minimizing Cross-Entropy loss is mathematically equivalent to what estimation framework?

Maximizing the parameter likelihood (MLE) under a Bernoulli or Multinomial distribution.

New cards

What is Joint Entropy? State its mathematical formula.

Definition: A metric that measures the total combined uncertainty or "surprise" contained within two random variables X and Y evaluated at the same time.
Formula:
- $\mathbb{H}(X, Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y)$

New cards

What is Conditional Entropy H(Y|X)? What does it calculate?

Definition: The amount of remaining uncertainty or "independent surprise" left in a random variable Y after the value of another variable X observed.
Connection: It is the vital component subtracted from base entropy to calculate Information Gain / Mutual Information

New cards

What is Mutual Information I(X; Y)? How is it interpreted and where is it applied?

Definition: A measure of the shared information or overlap between two variables. It represents the reduction in uncertainty ("surprise killed") about Y after seeing X.
Connection to Decision Trees: Known as Information Gain. Decision Trees utilize it as a splitting criteria to select the feature that drops data uncertainty the most.
$\text{IG}(X, Y) = \mathbb{H}(Y) - \mathbb{H}(Y \mid X)$

New cards

What is Kullback-Leibler (KL) Divergence? Provide its discrete and continuous equations.

Definition: A non-symmetric distance metric that measures how much an approximate or predicted distribution q diverges from a true baseline distribution p.
$D_{\mathbb{KL}}(p \parallel q) = \mathbb{H}(p, q) - \mathbb{H}(p)$
$\text{KL Divergence (Your waste)} = \text{Cross-Entropy (Your total surprise)} - \text{Entropy (Nature's chaos)}$

New cards

How do Entropy, Cross-Entropy, and KL Divergence connect mathematically?

The Formula: Cross-Entropy H(p, q) = H(X) + D_KL(p || q)
Because the true data distribution's Entropy H(X) is fixed by reality, minimizing
Cross-Entropy is mathematically identical to minimizing KL Divergence. Both push your model distribution (q) to match real data (p).

New cards

Define Posterior Expected Loss (Risk) and the Rejection Option.

Posterior Expected Loss (Risk): The sum of losses resulting from taking a specific action 'a' across all hidden states 'h', scaled by their posterior probability: R(a|x) = ∑_{h ∈ H} ℓ(h, a) p(h|x) *
Rejection Option: An action in classification where if the highest posterior class probability fails to meet a designated confidence threshold, the model triggers a "reject" ("I don't know") choice to avoid making errors.

New cards

Why are the Posterior and Posterior Predictive Distributions considered "intractable" in high-dimensional spaces? What are the workarounds?

The Problem: The Evidence denominator p(D) = ∫ p(D|θ)p(θ)dθ requires integrating over all possible parameter spaces. In high-dimensional spaces, this integral cannot be solved analytically or computationally.
Point Estimation Workaround: MLE and MAP "cheat" by avoiding computing the entire distribution. They only navigate to the single maximum peak point of the function, though this discards parameter uncertainty.
Approximation Workaround: Deploying advanced inference frameworks like Markov Chain Monte Carlo (MCMC) or Variational Inference (VI) to approximate the distribution.

New cards

Summarize the hierarchy links between MLE, MAP, Uniform Priors, and Gaussian assumptions.

MLE to MAP Link: MLE is mathematically identical to a MAP estimate that assumes a perfectly flat, Uniform Prior. MAP becomes MLE + Regularization once an informative prior is assigned.
MLE to MSE Link: Minimizing Mean Squared Error (MSE) is structurally identical to maximizing Likelihood (MLE) under the constraint that the dataset has errors modeled as Gaussian Noise.