Naïve Bayes Classifier: Derivation and Examples

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/30

Earn XP

Description and Tags

These flashcards cover the key concepts, terms, and definitions related to the Naïve Bayes Classifier discussed in the lecture notes.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

31 Terms

New cards

What is the definition of Bayes' Theorem?

Bayes' Theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event. It mathematically expresses how to update the probability for a hypothesis as more evidence or information becomes available.

New cards

What is the formula for Bayes' Theorem in conditional probability?

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

New cards

In Bayes' Theorem, what does P(A|B) represent?

The posterior probability, which is the probability of hypothesis A being true given that evidence B has been observed. It's the updated probability after considering the evidence.

New cards

In Bayes' Theorem, what does P(B|A) represent?

The likelihood, which is the probability of observing evidence B given that hypothesis A is true. It quantifies how well the evidence supports the hypothesis.

New cards

In Bayes' Theorem, what does P(A) represent?

The prior probability, which is the initial probability of hypothesis A being true before any evidence B is observed. It reflects our initial belief.

New cards

In Bayes' Theorem, what does P(B) represent?

The evidence or marginal likelihood, which is the probability of observing evidence B, irrespective of any specific hypothesis A. It acts as a normalizing constant to ensure the posterior probabilities sum to 1.

New cards

In Naïve Bayes classification, what does P(C_k | \mathbf{x}) denote?

This is the posterior probability of a specific class Ck (e.g., 'spam' or 'not spam') given an observed feature vector \mathbf{x} (e.g., words in an email). It represents the updated belief in the class Ck after considering the features present in \mathbf{x}.

New cards

What is the fundamental Naïve Bayes assumption regarding features?

The Naïve Bayes assumption states that all features are conditionally independent of each other given the class variable. Mathematically, for a feature vector \mathbf{x} = (x1, \dots, xn) and class Ck, this means P(x1, \dots, xn | Ck) = \prod{i=1}^{n} P(xi | C_k). This greatly simplifies the likelihood calculation.

New cards

What does conditional independence mean in the context of Naïve Bayes?

Conditional independence means that the presence or absence of one feature (xi) does not affect the probability of any other feature (xj) occurring, given that we already know the class variable (C_k). For example, if we know an email is 'spam', the probability of the word 'money' appearing is independent of the word 'free' appearing.

New cards

What is the Maximum A Posteriori (MAP) classification rule in Naïve Bayes?

The MAP rule in Naïve Bayes aims to classify an instance \mathbf{x} into the class Ck that maximizes the posterior probability P(Ck | \mathbf{x}). Due to its equivalence with P(Ck) \prod{i=1}^{n} P(xi | Ck) (as P(\mathbf{x}) is a constant), the rule involves finding the class that maximizes this product.

New cards

What is the formula for the Naïve Bayes classification rule using the MAP principle?

\hat{y} = \arg \max{k} P(Ck) \prod{i=1}^{n} P(xi | Ck). The predicted class \hat{y} is the class Ck that yields the highest value for this product.

New cards

In the Naïve Bayes classification rule, what does P(C_k) signify?

This is the prior probability of class Ck. It represents the overall frequency or proportion of observations belonging to class Ck in the training data, before considering any specific features of the new instance.

New cards

In the Naïve Bayes classification rule, what does P(xi | Ck) signify?

This is the likelihood of observing feature xi given class Ck. It indicates how probable a specific value for feature xi is, considering that the instance belongs to class Ck. These likelihoods are typically estimated from the training data.

New cards

Why is log-space transformation commonly used in Naïve Bayes calculations?

Log-space is used to prevent numerical underflow, which occurs when multiplying many very small probability values (e.g., 0.001 \times 0.0001). These products can become so infinitesimally small they round down to zero on a computer, leading to incorrect or undefined results, especially when comparing probabilities.

New cards

How does log-space help in Naïve Bayes calculations?

By transforming products into sums using logarithms (\log(a \cdot b) = \log(a) + \log(b)), we avoid underflow. Instead of directly calculating \prod P(xi | Ck), we compute \sum \log(P(xi | Ck)). This keeps the intermediate and final values within a manageable numerical range, as sums of logs are much larger than products of small probabilities.

New cards

What is the purpose of Laplace Smoothing (also known as Add-\alpha smoothing) in Naïve Bayes?

Laplace Smoothing is applied to address the zero-frequency problem, which occurs when a particular feature value has not been observed with a certain class in the training data. Without smoothing, the estimated likelihood for such a combination would be P(xi | Ck) = 0, causing the entire posterior probability for that class to become zero and preventing proper classification.

New cards

What is the formula for Laplace Smoothing when estimating P(xi=v | Ck)?

P(xi=v | Ck) = \frac{N{ik}(v) + \alpha}{Nk + \alpha |Vi|} where $\alpha$ is typically 1 for Laplace smoothing (or other positive values for Add-\alpha smoothing), N{ik}(v) is the count of feature xi having value v in class Ck, Nk is the total count of samples in class Ck, and |Vi| is the number of possible unique values for feature xi.

New cards

In the Laplace Smoothing formula, what do N{ik}(v) and Nk represent?

N{ik}(v) is the count of how many times feature xi takes on the specific value v within the training examples belonging to class Ck. Nk is the total number of training examples that belong to class C_k.

New cards

In the Laplace Smoothing formula, what do \alpha and |V_i| represent?

\alpha is a smoothing parameter (typically 1 for Laplace smoothing), which adds a pseudo-count to observed events, effectively 'seeing' each possible feature value at least once. |Vi| is the number of unique possible values that the categorical feature xi can take (e.g., for a binary feature like 'gender', |V_i|=2).

New cards

What is Gaussian Naïve Bayes primarily used for?

Gaussian Naïve Bayes is a variant of Naïve Bayes specifically designed to handle continuous features. Instead of using discrete counts for likelihoods, it models the distribution of each continuous feature using a Gaussian (normal) distribution for each class.

New cards

How is the mean (\mu{ik}) for a continuous feature xi in class C_k estimated in Gaussian Naïve Bayes?

The mean \mu{ik} is estimated separately for each feature xi and each class Ck. The formula is: \mu{ik} = \frac{1}{Nk} \sum{j \in Ck} x{ji} where Nk is the number of training samples in class Ck, and the sum is over the values of feature xi for all training samples j that belong to class Ck.

New cards

How is the variance (\sigma{ik}^2) for a continuous feature xi in class C_k estimated in Gaussian Naïve Bayes?

The variance is estimated for each feature xi and each class Ck using the sample variance formula: \sigma{ik}^2 = \frac{1}{Nk-1} \sum{j \in Ck} (x{ji} - \mu{ik})^2 where Nk is the number of training samples in class Ck, x{ji} is the value of feature i for sample j, and \mu{ik} is the estimated mean for feature i in class C_k.

New cards

What is the role and estimation of the prior probability P(C_k) in Naïve Bayes?

The prior probability P(Ck) represents the initial belief or prevalence of class Ck in the dataset before any features are observed. It is typically estimated from the training data as the proportion of samples belonging to class Ck: P(Ck) = \frac{\text{Number of samples in class } C_k}{\text{Total number of training samples}}.

New cards

What does the likelihood P(\mathbf{x} | C_k) represent in Naïve Bayes classification?

The likelihood P(\mathbf{x} | Ck) represents the probability of observing the entire feature vector \mathbf{x} given that the instance belongs to class Ck. Due to the Naïve Bayes conditional independence assumption, this likelihood is simplified to a product of individual feature likelihoods: \prod{i=1}^{n} P(xi | Ck). For categorical features, $P(xi | C_k)$ involves counts; for continuous features (Gaussian Naïve Bayes), it uses the Gaussian PDF.

New cards

How is the posterior probability P(C_k | \mathbf{x}) explicitly calculated using Bayes' Theorem in the context of Naïve Bayes?

The posterior probability is calculated as: P(Ck | \mathbf{x}) = \frac{P(\mathbf{x} | Ck) \cdot P(Ck)}{P(\mathbf{x})}. Here P(Ck| \mathbf{x}) is the posterior, P(\mathbf{x} | Ck) is the likelihood, P(Ck) is the prior, and P(\mathbf{x}) is the evidence (or marginal likelihood).

New cards

In Naïve Bayes, what is the 'evidence' term P(\mathbf{x}), and why is it often ignored during classification?

The evidence term P(\mathbf{x}) is the probability of observing the feature vector \mathbf{x} itself, regardless of the class. It acts as a normalizing constant. In classification, when we want to find \arg \maxk P(Ck | \mathbf{x}), we can ignore P(\mathbf{x}) because it is constant for all classes Ck and therefore does not affect which Ck maximizes the probability (i.e., it doesn't change the ranking of posterior probabilities across classes).

New cards

What is the primary objective of the Gaussian Naïve Bayes classification rule?

The objective is to assign an incoming instance with continuous features \mathbf{x} to the class Ck that has the highest posterior probability P(Ck | \mathbf{x}). This is achieved by calculating P(Ck) \cdot \prod{i=1}^{n} P(xi | Ck) where each P(xi | Ck) is computed using the Gaussian Probability Density Function (PDF).

New cards

Explain the zero-frequency problem and how it impacts Naïve Bayes classification without smoothing.

The zero-frequency problem occurs when a specific combination of a feature value and a class (e.g., 'word: 'unseenvocabulary', class: 'spam') is not present in the training data. Without smoothing, the estimated likelihood P(xi | Ck) for this combination would be 0. Since the Naïve Bayes likelihood is computed as a product \prod P(xi | C_k), a single zero term would make the entire likelihood for that class 0, effectively preventing that class from ever being predicted, regardless of other strong evidence.

New cards

Describe the general structure and working principle of a Naïve Bayes classifier.

A Naïve Bayes classifier builds a probabilistic model by learning the prior probabilities of classes (P(Ck)) and the likelihoods of features given classes (P(xi | Ck)) from the training data. It then applies Bayes' Theorem with the conditional independence assumption to predict the class of a new instance by calculating the posterior probabilities P(Ck | \mathbf{x}) for all possible classes and selecting the one with the highest value (MAP classification).

New cards

In Gaussian Naïve Bayes, how does the assumption of Gaussian distribution for continuous features influence the likelihood calculation?

For each continuous feature xi and each class Ck, Gaussian Naïve Bayes assumes that the values of xi for instances belonging to class Ck follow a normal (Gaussian) distribution. The likelihood P(xi | Ck) is then calculated using the Probability Density Function (PDF) of the Gaussian distribution, which is parameterized by the mean (\mu{ik}) and variance (\sigma{ik}^2) estimated from the training data for that specific feature and class.

New cards

What is the Probability Density Function (PDF) for a Gaussian distribution used in Gaussian Naïve Bayes?

The PDF for a Gaussian distribution for feature xi given class Ck is:P(xi | Ck) = \frac{1}{\sqrt{2\pi\sigma{ik}^2}} \exp\left(-\frac{(xi - \mu{ik})^2}{2\sigma{ik}^2}\right)where \mu{ik} is the mean and \sigma{ik}^2 is the variance of feature xi for class Ck. This function tells us the relative likelihood of observing a particular value for x_i.