Principles of Machine Learning

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/50

There's no tags or description

Looks like no tags are added yet.

Last updated 3:45 AM on 4/14/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

51 Terms

New cards

What is the primary purpose of Softmax Regression?

It generalizes Logistic Regression to support direct classification of multiple classes without combining multiple binary classifiers.

New cards

What is the alternative name for Softmax Regression?

Multinomial Logistic Regression.

New cards

In Softmax Regression, how is the score $s_k(\mathbf{x})$ for a specific class $k$ computed?

It is calculated as the dot product of the instance's feature vector and the class-specific parameter vector, represented as $s_k(\mathbf{x}) = \mathbf{x}^\top \boldsymbol{\theta}^{(k)}$.

New cards

Where are the parameter vectors Theta^k or all classes typically stored in a Softmax model?

They are stored as rows within a parameter matrix denoted as Theta

New cards

Define the softmax function (normalized exponential) for estimating the probability p^k

New cards

In the softmax probability equation, what does the variable $K$ represent?

It represents the total number of classes.

New cards

What is the vector $\mathbf{s}(\mathbf{x})$ in the context of Softmax Regression?

A vector containing the raw scores computed for every class for a given instance $\mathbf{x}$.

New cards

What term is commonly used to describe the raw scores $s_k(\mathbf{x})$ before they are processed by the softmax function?

Logits (or log-odds).

New cards

How does the Softmax Regression classifier determine the predicted class $\hat{y}$ for an instance?

It selects the class $k$ that maximizes the estimated probability (the $\text{argmax}$ of the scores).

New cards

Formula: The prediction function for y^ using scores.

$\hat{y} = \text{argmax}_k \sigma(\mathbf{s}(\mathbf{x}))_k = \text{argmax}_k s_k(\mathbf{x})$.

$$\hat{y} = \text{argmax}_k \sigma(\mathbf{s}(\mathbf{x}))_k = \text{argmax}_k s_k(\mathbf{x})$.$

New cards

Why is Softmax Regression unsuitable for multioutput classification?

It predicts only one class at a time and assumes that classes are mutually exclusive.

New cards

Provide an example of a classification task suitable for Softmax Regression.

Classifying an iris flower into exactly one of three distinct species.

New cards

What is the training objective for a Softmax Regression model?

To maximize the estimated probability for the target class while minimizing it for all other classes.

New cards

What is the likelihood function $L$ for a single data point in Softmax Regression?

$L = \prod_{k=1}^K p_k^{y_k}$.

$$L = \prod_{k=1}^K p_k^{y_k}$.$

New cards

Which cost function is minimized during the training of Softmax Regression?

Cross entropy (also known as negative log likelihood).

New cards

In the cross entropy cost function J(Theta), what does y_k^i represent?

The target probability (usually 1 or 0) that the ith instance belongs to class k

New cards

Equation: Cross entropy cost function $J(\boldsymbol{\Theta})$ for $m$ instances.

$J(\boldsymbol{\Theta}) = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log(\hat{p}_k^{(i)})$.

$$J(\boldsymbol{\Theta}) = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log(\hat{p}_k^{(i)})$.$

New cards

From which field of study did the concept of cross entropy originate?

Information theory.

New cards

In information theory, what does cross entropy measure regarding message transmission?

The average number of bits sent per option when the encoding is based on a potentially imperfect probability distribution $q$ instead of the true distribution $p$.

New cards

Under what condition is the cross entropy equal to the intrinsic entropy of the data?

When the predicted probability distribution perfectly matches the actual distribution of the data.

New cards

Define Kullback-Leibler (KL) divergence in the context of cross entropy.

The extra amount by which cross entropy exceeds entropy due to incorrect assumptions about the probability distribution.

New cards

Formula: Cross entropy $H(p, q)$ between two probability distributions.

$H(p, q) = -\sum_x p(x) \log q(x)$.

New cards

Formula: The gradient vector of the cost function with respect to $\boldsymbol{\theta}^{(k)}$.

$\nabla_{\boldsymbol{\theta}^{(k)}} J(\boldsymbol{\Theta}) = \frac{1}{m} \sum_{i=1}^m (\hat{p}_k^{(i)} - y_k^{(i)}) \mathbf{x}^{(i)}$.

$$\nabla_{\boldsymbol{\theta}^{(k)}} J(\boldsymbol{\Theta}) = \frac{1}{m} \sum_{i=1}^m (\hat{p}_k^{(i)} - y_k^{(i)}) \mathbf{x}^{(i)}$.$

New cards

How is the optimal parameter matrix $\boldsymbol{\Theta}$ found once the gradient is computed?

By using an optimization algorithm such as Gradient Descent.

New cards

Which Scikit-Learn class is used to implement Softmax Regression?

`LogisticRegression`.

New cards

In Scikit-Learn, which hyperparameter must be set to use Softmax Regression instead of One-versus-Rest (OvR)?

The `multi_class` hyperparameter should be set to `"multinomial"`.

New cards

Which solver in Scikit-Learn's `LogisticRegression` is noted for supporting the multinomial hyperparameter?

The `"lbfgs"` solver.

New cards

What is the default regularization type applied by Scikit-Learn's `LogisticRegression`?

$\ell 2$ regularization.

New cards

In Scikit-Learn, which hyperparameter controls the strength of regularization for Softmax Regression?

The hyperparameter `C`.

New cards

What characterizes the decision boundaries between any two classes in Softmax Regression?

The decision boundaries are linear.

New cards

In a Softmax Regression visualization, what do the curved contour lines typically represent?

The estimated probabilities for a specific class.

New cards

Is it possible for a Softmax Regression model to predict a class even if its probability is below $50\%$?

Yes, it can happen if all other classes have even lower probabilities (e.g., at a point where three classes share $33.3\%$ each).

New cards

In the Iris dataset example, if an iris has petals $5\text{ cm}$ long and $2\text{ cm}$ wide, what species does the model predict?

Iris virginica (class 2).

New cards

According to the Scikit-Learn example, what probability does the model assign to Iris virginica for a 5cm by 2cm petal?

Approximately 94.2%

New cards

How does the Softmax function ensure that the sum of all predicted probabilities equals 1?

By dividing each exponentiated score by the sum of all exponentiated scores for all classes.

New cards

The Softmax Regression model first computes a score for each class and then applies the _____ function to estimate probabilities.

Softmax

New cards

The Softmax Regression model should only be used with _____ exclusive classes.

mutually

New cards

Minimizing the cross entropy cost function penalizes the model when it estimates a _____ probability for a target class.

low

New cards

What is the mathematical definition of entropy in terms of unpredictability?

It is the intrinsic unpredictability or average information content of a source.

New cards

Concept: Multinomial Logistic Regression

Definition: A classification method that extends binary logistic regression to handle multiple, mutually exclusive classes using a softmax layer.

New cards

Concept: Cross Entropy

Definition: A loss function used in classification that measures the difference between two probability distributions. Example: Comparing predicted class probabilities against one-hot encoded labels.

New cards

In information theory, how many bits are required to encode 8 equally likely options?

3 bits, because 2^3 = 8.

New cards

How does an efficient encoding strategy change the bit length of a highly probable event?

It uses fewer bits (e.g., 1 bit) for the highly probable event and more bits for less frequent events.

New cards

What happens to the probabilities at the exact point where all decision boundaries in a Softmax model meet?

All classes have an equal estimated probability.

New cards

Why is the Softmax function called the 'normalized exponential'?

Because it takes the exponential of each score and then normalizes them to sum to one.

New cards

When using `LogisticRegression` in Scikit-Learn with more than two classes, what is the default behavior if `multi_class` is not specified?

It uses a one-versus-the-rest (OvR) strategy.

New cards

What is the result of applying the $\text{argmax}$ operator to the estimated probabilities in Softmax?

The index $k$ of the class with the highest probability.

New cards

Term: Logits

Definition: The raw, unnormalized output scores of the linear layer in a classification model, often used as input to the softmax function.

New cards

How does $\ell 2$ regularization affect the parameter matrix $\boldsymbol{\Theta}$?

It penalizes large parameter values to prevent overfitting and improve generalization.

New cards

Which mathematical identity represents the fact that $y_k^{(i)}$ acts as a selector in the cross entropy formula?

Since $y_k^{(i)}$ is 1 for the target class and 0 otherwise, only the log-probability of the correct class is considered for that instance.

New cards