Principles of Machine Learning

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/50

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 3:45 AM on 4/14/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

51 Terms

1
New cards
What is the primary purpose of Softmax Regression?
It generalizes Logistic Regression to support direct classification of multiple classes without combining multiple binary classifiers.
2
New cards
What is the alternative name for Softmax Regression?
Multinomial Logistic Regression.
3
New cards
In Softmax Regression, how is the score $s_k(\mathbf{x})$ for a specific class $k$ computed?
It is calculated as the dot product of the instance's feature vector and the class-specific parameter vector, represented as $s_k(\mathbf{x}) = \mathbf{x}^\top \boldsymbol{\theta}^{(k)}$.
4
New cards

Where are the parameter vectors Theta^k or all classes typically stored in a Softmax model?

They are stored as rows within a parameter matrix denoted as Theta

5
New cards

Define the softmax function (normalized exponential) for estimating the probability p^​k

<p></p>
6
New cards
In the softmax probability equation, what does the variable $K$ represent?
It represents the total number of classes.
7
New cards
What is the vector $\mathbf{s}(\mathbf{x})$ in the context of Softmax Regression?
A vector containing the raw scores computed for every class for a given instance $\mathbf{x}$.
8
New cards
What term is commonly used to describe the raw scores $s_k(\mathbf{x})$ before they are processed by the softmax function?
Logits (or log-odds).
9
New cards
How does the Softmax Regression classifier determine the predicted class $\hat{y}$ for an instance?
It selects the class $k$ that maximizes the estimated probability (the $\text{argmax}$ of the scores).
10
New cards

Formula: The prediction function for y^ using scores.

$\hat{y} = \text{argmax}_k \sigma(\mathbf{s}(\mathbf{x}))_k = \text{argmax}_k s_k(\mathbf{x})$.

<p>$\hat{y} = \text{argmax}_k \sigma(\mathbf{s}(\mathbf{x}))_k = \text{argmax}_k s_k(\mathbf{x})$.</p>
11
New cards
Why is Softmax Regression unsuitable for multioutput classification?
It predicts only one class at a time and assumes that classes are mutually exclusive.
12
New cards
Provide an example of a classification task suitable for Softmax Regression.
Classifying an iris flower into exactly one of three distinct species.
13
New cards
What is the training objective for a Softmax Regression model?
To maximize the estimated probability for the target class while minimizing it for all other classes.
14
New cards

What is the likelihood function $L$ for a single data point in Softmax Regression?

$L = \prod_{k=1}^K p_k^{y_k}$.

<p>$L = \prod_{k=1}^K p_k^{y_k}$.</p>
15
New cards
Which cost function is minimized during the training of Softmax Regression?
Cross entropy (also known as negative log likelihood).
16
New cards

In the cross entropy cost function J(Theta), what does y_k^i represent?

The target probability (usually 1 or 0) that the ith instance belongs to class k

17
New cards

Equation: Cross entropy cost function $J(\boldsymbol{\Theta})$ for $m$ instances.

$J(\boldsymbol{\Theta}) = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log(\hat{p}_k^{(i)})$.

<p>$J(\boldsymbol{\Theta}) = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log(\hat{p}_k^{(i)})$.</p>
18
New cards
From which field of study did the concept of cross entropy originate?
Information theory.
19
New cards
In information theory, what does cross entropy measure regarding message transmission?
The average number of bits sent per option when the encoding is based on a potentially imperfect probability distribution $q$ instead of the true distribution $p$.
20
New cards
Under what condition is the cross entropy equal to the intrinsic entropy of the data?
When the predicted probability distribution perfectly matches the actual distribution of the data.
21
New cards
Define Kullback-Leibler (KL) divergence in the context of cross entropy.
The extra amount by which cross entropy exceeds entropy due to incorrect assumptions about the probability distribution.
22
New cards
Formula: Cross entropy $H(p, q)$ between two probability distributions.
$H(p, q) = -\sum_x p(x) \log q(x)$.
23
New cards

Formula: The gradient vector of the cost function with respect to $\boldsymbol{\theta}^{(k)}$.

$\nabla_{\boldsymbol{\theta}^{(k)}} J(\boldsymbol{\Theta}) = \frac{1}{m} \sum_{i=1}^m (\hat{p}_k^{(i)} - y_k^{(i)}) \mathbf{x}^{(i)}$.

<p>$\nabla_{\boldsymbol{\theta}^{(k)}} J(\boldsymbol{\Theta}) = \frac{1}{m} \sum_{i=1}^m (\hat{p}_k^{(i)} - y_k^{(i)}) \mathbf{x}^{(i)}$.</p>
24
New cards
How is the optimal parameter matrix $\boldsymbol{\Theta}$ found once the gradient is computed?
By using an optimization algorithm such as Gradient Descent.
25
New cards
Which Scikit-Learn class is used to implement Softmax Regression?
`LogisticRegression`.
26
New cards
In Scikit-Learn, which hyperparameter must be set to use Softmax Regression instead of One-versus-Rest (OvR)?
The `multi_class` hyperparameter should be set to `"multinomial"`.
27
New cards
Which solver in Scikit-Learn's `LogisticRegression` is noted for supporting the multinomial hyperparameter?
The `"lbfgs"` solver.
28
New cards
What is the default regularization type applied by Scikit-Learn's `LogisticRegression`?
$\ell 2$ regularization.
29
New cards
In Scikit-Learn, which hyperparameter controls the strength of regularization for Softmax Regression?
The hyperparameter `C`.
30
New cards
What characterizes the decision boundaries between any two classes in Softmax Regression?
The decision boundaries are linear.
31
New cards
In a Softmax Regression visualization, what do the curved contour lines typically represent?
The estimated probabilities for a specific class.
32
New cards
Is it possible for a Softmax Regression model to predict a class even if its probability is below $50\%$?
Yes, it can happen if all other classes have even lower probabilities (e.g., at a point where three classes share $33.3\%$ each).
33
New cards
In the Iris dataset example, if an iris has petals $5\text{ cm}$ long and $2\text{ cm}$ wide, what species does the model predict?
Iris virginica (class 2).
34
New cards

According to the Scikit-Learn example, what probability does the model assign to Iris virginica for a 5cm by 2cm petal?

Approximately 94.2%

35
New cards
How does the Softmax function ensure that the sum of all predicted probabilities equals 1?
By dividing each exponentiated score by the sum of all exponentiated scores for all classes.
36
New cards
The Softmax Regression model first computes a score for each class and then applies the _____ function to estimate probabilities.
Softmax
37
New cards
The Softmax Regression model should only be used with _____ exclusive classes.
mutually
38
New cards
Minimizing the cross entropy cost function penalizes the model when it estimates a _____ probability for a target class.
low
39
New cards
What is the mathematical definition of entropy in terms of unpredictability?
It is the intrinsic unpredictability or average information content of a source.
40
New cards
Concept: Multinomial Logistic Regression
Definition: A classification method that extends binary logistic regression to handle multiple, mutually exclusive classes using a softmax layer.
41
New cards
Concept: Cross Entropy
Definition: A loss function used in classification that measures the difference between two probability distributions. Example: Comparing predicted class probabilities against one-hot encoded labels.
42
New cards

In information theory, how many bits are required to encode 8 equally likely options?

3 bits, because 2^3 = 8.

43
New cards
How does an efficient encoding strategy change the bit length of a highly probable event?
It uses fewer bits (e.g., 1 bit) for the highly probable event and more bits for less frequent events.
44
New cards
What happens to the probabilities at the exact point where all decision boundaries in a Softmax model meet?
All classes have an equal estimated probability.
45
New cards
Why is the Softmax function called the 'normalized exponential'?
Because it takes the exponential of each score and then normalizes them to sum to one.
46
New cards
When using `LogisticRegression` in Scikit-Learn with more than two classes, what is the default behavior if `multi_class` is not specified?
It uses a one-versus-the-rest (OvR) strategy.
47
New cards
What is the result of applying the $\text{argmax}$ operator to the estimated probabilities in Softmax?
The index $k$ of the class with the highest probability.
48
New cards
Term: Logits
Definition: The raw, unnormalized output scores of the linear layer in a classification model, often used as input to the softmax function.
49
New cards
How does $\ell 2$ regularization affect the parameter matrix $\boldsymbol{\Theta}$?
It penalizes large parameter values to prevent overfitting and improve generalization.
50
New cards
Which mathematical identity represents the fact that $y_k^{(i)}$ acts as a selector in the cross entropy formula?
Since $y_k^{(i)}$ is 1 for the target class and 0 otherwise, only the log-probability of the correct class is considered for that instance.
51
New cards