logistic regression

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
GameKnowt Play
New
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/25

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

26 Terms

1
New cards

generative classifiers + discriminative

if distinguishing cat from dog images = generative classifier

build a model of what’s in a cat image

  • whiskers, eyes ears

  • assigns prob to any image to determine how cat-like is that image?

  • similarly, build model of what’s in dog image

now given new image, run both + see which fits better

If we are distinguishing cat from dog images using a Discriminative Classifier.

• Just try to distinguish dogs from cats. Oh look, dogs have collars. - Let’s ignore everything else.

2
New cards
<p>generative vs discriminative classifiers</p>

generative vs discriminative classifiers

generative (naive bayes)

  • assume some form of conditional independence

  • estimate parameters of P(D|h), P(h) directly from training data

  • use bayes rule to calculate P(h|D)

why not learn P(h|D) or decision boundary directly?

discriminative (logistic regression)

  • assume some functional form for P(h|D) or for the decision boundary

  • estimate parameters of P(h|D) directly from training data

Naïve Bayes:

  • 𝑌𝑁𝐵=𝑎𝑟𝑔𝑚𝑎𝑥h𝑃(𝐷|h) ⋅𝑃(h)

logistic regression:
- 𝑌𝐿𝑅 = 𝑎𝑟𝑔𝑚𝑎𝑥h𝑃(𝐷|h)

3
New cards

learning a LR classifier

given n input-output pairs:

  1. a feature representation of the input. For each input observation x, a vector of features [x1, x2, …, xd]

  2. a classification function that computes y, the estimated class via P(y|x), using the sigmoid or softmax functions

  3. an objective function for learning, e.g. cross-entropy loss

  4. an algo for optimising the objective function - stochastic gradient ascent/descent

4
New cards
<p>LR</p>

LR

LR assumes the following functional form for P(y|x):

P(y=1|x) = 1/ (1+e ^ (-(∑ j wjxj +b)))

→ ∑ wjxj + b > 0

logistic regression = linear classifier

<p>LR assumes the following functional form for P(y|x):</p><p>P(y=1|x) = 1/ (1+e ^ (-(<span>∑ j wjxj +b)))</span></p><p><span>→&nbsp;∑ wjxj + b &gt; 0</span></p><p></p><p><span>logistic regression = linear classifier</span></p>
5
New cards
<p>LR example - sentiment classification</p>

LR example - sentiment classification

e.g. doing binary sentiment classification on movie review test + we’d like to know whether to assign the sentiment class +ve = 1 or -ve = 0 to the following review:

Let’s assume for the moment that we’ve already learned a real-valued weight for each of these features, and that the 6 weights corresponding to the 6 features are [2.5, −5.0, −1.2, 0.5, 2.0, 0.7], while b = 0.1.

<p>e.g. doing binary sentiment classification on movie review test + we’d like to know whether to assign the sentiment class +ve = 1 or -ve = 0 to the following review:</p><p></p><p><span>Let’s assume for the moment that we’ve already learned a real-valued weight for each of these features, and that the 6 weights corresponding to the 6 features are [2.5, −5.0, −1.2, 0.5, 2.0, 0.7], while <em>b </em>= 0.1.</span></p>
6
New cards
<p>training LR</p>

training LR

we’ll focus on binary classification

  • we parameterise (wj,b) as 𝜽:

p(y| x, 𝜽) = ey𝜽Tx/ e𝜽Tx + 1

7
New cards
<p>cross entropy loss</p>

cross entropy loss

want to know how far is classifier output ŷ from true output y , difference = L(ŷ,y)

2 discrete outcomes, can express probability

  • P(y|x) = ŷy * (1-ŷ)1-y

goal: maximise prob of the correct label P(y|x)

Maximize: P(y|x) = ŷy * (1-ŷ)1-y

Maximize: logP(y|x) =log ( ŷy * (1-ŷ)1-y)

Maximize: logP(y|x) = y log ŷ + (1-y)log (1-ŷ)

  • want to min. cross entropy loss

= log ( 1+𝑒𝜽T𝐱) −𝑦𝜽T𝐱

<p>want to know how far is classifier output ŷ from true output y , difference = L(ŷ,y)</p><p></p><p>2 discrete outcomes, can express probability</p><ul><li><p>P(y|x) = ŷ<sup>y</sup>&nbsp;* (1-ŷ)<sup>1-y</sup></p></li></ul><p></p><p>goal: maximise prob of the correct label P(y|x)</p><p><span>Maximize: P</span>(y|x) = ŷ<sup>y</sup>&nbsp;* (1-ŷ)<sup>1-y</sup></p><p><span>Maximize: log</span>P(y|x)<span> =log (</span> ŷ<sup>y</sup>&nbsp;* (1-ŷ)<sup>1-y</sup>)</p><p>Maximize: logP(y|x) = y log ŷ + (1-y)log (1-ŷ)</p><ul><li><p>want to min. cross entropy loss</p></li></ul><p><span>= log ( 1+𝑒<sup>𝜽T𝐱)</sup> −𝑦𝜽<sup>T</sup>𝐱</span></p><p></p>
8
New cards
<p>minimising cross entropy loss</p>

minimising cross entropy loss

min 𝐿 𝐶 𝐸 (ŷ,y)

  • minimising above = convex optimisation problem

  • convex function = global min → Gradient Ascent

  • concave func = global max → Gradient descent

gradients:

  • gradient of function = vector pointing in direction of the greatest increase in a function

  • Gradient Ascent: Find the gradient of the function at the current point and move in the same direction.

    • Gradient Descent: Find the gradient of the function at the current point and move in the opposite direction.

<p>min<span> 𝐿 <sub>𝐶 𝐸</sub> (ŷ,y) </span></p><ul><li><p>minimising above = convex optimisation problem</p></li><li><p>convex function = global min → <span>Gradient Ascent</span></p></li><li><p>concave func = global max → <span>Gradient descent</span></p></li></ul><p></p><p>gradients:</p><ul><li><p>gradient of function = vector pointing in direction of the greatest increase in a function</p></li><li><p><span>Gradient Ascent: Find the gradient of the function at the current point and move in the same direction.</span></p><p><span>• Gradient Descent: Find the gradient of the function at the current point and move in the opposite direction.</span></p></li></ul><p></p>
9
New cards

gd for LR

• Let us represent ŷ = 𝑓(𝐱; 𝜽)

𝜽t+1 = 𝜽t - 𝜂 ⋅ 𝐱 [1 / 1+ 𝑒−𝜽T𝐱 - y]

<p><span>• Let us represent ŷ = 𝑓(𝐱; 𝜽)<br></span></p><p>𝜽<sub>t+1 </sub>= 𝜽<sub>t</sub> - <span>𝜂 ⋅ 𝐱 [1 / 1+ 𝑒<sup>−𝜽T𝐱</sup> - y]</span></p>
10
New cards

learning rate

𝜂 is a hyperparameter.
• Large 𝜂 ⇒ Fast convergence but larger residual error. Also, possible

oscillations.
• Small 𝜂 ⇒ Slow convergence but small residual error.

<p><span>𝜂 is a hyperparameter.<br>• Large 𝜂 ⇒ Fast convergence but larger residual error. Also, possible</span></p><p><span>oscillations.<br>• Small 𝜂 ⇒ Slow convergence but small residual error.</span></p>
11
New cards
<p><span><strong>Example – Sentiment Classification</strong></span></p>

Example – Sentiment Classification

knowt flashcard image
12
New cards

understanding the sigmoid

large weights = overfitting

penalising larger weights can reduce overfitting

<p>large weights = overfitting </p><p>penalising larger weights can reduce overfitting</p>
13
New cards

regularisation

used to avoid overfitting

weights for features will attempt to perfectly fit details of the training set, modelling even noisy data that just accidentally correlate with the class = overfitting

good model = generalises well from the training data to the unseen test set, but model that overfits will have poor generalisation

avoid overfitting = add regularisation term 𝑅(𝜽) to the loss function

min 𝐿𝑟𝑒𝑔 (ŷ,y) =min (log (1+𝑒𝜽T𝐱) −𝑦𝜽T𝐱+𝜆𝑅 (𝜽))

14
New cards

more regularisation

L2 regularisation → ridge regression

  • uses square of the L2 (euclidean) norm of the weight values

𝑅 (𝜽) = ||𝜽 ||22 = ∑𝜽2j

min 𝐿𝑟𝑒𝑔 (ŷ,y) =min (log (1+𝑒𝜽T𝐱) −𝑦𝜽T𝐱+𝜆∑𝜽2j )

L1 regularisation → lasso regression

  • uses L1 norm (Manhattan distance) of the weight values

𝑅 (𝜽) = ||𝜽 ||1 = ∑|𝜽j|

min 𝐿𝑟𝑒𝑔 (ŷ,y) =min (log (1+𝑒𝜽T𝐱) −𝑦𝜽T𝐱+𝜆∑|𝜽j|)

<p>L2 regularisation → ridge regression</p><ul><li><p>uses square of the L2 (euclidean) norm of the weight values</p></li></ul><p><span>𝑅 (𝜽) = ||𝜽 ||<sub>2</sub><sup>2</sup> = ∑𝜽<sup>2</sup><sub>j</sub></span></p><p></p><p>min 𝐿<sub>𝑟𝑒𝑔</sub> (ŷ,y) =min (log (1+𝑒<sup>𝜽T𝐱</sup>) −𝑦𝜽<sup>T</sup>𝐱+𝜆∑𝜽<sup>2</sup><sub>j </sub>)</p><p></p><p>L1 regularisation → lasso regression</p><ul><li><p>uses L1 norm (Manhattan distance) of the weight values</p></li></ul><p>𝑅 (𝜽) = ||𝜽 ||<sub>1</sub> = ∑|𝜽<sub>j</sub>|</p><p>min 𝐿<sub>𝑟𝑒𝑔</sub> (ŷ,y) =min (log (1+𝑒<sup>𝜽T𝐱</sup>) −𝑦𝜽<sup>T</sup>𝐱+𝜆∑|𝜽<sub>j</sub>|)</p><p></p>
15
New cards

batch training

stochastic g.d. - chooses single random example at a time, moving the weights so as to improve performance on that single example

= choppy movements, common to compute the gradient over batches of training instances over a single instance

training data: {𝑥𝑖 ,𝑦𝑖} 𝑖=1...𝑛 where, 𝑥𝑖 = (𝑥𝑖1 ,𝑥𝑖2 ,...𝑥𝑖𝑑), 𝑛 is the total instances in a batch and 𝑑 is the dimension of an instance

𝜽𝑡+1=𝜽𝑡−𝜂/ 𝑛⋅∑ 𝐱𝑖𝑗 [ 1/ 1+𝑒−𝜽T𝐱−𝑦𝑖]

<p>stochastic g.d. - chooses single random example at a time, moving the weights so as to improve performance on that single example</p><p>= choppy movements, common to compute the gradient over batches of training instances over a single instance</p><p></p><p>training data: {<span>𝑥𝑖 ,𝑦𝑖} <sub>𝑖=1...𝑛</sub> where, 𝑥𝑖 = (𝑥𝑖1 ,𝑥𝑖2 ,...𝑥𝑖𝑑), 𝑛 is the total instances in a batch and 𝑑 is the dimension of an instance</span></p><p><span>𝜽<sub>𝑡+1</sub>=𝜽</span><sub>𝑡</sub><span>−𝜂/ 𝑛⋅∑ 𝐱<sub>𝑖𝑗</sub> [ 1/ 1+𝑒<sup>−𝜽T𝐱</sup>−𝑦𝑖]</span></p>
16
New cards
<p>example - spam recognition</p>

example - spam recognition

knowt flashcard image
17
New cards
<p>training phase</p>

training phase

1st - calculate −𝜽T𝐱 for every example in the data set

2nd - calculate ∑ 𝐱𝑖𝑗 [ 1/ 1+𝑒−𝜽T𝐱−𝑦𝑖] for every example in data set for every 𝜽

3rd - compute every 𝜽𝑡+1

<p>1st - calculate <sup>−𝜽T𝐱</sup> for every example in the data set</p><p></p><p>2nd - calculate&nbsp;∑ 𝐱<sub>𝑖𝑗</sub> [ 1/ 1+𝑒<sup>−𝜽T𝐱</sup>−𝑦𝑖] for every example in data set for every&nbsp;𝜽</p><p></p><p>3rd - compute every&nbsp;𝜽<sub>𝑡+1</sub></p>
18
New cards
<p>cont</p>

cont

knowt flashcard image
19
New cards

testing phase

knowt flashcard image
20
New cards
<p>ROC curve</p>

ROC curve

receiver operating characteristic curve = graphical plot that illustrates the performance of a binary classifier model

  • ROC curve = plot of the true positive rate against the false positive rate

TPR:

  • tp / (tp+fn)

FPR:

  • fp/ (fp+tn)

Classification threshold is used to convert the outputof a probabilistic classifier into class labels.

• The threshold determines the minimum probabilityrequired for a positive class

<p>receiver operating characteristic curve = graphical plot that illustrates the performance of a binary classifier model</p><p></p><ul><li><p>ROC curve = plot of the true positive rate against the false positive rate</p></li></ul><p></p><p>TPR:</p><ul><li><p>tp / (tp+fn)</p></li></ul><p></p><p>FPR:</p><ul><li><p>fp/ (fp+tn)</p></li></ul><p></p><p><span>Classification threshold is used to convert the outputof a probabilistic classifier into class labels.</span></p><p><span>• The threshold determines the minimum probabilityrequired for a positive class</span></p>
21
New cards
<p>TPR/FPR</p>

TPR/FPR

knowt flashcard image
22
New cards

AUCa

area under the ROC curve provides aggregate measure of performance across all classification thresholds

between 2 ROC curves plotted based on 2 learning models, model w higher AUC learned better than the other

<p>area under the ROC curve provides aggregate measure of performance across all classification thresholds</p><p></p><p>between 2 ROC curves plotted based on 2 learning models, model w higher AUC learned better than the other</p>
23
New cards
<p><span><strong>Model Selection</strong></span></p>

Model Selection

Adopting the best algorithm and model for a specific dataset by assessing and comparing different models to identify the one with the best results.

24
New cards
<p>AIC/BIC</p>

AIC/BIC

akaike info criterion + bayesian info criterion compares diff models to choose 1 that best fits the data

  • goal of both AIC + BIC = balnce the goodness of fit of the model w its complexity in order to void overfitting or underfitting

both AIC + BIC penalises models w large no of parameters relative to size of data, but BIC penalises more severely

min 𝐴𝐼𝐶 = 2 𝑚 − 2 log 𝐿

min 𝐵𝐼𝐶 = 𝑚 log 𝑛 − 2 log 𝐿

  • where m = no of model parameters, n = no of data points + L = max likelihood of the model

25
New cards
<p><span><strong>Multinomial Logistic Regression</strong></span></p>

Multinomial Logistic Regression

loss function for multinomial LR generalises loss function for binary LR from 2 to K classes

true label y = vector with K elements, each corresponding to a class, with yc = 1 if the correct class is c, with all other elements of y being 0

classifier will produce estimate vector w K elements ŷ each element ŷk of which represents the estimated probability P(yk=1|x)

LCE(ŷ,y) = -∑ yk log ŷk

<p>loss function for multinomial LR generalises loss function for binary LR from 2 to K classes</p><p></p><p>true label y = vector with K elements, each corresponding to a class, with y<sub>c</sub> = 1 if the correct class is c, with all other elements of y being 0</p><p></p><p>classifier will produce estimate vector w K elements ŷ each element ŷk of which represents the estimated probability P(yk=1|x)</p><p>L<sub>CE</sub>(ŷ,y) = -<span>∑ y<sub>k</sub> log ŷ<sub>k</sub></span></p><p></p>
26
New cards

conclusion

Primarily used to estimate the probability of a specific outcome.
• Is a discriminative learning model.
• Is a linear classifier.
• Optimizes by minimizing the cross-entropy loss via gradient descent.

• Trains parameters:

oBegins with initial weight vector.
oModifies it iteratively to minimize the loss function.