logistic regression

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/25

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

26 Terms

New cards

generative classifiers + discriminative

if distinguishing cat from dog images = generative classifier

build a model of what’s in a cat image

whiskers, eyes ears
assigns prob to any image to determine how cat-like is that image?
similarly, build model of what’s in dog image

now given new image, run both + see which fits better

If we are distinguishing cat from dog images using a Discriminative Classifier.

• Just try to distinguish dogs from cats. Oh look, dogs have collars. - Let’s ignore everything else.

New cards

generative vs discriminative classifiers

generative (naive bayes)

assume some form of conditional independence
estimate parameters of P(D|h), P(h) directly from training data
use bayes rule to calculate P(h|D)

why not learn P(h|D) or decision boundary directly?

discriminative (logistic regression)

assume some functional form for P(h|D) or for the decision boundary
estimate parameters of P(h|D) directly from training data

Naïve Bayes:

𝑌_𝑁𝐵=𝑎𝑟𝑔𝑚𝑎𝑥_h𝑃(𝐷|h) ⋅𝑃(h)

logistic regression:
- 𝑌_𝐿𝑅 = 𝑎𝑟𝑔𝑚𝑎𝑥_h𝑃(𝐷|h)

New cards

learning a LR classifier

given n input-output pairs:

a feature representation of the input. For each input observation x, a vector of features [x1, x2, …, xd]
a classification function that computes y, the estimated class via P(y|x), using the sigmoid or softmax functions
an objective function for learning, e.g. cross-entropy loss
an algo for optimising the objective function - stochastic gradient ascent/descent

New cards

LR assumes the following functional form for P(y|x):

P(y=1|x) = 1/ (1+e ^ (-(∑ j wjxj +b)))

→ ∑ wjxj + b > 0

logistic regression = linear classifier

New cards

LR example - sentiment classification

e.g. doing binary sentiment classification on movie review test + we’d like to know whether to assign the sentiment class +ve = 1 or -ve = 0 to the following review:

Let’s assume for the moment that we’ve already learned a real-valued weight for each of these features, and that the 6 weights corresponding to the 6 features are [2.5, −5.0, −1.2, 0.5, 2.0, 0.7], while b = 0.1.

<p>e.g. doing binary sentiment classification on movie review test + we’d like to know whether to assign the sentiment class +ve = 1 or -ve = 0 to the following review:</p><p></p><p><span>Let’s assume for the moment that we’ve already learned a real-valued weight for each of these features, and that the 6 weights corresponding to the 6 features are [2.5, −5.0, −1.2, 0.5, 2.0, 0.7], while <em>b </em>= 0.1.</span></p>

New cards

training LR

we’ll focus on binary classification

we parameterise (wj,b) as 𝜽:

p(y| x, 𝜽) = e^y𝜽Tx/ e^𝜽Tx+ 1

New cards

cross entropy loss

want to know how far is classifier output ŷ from true output y , difference = L(ŷ,y)

2 discrete outcomes, can express probability

P(y|x) = ŷ^y * (1-ŷ)^1-y

goal: maximise prob of the correct label P(y|x)

Maximize: P(y|x) = ŷ^y * (1-ŷ)^1-y

Maximize: logP(y|x) =log ( ŷ^y * (1-ŷ)^1-y)

Maximize: logP(y|x) = y log ŷ + (1-y)log (1-ŷ)

want to min. cross entropy loss

= log ( 1+𝑒^𝜽T𝐱) −𝑦𝜽^T𝐱

<p>want to know how far is classifier output ŷ from true output y , difference = L(ŷ,y)</p><p></p><p>2 discrete outcomes, can express probability</p><ul><li><p>P(y|x) = ŷ<sup>y</sup> * (1-ŷ)<sup>1-y</sup></p></li></ul><p></p><p>goal: maximise prob of the correct label P(y|x)</p><p><span>Maximize: P</span>(y|x) = ŷ<sup>y</sup> * (1-ŷ)<sup>1-y</sup></p><p><span>Maximize: log</span>P(y|x)<span> =log (</span> ŷ<sup>y</sup> * (1-ŷ)<sup>1-y</sup>)</p><p>Maximize: logP(y|x) = y log ŷ + (1-y)log (1-ŷ)</p><ul><li><p>want to min. cross entropy loss</p></li></ul><p><span>= log ( 1+𝑒<sup>𝜽T𝐱)</sup> −𝑦𝜽<sup>T</sup>𝐱</span></p><p></p>

New cards

minimising cross entropy loss

min 𝐿 _{𝐶 𝐸} (ŷ,y)

minimising above = convex optimisation problem
convex function = global min → Gradient Ascent
concave func = global max → Gradient descent

gradients:

gradient of function = vector pointing in direction of the greatest increase in a function
Gradient Ascent: Find the gradient of the function at the current point and move in the same direction.
• Gradient Descent: Find the gradient of the function at the current point and move in the opposite direction.

<p>min<span> 𝐿 <sub>𝐶 𝐸</sub> (ŷ,y) </span></p><ul><li><p>minimising above = convex optimisation problem</p></li><li><p>convex function = global min → <span>Gradient Ascent</span></p></li><li><p>concave func = global max → <span>Gradient descent</span></p></li></ul><p></p><p>gradients:</p><ul><li><p>gradient of function = vector pointing in direction of the greatest increase in a function</p></li><li><p><span>Gradient Ascent: Find the gradient of the function at the current point and move in the same direction.</span></p><p><span>• Gradient Descent: Find the gradient of the function at the current point and move in the opposite direction.</span></p></li></ul><p></p>

New cards

gd for LR

• Let us represent ŷ = 𝑓(𝐱; 𝜽)

𝜽_t+1= 𝜽_t - 𝜂 ⋅ 𝐱 [1 / 1+ 𝑒^−𝜽T𝐱 - y]

New cards

learning rate

𝜂 is a hyperparameter.
• Large 𝜂 ⇒ Fast convergence but larger residual error. Also, possible

oscillations.
• Small 𝜂 ⇒ Slow convergence but small residual error.

<p><span>𝜂 is a hyperparameter.<br>• Large 𝜂 ⇒ Fast convergence but larger residual error. Also, possible</span></p><p><span>oscillations.<br>• Small 𝜂 ⇒ Slow convergence but small residual error.</span></p>

New cards

Example – Sentiment Classification

New cards

understanding the sigmoid

large weights = overfitting

penalising larger weights can reduce overfitting

New cards

regularisation

used to avoid overfitting

weights for features will attempt to perfectly fit details of the training set, modelling even noisy data that just accidentally correlate with the class = overfitting

good model = generalises well from the training data to the unseen test set, but model that overfits will have poor generalisation

avoid overfitting = add regularisation term 𝑅(𝜽) to the loss function

min 𝐿_𝑟𝑒𝑔 (ŷ,y) =min (log (1+𝑒^𝜽T𝐱) −𝑦𝜽^T𝐱+𝜆𝑅 (𝜽))

New cards

more regularisation

L2 regularisation → ridge regression

uses square of the L2 (euclidean) norm of the weight values

𝑅 (𝜽) = ||𝜽 ||₂² = ∑𝜽²_j

min 𝐿_𝑟𝑒𝑔 (ŷ,y) =min (log (1+𝑒^𝜽T𝐱) −𝑦𝜽^T𝐱+𝜆∑𝜽²_j)

L1 regularisation → lasso regression

uses L1 norm (Manhattan distance) of the weight values

𝑅 (𝜽) = ||𝜽 ||₁ = ∑|𝜽_j|

min 𝐿_𝑟𝑒𝑔 (ŷ,y) =min (log (1+𝑒^𝜽T𝐱) −𝑦𝜽^T𝐱+𝜆∑|𝜽_j|)

<p>L2 regularisation → ridge regression</p><ul><li><p>uses square of the L2 (euclidean) norm of the weight values</p></li></ul><p><span>𝑅 (𝜽) = ||𝜽 ||<sub>2</sub><sup>2</sup> = ∑𝜽<sup>2</sup><sub>j</sub></span></p><p></p><p>min 𝐿<sub>𝑟𝑒𝑔</sub> (ŷ,y) =min (log (1+𝑒<sup>𝜽T𝐱</sup>) −𝑦𝜽<sup>T</sup>𝐱+𝜆∑𝜽<sup>2</sup><sub>j </sub>)</p><p></p><p>L1 regularisation → lasso regression</p><ul><li><p>uses L1 norm (Manhattan distance) of the weight values</p></li></ul><p>𝑅 (𝜽) = ||𝜽 ||<sub>1</sub> = ∑|𝜽<sub>j</sub>|</p><p>min 𝐿<sub>𝑟𝑒𝑔</sub> (ŷ,y) =min (log (1+𝑒<sup>𝜽T𝐱</sup>) −𝑦𝜽<sup>T</sup>𝐱+𝜆∑|𝜽<sub>j</sub>|)</p><p></p>

New cards

batch training

stochastic g.d. - chooses single random example at a time, moving the weights so as to improve performance on that single example

= choppy movements, common to compute the gradient over batches of training instances over a single instance

training data: {𝑥𝑖 ,𝑦𝑖} _{𝑖=1...𝑛} where, 𝑥𝑖 = (𝑥𝑖1 ,𝑥𝑖2 ,...𝑥𝑖𝑑), 𝑛 is the total instances in a batch and 𝑑 is the dimension of an instance

𝜽_𝑡+1=𝜽_𝑡−𝜂/ 𝑛⋅∑ 𝐱_𝑖𝑗 [ 1/ 1+𝑒^−𝜽T𝐱−𝑦𝑖]

<p>stochastic g.d. - chooses single random example at a time, moving the weights so as to improve performance on that single example</p><p>= choppy movements, common to compute the gradient over batches of training instances over a single instance</p><p></p><p>training data: {<span>𝑥𝑖 ,𝑦𝑖} <sub>𝑖=1...𝑛</sub> where, 𝑥𝑖 = (𝑥𝑖1 ,𝑥𝑖2 ,...𝑥𝑖𝑑), 𝑛 is the total instances in a batch and 𝑑 is the dimension of an instance</span></p><p><span>𝜽<sub>𝑡+1</sub>=𝜽</span><sub>𝑡</sub><span>−𝜂/ 𝑛⋅∑ 𝐱<sub>𝑖𝑗</sub> [ 1/ 1+𝑒<sup>−𝜽T𝐱</sup>−𝑦𝑖]</span></p>

New cards

example - spam recognition

New cards

training phase

1st - calculate ^−𝜽T𝐱 for every example in the data set

2nd - calculate ∑ 𝐱_𝑖𝑗 [ 1/ 1+𝑒^−𝜽T𝐱−𝑦𝑖] for every example in data set for every 𝜽

3rd - compute every 𝜽_𝑡+1

<p>1st - calculate <sup>−𝜽T𝐱</sup> for every example in the data set</p><p></p><p>2nd - calculate ∑ 𝐱<sub>𝑖𝑗</sub> [ 1/ 1+𝑒<sup>−𝜽T𝐱</sup>−𝑦𝑖] for every example in data set for every 𝜽</p><p></p><p>3rd - compute every 𝜽<sub>𝑡+1</sub></p>

New cards

cont

New cards

testing phase

New cards

ROC curve

receiver operating characteristic curve = graphical plot that illustrates the performance of a binary classifier model

ROC curve = plot of the true positive rate against the false positive rate

TPR:

tp / (tp+fn)

FPR:

fp/ (fp+tn)

Classification threshold is used to convert the outputof a probabilistic classifier into class labels.

• The threshold determines the minimum probabilityrequired for a positive class

<p>receiver operating characteristic curve = graphical plot that illustrates the performance of a binary classifier model</p><p></p><ul><li><p>ROC curve = plot of the true positive rate against the false positive rate</p></li></ul><p></p><p>TPR:</p><ul><li><p>tp / (tp+fn)</p></li></ul><p></p><p>FPR:</p><ul><li><p>fp/ (fp+tn)</p></li></ul><p></p><p><span>Classification threshold is used to convert the outputof a probabilistic classifier into class labels.</span></p><p><span>• The threshold determines the minimum probabilityrequired for a positive class</span></p>

New cards

TPR/FPR

New cards

AUCa

area under the ROC curve provides aggregate measure of performance across all classification thresholds

between 2 ROC curves plotted based on 2 learning models, model w higher AUC learned better than the other

New cards

Model Selection

Adopting the best algorithm and model for a specific dataset by assessing and comparing different models to identify the one with the best results.

New cards

AIC/BIC

akaike info criterion + bayesian info criterion compares diff models to choose 1 that best fits the data

goal of both AIC + BIC = balnce the goodness of fit of the model w its complexity in order to void overfitting or underfitting

both AIC + BIC penalises models w large no of parameters relative to size of data, but BIC penalises more severely

min 𝐴𝐼𝐶 = 2 𝑚 − 2 log 𝐿

min 𝐵𝐼𝐶 = 𝑚 log 𝑛 − 2 log 𝐿

where m = no of model parameters, n = no of data points + L = max likelihood of the model

New cards

Multinomial Logistic Regression

loss function for multinomial LR generalises loss function for binary LR from 2 to K classes

true label y = vector with K elements, each corresponding to a class, with y_c = 1 if the correct class is c, with all other elements of y being 0

classifier will produce estimate vector w K elements ŷ each element ŷk of which represents the estimated probability P(yk=1|x)

L_CE(ŷ,y) = -∑ y_k log ŷ_k

<p>loss function for multinomial LR generalises loss function for binary LR from 2 to K classes</p><p></p><p>true label y = vector with K elements, each corresponding to a class, with y<sub>c</sub> = 1 if the correct class is c, with all other elements of y being 0</p><p></p><p>classifier will produce estimate vector w K elements ŷ each element ŷk of which represents the estimated probability P(yk=1|x)</p><p>L<sub>CE</sub>(ŷ,y) = -<span>∑ y<sub>k</sub> log ŷ<sub>k</sub></span></p><p></p>

New cards

conclusion

Primarily used to estimate the probability of a specific outcome.
• Is a discriminative learning model.
• Is a linear classifier.
• Optimizes by minimizing the cross-entropy loss via gradient descent.

• Trains parameters:

oBegins with initial weight vector.
oModifies it iteratively to minimize the loss function.