COSC401 5: Generative Models & Feature Maps

0.0(0)

Studied by 2 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/17

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

18 Terms

New cards

Difference between discriminative and generative algorithms?

Discriminative models try to learn p(y|x) directly, or mappings from input space to output space {0,1}

Generative models try to model p(x|y) and p(y).

New cards

How to get p(y|x) from p(x|y) and p(y)

Bayes rule, where p(x) is also calculatable given what we know:

p(x) = p(x|y=1)p(y=1) + p(x|y=0)p(y=0)

Also, we don’t actually need to calculate p(x) when we are making a prediction since we can just find the y argmax of p(x|y)p(y)

New cards

Motivation for feature maps

We sometimes need a more expressive family of models than linear models.

Feature maps phi allow us to view a cubic function of x for example, as a linear function over the variables phi(x)

<p>We sometimes need a more expressive family of models than linear models.<br><br>Feature maps phi allow us to view a cubic function of x for example, as a linear function over the variables phi(x)</p>

New cards

LMS with features

The exact same but replacing x(i) with phi(x(i))

New cards

Benefits of kernel trick

no need to store large theta explicitly
significant runtime improvement

New cards

When can theta be represented as a linear combination of the vectors phi(x(1)), … phi(x(n))?

At any time.

New cards

Explain:

We assume that at some point, theta can be represented as the image shows, for some beta_1, … , beta_n.

We claim in the next iteration, that theta is still a linear combination of phi(x(1)), … phi(x(n)), and then do some math magic to implicitly represent theta by the betas.

<p>We assume that at <u>some</u> point, theta can be represented as the image shows, for some beta_1, … , beta_n.</p><p></p><p>We claim in the next iteration, that theta is still a linear combination of phi(x(1)), … phi(x(n)), and then do some math magic to implicitly represent theta by the betas.</p>

New cards

Two important properties of using the betas

1) we can pre-compute the pairwise inner products of the phis, for all pairs.

2) computing phi inner products can be efficient and does not require explicitly computing the phi(.) values.

e.g. for up to 3 dimensions: the inner product of phi(x) and phi(z) is just 1 + the inner product of x and z + the inner product of x and z squared + same thing cubed.
this take O(d) time (efficient as)

<p>1) we can pre-compute the pairwise inner products of the phis, for all pairs.<br><br>2) computing phi inner products can be efficient and does not require explicitly computing the phi(.) values.</p><ul><li><p>e.g. for up to 3 dimensions: the inner product of phi(x) and phi(z) is just 1 + the inner product of x and z + the inner product of x and z squared + same thing cubed.</p></li><li><p>this take O(d) time (efficient as)</p></li></ul><p></p><p></p>

New cards

Final kernel trick algorithm?

1) compute all values K(x(i), x(j)) for all i, j in range(1, n + 1)

2) set initial beta to 0

3) loop beta update formula.

New cards

Time complexity for updating beta representation of theta?

O(n) per update.

New cards

What can kernels kind of be thought of?

A similarity metric.

if phi(x) and phi(z) are close together, we might expect K(x, z) to be quite large

New cards

What 2 things must a kernel (matrix) be?

symmetric
positive semidefinite (all entries are >= 0)

New cards

What is the main assumption of Gaussian Discriminant Analysis GDA?

We assume p(x|y) is distributed multivariate normal

New cards

Parameters of GDA?

mean vector mu, and covariance matrix sigma.

New cards

What are the three assumed distributions?

y ~ Bernoulli(sigma)

x|y=0 ~ Normal(mu_0, SIGMA)

x|y=1 ~ Normal(mu_1, SIGMA)

New cards

Do we maximise lof likelihood to estimate sigma, mu_0, mu_1, and SIGMA?

Yes.

New cards

Compare GDA and logistic regression

Viewing p(y=1|x; sigma, mu_0, mu_1, SIGMA) as a function of x reveals a form exactly that of logistic regression (which is a DISCRIMINATIVE algorithm).

GDA implies log.reg., but the inverse is not true. Therefor, GDA makes stronger assumptions about the data, and when these assumptions are correct, it will find better fits with the data, and with less data required to do so.

Logistic regression making weaker assumptions makes it more robust and less sensitive to incorrect assumptions. In other words, there are different distributions that can also be modelled by log.reg. (e.g. Poisson(lambda)) that GDA cannot model.

In practice, log.reg. is used more.