Linear Models

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/35

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

36 Terms

1
New cards

Linear Models

Linear models are parameteric models that are characterized by having prediction functions.

<p>Linear models are parameteric models that are characterized by having prediction functions.</p>
2
New cards

Classification with Linear Models

When using a linear model for classification, we are interested in which side of the line f (x) = c the target appears on.

<p>When using a linear model for classification, we are interested in which side of the line f (x) = c the target appears on.</p>
3
New cards

Regression with Linear Models

When using a linear model for regression, we simply output the value of y = wTx + b.

<p>When using a linear model for regression, we simply output the value of y = w<sup>T</sup>x + b.</p>
4
New cards

Why linear models?

  • Even if there is no linear relationship (or decision boundary), often possible to project data to a space in which relationship is linear.

  • If you have very little data, you need to make strong assumptions to fit a model. Low variance.

  • Occam’s razor (principle of parsimony). Simpler models that explain the data are better than more complicated ones.

  • Possible to build much more complex models using linear models as a building block. This is the basis of neural networks and deep learning.

5
New cards

Linear Regression (Ordinary Least Squares)

In linear regression, we would like to find parameters {w, b} such that the resulting line f (x) fits the training data well.

6
New cards

Loss (Cost) Function

A loss function measures how wrong a model’s prediction is for a single data point.

7
New cards

Normal Equations

The Normal Equations give a direct solution for finding the best-fit parameters in Linear Regression by minimizing Mean Squared Error.

8
New cards

Gradient Descent

An iterative optimization algorithm used to minimize a loss (cost) function by repeatedly moving in the direction of steepest decrease.

<p>An iterative optimization algorithm used to minimize a loss (cost) function by repeatedly moving in the direction of steepest decrease.</p>
9
New cards

Gradient Descent: The Algorithm

  1. Start with a random guess for the initial parameters θ0

  2. While not converged:

    • Set θt+1 = θt − αtθL(θt)

10
New cards

Gradient Descent: When to Decide Convergence

The value of αt is called the learning rate. Decide convergence when:

  • The change in θ is small |θt+1 − θt| < T

  • Maximum number of iterations are reached.

  • Loss stops improving.

11
New cards

Properties of Gradient Descent

The learning rate αt :

  • Gradient descent will converge on convex functions as long as the learning rate αt keeps getting smaller over time.

  • Gradient descent may be very slow if αt is too small.

  • Gradient descent may diverge if αt is too large.

Gradient descent requires a full pass over the dataset each time it updates the parameters (weights and biases). Slow for large datasets.

12
New cards

Stochastic Gradient Descent (SGD)

Instead of computing the gradient of the loss for the full dataset, estimate it using a single randomly chosen example and then take step in that direction.

13
New cards

Properties of SGD

  • Usually much faster than batch gradient descent, since we update the weights every iteration instead of every epoch.

  • Can operate on large datasets, since we only need to process a single example at a time.

  • Important to shuffle training data to ensure the gradient estimates are unbiased (i.i.d.)

14
New cards

Cons of SGD

  • Gradient updates are noisy, since it is only an estimate of the full gradient.

  • Convergence may be slower as you near the optimum solution.

15
New cards

Assumptions of Linear Regression

  • Linear relationship between variates and covariate

  • All other variation can be modelled as zero mean Gaussian noise

If these assumptions are violated, then linear regression may be a bad model.

16
New cards

Implications of Linear Regression

  • Linear regression is sensitive to extreme outliers.

  • Linear regression may not be appropriate when the noise is not Gaussian.

  • Linear regression may not give good results if the relationship between the variates and covariates is not approximately linear.

17
New cards

Is it possible to use our linear regression model to do classification?

Sure, but it probably won’t work very well.

  • The relationship between the variates and covariates is not directly linear

  • We would like our outputs to be binary variables in {0, 1} (or probabilities in [0 . . . 1]), but linear regression produces arbitrary real numbers. We would need to do some post processing to map the values to the desired range.

  • Assumption of Gaussian noise is not true for binary outputs. So square error is inappropriate.

18
New cards

Logistic Regression

A supervised learning algorithm used for binary classification problems, where the output is either 0 or 1. It works by applying a linear model to the input features and then passing the result through the sigmoid function, which converts the output into a probability between 0 and 1. The model predicts class 1 if the probability is greater than or equal to 0.5, otherwise it predicts class 0.

<p>A supervised learning algorithm used for <strong>binary classification</strong> problems, where the output is either 0 or 1. It works by applying a <strong>linear model</strong> to the input features and then passing the result through the <strong>sigmoid function</strong>, which converts the output into a probability between 0 and 1. The model predicts class 1 if the probability is greater than or equal to 0.5, otherwise it predicts class 0.</p>
19
New cards

Components needed to specify the logistic regression algorithm?

  1. An activation function (transfer function): the logistic sigmoid

  2. An appropriate loss function (cross-entropy) derived using maximum likelihood

20
New cards

Logistic Regression: Loss Functions

To fit the parameters we would also like to use a more appropriate loss function than square loss:

  • Probability Distribution

  • Binary Cross Entropy Loss

21
New cards

Gradient Descent for Logistic Regression

Perform gradient descent (or SGD) using the usual update rule to find good parameters.

22
New cards

Linear regression: Fitting Polynomials

Possible to use linear regression for more than just fitting lines! Only needs for function to be linear in the parameters

23
New cards

! Feature Mapping Function φ(x)

24
New cards

Feature Engineering

Good features are often the key to good generalization performance.

  • Low dimensional: less likely for data to be linearly separable

  • High dimensional: possibly more prone to overfitting

Process of designing such functions by hand is called feature engineering, but it is notoriously difficult.

25
New cards

! Feature Learning

Learn features from the data directly.

26
New cards

Overfitting

If the model has too many degrees of freedom, you can end up fitting not only the patterns of interest, but also the noise.

  • This leads to poor generalization: model fits the training data well but does poor on unseen data

  • Can happen when the model has too many parameters (and too little data to train on).

<p>If the model has too many degrees of freedom, you can end up fitting not only the patterns of interest, but also the noise. </p><ul><li><p>This leads to poor generalization: model fits the training data well but does poor on unseen data </p></li><li><p>Can happen when the model has too many parameters (and too little data to train on).</p></li></ul><p></p>
27
New cards

Curse of Dimensionality

  • Estimation: more parameters to estimate (risk of overfitting).

  • Sampling: exponential increase in volume of space.

  • Optimization: slower, larger space to search.

  • Distances: everything is far away.

  • Harder to model data distribution P(x1, x2, . . . , xD )

  • Exponentially harder to compute integrals.

  • Geometric intuitions break down.

  • Difficult to visualize

28
New cards

Blessings of Dimensionality

Linear separability: easier to separate things in high-D using hyperplanes

29
New cards

! Dimensionality

30
New cards

! Multiple Regression

31
New cards

Multi-Class Classification

Logistic regression is limited to binary outputs {0, 1}. Often, if we want to do multi-class classification: Output is one of K classes {1, 2, . . . , K}

32
New cards

How do we model Multi-Class Classification?

Two approaches:

  1. One-vs-rest (OVR): aka one-vs-all

  2. Softmax regression

33
New cards

! One-vs-Rest (OVR)

Idea: train K separate binary classifiers.

34
New cards

! Softmax regression

Direct extension of logistic regression to the multi-class case.

35
New cards

! Loss function for Softmax regression

categorical cross entropy

36
New cards

Regularisation