MACHINE LEARNING final

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/16

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

17 Terms

1
New cards
<p>What is the goal of supervised learning?</p>

What is the goal of supervised learning?

To learn a mapping function f:R^p → R^k from labeled examples (xi,yi)

  • You give the model examples → it learns a rule → it predicts outputs for new, unseen inputs.

<p>To learn a mapping function f:R^p → R^k from labeled examples (xi,yi)</p><ul><li><p>You give the model examples → it learns a rule → it predicts outputs for new, unseen inputs.</p></li></ul><p></p>
2
New cards

What are the 2 types of supervised learning?

  • Classification: predicts a categorical label

    • outputs are target classes/probablities (disease yes/no, email spam/no spam)

  • Regression: predicts a continuous value

    • outputs are real values (temp, price)

<ul><li><p><strong>Classification</strong>: predicts a categorical label</p><ul><li><p>outputs are target classes/probablities (disease yes/no, email spam/no spam)</p></li></ul></li><li><p><strong>Regression</strong>: predicts a continuous value</p><ul><li><p>outputs are real values (temp, price)</p></li></ul></li></ul><p></p>
3
New cards

What is linear regression?

Goal: find the best straight line that predicts something

  • prediction = (slope)(x+intercept)

4
New cards

What is the matrix notation of linear regression?

y = Xw + e

  • each row = one sample

  • each column = one feature

  • the first column is a column of ones (for the bias)

Prediction = weighted sum of features + bias

<p>y = Xw + e</p><ul><li><p>each row = one sample</p></li><li><p>each column = one feature</p></li><li><p>the first column is a column of ones (for the bias)</p></li></ul><p>Prediction = weighted sum of features + bias</p>
5
New cards

What is the connection between the cost function and linear regression?

by minizing the cost function, you find the best weights w for linear regression

  • the matrix form measures how wrong the model is, smaller cost = better model

<p>by minizing the cost function, you find the best weights w for linear regression</p><ul><li><p>the matrix form measures how wrong the model is, smaller cost = better model</p></li></ul><p></p>
6
New cards

What is the convex problem in the cost function?

  • The curve is shaped like a bowl

  • There is one unique global minimum

  • This minimum gives you the best possible weights

  • No risk of getting stuck in a local minimum (unlike neural networks)

<ul><li><p>The curve is shaped like a <strong>bowl</strong></p></li><li><p>There is <strong>one unique global minimum</strong></p></li><li><p>This minimum gives you the <strong>best possible weights</strong></p></li><li><p>No risk of getting stuck in a local minimum (unlike neural networks)</p></li></ul><p></p>
7
New cards

What is the neat exact solution of linear regression?

optimization - the analytical solution

  • this is a formula that gives you the best weights w in one single calculation, without looping → called the normal equation

    • Instead of searching for the best line, this formula directly computes the best line.

When does it work?

Only when:

  1. XTX is invertible, and

  2. The number of features p is not huge.

If features are strongly correlated (e.g., height in cm and height in inches), you get multicollinearity, and XTX is not invertible.
→ Then you cannot use the formula.

When shouldn’t you use the formula?

  • If you have many features → matrix inverse is very slow

  • If the design matrix is not invertible

So in real ML:

We rarely use the analytical solution.
Instead, we use gradient descent, which works even when the matrix isn’t invertible and when p is huge.

<p><strong>optimization </strong>- the analytical solution</p><ul><li><p>this is a formula that gives you the best weights w in one single calculation, without looping → called the <strong>normal equation</strong></p><ul><li><p>Instead of searching for the best line, this formula <strong>directly computes</strong> the best line.</p><p></p></li></ul></li></ul><p><u>When does it work? </u></p><p>Only when:</p><ol><li><p><span>XTX</span> is <strong>invertible</strong>, and</p></li><li><p>The number of features <span>p</span> is not huge.</p></li></ol><p> </p><p>If features are strongly correlated (e.g., height in cm and height in inches), you get <strong>multicollinearity</strong>, and <span>XTX</span> is <em>not invertible</em>.<br>→ Then you cannot use the formula.</p><p> <span data-name="check_mark" data-type="emoji">✔</span> When shouldn’t you use the formula? </p><ul><li><p>If you have <strong>many features</strong> → matrix inverse is very slow</p></li><li><p>If the design matrix is <strong>not invertible</strong></p></li></ul><p> </p><p>So in real ML:</p><figure data-type="blockquoteFigure"><div><blockquote><p>We <em>rarely</em> use the analytical solution.<br>Instead, we use <strong>gradient descent</strong>, which works even when the matrix isn’t invertible and when p is huge.</p></blockquote><figcaption></figcaption></div></figure><p></p>
8
New cards

Describe the 4 steps of iterative optimization - gradient descent algorithm

  1. start with random parameters

  • start somewhere random on hill

  • initial model w bad weights

  1. We need to find the direction of steepest descent. To this end, we need to calculate the slope of the cost function with current parameters

  2. We take one step downhill using the following update rule where 𝜂 is

learning rate (step size/amount you move)

  • if learning rate is too big = overshooting, too small = move too slow

  1. We repeat steps 2-3 until we reach the global minimum

  • works for huge datasets, even when analytical solution is impossible

<ol><li><p>start with random parameters</p></li></ol><ul><li><p>start somewhere random on hill</p></li><li><p>initial model w bad weights</p></li></ul><ol start="2"><li><p>We need to find the direction of steepest descent. To this end, we need to calculate the slope of the cost function with current parameters</p></li><li><p>We take one step downhill using the following update rule where 𝜂 is</p></li></ol><p>learning rate (step size/amount you move)</p><ul><li><p>if learning rate is too big = overshooting, too small = move too slow</p></li></ul><ol start="4"><li><p>We repeat steps 2-3 until we reach the global minimum</p></li></ol><p></p><ul><li><p>works for huge datasets, even when analytical solution is impossible</p></li></ul><p></p>
9
New cards

How can you estimate non-linear functions using linear regression?

Linear regression is only linear in the PARAMETERS (weights), not in the original features

10
New cards

What is feature engineering?

Transform the original input x into new features that capture non-linear patterns. Then run linear regression on those transformed features.

  • Feature engineering = creating new, meaningful features from existing data so the model can learn patterns more easily.

  • Feature engineering is the process of transforming your raw data into better inputs (features) that make your model perform much better.

🧠 Why do we need feature engineering?

Because many relationships in the real world are non-linear, messy, or hidden. Raw data often does NOT directly show the patterns we want the model to learn. Feature engineering helps us expose those patterns.

11
New cards

What are examples of feature engineering?

  • polynomials: x²,x³

  • interaction terms: x1x2

  • logarithms: log(x)

  • trigonometric functions: sin(x), cos(x)

  • domain-specific features: ratios, counts

Linear Regression + Feature Engineering = very powerful model

  • However, it can result in overfitting to training data.

  • The relation between the complexity of the induced model and underfitting and overfitting is a crucial notion in data mining

feature engineering increases model complexity, more features = more flexibility = better ability to fit complicated patterns BUT higher risk of overfitting

12
New cards

What is under and overfitting?

  • underfitting: the induced model is not complex (flexible) enough to model data

    • too simple

    • predictions are bad on training and test data

  • overfitting: the induced model is too complex (flexible) to model data and tries to fit noise

    • weights become very large

    • very good on training data but bad on test data

<ul><li><p><strong>underfitting</strong>: the induced model is not complex (flexible) enough to model data</p><ul><li><p>too simple</p></li><li><p>predictions are bad on training and test data</p></li></ul></li><li><p><strong>overfitting</strong>:&nbsp;the induced model is too complex (flexible) to model data and tries to fit noise</p><ul><li><p>weights become very large</p></li><li><p>very good on training data but bad on test data</p></li></ul></li></ul><p></p>
13
New cards

What is bias-variance trade-off?

Model error can be deconstructed into: bias² + variance + noise

  • bias = error from being too simple (underfitting)

  • variance = error from being too sensitive to training data (overfitting)

*high bias = too simple

*too complex = high variance

<p>Model error can be deconstructed into: bias² + variance + noise</p><ul><li><p>bias = error from being too simple (underfitting)</p></li><li><p>variance = error from being too sensitive to training data (overfitting)</p></li></ul><p></p><p>*high bias = too simple</p><p>*too complex = high variance</p>
14
New cards

What is the solution to overfitting?

regularization

  • shrinks weights to prevent overfitting

  • Regularization = adding a penalty to the cost function to prevent the model from having huge coefficients → helps reduce overfitting and improves stability.

What regularization does:

  • Adds a penalty on large coefficients

  • This reduces variance

  • But slightly increases bias

  • Moves the model back toward the “just right” region in the bias–variance trade-off graph

👉 Underfitting is fixed by making the model more complex: add features, reduce regularization, or choose a more flexible model.

<p><strong>regularization</strong></p><ul><li><p>shrinks weights to prevent overfitting</p></li><li><p>Regularization = adding a <strong>penalty</strong> to the cost function to prevent the model from having huge coefficients → helps reduce overfitting and improves stability.</p><p></p></li></ul><p>What regularization does: </p><ul><li><p>Adds a <strong>penalty</strong> on large coefficients</p></li><li><p>This reduces variance</p></li><li><p>But slightly increases bias</p></li><li><p>Moves the model back toward the “just right” region in the bias–variance trade-off graph</p></li></ul><p></p><p><span data-name="point_right" data-type="emoji">👉</span> Underfitting is fixed by making the model more complex: add features, reduce regularization, or choose a more flexible model.</p>
15
New cards

What are the 3 methods for regularization in linear regression?

  1. Ridge (L2 penalization): Shrinks all coefficients smoothly

  2. Lasso (L1 penalization): Encourages sparsity in coefficients

  3. Elastic-net (L1+L2 penalization): Balances between ridge and Lasso

👉 A penalty is an extra cost added to the model when the weights become too large.

  • large weights = overly complicated model = overfitting.

Method

Penalty

Effect

Best Use

Ridge

L2

Shrinks all coefficients smoothly

Multicollinearity, all features matter

Lasso

L1

Sets some coefficients exactly to zero

Feature selection, sparse models

Elastic Net

L1 + L2

Mix of sparsity + stability

Correlated features + many features

<ol><li><p>Ridge (L2 penalization): Shrinks all coefficients smoothly</p></li><li><p>Lasso (L1 penalization): Encourages sparsity in coefficients</p></li><li><p>Elastic-net (L1+L2 penalization): Balances between ridge and Lasso</p></li></ol><p></p><p><span data-name="point_right" data-type="emoji">👉</span> A penalty is an extra cost added to the model when the weights become too large. </p><p> </p><ul><li><p><strong>large weights = overly complicated model = overfitting</strong>.</p></li></ul><table style="min-width: 100px;"><colgroup><col style="min-width: 25px;"><col style="min-width: 25px;"><col style="min-width: 25px;"><col style="min-width: 25px;"></colgroup><tbody><tr><th colspan="1" rowspan="1"><p>Method</p></th><th colspan="1" rowspan="1"><p>Penalty</p></th><th colspan="1" rowspan="1"><p>Effect</p></th><th colspan="1" rowspan="1"><p>Best Use</p></th></tr><tr><td colspan="1" rowspan="1"><p><strong>Ridge</strong></p></td><td colspan="1" rowspan="1"><p>L2</p></td><td colspan="1" rowspan="1"><p>Shrinks all coefficients smoothly</p></td><td colspan="1" rowspan="1"><p>Multicollinearity, all features matter</p></td></tr><tr><td colspan="1" rowspan="1"><p><strong>Lasso</strong></p></td><td colspan="1" rowspan="1"><p>L1</p></td><td colspan="1" rowspan="1"><p>Sets some coefficients <em>exactly</em> to zero</p></td><td colspan="1" rowspan="1"><p>Feature selection, sparse models</p></td></tr><tr><td colspan="1" rowspan="1"><p><strong>Elastic Net</strong></p></td><td colspan="1" rowspan="1"><p>L1 + L2</p></td><td colspan="1" rowspan="1"><p>Mix of sparsity + stability</p></td><td colspan="1" rowspan="1"><p>Correlated features + many features</p></td></tr></tbody></table><p></p>
16
New cards
<p>What is logistic regression?</p>

What is logistic regression?

  • Logistic regression is a linear classifier.

  • It predicts the probability of a class (in binary: 0 or 1).

  • It uses input features x∈R^p

  • The model has:

    • w = weight vector

    • b = bias

First computes a linear score: z = w^T x + b

Then applies the sigmoid function to convert z → probability:

  • σ(z) = 1/ 1+e^-z

<ul><li><p>Logistic regression is a <strong>linear classifier</strong>.</p></li><li><p>It predicts the probability of a class (in binary: 0 or 1).</p></li><li><p>It uses input features <span>x∈R^p </span></p></li><li><p>The model has:</p><ul><li><p><strong>w</strong> = weight vector</p></li><li><p><strong>b</strong> = bias</p></li></ul></li></ul><p>First computes a linear score:&nbsp;<span>z = w^T x + b</span></p><p>Then applies the <strong>sigmoid function</strong> to convert z → probability:</p><ul><li><p><span>σ(z) = 1/ 1+e^-z</span></p></li></ul><p></p>
17
New cards