MACHINE LEARNING final

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/16

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

17 Terms

New cards

What is the goal of supervised learning?

To learn a mapping function f:R^p → R^k from labeled examples (xi,yi)

You give the model examples → it learns a rule → it predicts outputs for new, unseen inputs.

New cards

What are the 2 types of supervised learning?

Classification: predicts a categorical label
- outputs are target classes/probablities (disease yes/no, email spam/no spam)
Regression: predicts a continuous value
- outputs are real values (temp, price)

<ul><li><p><strong>Classification</strong>: predicts a categorical label</p><ul><li><p>outputs are target classes/probablities (disease yes/no, email spam/no spam)</p></li></ul></li><li><p><strong>Regression</strong>: predicts a continuous value</p><ul><li><p>outputs are real values (temp, price)</p></li></ul></li></ul><p></p>

New cards

What is linear regression?

Goal: find the best straight line that predicts something

prediction = (slope)(x+intercept)

New cards

What is the matrix notation of linear regression?

y = Xw + e

each row = one sample
each column = one feature
the first column is a column of ones (for the bias)

Prediction = weighted sum of features + bias

<p>y = Xw + e</p><ul><li><p>each row = one sample</p></li><li><p>each column = one feature</p></li><li><p>the first column is a column of ones (for the bias)</p></li></ul><p>Prediction = weighted sum of features + bias</p>

New cards

What is the connection between the cost function and linear regression?

by minizing the cost function, you find the best weights w for linear regression

the matrix form measures how wrong the model is, smaller cost = better model

New cards

What is the convex problem in the cost function?

The curve is shaped like a bowl
There is one unique global minimum
This minimum gives you the best possible weights
No risk of getting stuck in a local minimum (unlike neural networks)

<ul><li><p>The curve is shaped like a <strong>bowl</strong></p></li><li><p>There is <strong>one unique global minimum</strong></p></li><li><p>This minimum gives you the <strong>best possible weights</strong></p></li><li><p>No risk of getting stuck in a local minimum (unlike neural networks)</p></li></ul><p></p>

New cards

What is the neat exact solution of linear regression?

optimization - the analytical solution

this is a formula that gives you the best weights w in one single calculation, without looping → called the normal equation
- Instead of searching for the best line, this formula directly computes the best line.

When does it work?

Only when:

XTX is invertible, and
The number of features p is not huge.

If features are strongly correlated (e.g., height in cm and height in inches), you get multicollinearity, and XTX is not invertible.
→ Then you cannot use the formula.

✔ When shouldn’t you use the formula?

If you have many features → matrix inverse is very slow
If the design matrix is not invertible

So in real ML:

We rarely use the analytical solution.
Instead, we use gradient descent, which works even when the matrix isn’t invertible and when p is huge.

<p><strong>optimization </strong>- the analytical solution</p><ul><li><p>this is a formula that gives you the best weights w in one single calculation, without looping → called the <strong>normal equation</strong></p><ul><li><p>Instead of searching for the best line, this formula <strong>directly computes</strong> the best line.</p><p></p></li></ul></li></ul><p><u>When does it work? </u></p><p>Only when:</p><ol><li><p><span>XTX</span> is <strong>invertible</strong>, and</p></li><li><p>The number of features <span>p</span> is not huge.</p></li></ol><p> </p><p>If features are strongly correlated (e.g., height in cm and height in inches), you get <strong>multicollinearity</strong>, and <span>XTX</span> is <em>not invertible</em>.<br>→ Then you cannot use the formula.</p><p> <span data-name="check_mark" data-type="emoji">✔</span> When shouldn’t you use the formula? </p><ul><li><p>If you have <strong>many features</strong> → matrix inverse is very slow</p></li><li><p>If the design matrix is <strong>not invertible</strong></p></li></ul><p> </p><p>So in real ML:</p><figure data-type="blockquoteFigure"><div><blockquote><p>We <em>rarely</em> use the analytical solution.<br>Instead, we use <strong>gradient descent</strong>, which works even when the matrix isn’t invertible and when p is huge.</p></blockquote><figcaption></figcaption></div></figure><p></p>

New cards

Describe the 4 steps of iterative optimization - gradient descent algorithm

start with random parameters

start somewhere random on hill
initial model w bad weights

We need to find the direction of steepest descent. To this end, we need to calculate the slope of the cost function with current parameters
We take one step downhill using the following update rule where 𝜂 is

learning rate (step size/amount you move)

if learning rate is too big = overshooting, too small = move too slow

We repeat steps 2-3 until we reach the global minimum

works for huge datasets, even when analytical solution is impossible

<ol><li><p>start with random parameters</p></li></ol><ul><li><p>start somewhere random on hill</p></li><li><p>initial model w bad weights</p></li></ul><ol start="2"><li><p>We need to find the direction of steepest descent. To this end, we need to calculate the slope of the cost function with current parameters</p></li><li><p>We take one step downhill using the following update rule where 𝜂 is</p></li></ol><p>learning rate (step size/amount you move)</p><ul><li><p>if learning rate is too big = overshooting, too small = move too slow</p></li></ul><ol start="4"><li><p>We repeat steps 2-3 until we reach the global minimum</p></li></ol><p></p><ul><li><p>works for huge datasets, even when analytical solution is impossible</p></li></ul><p></p>

New cards

How can you estimate non-linear functions using linear regression?

Linear regression is only linear in the PARAMETERS (weights), not in the original features

New cards

What is feature engineering?

Transform the original input x into new features that capture non-linear patterns. Then run linear regression on those transformed features.

Feature engineering = creating new, meaningful features from existing data so the model can learn patterns more easily.
Feature engineering is the process of transforming your raw data into better inputs (features) that make your model perform much better.

🧠 Why do we need feature engineering?

Because many relationships in the real world are non-linear, messy, or hidden. Raw data often does NOT directly show the patterns we want the model to learn. Feature engineering helps us expose those patterns.

New cards

What are examples of feature engineering?

polynomials: x²,x³
interaction terms: x1x2
logarithms: log(x)
trigonometric functions: sin(x), cos(x)
domain-specific features: ratios, counts

Linear Regression + Feature Engineering = very powerful model

However, it can result in overfitting to training data.
The relation between the complexity of the induced model and underfitting and overfitting is a crucial notion in data mining

feature engineering increases model complexity, more features = more flexibility = better ability to fit complicated patterns BUT higher risk of overfitting

New cards

What is under and overfitting?

underfitting: the induced model is not complex (flexible) enough to model data
- too simple
- predictions are bad on training and test data
overfitting: the induced model is too complex (flexible) to model data and tries to fit noise
- weights become very large
- very good on training data but bad on test data

<ul><li><p><strong>underfitting</strong>: the induced model is not complex (flexible) enough to model data</p><ul><li><p>too simple</p></li><li><p>predictions are bad on training and test data</p></li></ul></li><li><p><strong>overfitting</strong>: the induced model is too complex (flexible) to model data and tries to fit noise</p><ul><li><p>weights become very large</p></li><li><p>very good on training data but bad on test data</p></li></ul></li></ul><p></p>

New cards

What is bias-variance trade-off?

Model error can be deconstructed into: bias² + variance + noise

bias = error from being too simple (underfitting)
variance = error from being too sensitive to training data (overfitting)

*high bias = too simple

*too complex = high variance

<p>Model error can be deconstructed into: bias² + variance + noise</p><ul><li><p>bias = error from being too simple (underfitting)</p></li><li><p>variance = error from being too sensitive to training data (overfitting)</p></li></ul><p></p><p>*high bias = too simple</p><p>*too complex = high variance</p>

New cards

What is the solution to overfitting?

regularization

shrinks weights to prevent overfitting
Regularization = adding a penalty to the cost function to prevent the model from having huge coefficients → helps reduce overfitting and improves stability.

What regularization does:

Adds a penalty on large coefficients
This reduces variance
But slightly increases bias
Moves the model back toward the “just right” region in the bias–variance trade-off graph

👉 Underfitting is fixed by making the model more complex: add features, reduce regularization, or choose a more flexible model.

<p><strong>regularization</strong></p><ul><li><p>shrinks weights to prevent overfitting</p></li><li><p>Regularization = adding a <strong>penalty</strong> to the cost function to prevent the model from having huge coefficients → helps reduce overfitting and improves stability.</p><p></p></li></ul><p>What regularization does: </p><ul><li><p>Adds a <strong>penalty</strong> on large coefficients</p></li><li><p>This reduces variance</p></li><li><p>But slightly increases bias</p></li><li><p>Moves the model back toward the “just right” region in the bias–variance trade-off graph</p></li></ul><p></p><p><span data-name="point_right" data-type="emoji">👉</span> Underfitting is fixed by making the model more complex: add features, reduce regularization, or choose a more flexible model.</p>

New cards

What are the 3 methods for regularization in linear regression?

Ridge (L2 penalization): Shrinks all coefficients smoothly
Lasso (L1 penalization): Encourages sparsity in coefficients
Elastic-net (L1+L2 penalization): Balances between ridge and Lasso

👉 A penalty is an extra cost added to the model when the weights become too large.

large weights = overly complicated model = overfitting.

Method	Penalty	Effect	Best Use
Ridge	L2	Shrinks all coefficients smoothly	Multicollinearity, all features matter
Lasso	L1	Sets some coefficients exactly to zero	Feature selection, sparse models
Elastic Net	L1 + L2	Mix of sparsity + stability	Correlated features + many features

<ol><li><p>Ridge (L2 penalization): Shrinks all coefficients smoothly</p></li><li><p>Lasso (L1 penalization): Encourages sparsity in coefficients</p></li><li><p>Elastic-net (L1+L2 penalization): Balances between ridge and Lasso</p></li></ol><p></p><p><span data-name="point_right" data-type="emoji">👉</span> A penalty is an extra cost added to the model when the weights become too large. </p><p> </p><ul><li><p><strong>large weights = overly complicated model = overfitting</strong>.</p></li></ul><table style="min-width: 100px;"><colgroup><col style="min-width: 25px;"><col style="min-width: 25px;"><col style="min-width: 25px;"><col style="min-width: 25px;"></colgroup><tbody><tr><th colspan="1" rowspan="1"><p>Method</p></th><th colspan="1" rowspan="1"><p>Penalty</p></th><th colspan="1" rowspan="1"><p>Effect</p></th><th colspan="1" rowspan="1"><p>Best Use</p></th></tr><tr><td colspan="1" rowspan="1"><p><strong>Ridge</strong></p></td><td colspan="1" rowspan="1"><p>L2</p></td><td colspan="1" rowspan="1"><p>Shrinks all coefficients smoothly</p></td><td colspan="1" rowspan="1"><p>Multicollinearity, all features matter</p></td></tr><tr><td colspan="1" rowspan="1"><p><strong>Lasso</strong></p></td><td colspan="1" rowspan="1"><p>L1</p></td><td colspan="1" rowspan="1"><p>Sets some coefficients <em>exactly</em> to zero</p></td><td colspan="1" rowspan="1"><p>Feature selection, sparse models</p></td></tr><tr><td colspan="1" rowspan="1"><p><strong>Elastic Net</strong></p></td><td colspan="1" rowspan="1"><p>L1 + L2</p></td><td colspan="1" rowspan="1"><p>Mix of sparsity + stability</p></td><td colspan="1" rowspan="1"><p>Correlated features + many features</p></td></tr></tbody></table><p></p>

New cards

What is logistic regression?

Logistic regression is a linear classifier.
It predicts the probability of a class (in binary: 0 or 1).
It uses input features x∈R^p
The model has:
- w = weight vector
- b = bias

First computes a linear score: z = w^T x + b

Then applies the sigmoid function to convert z → probability:

σ(z) = 1/ 1+e^-z

<ul><li><p>Logistic regression is a <strong>linear classifier</strong>.</p></li><li><p>It predicts the probability of a class (in binary: 0 or 1).</p></li><li><p>It uses input features <span>x∈R^p </span></p></li><li><p>The model has:</p><ul><li><p><strong>w</strong> = weight vector</p></li><li><p><strong>b</strong> = bias</p></li></ul></li></ul><p>First computes a linear score: <span>z = w^T x + b</span></p><p>Then applies the <strong>sigmoid function</strong> to convert z → probability:</p><ul><li><p><span>σ(z) = 1/ 1+e^-z</span></p></li></ul><p></p>

New cards