Chapter 1 - 2

0.0(0)
studied byStudied by 1 person
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/7

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

8 Terms

1
New cards
  • In a matrix, it is common convention to represent each example as a separate row or column?

  • the iris data set consists of 150 examples and four features which can be written as a ____ x ____ matrix

  • the superscript ith superscript will refer to what?

  • the subscript jth will refer to what?

  • In a matrix, it is common convention to represent each example as a separate row

  • the iris data set consists of 150 examples and four features which can be written as a 150 × 4 matrix

  • the ith superscript will refer to the ith training example

  • the subscript jth will refer to the jth dimension of the training dataset

<ul><li><p>In a matrix, it is common convention to represent each example as a separate row</p><p></p></li><li><p>the iris data set consists of 150 examples and four features which can be written as a 150 × 4 matrix</p><p></p></li><li><p>the i<em>th </em>superscript will refer to the i<em>th </em>training example</p><p></p></li><li><p>the subscript j<em>th </em>will refer to the j<em>th</em> dimension of the training dataset<em> </em></p></li></ul><p></p>
2
New cards
  • lowercase bold face letters refer to what?

  • uppercase bold face letters refer to what?

  • lowercase bold face letters refer to vectors

  • uppercase bold face letters refer to matrices

<ul><li><p>lowercase bold face letters refer to vectors</p></li><li><p>uppercase bold face letters refer to matrices</p></li></ul><p></p>
3
New cards

X(i)

  • here X upper case so it’s referring to a ________

  • what is the superscript referring to about this matrix

  • X(i) is referring to a matrix and the superscript is referring to the ith row of the matrix

in the example the matrix has 150 rows and 4 columns

<ul><li><p><strong>X</strong><sup>(</sup><em><sup>i</sup></em><sup>)</sup>  is referring to a matrix and the superscript is referring to the ith row of the matrix</p></li></ul><p>in the example the matrix has 150 rows and 4 columns</p>
4
New cards

xj

  • here x is lowercase

5
New cards

The learning rate, n (eta) as well as the number of epochs (n_iter), are the so-called what?

Hyperparameters (or tuning parameters)

6
New cards

What might happen is we choose a learning rate that is too large? What happens if the learning rate is too small?

We overshoot the global minimum, see figure. The left figure shows a well-chosen learning rate, while the figure on the right shows what happens when we choose a learning rate that is too large

If the learning rate is too small then the algorithm would require a very large number of epochs to converge to the global minimum

<p>We overshoot the global minimum, see figure. The left figure shows a well-chosen learning rate, while the figure on the right shows what happens when we choose a learning rate that is too large</p><p></p><p>If the learning rate is too small then the algorithm would require a very large number of epochs to converge to the global minimum</p>
7
New cards

Why is the feature scaling method standardization and feature scaling in general used? How do you apply it?

Feature scaling is used to optimize machine learning algorithms, gradient descent is one of the many algorithms that benefits from feature scaling

Standardization is a normalization procedure that helps gradient descent learning to converge more quickly but it does not make the original dataset normally distributed (when most of the values are centered around the mean). It only rescales and recenters the data - it does not change the shape of the distribution. Standardization shifts the mean of each feature so that it is centered at zero and each feature has a standard deviation of 1 (unit variance)

To standardize the jth feature, subtract the sample mean, from every training example and divide it by the standard deviation.

In the attached image, xj is a vector consisting of the jth feature values of all training examples, n, and this standardization technique is applied to each feature, j, in our dataset

<p>Feature scaling is used to optimize machine learning algorithms, gradient descent is one of the many algorithms that benefits from feature scaling</p><p></p><p>Standardization is a normalization procedure that helps gradient descent learning to converge more quickly but it does not make the original dataset normally distributed (when most of the values are centered around the mean). It only rescales and recenters the data - it does not change the shape of the distribution. Standardization shifts the mean of each feature so that it is centered at zero and each feature has a standard deviation of 1 (unit variance)</p><p></p><p>To standardize the jth feature, subtract the sample mean, from every training example and divide it by the standard deviation.</p><p></p><p>In the attached image, xj is a vector consisting of the jth feature values of all training examples, n, and this standardization technique is applied to each feature, j, in our dataset</p><p></p>
8
New cards

Why does standardization hep with gradient descent learning?

It makes it easier to find a learning rate that works well for all the weights and the bias. When the features are on vastly different scales, a learning rate that works well for updating one weight might be too large or too small to update the other weight equally well. Standardization features can stabilize the training such that the optimizer has to go through fewer steps to find a good or optimal solution (the global loss minimum)

The attached image shows gradient descent with unscaled features (left) and standardized features (right). The concentric circles represent the loss surface as a function of two model weights in a two-dimensional classification problem.

<p>It makes it easier to find a learning rate that works well for all the weights and the bias. When the features are on vastly different scales, a learning rate that works well for updating one weight might be too large or too small to update the other weight equally well. Standardization features can stabilize the training such that the optimizer has to go through fewer steps to find a good or optimal solution (the global loss minimum)</p><p></p><p>The attached image shows gradient descent with unscaled features (left) and standardized features (right). The concentric circles represent the loss surface as a function of two model weights in a two-dimensional classification problem. </p><p></p><p></p>