1/7
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
In a matrix, it is common convention to represent each example as a separate row or column?
the iris data set consists of 150 examples and four features which can be written as a ____ x ____ matrix
the superscript ith superscript will refer to what?
the subscript jth will refer to what?
In a matrix, it is common convention to represent each example as a separate row
the iris data set consists of 150 examples and four features which can be written as a 150 × 4 matrix
the ith superscript will refer to the ith training example
the subscript jth will refer to the jth dimension of the training dataset
lowercase bold face letters refer to what?
uppercase bold face letters refer to what?
lowercase bold face letters refer to vectors
uppercase bold face letters refer to matrices
X(i)
here X upper case so it’s referring to a ________
what is the superscript referring to about this matrix
X(i) is referring to a matrix and the superscript is referring to the ith row of the matrix
in the example the matrix has 150 rows and 4 columns
xj
here x is lowercase
The learning rate, n (eta) as well as the number of epochs (n_iter), are the so-called what?
Hyperparameters (or tuning parameters)
What might happen is we choose a learning rate that is too large? What happens if the learning rate is too small?
We overshoot the global minimum, see figure. The left figure shows a well-chosen learning rate, while the figure on the right shows what happens when we choose a learning rate that is too large
If the learning rate is too small then the algorithm would require a very large number of epochs to converge to the global minimum
Why is the feature scaling method standardization and feature scaling in general used? How do you apply it?
Feature scaling is used to optimize machine learning algorithms, gradient descent is one of the many algorithms that benefits from feature scaling
Standardization is a normalization procedure that helps gradient descent learning to converge more quickly but it does not make the original dataset normally distributed (when most of the values are centered around the mean). It only rescales and recenters the data - it does not change the shape of the distribution. Standardization shifts the mean of each feature so that it is centered at zero and each feature has a standard deviation of 1 (unit variance)
To standardize the jth feature, subtract the sample mean, from every training example and divide it by the standard deviation.
In the attached image, xj is a vector consisting of the jth feature values of all training examples, n, and this standardization technique is applied to each feature, j, in our dataset
Why does standardization hep with gradient descent learning?
It makes it easier to find a learning rate that works well for all the weights and the bias. When the features are on vastly different scales, a learning rate that works well for updating one weight might be too large or too small to update the other weight equally well. Standardization features can stabilize the training such that the optimizer has to go through fewer steps to find a good or optimal solution (the global loss minimum)
The attached image shows gradient descent with unscaled features (left) and standardized features (right). The concentric circles represent the loss surface as a function of two model weights in a two-dimensional classification problem.