1/51
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Consider the following graph from a series of experiments with increased classification model complexity. The accuracy of the model is measured and plotted in two graphs. What are point A, graph B, distance C, and graph D?
[image]
Point A is the sweet points where the best test accuracy is achieved.
Graph B is the training results.
Distance C is the generalization error.
Graph D is the test results.
What are “bias” and “variance?”
Bias is the prediction error. This error is due to inaccurate assumptions or a simple model.
Variance is the change in prediction errors when the dataset is changed.
How does linear regression work, and what are its assumptions?
Linear regression is a supervised learning algorithm used for regression problems. It attempts to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. Then, the equation is used to make predictions on unseen data.
The assumptions of linear regression are as follows:
Linearity: The relationship between the independent and dependent variables is linear.
Independence: The observations are independent of each other.
Homoscedasticity: The variance of the errors is constant for all values of the independent variables.
Normality: The errors are normally distributed.
No multicollinearity: The independent variables are not highly correlated with each other.
Violating these assumptions can lead to biased or inefficient estimates or inaccurate predictions.
What causes “high variance” and “high bias”?
Generally, “high variance” indicates overfitting, and “high bias” is due to underfitting.
However, in exceptional cases, high variance could be caused by the statistical nature of the new dataset being tested with the trained model. For example, if the distribution of the new dataset is different from the training set, we could see a high variance. In such a case, we should retrain the model with a mixture of the two datasets. One can say that the initial training was overfitted on the initial dataset.
What is the relationship between the model’s complexity, bias, and variance?
Increasing the model’s complexity will reduce the bias and reduce the model’s generalization. Low generalization causes high variance.
True or False: Overfitting is more probable when the dataset is small.
True
True or False: Increasing the dataset increases the model’s generalization.
True
What is the primary goal of cross-validation?
Avoid overfitting.
What is a “5-fold leave-one-out cross-validation?”
Leave a part of the dataset for testing. The training part is partitioned into 5 folds. 1 fold will be used for validation, and 4 folds will be used for training.
What is the primary purpose of regularization in machine learning?
A) To increase the complexity of the model
B) To reduce the training error
C) To penalize a large number of coefficients and large magnitudes to avoid overfitting
D) To speed up the training process
C) To penalize a large number of coefficients and large magnitudes to avoid overfitting
Which of the following is a common regularization technique?
A) Gradient Descent
B) L1 Regularization (Lasso)
C) Principal Component Analysis (PCA)
D) K-Means Clustering
B) L1 Regularization (Lasso)
What is the key difference between L1 and L2 regularization?
A) L1 adds the absolute value of coefficients to the loss function, while L2 adds the squared value of coefficients
B) L1 adds the squared value of coefficients to the loss function, while L2 adds the absolute value of coefficients
C) L1 is used for classification, while L2 is used for regression
D) L1 is faster to compute than L2
A) L1 adds the absolute value of coefficients to the loss function, while L2 adds the squared value of coefficients
What happens to the model’s complexity as the regularization parameter (λ) increases?
A) The model becomes more complex
B) The model becomes less complex
C) The model’s complexity remains unchanged
D) The model’s complexity depends on the dataset
B) The model becomes less complex
What are some regularization methods?
A) L1 (Lasso) and L2 (Ridge) regularization, dropout, and early stopping
B) L1 (Lasso) and L2 (Ridge), mini-batch gradient descent (MBGD)
C) Stochastic Gradient Descent, bath normalization
D) Standard scaler, min-max scaler
A) L1 (Lasso) and L2 (Ridge) regularization, dropout, and early stopping
Suppose we are performing regression on N samples. Each sample has one target value (y), and a polynomial of degree M is applied to each point, xi. We want to train a model with M parameters. The following relation is used during the training. What are A, B, C, λ, and θ?
[image]
A is the cost.
B is the loss function that calculate the difference between the predicted value and the target.
C is the L2 regularizer.
λ is the coefficient of regularization.
θ is the set of model parameters.
What is L2 or ridge regularization?
The sum of squared parameters is added to the loss function. It forces the training processes to reduce the number of the model’s parameters.
What is L1 or lasso regularization?
The sum of the absolute values of parameters is added to the loss function.
Why do we call L2 regularization “ridge” and L1 “Lasso?”
The term “ridge” refers to the fact that the penalty term acts like a “ridge” that constrains the model’s coefficients to lie within a specific range.
The term “lasso” stands for “Least Absolute Shrinkage and Selection Operator”.
What is the role of the “coefficient of regularization λ”?
Large values of λ increase the effect of the model’s parameters on the loss function. Hence, large values of λ do not allow the training process to create a complex model with many parameters. Large values of λ could result in a simple model and underfitting.
On the other hand, a small regularization coefficient allows the model to become complex and could cause overfitting.
λ is a hyperparameter.
Plot A is the original data. We are performing regression. Which of the two plots, B and C, is regression before, and which one is after regularization?
[image]
B is before regularization.
C is after regularization.
What is gradient descent?
Gradient descent is an optimization algorithm for finding a local minimum of a differentiable function. Gradient decent is used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.
Suppose a linear regression with parameters θ0 and θ1 is used for prediction. We have n samples, and each sample i has one attribute xi and a target yi.
a) What is the residual sum of squares loss function J(θ)?
b) What is the gradient of the loss function with respect to θ0?
c) What is the gradient of the loss function with respect to θ1?
d) What does the gradient of the loss function with respect to θ1 show?
a) [image]
b) [image]
c) [image]
d) The gradient with respect to the parameter θ1 shows the relationship between this parameter and the loss function. The goal is to reduce the loss function by choosing better values for the model’s parameters. For example, if the gradient with respect to θ1 is positive, it means that the loss function would increase if we increase this parameter. Hence, we will decrease θ1 to force the loss function to decrease.
What type of regression is performed by the following operation? What are x’s, θ’s, and y(i)?
[image]
This is a polynomial regression of degree d, where x(i) is the ith data point.
θ’s are parameters of the regression model.
y(i) is the predicted value of the ith data point. We are assuming that each data point has only one attribute.
What is the following expression? What are hθ(x(i)), and y(i)? What is the goal of modeling when this expression is calculated?
[image]
It is the cost function for a prediction model (e.g., regression).
hθ(x(i)) is the predicted value and y(i) is the target value.
The goal of modeling is to find a set of modeling parameters (θ) to minimize J(θ).
What would the gradient descent operation do with the following cost function?
[image]
The goal of modeling is to find a set of modeling parameters (θi) to minimize J(θ). For a random value of θi, gradient descent gives us the direction of modifying θi to get closer to the minimum value of J. Hence, we calculate the gradient of the loss function, [image], which is the slope of the cost as a function of θi.
If this slope is positive, we decrease the value of θi. If the slope is negative, we increase θi and if the slope is zero, it means θi has created the minimum loss J(θ) and we have reached our goal.
What is the purpose of the following expression?
[image]
This is the cost function for a linear regression prediction with two parameters of θ0 and θ1. We want to find θ0 and θ1 to minimize this function. The function shows the average differences between predicted and actual values.
What is the derivative of the following expression with respect to θ0?
[image]
[image]
What is the derivative of the following expression with respect to θ1?
[image]
[image]
Considering the following cost function, how do we update θ0 using gradient descent?
[image]
[image]
What is the purpose of the following expression, and what are A, B, C, and D?
[image]
This expression iteratively updates the parameter θ0 using gradient descent.
A is the next value of θ0
B is the current value of θ0
C is the learning rate
D is the gradient of the cost function with respect to θ0
What is the role of the learning rate in the iterative gradient descent operation?
The learning rate is a hyperparameter. It shows how much we should change a model parameter in response to the estimated error at each iteration.
a) What are the consequences of choosing large or small learning rates (fixed values)? b) What is an adaptive optimization method?
a) Large learning rates could result in fast convergence to the minimum cost. However, there is a chance that a significant learning rate causes divergence. Low learning rates could slow down the convergence process.
b) An adaptive optimization method uses gradient descent, but the learning rate value is not fixed. When the magnitude of the gradient becomes small, the value of the learning rate is adaptively reduced.
What is stochastic gradient descent (SGD)?
To train the model, we randomly select a data point. The gradient of the loss is calculated for just this data, and model parameters are adjusted based on the computed gradient. This process is repeated for all data items in the dataset.
What is batch gradient descent (BGD)?
The average loss gradient for all data points is used to modify the model parameters.
What is mini-batch gradient descent?
An optimization algorithm for training a model. Mini-batch is a variation of the stochastic gradient descent (SGD) algorithm and updates model parameters based on a small subset of the training data called a mini-batch rather than a single example at a time. The mini-batch is typically chosen to be a small random subset of the entire training data, with a size greater than one and smaller than the total number of examples. The algorithm cyclically processes each mini-batch, which means that all mini-batches are processed one after another until the model has seen all training examples. During each iteration, the algorithm computes the average gradient of the loss function with respect to the model parameters over the mini-batch. This gradient is then used to update the model parameters in the direction that minimizes the loss function. Unlike SGD, mini-batch gradient descent can lead to more stable updates, resulting in faster convergence and better generalization performance. However, it requires more memory to store the mini-batches and can be slower than SGD if the mini-batch size is too large.
Let us assume we have the following 6 data points:
[image]
We want to use regression by applying it to the data’s attributes and predicting the target value. The model is ŷ = θ0 + θ1x1 + θ2x2. Let’s randomly choose the initial value of the parameters to be:
θ0 = 0, θ1 = -0.017, θ2 = -0.048
With a learning rate of 0.05, stochastic gradient descent finds the new set of parameters using the first data point (4,1).
[image]
This is stochastic gradient descent. We randomly choose the initial values of the θ parameters. Then, we use the gradient of the loss function for the data point of (x1 = 4, x2 = 1). Using the initial parameters, we predict the target value as -0.116, while it is 2. Hence, the loss is (-0.116 - 2)2.
After the first run, we find out that θ0 changed from zero to 0.212. We do the same thing for θ1 and θ2. In this example, we are not showing the complete gradient descent process. The process continues, and we perform the same procedure at the following data point. We continue until we get very low errors, meaning the parameters are trained.
In what fashion do we build a decision tree?
Top-down
What are the different parts of a decision tree?
Root node, branches, internal node (decision node), and leaves.
How do we use entropy to measure the performance of a decision node?
The entropy of each split branch should be as low as possible.
What is entropy gain, and when is it used?
Entropy gain is the difference between a decision node’s entropy and its branches’ average entropy. We use the attribute that results in the largest entropy gain. A significant entropy gain means that the chosen attribute was distinctive and correctly separated the data into different classes.
Consider the following dataset and answer the following questions.
[image]
a) Using log2, calculate the entropy of the target, i.e., H(play).
b) Calculate H(play|outlook)
c) calculate H(play|wind)
d) Which feature should be used first? Why?
a) [image]
b) [image]
c) [image]
d) Since the entropy of wind is higher than that of outlook, we choose outlook.
We will use the dataset below to learn a decision tree.
[image]
a) Calculate the entropy of Y using log2, i.e., H(Y)=?. Notice that log23 ≈ 1.6
b) What is H(Y|x1)? c) What is H(Y|x2)? d) Which feature should be first used?
[image]
A decision tree is a:
a) Linear model
b) Non-linear model
c) Clustering algorithm
d) Dimensionality reduction technique
b) Non-linear model
The root node of a decision tree represents:
a) The class labels
b) The most important feature
c) The least important feature
d) A random feature
b) The most important feature
Decision trees are prone to:
a) Underfitting
b) Overfitting
c) Both underfitting and overfitting
d) Neither underfitting nor overfitting
b) Overfitting
Entropy is a measure of:
a) Information gain ; b) Impurity or disorder; c) Gini index; d) Variance
b) Impurity or disorder
Information gain is calculated as:
a) Entropy(parent) - Entropy(children)
b) Entropy(children) - Entropy(parent)
c) Entropy(parent) + Entropy(children)
d) Entropy(parent) * Entropy(children)
b) Entropy(children) - Entropy(parent)
Decision trees can only handle categorical features. (True/False)
False
A high entropy value indicates high purity in the data. (True/False)
False
Information gain us used to select the best feature for splitting at each node. (True/False)
True
Pruning a decision tree helps to prevent overfitting. (True/False)
True
Decision trees are sensitive to outliers in the data. (True/False)
False