Review Questions, Week 6

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/51

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

52 Terms

1
New cards
<p>Consider the following graph from a series of experiments with increased classification model complexity. The accuracy of the model is measured and plotted in two graphs. What are point A, graph B, distance C, and graph D?</p><p>[image]</p>

Consider the following graph from a series of experiments with increased classification model complexity. The accuracy of the model is measured and plotted in two graphs. What are point A, graph B, distance C, and graph D?

[image]

  • Point A is the sweet points where the best test accuracy is achieved.

  • Graph B is the training results.

  • Distance C is the generalization error.

  • Graph D is the test results.

2
New cards

What are “bias” and “variance?”

Bias is the prediction error. This error is due to inaccurate assumptions or a simple model.

Variance is the change in prediction errors when the dataset is changed.

3
New cards

How does linear regression work, and what are its assumptions?

Linear regression is a supervised learning algorithm used for regression problems. It attempts to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. Then, the equation is used to make predictions on unseen data.

The assumptions of linear regression are as follows:

  • Linearity: The relationship between the independent and dependent variables is linear.

  • Independence: The observations are independent of each other.

  • Homoscedasticity: The variance of the errors is constant for all values of the independent variables.

  • Normality: The errors are normally distributed.

  • No multicollinearity: The independent variables are not highly correlated with each other.

Violating these assumptions can lead to biased or inefficient estimates or inaccurate predictions.

4
New cards

What causes “high variance” and “high bias”?

Generally, “high variance” indicates overfitting, and “high bias” is due to underfitting.

However, in exceptional cases, high variance could be caused by the statistical nature of the new dataset being tested with the trained model. For example, if the distribution of the new dataset is different from the training set, we could see a high variance. In such a case, we should retrain the model with a mixture of the two datasets. One can say that the initial training was overfitted on the initial dataset.

5
New cards

What is the relationship between the model’s complexity, bias, and variance?

Increasing the model’s complexity will reduce the bias and reduce the model’s generalization. Low generalization causes high variance.

6
New cards

True or False: Overfitting is more probable when the dataset is small.

True

7
New cards

True or False: Increasing the dataset increases the model’s generalization.

True

8
New cards

What is the primary goal of cross-validation?

Avoid overfitting.

9
New cards

What is a “5-fold leave-one-out cross-validation?”

Leave a part of the dataset for testing. The training part is partitioned into 5 folds. 1 fold will be used for validation, and 4 folds will be used for training.

10
New cards

What is the primary purpose of regularization in machine learning?

A) To increase the complexity of the model

B) To reduce the training error

C) To penalize a large number of coefficients and large magnitudes to avoid overfitting

D) To speed up the training process

C) To penalize a large number of coefficients and large magnitudes to avoid overfitting

11
New cards

Which of the following is a common regularization technique?


A) Gradient Descent

B) L1 Regularization (Lasso)

C) Principal Component Analysis (PCA)

D) K-Means Clustering

B) L1 Regularization (Lasso)

12
New cards

What is the key difference between L1 and L2 regularization?

A) L1 adds the absolute value of coefficients to the loss function, while L2 adds the squared value of coefficients

B) L1 adds the squared value of coefficients to the loss function, while L2 adds the absolute value of coefficients

C) L1 is used for classification, while L2 is used for regression

D) L1 is faster to compute than L2

A) L1 adds the absolute value of coefficients to the loss function, while L2 adds the squared value of coefficients

13
New cards

What happens to the model’s complexity as the regularization parameter (λ) increases?

A) The model becomes more complex

B) The model becomes less complex

C) The model’s complexity remains unchanged

D) The model’s complexity depends on the dataset

B) The model becomes less complex

14
New cards

What are some regularization methods?

A) L1 (Lasso) and L2 (Ridge) regularization, dropout, and early stopping

B) L1 (Lasso) and L2 (Ridge), mini-batch gradient descent (MBGD)

C) Stochastic Gradient Descent, bath normalization

D) Standard scaler, min-max scaler

A) L1 (Lasso) and L2 (Ridge) regularization, dropout, and early stopping

15
New cards
<p>Suppose we are performing regression on N samples. Each sample has one target value (y), and a polynomial of degree M is applied to each point, x<sub>i</sub>. We want to train a model with M parameters. The following relation is used during the training. What are A, B, C, λ, and <span>θ?</span></p><p><span>[image]</span></p>

Suppose we are performing regression on N samples. Each sample has one target value (y), and a polynomial of degree M is applied to each point, xi. We want to train a model with M parameters. The following relation is used during the training. What are A, B, C, λ, and θ?

[image]

  • A is the cost.

  • B is the loss function that calculate the difference between the predicted value and the target.

  • C is the L2 regularizer.

  • λ is the coefficient of regularization.

  • θ is the set of model parameters.

16
New cards

What is L2 or ridge regularization?

The sum of squared parameters is added to the loss function. It forces the training processes to reduce the number of the model’s parameters.

17
New cards

What is L1 or lasso regularization?

The sum of the absolute values of parameters is added to the loss function.

18
New cards

Why do we call L2 regularization “ridge” and L1 “Lasso?”

The term “ridge” refers to the fact that the penalty term acts like a “ridge” that constrains the model’s coefficients to lie within a specific range.

The term “lasso” stands for “Least Absolute Shrinkage and Selection Operator”.

19
New cards

What is the role of the “coefficient of regularization λ”?

Large values of λ increase the effect of the model’s parameters on the loss function. Hence, large values of λ do not allow the training process to create a complex model with many parameters. Large values of λ could result in a simple model and underfitting.

On the other hand, a small regularization coefficient allows the model to become complex and could cause overfitting.

λ is a hyperparameter.

20
New cards
<p>Plot A is the original data. We are performing regression. Which of the two plots, B and C, is regression before, and which one is after regularization?</p><p>[image]</p>

Plot A is the original data. We are performing regression. Which of the two plots, B and C, is regression before, and which one is after regularization?

[image]

  • B is before regularization.

  • C is after regularization.

21
New cards

What is gradient descent?

Gradient descent is an optimization algorithm for finding a local minimum of a differentiable function. Gradient decent is used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.

22
New cards

Suppose a linear regression with parameters θ0 and θ1 is used for prediction. We have n samples, and each sample i has one attribute xi and a target yi.

a) What is the residual sum of squares loss function J(θ)?

b) What is the gradient of the loss function with respect to θ0?

c) What is the gradient of the loss function with respect to θ1?

d) What does the gradient of the loss function with respect to θ1 show?

a) [image]

b) [image]

c) [image]

d) The gradient with respect to the parameter θ1 shows the relationship between this parameter and the loss function. The goal is to reduce the loss function by choosing better values for the model’s parameters. For example, if the gradient with respect to θ1 is positive, it means that the loss function would increase if we increase this parameter. Hence, we will decrease θ1 to force the loss function to decrease.

<p>a) [image]</p><p>b) [image]</p><p>c) [image]</p><p>d) The gradient with respect to the parameter <em>θ<sub>1</sub></em> shows the relationship between this parameter and the loss function. The goal is to reduce the loss function by choosing better values for the model’s parameters. For example, if the gradient with respect to <em>θ<sub>1</sub></em> is positive, it means that the loss function would increase if we increase this parameter. Hence, we will decrease <em>θ<sub>1</sub></em> to force the loss function to decrease.</p>
23
New cards
<p>What type of regression is performed by the following operation? What are <em>x</em>’s, <em>θ</em>’s, and <em>y<sup>(i)</sup></em>?</p><p>[image]</p>

What type of regression is performed by the following operation? What are x’s, θ’s, and y(i)?

[image]

This is a polynomial regression of degree d, where x(i) is the ith data point.

θ’s are parameters of the regression model.

y(i) is the predicted value of the ith data point. We are assuming that each data point has only one attribute.

24
New cards
<p>What is the following expression? What are <em>h<sub>θ</sub>(x<sup>(i)</sup>)</em>, and <em>y<sup>(i)</sup></em>? What is the goal of modeling when this expression is calculated?</p><p>[image]</p>

What is the following expression? What are hθ(x(i)), and y(i)? What is the goal of modeling when this expression is calculated?

[image]

It is the cost function for a prediction model (e.g., regression).

hθ(x(i)) is the predicted value and y(i) is the target value.

The goal of modeling is to find a set of modeling parameters (θ) to minimize J(θ).

25
New cards
<p>What would the gradient descent operation do with the following cost function?</p><p>[image]</p>

What would the gradient descent operation do with the following cost function?

[image]

The goal of modeling is to find a set of modeling parameters (θi) to minimize J(θ). For a random value of θi, gradient descent gives us the direction of modifying θi to get closer to the minimum value of J. Hence, we calculate the gradient of the loss function, [image], which is the slope of the cost as a function of θi.

If this slope is positive, we decrease the value of θi. If the slope is negative, we increase θi and if the slope is zero, it means θi has created the minimum loss J(θ) and we have reached our goal.

<p>The goal of modeling is to find a set of modeling parameters (<em>θ<sub>i</sub></em>) to minimize <em>J</em>(<em>θ</em>). For a random value of <em>θ<sub>i</sub></em>, gradient descent gives us the direction of modifying <em>θ<sub>i</sub></em> to get closer to the minimum value of <em>J</em>. Hence, we calculate the gradient of the loss function, [image], which is the slope of the cost as a function of <em>θ<sub>i</sub></em>.</p><p>If this slope is positive, we decrease the value of <em>θ<sub>i</sub></em>. If the slope is negative, we increase <em>θ<sub>i</sub></em> and if the slope is zero, it means <em>θ<sub>i</sub></em> has created the minimum loss <em>J</em>(<em>θ</em>) and we have reached our goal.</p>
26
New cards
<p>What is the purpose of the following expression?</p><p>[image]</p>

What is the purpose of the following expression?

[image]

This is the cost function for a linear regression prediction with two parameters of θ0 and θ1. We want to find θ0 and θ1 to minimize this function. The function shows the average differences between predicted and actual values.

27
New cards
<p>What is the derivative of the following expression with respect to <em>θ<sub>0</sub></em>?</p><p>[image]</p>

What is the derivative of the following expression with respect to θ0?

[image]

[image]

<p>[image]</p>
28
New cards
<p>What is the derivative of the following expression with respect to <em>θ<sub>1</sub></em>?</p><p>[image]</p>

What is the derivative of the following expression with respect to θ1?

[image]

[image]

<p>[image]</p>
29
New cards
<p>Considering the following cost function, how do we update <em>θ<sub>0</sub></em> using gradient descent?</p><p>[image]</p>

Considering the following cost function, how do we update θ0 using gradient descent?

[image]

[image]

<p>[image]</p>
30
New cards
<p>What is the purpose of the following expression, and what are A, B, C, and D?</p><p>[image]</p>

What is the purpose of the following expression, and what are A, B, C, and D?

[image]

This expression iteratively updates the parameter θ0 using gradient descent.

  • A is the next value of θ0

  • B is the current value of θ0

  • C is the learning rate

  • D is the gradient of the cost function with respect to θ0

31
New cards

What is the role of the learning rate in the iterative gradient descent operation?

The learning rate is a hyperparameter. It shows how much we should change a model parameter in response to the estimated error at each iteration.

32
New cards

a) What are the consequences of choosing large or small learning rates (fixed values)? b) What is an adaptive optimization method?

a) Large learning rates could result in fast convergence to the minimum cost. However, there is a chance that a significant learning rate causes divergence. Low learning rates could slow down the convergence process.

b) An adaptive optimization method uses gradient descent, but the learning rate value is not fixed. When the magnitude of the gradient becomes small, the value of the learning rate is adaptively reduced.

33
New cards

What is stochastic gradient descent (SGD)?

To train the model, we randomly select a data point. The gradient of the loss is calculated for just this data, and model parameters are adjusted based on the computed gradient. This process is repeated for all data items in the dataset.

34
New cards

What is batch gradient descent (BGD)?

The average loss gradient for all data points is used to modify the model parameters.

35
New cards

What is mini-batch gradient descent?

An optimization algorithm for training a model. Mini-batch is a variation of the stochastic gradient descent (SGD) algorithm and updates model parameters based on a small subset of the training data called a mini-batch rather than a single example at a time. The mini-batch is typically chosen to be a small random subset of the entire training data, with a size greater than one and smaller than the total number of examples. The algorithm cyclically processes each mini-batch, which means that all mini-batches are processed one after another until the model has seen all training examples. During each iteration, the algorithm computes the average gradient of the loss function with respect to the model parameters over the mini-batch. This gradient is then used to update the model parameters in the direction that minimizes the loss function. Unlike SGD, mini-batch gradient descent can lead to more stable updates, resulting in faster convergence and better generalization performance. However, it requires more memory to store the mini-batches and can be slower than SGD if the mini-batch size is too large.

36
New cards
<p>Let us assume we have the following 6 data points:</p><p>[image]</p><p>We want to use regression by applying it to the data’s attributes and predicting the target value. The model is <em>ŷ = θ<sub>0</sub> + θ<sub>1</sub>x<sub>1</sub> + θ<sub>2</sub>x<sub>2</sub></em>. Let’s randomly choose the initial value of the parameters to be:</p><p><em>θ<sub>0</sub></em> = 0, <em>θ<sub>1</sub></em> = -0.017, <em>θ<sub>2</sub></em> = -0.048</p><p>With a learning rate of 0.05, <strong>stochastic gradient descent</strong> finds the new set of parameters using the first data point (4,1).</p>

Let us assume we have the following 6 data points:

[image]

We want to use regression by applying it to the data’s attributes and predicting the target value. The model is ŷ = θ0 + θ1x1 + θ2x2. Let’s randomly choose the initial value of the parameters to be:

θ0 = 0, θ1 = -0.017, θ2 = -0.048

With a learning rate of 0.05, stochastic gradient descent finds the new set of parameters using the first data point (4,1).

[image]

This is stochastic gradient descent. We randomly choose the initial values of the θ parameters. Then, we use the gradient of the loss function for the data point of (x1 = 4, x2 = 1). Using the initial parameters, we predict the target value as -0.116, while it is 2. Hence, the loss is (-0.116 - 2)2.

After the first run, we find out that θ0 changed from zero to 0.212. We do the same thing for θ1 and θ2. In this example, we are not showing the complete gradient descent process. The process continues, and we perform the same procedure at the following data point. We continue until we get very low errors, meaning the parameters are trained.

<p>[image]</p><p>This is stochastic gradient descent. We randomly choose the initial values of the <em>θ</em> parameters. Then, we use the gradient of the loss function for the data point of (<em>x<sub>1</sub></em> = 4, <em>x<sub>2</sub></em> = 1). Using the initial parameters, we predict the target value as -0.116, while it is 2. Hence, the loss is (-0.116 - 2)<sup>2</sup>.</p><p>After the first run, we find out that <em>θ<sub>0</sub></em> changed from zero to 0.212. We do the same thing for <em>θ<sub>1</sub></em> and <em>θ<sub>2</sub></em>. In this example, we are not showing the complete gradient descent process. The process continues, and we perform the same procedure at the following data point. We continue until we get very low errors, meaning the parameters are trained.</p>
37
New cards

In what fashion do we build a decision tree?

Top-down

38
New cards

What are the different parts of a decision tree?

Root node, branches, internal node (decision node), and leaves.

39
New cards

How do we use entropy to measure the performance of a decision node?

The entropy of each split branch should be as low as possible.

40
New cards

What is entropy gain, and when is it used?

Entropy gain is the difference between a decision node’s entropy and its branches’ average entropy. We use the attribute that results in the largest entropy gain. A significant entropy gain means that the chosen attribute was distinctive and correctly separated the data into different classes.

41
New cards
<p>Consider the following dataset and answer the following questions.</p><p>[image]</p><p>a) Using log2, calculate the entropy of the target, i.e., H(play).</p><p>b) Calculate H(play|outlook)</p><p>c) calculate H(play|wind)</p><p>d) Which feature should be used first? Why?</p>

Consider the following dataset and answer the following questions.

[image]

a) Using log2, calculate the entropy of the target, i.e., H(play).

b) Calculate H(play|outlook)

c) calculate H(play|wind)

d) Which feature should be used first? Why?

a) [image]

b) [image]

c) [image]

d) Since the entropy of wind is higher than that of outlook, we choose outlook.

<p>a) [image]</p><p>b) [image]</p><p>c) [image]</p><p>d) Since the entropy of wind is higher than that of outlook, we choose outlook.</p>
42
New cards
<p>We will use the dataset below to learn a decision tree.</p><p>[image]</p><p>a) Calculate the entropy of Y using log2, i.e., <em>H(Y)</em>=?. Notice that log<sub>2</sub>3<span> </span>≈ 1.6</p><p>b) What is <em>H(Y|x1)</em>? c) What is <em>H(Y|x2)</em>? d) Which feature should be first used?</p>

We will use the dataset below to learn a decision tree.

[image]

a) Calculate the entropy of Y using log2, i.e., H(Y)=?. Notice that log23 ≈ 1.6

b) What is H(Y|x1)? c) What is H(Y|x2)? d) Which feature should be first used?

[image]

<p>[image]</p>
43
New cards

A decision tree is a:

a) Linear model

b) Non-linear model

c) Clustering algorithm

d) Dimensionality reduction technique

b) Non-linear model

44
New cards

The root node of a decision tree represents:

a) The class labels

b) The most important feature

c) The least important feature

d) A random feature

b) The most important feature

45
New cards

Decision trees are prone to:

a) Underfitting

b) Overfitting

c) Both underfitting and overfitting

d) Neither underfitting nor overfitting

b) Overfitting

46
New cards

Entropy is a measure of:

a) Information gain ; b) Impurity or disorder; c) Gini index; d) Variance

b) Impurity or disorder

47
New cards

Information gain is calculated as:

a) Entropy(parent) - Entropy(children)

b) Entropy(children) - Entropy(parent)

c) Entropy(parent) + Entropy(children)

d) Entropy(parent) * Entropy(children)

b) Entropy(children) - Entropy(parent)

48
New cards

Decision trees can only handle categorical features. (True/False)

False

49
New cards

A high entropy value indicates high purity in the data. (True/False)

False

50
New cards

Information gain us used to select the best feature for splitting at each node. (True/False)

True

51
New cards

Pruning a decision tree helps to prevent overfitting. (True/False)

True

52
New cards

Decision trees are sensitive to outliers in the data. (True/False)

False