Review Questions, Week 6

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/51

There's no tags or description

Looks like no tags are added yet.

Last updated 4:52 AM on 5/13/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

52 Terms

New cards

<p>Consider the following graph from a series of experiments with increased classification model complexity. The accuracy of the model is measured and plotted in two graphs. What are point A, graph B, distance C, and graph D?</p><p>[image]</p>

Consider the following graph from a series of experiments with increased classification model complexity. The accuracy of the model is measured and plotted in two graphs. What are point A, graph B, distance C, and graph D?

[image]

Point A is the sweet points where the best test accuracy is achieved.
Graph B is the training results.
Distance C is the generalization error.
Graph D is the test results.

New cards

What are “bias” and “variance?”

Bias is the prediction error. This error is due to inaccurate assumptions or a simple model.

Variance is the change in prediction errors when the dataset is changed.

New cards

How does linear regression work, and what are its assumptions?

Linear regression is a supervised learning algorithm used for regression problems. It attempts to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. Then, the equation is used to make predictions on unseen data.

The assumptions of linear regression are as follows:

Linearity: The relationship between the independent and dependent variables is linear.
Independence: The observations are independent of each other.
Homoscedasticity: The variance of the errors is constant for all values of the independent variables.
Normality: The errors are normally distributed.
No multicollinearity: The independent variables are not highly correlated with each other.

Violating these assumptions can lead to biased or inefficient estimates or inaccurate predictions.

New cards

What causes “high variance” and “high bias”?

Generally, “high variance” indicates overfitting, and “high bias” is due to underfitting.

However, in exceptional cases, high variance could be caused by the statistical nature of the new dataset being tested with the trained model. For example, if the distribution of the new dataset is different from the training set, we could see a high variance. In such a case, we should retrain the model with a mixture of the two datasets. One can say that the initial training was overfitted on the initial dataset.

New cards

What is the relationship between the model’s complexity, bias, and variance?

Increasing the model’s complexity will reduce the bias and reduce the model’s generalization. Low generalization causes high variance.

New cards

True or False: Overfitting is more probable when the dataset is small.

True

New cards

True or False: Increasing the dataset increases the model’s generalization.

True

New cards

What is the primary goal of cross-validation?

Avoid overfitting.

New cards

What is a “5-fold leave-one-out cross-validation?”

Leave a part of the dataset for testing. The training part is partitioned into 5 folds. 1 fold will be used for validation, and 4 folds will be used for training.

New cards

What is the primary purpose of regularization in machine learning?

A) To increase the complexity of the model

B) To reduce the training error

C) To penalize a large number of coefficients and large magnitudes to avoid overfitting

D) To speed up the training process

C) To penalize a large number of coefficients and large magnitudes to avoid overfitting

New cards

Which of the following is a common regularization technique?

A) Gradient Descent

B) L1 Regularization (Lasso)

C) Principal Component Analysis (PCA)

D) K-Means Clustering

B) L1 Regularization (Lasso)

New cards

What is the key difference between L1 and L2 regularization?

A) L1 adds the absolute value of coefficients to the loss function, while L2 adds the squared value of coefficients

B) L1 adds the squared value of coefficients to the loss function, while L2 adds the absolute value of coefficients

C) L1 is used for classification, while L2 is used for regression

D) L1 is faster to compute than L2

A) L1 adds the absolute value of coefficients to the loss function, while L2 adds the squared value of coefficients

New cards

What happens to the model’s complexity as the regularization parameter (λ) increases?

A) The model becomes more complex

B) The model becomes less complex

C) The model’s complexity remains unchanged

D) The model’s complexity depends on the dataset

B) The model becomes less complex

New cards

What are some regularization methods?

A) L1 (Lasso) and L2 (Ridge) regularization, dropout, and early stopping

B) L1 (Lasso) and L2 (Ridge), mini-batch gradient descent (MBGD)

C) Stochastic Gradient Descent, bath normalization

D) Standard scaler, min-max scaler

A) L1 (Lasso) and L2 (Ridge) regularization, dropout, and early stopping

New cards

<p>Suppose we are performing regression on N samples. Each sample has one target value (y), and a polynomial of degree M is applied to each point, x<sub>i</sub>. We want to train a model with M parameters. The following relation is used during the training. What are A, B, C, λ, and <span>θ?</span></p><p><span>[image]</span></p>

Suppose we are performing regression on N samples. Each sample has one target value (y), and a polynomial of degree M is applied to each point, x_i. We want to train a model with M parameters. The following relation is used during the training. What are A, B, C, λ, and θ?

[image]

A is the cost.
B is the loss function that calculate the difference between the predicted value and the target.
C is the L2 regularizer.
λ is the coefficient of regularization.
θ is the set of model parameters.

New cards

What is L2 or ridge regularization?

The sum of squared parameters is added to the loss function. It forces the training processes to reduce the number of the model’s parameters.

New cards

What is L1 or lasso regularization?

The sum of the absolute values of parameters is added to the loss function.

New cards

Why do we call L2 regularization “ridge” and L1 “Lasso?”

The term “ridge” refers to the fact that the penalty term acts like a “ridge” that constrains the model’s coefficients to lie within a specific range.

The term “lasso” stands for “Least Absolute Shrinkage and Selection Operator”.

New cards

What is the role of the “coefficient of regularization λ”?

Large values of λ increase the effect of the model’s parameters on the loss function. Hence, large values of λ do not allow the training process to create a complex model with many parameters. Large values of λ could result in a simple model and underfitting.

On the other hand, a small regularization coefficient allows the model to become complex and could cause overfitting.

λ is a hyperparameter.

New cards

<p>Plot A is the original data. We are performing regression. Which of the two plots, B and C, is regression before, and which one is after regularization?</p><p>[image]</p>

Plot A is the original data. We are performing regression. Which of the two plots, B and C, is regression before, and which one is after regularization?

[image]

B is before regularization.
C is after regularization.

New cards

What is gradient descent?

Gradient descent is an optimization algorithm for finding a local minimum of a differentiable function. Gradient decent is used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.

New cards

Suppose a linear regression with parameters θ₀ and θ₁ is used for prediction. We have n samples, and each sample i has one attribute x_i and a target y_i.

a) What is the residual sum of squares loss function J(θ)?

b) What is the gradient of the loss function with respect to θ₀?

c) What is the gradient of the loss function with respect to θ₁?

d) What does the gradient of the loss function with respect to θ₁ show?

a) [image]

b) [image]

c) [image]

d) The gradient with respect to the parameter θ₁ shows the relationship between this parameter and the loss function. The goal is to reduce the loss function by choosing better values for the model’s parameters. For example, if the gradient with respect to θ₁ is positive, it means that the loss function would increase if we increase this parameter. Hence, we will decrease θ₁ to force the loss function to decrease.

New cards

What type of regression is performed by the following operation? What are x’s, θ’s, and y⁽ⁱ⁾?

[image]

This is a polynomial regression of degree d, where x⁽ⁱ⁾ is the ith data point.

θ’s are parameters of the regression model.

y⁽ⁱ⁾ is the predicted value of the ith data point. We are assuming that each data point has only one attribute.

New cards

<p>What is the following expression? What are <em>h<sub>θ</sub>(x<sup>(i)</sup>)</em>, and <em>y<sup>(i)</sup></em>? What is the goal of modeling when this expression is calculated?</p><p>[image]</p>

What is the following expression? What are h_θ(x⁽ⁱ⁾), and y⁽ⁱ⁾? What is the goal of modeling when this expression is calculated?

[image]

It is the cost function for a prediction model (e.g., regression).

h_θ(x⁽ⁱ⁾) is the predicted value and y⁽ⁱ⁾ is the target value.

The goal of modeling is to find a set of modeling parameters (θ) to minimize J(θ).

New cards

What would the gradient descent operation do with the following cost function?

[image]

The goal of modeling is to find a set of modeling parameters (θ_i) to minimize J(θ). For a random value of θ_i, gradient descent gives us the direction of modifying θ_i to get closer to the minimum value of J. Hence, we calculate the gradient of the loss function, [image], which is the slope of the cost as a function of θ_i.

If this slope is positive, we decrease the value of θ_i. If the slope is negative, we increase θ_i and if the slope is zero, it means θ_i has created the minimum loss J(θ) and we have reached our goal.

<p>The goal of modeling is to find a set of modeling parameters (<em>θ<sub>i</sub></em>) to minimize <em>J</em>(<em>θ</em>). For a random value of <em>θ<sub>i</sub></em>, gradient descent gives us the direction of modifying <em>θ<sub>i</sub></em> to get closer to the minimum value of <em>J</em>. Hence, we calculate the gradient of the loss function, [image], which is the slope of the cost as a function of <em>θ<sub>i</sub></em>.</p><p>If this slope is positive, we decrease the value of <em>θ<sub>i</sub></em>. If the slope is negative, we increase <em>θ<sub>i</sub></em> and if the slope is zero, it means <em>θ<sub>i</sub></em> has created the minimum loss <em>J</em>(<em>θ</em>) and we have reached our goal.</p>

New cards

What is the purpose of the following expression?

[image]

This is the cost function for a linear regression prediction with two parameters of θ₀ and θ₁. We want to find θ₀ and θ₁ to minimize this function. The function shows the average differences between predicted and actual values.

New cards

What is the derivative of the following expression with respect to θ₀?

[image]

New cards

What is the derivative of the following expression with respect to θ₁?

[image]

New cards

Considering the following cost function, how do we update θ₀ using gradient descent?

[image]

New cards

What is the purpose of the following expression, and what are A, B, C, and D?

[image]

This expression iteratively updates the parameter θ₀ using gradient descent.

A is the next value of θ₀
B is the current value of θ₀
C is the learning rate
D is the gradient of the cost function with respect to θ₀

New cards

What is the role of the learning rate in the iterative gradient descent operation?

The learning rate is a hyperparameter. It shows how much we should change a model parameter in response to the estimated error at each iteration.

New cards

a) What are the consequences of choosing large or small learning rates (fixed values)? b) What is an adaptive optimization method?

a) Large learning rates could result in fast convergence to the minimum cost. However, there is a chance that a significant learning rate causes divergence. Low learning rates could slow down the convergence process.

b) An adaptive optimization method uses gradient descent, but the learning rate value is not fixed. When the magnitude of the gradient becomes small, the value of the learning rate is adaptively reduced.

New cards

What is stochastic gradient descent (SGD)?

To train the model, we randomly select a data point. The gradient of the loss is calculated for just this data, and model parameters are adjusted based on the computed gradient. This process is repeated for all data items in the dataset.

New cards

What is batch gradient descent (BGD)?

The average loss gradient for all data points is used to modify the model parameters.

New cards

What is mini-batch gradient descent?

An optimization algorithm for training a model. Mini-batch is a variation of the stochastic gradient descent (SGD) algorithm and updates model parameters based on a small subset of the training data called a mini-batch rather than a single example at a time. The mini-batch is typically chosen to be a small random subset of the entire training data, with a size greater than one and smaller than the total number of examples. The algorithm cyclically processes each mini-batch, which means that all mini-batches are processed one after another until the model has seen all training examples. During each iteration, the algorithm computes the average gradient of the loss function with respect to the model parameters over the mini-batch. This gradient is then used to update the model parameters in the direction that minimizes the loss function. Unlike SGD, mini-batch gradient descent can lead to more stable updates, resulting in faster convergence and better generalization performance. However, it requires more memory to store the mini-batches and can be slower than SGD if the mini-batch size is too large.

New cards

Let us assume we have the following 6 data points:

[image]

We want to use regression by applying it to the data’s attributes and predicting the target value. The model is ŷ = θ₀ + θ₁x₁ + θ₂x₂. Let’s randomly choose the initial value of the parameters to be:

θ₀ = 0, θ₁ = -0.017, θ₂ = -0.048

With a learning rate of 0.05, stochastic gradient descent finds the new set of parameters using the first data point (4,1).

[image]

This is stochastic gradient descent. We randomly choose the initial values of the θ parameters. Then, we use the gradient of the loss function for the data point of (x₁ = 4, x₂ = 1). Using the initial parameters, we predict the target value as -0.116, while it is 2. Hence, the loss is (-0.116 - 2)².

After the first run, we find out that θ₀ changed from zero to 0.212. We do the same thing for θ₁ and θ₂. In this example, we are not showing the complete gradient descent process. The process continues, and we perform the same procedure at the following data point. We continue until we get very low errors, meaning the parameters are trained.

<p>[image]</p><p>This is stochastic gradient descent. We randomly choose the initial values of the <em>θ</em> parameters. Then, we use the gradient of the loss function for the data point of (<em>x<sub>1</sub></em> = 4, <em>x<sub>2</sub></em> = 1). Using the initial parameters, we predict the target value as -0.116, while it is 2. Hence, the loss is (-0.116 - 2)<sup>2</sup>.</p><p>After the first run, we find out that <em>θ<sub>0</sub></em> changed from zero to 0.212. We do the same thing for <em>θ<sub>1</sub></em> and <em>θ<sub>2</sub></em>. In this example, we are not showing the complete gradient descent process. The process continues, and we perform the same procedure at the following data point. We continue until we get very low errors, meaning the parameters are trained.</p>

New cards

In what fashion do we build a decision tree?

Top-down

New cards

What are the different parts of a decision tree?

Root node, branches, internal node (decision node), and leaves.

New cards

How do we use entropy to measure the performance of a decision node?

The entropy of each split branch should be as low as possible.

New cards

What is entropy gain, and when is it used?

Entropy gain is the difference between a decision node’s entropy and its branches’ average entropy. We use the attribute that results in the largest entropy gain. A significant entropy gain means that the chosen attribute was distinctive and correctly separated the data into different classes.

New cards

Consider the following dataset and answer the following questions.

[image]

a) Using log2, calculate the entropy of the target, i.e., H(play).

b) Calculate H(play|outlook)

c) calculate H(play|wind)

d) Which feature should be used first? Why?

a) [image]

b) [image]

c) [image]

d) Since the entropy of wind is higher than that of outlook, we choose outlook.

New cards

We will use the dataset below to learn a decision tree.

[image]

a) Calculate the entropy of Y using log2, i.e., H(Y)=?. Notice that log₂3 ≈ 1.6

b) What is H(Y|x1)? c) What is H(Y|x2)? d) Which feature should be first used?

[image]

New cards

A decision tree is a:

a) Linear model

b) Non-linear model

c) Clustering algorithm

d) Dimensionality reduction technique

b) Non-linear model

New cards

The root node of a decision tree represents:

a) The class labels

b) The most important feature

c) The least important feature

d) A random feature

b) The most important feature

New cards

Decision trees are prone to:

a) Underfitting

b) Overfitting

c) Both underfitting and overfitting

d) Neither underfitting nor overfitting

b) Overfitting

New cards

Entropy is a measure of:

a) Information gain ; b) Impurity or disorder; c) Gini index; d) Variance

b) Impurity or disorder

New cards

Information gain is calculated as:

a) Entropy(parent) - Entropy(children)

b) Entropy(children) - Entropy(parent)

c) Entropy(parent) + Entropy(children)

d) Entropy(parent) * Entropy(children)

b) Entropy(children) - Entropy(parent)

New cards

Decision trees can only handle categorical features. (True/False)

False

New cards

A high entropy value indicates high purity in the data. (True/False)

False

New cards

Information gain us used to select the best feature for splitting at each node. (True/False)

True

New cards

Pruning a decision tree helps to prevent overfitting. (True/False)

True

New cards

Decision trees are sensitive to outliers in the data. (True/False)

False

Explore top notes

Chapter 2 - Conformations of Alkanes and Cycloalkanes

Updated 1242d ago

Note

Future Perfect

Updated 1198d ago

Note

Positive Psychology: Optimism, Hope, Wisdom, and Courage

Updated 1230d ago

Note

CHAPTER 5: THERMOCHEMISTRY

Updated 1141d ago

Note

Chapter 6 - The Role of Profits and Losses

Updated 1091d ago

Note

4.1 Conservatism Applied Principles

Updated 1061d ago

Note

VOCABULARIO: Pasatiempos (copy)

Updated 1230d ago

Note

~The Phagocytic System~

Updated 490d ago

Note

Chapter 2 - Conformations of Alkanes and Cycloalkanes

Updated 1242d ago

Note

Future Perfect

Updated 1198d ago

Note

Positive Psychology: Optimism, Hope, Wisdom, and Courage

Updated 1230d ago

Note

CHAPTER 5: THERMOCHEMISTRY

Updated 1141d ago

Note

Chapter 6 - The Role of Profits and Losses

Updated 1091d ago

Note

4.1 Conservatism Applied Principles

Updated 1061d ago

Note

VOCABULARIO: Pasatiempos (copy)

Updated 1230d ago

Note

~The Phagocytic System~

Updated 490d ago

Note

Explore top flashcards

Unit 1: Introduction to Business management

Updated 915d ago

Flashcards (92)

AP Music Theory Ultimate Guide (copy)

Updated 648d ago

Flashcards (351)

Microbio Final: Week 9

Updated 646d ago

Flashcards (38)

APUSH 23 Simple IDs

Updated 7d ago

Flashcards (30)

Better Vocabulario SP4 - La inmigración

Flashcards (25)

Flashcards (96)

Flashcards (50)

Flashcards (40)

Unit 1: Introduction to Business management

Updated 915d ago

Flashcards (92)

AP Music Theory Ultimate Guide (copy)

Updated 648d ago

Flashcards (351)

Microbio Final: Week 9

Updated 646d ago

Flashcards (38)

APUSH 23 Simple IDs

Updated 7d ago

Flashcards (30)

Better Vocabulario SP4 - La inmigración

Flashcards (25)

Flashcards (96)

Flashcards (50)

Flashcards (40)