CAP4612 - Exam 2

0.0(0)
studied byStudied by 3 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/141

flashcard set

Earn XP

Description and Tags

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

142 Terms

1
New cards
Regularization
constraining a model to make it simpler and reduce the risk of overfitting
2
New cards
regularization
If you set the _____________ hyperparameter to a very large value, you will get an almost flat model (a slope close to zero)
3
New cards
overfit
If you set the regularization hyperparameter to a very large value, the learning algorithm will almost certainly not _______ the training data, but it will be less likely to find a good solution.
4
New cards
b.) If lambda is very large, it will make the parameter θ1 to θn close to zero
In linear regression regularization, the regularization term lambda has the following effect on the hypothesis function

a.) If lambda is very small, it will make the parameter θ1 to θn close to zero

b.) If lambda is very large, it will make the parameter θ1 to θn close to zero

c.) If lambda is very large, it will make the parameter θ1 to θ very large too

d.) If lambda is close to zero, it will make the parameter θ1 to θn close to zero too
In linear regression regularization, the regularization term lambda has the following effect on the hypothesis function

a.) If lambda is very small, it will make the parameter θ1 to θn close to zero

b.) If lambda is very large, it will make the parameter θ1 to θn close to zero

c.) If lambda is very large, it will make the parameter θ1 to θ very large too

d.) If lambda is close to zero, it will make the parameter θ1 to θn close to zero too
5
New cards
polynomial degrees
A simply way to regularize a polynomial model is to reduce the number of ___________ ________ (complexity).
6
New cards
regularization
For linear models, ____________ is typically achieved by constraining the weights of the model.
7
New cards
d.) softmax
Which of the following is not away of constraining weight for regularization?

a.) ridge regression

b.) lasso regression

c.) elastic net

d.) softmax
8
New cards
d.) All of the above
Which of the following is true for constraining weights for regularization?

a.) The term added to the cost function during training. Once the model is trained, use the unregularized performance measure to evaluate

b.) The hyperparameter α controls how much you want to regularize the model. If α is very high, then all weights end up very close to zero and the result is a flat line going through the data’s mean.

c.) Lasso Regression tends to eliminate the weights (set to zero) of the least important features.

d.) All of the above
9
New cards
d.) All of the above
Which of the followings is correct for regularization of linear regression?

a.) We should avoid plain linear regression

b.) Ridge regression is a good default

c.) We should use Lasso or Elastic Net if you expect that only a few features are actually useful

d.) All of the above
10
New cards
decrease
Using Early Stopping for regularization to avoid overfitting:

We identify and stop at the point where errors for the validation stops decreasing and starts increasing while the error for the training set continues to ______________.
11
New cards
around 300 epoch
A learning curve graph is given below. When should we stop training to avoid overfitting?

(hint: where validation set starts to increase)
A learning curve graph is given below. When should we stop training to avoid overfitting?

(hint: where validation set starts to increase)
12
New cards
a
Which of the following corresponds to the logistic regression graph?
Which of the following corresponds to the logistic regression graph?
13
New cards
c.) Classification
The logistic regression approach is used for:

a.) Regression

b.) Clustering

c.) Classification

d.) Data segmentation
14
New cards
e.) All of the above
In the cot function for logistic regression, h(x) is the prediction and y is the actual value.

Which of the following is true?

a.) The cost of predicting 1 when y=0 is high

b.) The cost of predicting 1 when y=1 is low

c.) The cost of predicting 0 when y=0 is low

d.) The cost of predicting 0 when y=1 is high

e.) All of the above
15
New cards
true
(T/F) If a model performs great on the training data but generalizes poorly to new instances, the model is likely overfitting the training data.
16
New cards
a.) underfit the training data, overfit the training data
Excessively simple models __________________ while excessively complex models ________________.

a.) underfit the training data, overfit the training data

b.) overfit the training data, underfit the training data

c.) underfit the training data, optimally fit the training data

d.) optimally fit the training data, underfit the training data
17
New cards
a.) two Logistic Regression classifiers
To classify pictures as outdoor or indoor and daytime or nighttime we may train:

a.) two Logistic Regression classifiers

b.) two Linear Regression classifiers

c.) four Logistic Regression classifiers

d.) four Linear Regression classifiers
18
New cards
softmax regression
the generalization of logistic regression to support multiple classes directly without having to train and combine multiple binary classifiers
19
New cards
softmax regression
predicts only one class at a time, so it should be used only with mutually exclusive classes (such as different types of plants)
20
New cards
true
(T/F) Softmax regression is multiclass but not multioutput
21
New cards
d.) Softmax is a multioutput classifier
Which of the following is not correct for softmax regression?

a.) It is the generalization of Logistic Regression

b.) Softmax regression does not train and combine multiple binary classifiers

c.) Softmax regression should only be used for mutually exclusive classes

d.) Softmax is a multioutput classifier
22
New cards
a.) Continuous values, classes
Linear regression predicts ______________while logistics regression predicts ______________

a.) Continuous values, classes

b.) Classes, continuous values

c.) Classes, Values close to the mean

d.) Classes, Outlier values away from the mean
23
New cards
d.) All of the above
Which of the followings can be done by a Support Vector Machine (SVM) learning model?

a.) Classification

b.) Regression

c.) Outlier detection

d.) All of the above
24
New cards
true
(T/F) The idea of SVM is to create a line or a hyperparameter which separates the data into classes.
25
New cards
margin
the distance between the closest examples of 2 classes to the decision boundary
26
New cards
hyperplane
In an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into 2 disconnected parts
27
New cards
support vectors
the samples on the margin
28
New cards
a.) Large margin
SVM Classification is also called _____________ classification.

a.) Large margin

b.) Street

c.) Hyperplane

d.) None of the options listed
29
New cards
hard margin
During SVM classification, if we strictly impose that all instances must be off the street and on the right side, this is called _____ _______ classification.
30
New cards
a and b
Which of the followings are true for hard margin classification?

a.) Only works if the data is linearly separable.

b.) Sensitive to outliers and it will probably not generalize as well.

c.) Tries to find a good balance between keeping the street as large as possible and limiting the margin violations.

d.) All of the above
31
New cards
Need to use scaled data for SVM because it will tend to neglect small features
We use SVM learning models to solve a ML problem. We see scaled and not scaled data in the figures. Do you recommend using scaled or unscaled data and why?
We use SVM learning models to solve a ML problem. We see scaled and not scaled data in the figures. Do you recommend using scaled or unscaled data and why?
32
New cards
Doesn’t matter, decision trees don’t need data to be scaled
If we are using a decision tree for this data, should we use scaled or unscaled?
If we are using a decision tree for this data, should we use scaled or unscaled?
33
New cards
b, c, d
Many datasets are not even close to being linearly separable. Linear SVM would not perform well on these datasets. We need to transform the original space to a higher dimensional space to improve the performance. Certain values can be passed to Scikit Learn Support Vector Machine Classifier (SVC)’s ‘kernel’ parameter to transform the original space to a higher dimension.

Which of the following values can be used for that:

a.) linear

b.) poly

c.) rbf

d.) sigmoid
34
New cards
kernel trick
Using a function to transform the original space into a higher dimensional space during the costs function optimization is called ________ _______.
35
New cards
true
(T/F) The SVM is a different type of algorithm as it picks the extreme case which is close to the boundary and uses that to construct its analysis.
36
New cards
rbf
Name the kernel parameter used for the following SVM regression model.
Name the kernel parameter used for the following SVM regression model.
37
New cards
linear
Name the kernel parameter used for the following SVM regression model.
Name the kernel parameter used for the following SVM regression model.
38
New cards
poly
Name the kernel parameter used for the following SVM regression model.
Name the kernel parameter used for the following SVM regression model.
39
New cards
sigmoid
Name the kernel parameter used for the following SVM regression model.
Name the kernel parameter used for the following SVM regression model.
40
New cards
samples
counts how many training instances a decision tree node applies to
counts how many training instances a decision tree node applies to
41
New cards
value
defines how many training instances of each class the node applies to
defines how many training instances of each class the node applies to
42
New cards
gini
measures the impurity of a node
measures the impurity of a node
43
New cards
a.) decision trees
Which of the following is a whitebox model?

a.) decision trees

b.) random forests

c.) neural networks

d.) all of the above
44
New cards
c.) predictions are hard to explain
Which of the following is not true for whitebox models?

a.) intuitive models

b.) decisions are easy to interpret

c.) predictions are hard to explain

d.) none of the above
45
New cards
d.) All of the above


Which of the following is true for the CART algorithm that is used for the decision tree?

a.) Searches for an optimum split at the top level, then repeats the process at each subsequent level.

b.) It is a greedy algorithm. It does not check whether or not the split will lead to the lowest possible impurity several levels down.

c.) Produces a solution that’s reasonably good but not guaranteed to be optimal.

d.) All of the above
46
New cards
prediction
The _____________ complexity of a decision tree is O( log2m ) because there is one comparison at each level for log2m levels.
47
New cards
training
The __________ complexity for a decision tree using the CART algorithm is O( n x m x log2m ) because there are n features, m samples, and log2m levels. The algorithm compares all features on all samples at each level.
48
New cards
a and b
Which of the following are an algorithm to form decision trees?

a.) ID3

b.) CART

c.) Entropy

d.) Gini
49
New cards
regularization
Having _______________ for decision trees is important because decision trees make few assumptions about the training data. If left unconstrained, the tree structure will adapt itself to the training data and most likely overfit it.
50
New cards
d.) When used for regression, decision trees are not prone to overfitting.


Which of the followings is not true for decision tree regression?

a.) Instead of predicting a class in each node, it predicts a value.

b.) The predicted value for each region is always the average target value of the instances in that region.

c.) The algorithm splits each region in a way that makes most training instances as close as possible to that predicted value.

d.) When used for regression, decision trees are not prone to overfitting.
51
New cards
ensemble
In _________ learning, we aggregate the predictions of a group of predictors (such as classifiers or regressors), and we often get better predictions that with the best individual predictor.
52
New cards
weak, strong
Even if each classifier used in ensemble learning model is a ____ learner (slightly better than random guessing), the ensemble can still be a _____ learner (achieving high accuracy).
53
New cards
a and b
For an ensemble learning model to work, which of the following need to be satisfied for learners?

a.) There are a sufficient number of weak learners

b.) They are sufficiently diverse

c.) The learners must be random

d.) Decision trees must be used

e.) Random forest learner must be used
54
New cards
training set
A way to diversify learners in an ensemble learning model is to use different training algorithms with the same _____________ ____
55
New cards
random subsets
A way to diversify learners in an ensemble learning model is to use the same training algorithm for every predictor, but train on different __________ ________of the training set.
56
New cards
features
A way to diversify learners in an ensemble learning model is to use different random subsets of ________
57
New cards
true
(T/F) The benefit of diversifying leaners is that they will make very different types of errors, improving the ensemble’s accuracy.
58
New cards
hard voting
* a type of classifier for ensemble learning
* aggregate the predictions of each classifier and predict the class that gets the most votes
* majority-vote classifier
59
New cards
soft voting
* a type of classifier for ensemble learning
* predict the class with the highest class probability, averaged over all the individual classifiers
60
New cards
d.) All of the above


Which of the following(s) is true for soft voting?

a.) Soft voting classifier achieves better accuracy than hard voting

b.) It gives more weight to highly confident votes

c.) All classifiers of used from the Scikit Learn library must have predict_proba() method to be able to use soft voting

d.) All of the above
61
New cards
bagging
* AKA bootstrap aggregating
* process where sampling is performed with replacement
* same training instances can be sampled several times for the same predictor
62
New cards
pasting
process where sampling is performed **without** replacement
63
New cards
bagging and pasting
in both processes, same training instances can be sampled several times across multiple predictors
64
New cards
similar, lower
When bagging is compared to a single predictor trained on the original training set, the ensemble has a ______ bias and a _______ variance.
65
New cards
bottom right
Which one has the lowest bias and highest variance?

(high variance: spread apart, high bias: further from center)
Which one has the lowest bias and highest variance?

(high variance: spread apart, high bias: further from center)
66
New cards
underfitting
This model represents what issue? (low variance and high bias)
This model represents what issue? (low variance and high bias)
67
New cards
overfitting
This model represents what issue? (high variance and low bias)
This model represents what issue? (high variance and low bias)
68
New cards
high, high
This model represents ___ variance and _____ bias
This model represents ___ variance and _____ bias
69
New cards
low, low
This model represents ___ variance and _____ bias
This model represents ___ variance and _____ bias
70
New cards
right
Which figure has lower variance?
Which figure has lower variance?
71
New cards
true
(T/F) Bagging and pasting methods enables parallel training and prediction. Training can be done in parallel on different CPU cores or servers. Predictions can be made in parallel.
72
New cards
patches
It’s called random ________ when we sample both training instances and features.
73
New cards
subspaces
It’s called random ___________ when we keep all training instances but sample features.
74
New cards
decision trees
Random forest is an ensemble of _______ ______, generally trained via bagging method.
75
New cards
bagging
Random forest is an ensemble of decision trees, generally trained via __________ method.
76
New cards
true
(T/F) The Random Forest algorithm introduces extra randomness when growing trees. Instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features
77
New cards
random patches
subset of features is selected globally once and for all, prior to the construction of the tree
78
New cards
random forest
subsets of features are drawn locally at each node
79
New cards
random forest
* uses bootstrap replicas
* chooses optimum split
80
New cards
extra trees
* uses whole original sample by default
* has optional parameter allowing users to bootstrap replicas
* chooses split randomly
81
New cards
d.) All of the above
Which of the following(s) are true for boosting?

a.) It is another Ensemble method that combine several weak learners into a strong learner

b.) Trains predictors sequentially

c.) Each predictor attempts to correct its predecessor

d.) All of the above
82
New cards
gradient descent
Sequential learning technique has some similarities with ______ ______, except that instead of tweaking a single predictor’s parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better.
83
New cards
AdaBoost
Sequential learning technique has some similarities with Gradient Descent, except that instead of tweaking a single predictor’s parameters to minimize a cost function, _________ adds predictors to the ensemble, gradually making it better.
84
New cards
d.) All of the above
The figure below shows an ensemble learning system. Which of these could be true for this figure?

a.) It uses boosting

b.) Trains classifiers sequentially

c.) It cannot be scaled

d.) All of the above
The figure below shows an ensemble learning system. Which of these could be true for this figure? 

a.) It uses boosting 

b.) Trains classifiers sequentially 

c.) It cannot be scaled

d.) All of the above
85
New cards
gradient boost
attempts to fit the new predictor to the residual errors made by the previous predictor
86
New cards
AdaBoost
tweaks the instance weights and adds a new predictor at every iteration
87
New cards
y_predict = svr1.predict(X_new) + svr2.predict(X_new) + svr3.predict(X_new)
How do you find the final prediction using the gradient boosting ensemble developed for the previous question when you receive a new data set given in X_new?
88
New cards
2 stage training and early stopping
2 methods to find the optimal number of trees during a Gradient Boosting method
89
New cards
2 stage training method
* trains a large number of trees
* measures validation error at each stage in training
* selects tree size with minimum validation error
* trains a new model with the optimal tree size found
90
New cards
early stopping method
* implements incremental learning
* measures validation error for every tree
* stops adding trees when error increases 5 times in a row
91
New cards
around 55
According to the figure, how many decision trees would you use in gradient boost? (hint: minimum validation error)
According to the figure, how many decision trees would you use in gradient boost? (hint: minimum validation error)
92
New cards
aggregate
The main idea behind the stacking approach in ensemble learning is to train a model to _____________ the predictions of all predictors in an ensemble.
93
New cards
clustering and anomaly detection
Name 2 main tasks that are achieved using unsupervised learning
94
New cards
clustering
the goal is to group similar instances together into clusters
95
New cards
anomaly detection
learn what “normal” data looks like and use that to detect abnormal instances
96
New cards
clustering
applications of _____________:

* customer segmentation
* data analysis
* dimensionality reduction
* anomaly/outlier detection
* semi-supervised learning
* image segmentation
* search engines
97
New cards
customer segmentation
useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment
98
New cards
data analysis
analyzing each cluster separately might give further insights
99
New cards
dimensionality reduction
once a dataset has been clustered, it’s usually possible to measure each instance’s affinity with each cluster
100
New cards
anomaly detection
* AKA outlier detection
* any instance that has a low affinity to all the clusters is likely to be an anomaly