Data Mining & Analytics Review

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/17

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

18 Terms

1
New cards

Which of the following would be the least effective way to represent a color (e.g., "Pink") in a dataset used in a predictive modeling task?

As an ordinal value based on its rank in an alphanumeric sorting of all colors

2
New cards

Consider a dataset with the following structure:

City

State

Date

Temperature

Berkeley

CA

Jan 25, 2018

11

Assuming we wanted to transform this dataset into a dataset with only the features of State, Month, Temperature, with State represented by the longitude and latitude of the State's capital, Month represented by a one-hot, and temperature left as a numeric: How many total features (columns) would be in this dataset?

(hint: longitude and latitude count as two features)

15

3
New cards

Sum of Squares Error (SSE) can be used with K-means clustering to:

(check all that apply)

K = number of clusters

n = number of data points being clustered

Choose a value of K based on the heuristic of the "elbow" method

Choose between different clusterings (for a fixed K) produced by starting with different random K-means centroids

4
New cards

What is the range of the silhouette score?

[-1,1]

5
New cards

How could a data point have a silhouette coefficient of 0?

If the data point is as close to points in its cluster as it is to points in the nearest cluster (not including its own)

6
New cards

How many different assignments of data points to clusters are there given n data points and K clusters? Assume a data point can only belong to a single cluster.

K^n

7
New cards

The plot below depicts data points for a dataset of 10 credit card seeking individuals, 6 of whom are considered to be a high credit risk and 4 of whom are considered to be a low credit risk.

quiz_image1.png

What is the starting Gini impurity (index) of this dataset given credit risk as the target?

[Reminder] Gini impurity (index) formula: 

gini_impurity (1).PNG

0.48

8
New cards

If there were equal low credit risk as high credit risk individuals, what would the Gini impurity be of the dataset without any splits?

0.5

9
New cards

If you were creating a decision tree based on this dataset using the C4.5 or CART algorithm, the first step would be to choose an attribute and split point that best partitioned the data points by the target value. 

According to the credit risk plot, which attribute and split point would be the best choice among the following options?

Age with a split point of 35

10
New cards

Given enough depth (splits), a decision tree can successfully classify any training dataset with 100% accuracy.

False

11
New cards

Assume you are a building an image classification neural network to predict an image as either a dog, cat, or turtle. The images are 32x32 pixels and serialized into a vector of 1024 features per image. Assume there is only one hidden layer between the input and output layer. The hidden layer has 10 neurons (nodes). Ignoring bias terms, what is the total number of weights for this network?

10,270

12
New cards

Using a sigmoid as the activation function for a binary class in the output layer, what output value produced by the sigmoid would denote highest uncertainty for a class prediction:

+0.5

13
New cards

What input value into the sigmoid function would produce the highest uncertainty output value?

0

14
New cards

A binary classifier needs to predict the question: "Does the patient have lung cancer?" The table below shows a validation dataset labels and predictions. Compute the precision of these predictions:

Sample Number

Actual

Predicted

1

Normal

Cancer

2

Cancer

Cancer

3

Cancer

Cancer

4

Normal

Normal

5

Cancer

Normal

Assume "Cancer" represents the positive class, and "Normal" represents the negative class.
Please round your answer to the 2nd decimal place.

[note: precision is a value between 0 and 1]

0.66 (with margin: 0.01)

0.666 (with margin: 0.01)

0.67 (with margin: 0.01)

15
New cards

In which of the following prediction scenarios would it be appropriate to apply AUC as the metric? 

When predicting a binary label with a probabilistic prediction

16
New cards

Simple aggregation (also known as a simple combiner) differs from bagging in the following ways (check all that apply):

bagging requires bootstrapping and simple aggregation does not

bagging requires that the same algorithm be used for prediction/classification and simple aggregation does not

17
New cards

What type of ensemble technique uses bootstrapping but modifies the probability of sampling an instance based on how well it was predicted in previously trained models:

Boosting

18
New cards

Which ensemble method does not allow for parallel training of the models in the ensemble:

Boosting