Data Mining & Analytics Review

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/17

There's no tags or description

Looks like no tags are added yet.

Last updated 5:31 AM on 10/21/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

18 Terms

New cards

Which of the following would be the least effective way to represent a color (e.g., "Pink") in a dataset used in a predictive modeling task?

As an ordinal value based on its rank in an alphanumeric sorting of all colors

New cards

Consider a dataset with the following structure:

City	State	Date	Temperature
Berkeley	CA	Jan 25, 2018	11

Assuming we wanted to transform this dataset into a dataset with only the features of State, Month, Temperature, with State represented by the longitude and latitude of the State's capital, Month represented by a one-hot, and temperature left as a numeric: How many total features (columns) would be in this dataset?

(hint: longitude and latitude count as two features)

New cards

Sum of Squares Error (SSE) can be used with K-means clustering to:

(check all that apply)

K = number of clusters

n = number of data points being clustered

Choose a value of K based on the heuristic of the "elbow" method

Choose between different clusterings (for a fixed K) produced by starting with different random K-means centroids

New cards

What is the range of the silhouette score?

[-1,1]

New cards

How could a data point have a silhouette coefficient of 0?

If the data point is as close to points in its cluster as it is to points in the nearest cluster (not including its own)

New cards

How many different assignments of data points to clusters are there given n data points and K clusters? Assume a data point can only belong to a single cluster.

K^n

New cards

The plot below depicts data points for a dataset of 10 credit card seeking individuals, 6 of whom are considered to be a high credit risk and 4 of whom are considered to be a low credit risk.

What is the starting Gini impurity (index) of this dataset given credit risk as the target?

[Reminder] Gini impurity (index) formula:

0.48

New cards

If there were equal low credit risk as high credit risk individuals, what would the Gini impurity be of the dataset without any splits?

0.5

New cards

If you were creating a decision tree based on this dataset using the C4.5 or CART algorithm, the first step would be to choose an attribute and split point that best partitioned the data points by the target value.

According to the credit risk plot, which attribute and split point would be the best choice among the following options?

Age with a split point of 35

New cards

Given enough depth (splits), a decision tree can successfully classify any training dataset with 100% accuracy.

False

New cards

Assume you are a building an image classification neural network to predict an image as either a dog, cat, or turtle. The images are 32x32 pixels and serialized into a vector of 1024 features per image. Assume there is only one hidden layer between the input and output layer. The hidden layer has 10 neurons (nodes). Ignoring bias terms, what is the total number of weights for this network?

10,270

New cards

Using a sigmoid as the activation function for a binary class in the output layer, what output value produced by the sigmoid would denote highest uncertainty for a class prediction:

+0.5

New cards

What input value into the sigmoid function would produce the highest uncertainty output value?

New cards

A binary classifier needs to predict the question: "Does the patient have lung cancer?" The table below shows a validation dataset labels and predictions. Compute the precision of these predictions:

Sample Number	Actual	Predicted
1	Normal	Cancer
2	Cancer	Cancer
3	Cancer	Cancer
4	Normal	Normal
5	Cancer	Normal

Assume "Cancer" represents the positive class, and "Normal" represents the negative class.
Please round your answer to the 2nd decimal place.

[note: precision is a value between 0 and 1]

0.66 (with margin: 0.01)

0.666 (with margin: 0.01)

0.67 (with margin: 0.01)

New cards

In which of the following prediction scenarios would it be appropriate to apply AUC as the metric?

When predicting a binary label with a probabilistic prediction

New cards

Simple aggregation (also known as a simple combiner) differs from bagging in the following ways (check all that apply):

bagging requires bootstrapping and simple aggregation does not

bagging requires that the same algorithm be used for prediction/classification and simple aggregation does not

New cards

What type of ensemble technique uses bootstrapping but modifies the probability of sampling an instance based on how well it was predicted in previously trained models:

Boosting

New cards

Which ensemble method does not allow for parallel training of the models in the ensemble:

Boosting