CAP4770 Exam 2

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/44

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

45 Terms

New cards

T/F - Hierarchical clustering requires a predetermined number of clusters.

False

New cards

T/F - Normalization is essential in clustering to ensure all variables contribute equally to the distance measures.

True

New cards

T/F - K-means clustering assigns records to clusters based on probabilities.

False

New cards

T/F - Binary similarity measures are preferred when working with continuous data.

False

New cards

Which of the following is an unsupervised learning task?

Clustering
Regression
Classification
Reinforcement learning

Clustering

New cards

Which distance measure is most commonly used in clustering but has limitations such as sensitivity to outliers?

Cosine similarity
Manhattan distance
Euclidean distance
Jaccard similarity

Euclidean distance

New cards

In clustering, normalization of numerical variables ensures that:

Larger variables dominate the clustering process
All variables contribute equally to the distance measures
Only important variables contribute to distance measures
Clustering becomes faster

All variables contribute equally to the distance measures

New cards

In k-means clustering, the number of clusters is:

Randomly chosen after clustering
Automatically determined by the algorithm
Predefined before running the algorithm

Predefined before running the algorithm

New cards

What does hierarchical clustering require that can make it computationally expensive?

An n x n distance matrix
A small dataset
A random seed
A predetermined number of clusters

An n x n distance matrix

New cards

T/F - The k-means algorithm always guarantees the globally optimal clustering solution.

False

New cards

T/F - K-means clustering is best suited for datasets with spherical clusters.

True

New cards

T/F - K-means is more sensitive to outliers compared to K-medoids clustering.

True

New cards

T/F - In centroid linkage clustering, the distance between clusters is calculated by averaging the distances between all points in the two clusters.

False

New cards

T/F - The sum of squared errors (SSE) decreases as the number of clusters (k) increases in k-means clustering.

True

New cards

T/F - The “elbow method” is used to determine the most appropriate number of clusters in k-means clustering.

True

New cards

T/F - K-means clustering can be applied to both continuous and categorical data without modification.

False

New cards

What is the main goal of k-means clustering?

Minimize between-cluster variance
Maximize between-cluster variance
Minimize within-cluster variance
Maximize within-cluster variance

Minimize within-cluster variance

New cards

Which of the following statements about K-medoids is true?

It only uses squared Euclidean distance to calculate distances
It is less sensitive to outliers compared to k-means
It uses centroids to define clusters
It assumes clusters are spherical

It is less sensitive to outliers compared to k-means

New cards

In k-means clustering, what is the purpose of the “elbow method”?

To determine the optimal number of clusters
To minimize outliers
To increase the number of centroids
To reduce the within-cluster variance

To determine the optimal number of clusters

New cards

Maximum Coordinate Distance is the same as _________. Both refer to the distance between two points where only the maximum absolute difference across any of the dimensions is considered.

Euclidean distance
Manhattan distance
Chebyshev distance
City block distance

Chebyshev distance

New cards

T/F - In agglomerative hierarchical clustering, the clustering process starts with each instance as an individual cluster.

True

New cards

T/F - Divisive clustering is a bottom-up approach to hierarchical clustering.

False

New cards

T/F - Single linkage clustering calculates the maximum distance between points in two clusters.

False

New cards

T/F - Ward’s method minimizes the loss of information at each step by using Error Sum of Squares (ESS).

True

New cards

T/F - Dendrograms are tree-like diagrams used to show the order of clustering and the distance between clusters.

True

New cards

In complete linkage clustering, how is the distance between two clusters measured?

By the average distance between all pairs of points
By the distance between the farthest points
By the distance between centroids
By the distance between the closest points

By the distance between the farthest points

New cards

What characteristic is commonly associated with single linkage clustering?

Spherical clusters
Large compact clusters
Globular clusters
Elongated, chain-like clusters

Elongated, chain-like clusters

New cards

Which of the following instances is most similar to A?

New cards

<p>In the figure, two thresholds are represented by a solid red line and a dotted clue line. How many clusters would be formed at the solid red line threshold and at the dotted blue line threshold, respectively?</p><ul><li><p>5, 5</p></li><li><p>5, 7</p></li><li><p>6, 11</p></li><li><p>7, 7</p></li></ul><p></p>

In the figure, two thresholds are represented by a solid red line and a dotted clue line. How many clusters would be formed at the solid red line threshold and at the dotted blue line threshold, respectively?

5, 5
5, 7
6, 11
7, 7

5, 7

New cards

Which of the following is NOT a feature of DBSCAN?

Handles noise and outliers
Discover clusters of arbitrary shapes
Sensitive to initial cluster centroids
Does not require you to pre-specify the number of clusters

Sensitive to initial cluster centroids

New cards

In DBSCAN, a point is classified as a core point if:

It has at least a specified number of points within the epsilon radius
It is on the boundary of the cluster
It does not belong to any cluster
It has a large distance to its nearest neighbors

It has at least a specified number of points within the epsilon

New cards

What is the main parameter that defines the neighborhood of a point in DBSCAN?

Variance
Eps (epsilon)
k-nearest neighbords
Number of clusters

Epsilon

New cards

Which of the following methods does DBSCAN use to form clusters?

Identifying regions of high data density
Maximizing intra-cluster similarity
Minimizing distances to centroids
Dividing data into equal partitions

Identifying regions of high data density

New cards

T/F - In DBSCAN, all border points have more than MinPts neighbors within the Eps radius.

False

New cards

T/F - DBSCAN can find clusters of arbitrary shapes, unlike k-means, which assumes spherical clusters.

True

New cards

T/F - DBSCAN requires the number of clusters to be specified before clustering begins.

False

New cards

T/F - The elbow method helps in determining the optimal number of clusters for k-means clustering.

True

New cards

T/F - External indices require ground truth labels to evaluate clustering results.

True

New cards

T/F - A higher Dunn Index indicates poor clustering quality.

False

New cards

In a GMM, the “soft clustering: property means that:

Each data point is assigned a probability of belonging to each cluster
Each data point is assigned to exactly one cluster

Each data point is assigned a probability of belonging to each cluster

New cards

T/F - the Silhouette Score ranges from 0 to 1, where 1 indicates the worst clustering quality.

False

New cards

What does the Dunn Index measure?

Cohesion and dispersion
Silhouette coefficient
Intra-cluster compactness and inter-cluster separation
Cluster variance

Intra-cluster compactness and inter-cluster separation

New cards

Which of the following is an internal index for cluster evaluation?

Precision
Purity
Accuracy
Silhouette Index

Silhouette Index

New cards

What does a high Silhouette Score indicate about clusters?

Clusters overlap significantly
Clusters are well-separated and cohesive
Clusters are poorly defined
Clusters are compact but not well-separated

Clusters are well-separated and cohesive

New cards

Which of the following best describes the purpose of a Gaussian Mixture Model?

To perform hierarchical clustering by combining clusters in a tree-like cluster
To separate data into clusters using a distance-based metric
To model data with a single Gaussian distribution
To model data with overlapping clusters by representing each cluster as a Gaussian distribution

To model data with overlapping clusters by representing each cluster as a Gaussian distribution