data mining exam 2

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/95

flashcard set

Earn XP

Description and Tags

clustering

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

96 Terms

1
New cards

agglomerative hierarchical clustering

where each instance is treated as an individual cluster and then iterated in twos until it creates a single cluster (bottom up approach)

2
New cards
3
New cards

key concepts of k-means clustering

measured as the sum of squared distances from each data point to it’s centroid

4
New cards

what does k-means aim to do?

k-means aims to reduce the total within dispersion at each step

5
New cards

what approach does k-means use?

k-means uses the heuristic approach, providing quick, but not always optimal solutions

6
New cards

advantages of k-means clustering?

speed, high scalability, easy to understand

7
New cards

challenges of using k-means clustering?

number of clusters must be pre-specified, sensitivity to initialization, sensitive to outliers

8
New cards

key issues of k-means?

assumes spherical clusters, sensitivity to cluster size and assumes they all have the same variance, centroid location basis

9
New cards

k-medoids clustering objective

uses actual data points as cluster centers

10
New cards

advantages of k-medoids clustering?

less sensitive to outliers compared to k-means, can better handle arbitrary distance metrics

11
New cards

when is k-medoids best used?

best used for clustering data with outliers and noise

12
New cards

what is the method of k-medoids?

minimize the sum of dissimilarities between points and their closest medoids, uses as more robust method as to the mean based method

13
New cards

k-mode clustering objective

extension of k-means for categorical data

14
New cards

k-mode advantages

handles categorical data by using modes instead of means, efficiently handles data based on the most frequent category

15
New cards

applications of k-mode

used in market segmentation, document classification, other areas involving categorical variables

16
New cards

k-mode method

minimize the mismatch between data points and their cluster mode, uses matching dissimilarity measure for categorical data

17
New cards

divisive hierarchical clustering

where all instances belong to a single cluster and are split into twos clusters iteatively until only one instance remains (top down approach)

18
New cards

single linkage cluster

the minimum distance between the nearest pair of records from two clusters

19
New cards

what are characteristics of single linkage clustering?

clusters distant together at an early stage, creating elongated, sausage like clusters

20
New cards

complete linkage clustering

the maximum distance between the farthest pair of records in two clusters

21
New cards

what are characteristics of complete linkage clusters?

form clusters at early stages where records are within a narrow range of distances, result in roughly spherical shapes

22
New cards

average linkage clustering (UPGMA)

based on the average distance between all possible pairs of records in two clusters

23
New cards

what are some characteristics of average linkage clustering?

average linkage relies on actual distances between the records and not just the order, transformation resilience: results are not affected by linear transformations of distances (as long as the order of distances remains unchanged)

24
New cards

what does average linkage clustering (UPGMA) stand for?

unweighted pair group method using averages

25
New cards

centroid linkage clustering

based on the centroid distance where they are measured by their mean values for each variable (the distance between these two clusters is the distance between the two mean vectors)

26
New cards

average linkage vs centroid linkage

average linkage: all points are computed and the average of those distances is taken. centroid linkage: only one distance is calculated, the mean and the vectors of the two clusters

27
New cards

what is another name for centroid linkage?

(UPGMC) unweighted pair group method using centroids

28
New cards

what is the key feature of centroid linkage?

less computation because only the distance between centroids is calculated rather than all pairs

29
New cards

ward’s method

agglomerative clustering that joins records and clusters to form larger clusters, minimizing the loss of information when clusters are formed

30
New cards

key concept of ward’s method

error sum of squares, measures the loss of information when individual records are replaced by a cluster mean

31
New cards

dendrogram

a tree that shows the order in which clusters are grouped together and the distances between the clusters

32
New cards

what is a clade?

a branch of a dendrogram or a vertical line

33
New cards

what is a link?

a link is a horizontal line that connects two clades, whose height gives the distance between clusters

34
New cards

what is a leaf?

a leaf is the terminal end of each clade in a dendrogram which represents a single instance

35
New cards

single linkage dendrogram

join clusters based on the minimum distance between the records

36
New cards

average linkage

joins clusters based on the average distance between all pairs of records

37
New cards

what happens as the number of clusters decreases?

smaller clusters merge into larger clusters as the number of clusters decreases

38
New cards

what provides a visual representation of the hierarchical clustering process?

dendrograms

39
New cards

what are different ways to measure distances between clusters?

single, average, complete, and centroid linkage

40
New cards

how can the number of clusters be adjusted?

by cutting the dendrogram at different heights

41
New cards

advantages of hierarchical clustering

no need to pre-specify the number of clusters, dendrograms provide clear visual representation of clustering

42
New cards

limitations of hierarchical clustering

memory and computational cost, no reallocation of records, low stability, sensitivity to distance metrics, sensitivity to outliers

43
New cards

K-Means

centroid based clustering, minimizing the squared distances between data points and their respective cluster centroids, sensitive to initial placement of the centroids, assumes clusters are of roughly similar size and have equal variance

44
New cards

when does k-means work best?

when clusters are well separated, roughly spherical, and have similar sizes. may not perform well if the clusters have irregular shapes and densities

45
New cards

DBSCAN

density based clustering, algorithm that identifies clusters as dense regions separated by areas of lower point density, can identify outliers and noise, and can be used for arbitrary shapes varying in sizes

46
New cards

what does not assume a fixed number of clusters?

DBSCAN

47
New cards

core point

specified number of points within EPS

48
New cards

border point

not a core point, but in the neighborhood of a core point

49
New cards

noise point

a point that is neither a core point nor a border point

50
New cards

DBSCAN algorithm

label points, eliminate noise, connect core points, form clusters, assign border points

51
New cards

when does DBSCAN not work well

sensitive to its parameters, varying densities (may merge dense clusters or miss sparse ones if they have different densities), struggles with high dimensional data

52
New cards

directly density reachable

not symmetric

53
New cards

density connected

is symmetric

54
New cards

DBSCAN pros

robust to outliers, only 2 parameters, no predefined cluster numbers, identifies arbitrary shapes, insensitive to point ordering

55
New cards

DBSCAN cons

cannot differentiate varying density clusters, not deterministic, parameter sensitivity

56
New cards

DBSCAN clustering approach

identifies clusters based on regions of high density, identify low density regions as noise

57
New cards

gaussian mixture model

a probabilistic approach to clustering

58
New cards

assumption

data is generated from a mixture of multiple gaussian distributions

59
New cards

clusters

each gaussian distribution corresponds to a cluster and the model computes the probability of each data point belonging to a cluster

60
New cards

soft clustering

assigns probabilities offering a flexible probabilistic distributions for cluster assignment

61
New cards

expectations maximization algorithm

to find optimal parameters (means, covariance, and mixing coefficients) for GMM

62
New cards

what are the steps for the EM algorithim?

E-step (expectation), M-Step (Maximization), iterating between these until the parameters converge

63
New cards

strengths of GMM

handles overlapping clusters, models clusters of various shapes, provides soft clustering

64
New cards

weaknesses of GMM

sensitive to initialization, requires number of clusters to be predefined

65
New cards

k-means strengths

simple and fast, effective for well separated clusters

66
New cards

k-means weaknesses

assumes clusters are spherical and equal in size, requires predefined number of clusters

67
New cards

DBSCAN strengths

detects clusters of arbitrary shapes, identifies noise and outliers effectively, no need to predefine the number of clusters

68
New cards

DBSCAN weaknesses

sensitive to parameter settings, struggles with varying density clusters

69
New cards

hierarchical clustering strengths

builds a hierarchy of clusters, no need to predefine the number of clusters, works well with small datasets

70
New cards

hierarchical clustering weaknesses

computationally expensive for long datasets, assumes the clusters are either merged or split based on linkage method, does not handle noise or outliers as effectively as DBSCAN

71
New cards

what is best for overlapping clusters or clusters with different shapes and densities? suitable for soft clustering

GMM

72
New cards

what is ideal for quick clustering of well separated, spherical clusters when the number of clusters is known

k-means

73
New cards

what is effective for detecting irregular shapes and handling noise but requires parameter tuning?

DBSCAN

74
New cards

what is used for visualizing cluster hierarchy (dendrogram) and works well for small datasets where computation is manageable?

hierarchical

75
New cards

external cluster evaluation

entropy, measures the disorder or impurity within a cluster

76
New cards

low entropy

more homogeneous clusters (more points belonging to the same class)

77
New cards

high entropy

more heterogeneous clusters (points distributed among different classes)

78
New cards

low entropy clusters indicate…?

well grouped data points

79
New cards

internal cluster evaluation

measure clustering quality without eternal information based on the datasets inherent features

80
New cards

intra-cluster compactness (cohesion)

mesures how close data points are to their cluster centroid (sum of squared error SSE)

81
New cards

elbow method

graphical approach to determine optimal clusters by plotting SSE against the number of clusters

82
New cards

elbow curve method

helps balance between complexity and clustering effectiveness by plotting within cluster sum of squares

83
New cards

drawbacks of the elbow method

determining the elbow point can differ between individuals (subjective), assumes spherical and equally sized clusters which may not be suited for more complex data, not suitable for varying density

84
New cards
<p>dunn index</p>

dunn index

evaluated intra-cluster compactness and inter cluster separation

85
New cards

intra cluster distance

how tightly points are clustered around the centroid

86
New cards

inter cluster distance

measures the distance between the centroids of different clusters

87
New cards

what does higher dunn index indicate?

better clustering with compact clusters and well separated clusters

88
New cards
<p>Davies Bouldin index</p>

Davies Bouldin index

the summation of the mean value points in a cluster… the mean value of the maximum of the above value among all the clusters

89
New cards

log ss ratio

compares the sum of squared distances between clusters vs within clusters

90
New cards

what does a larger log ss ratio indicate?

better cluster seperation with low intra cluster variance and high inter cluster variance

91
New cards

what does a smaller log ss ratio indicate?

poor seperation with higher variance within clusters than between them

92
New cards
<p>silhouette score coefficient</p>

silhouette score coefficient

quantatively evaluates clustering quality

93
New cards

what does the silhouette score consider?

both cohesion (how close the points are within clusters) and separation (how far clusters are from each other), measures the distinctness of clusters and overall clustering validity

94
New cards

silhouette score range

range from -1 to 1; closer to 1 means better defined clusters, negative values suggest overlapping clusters or poor clustering quality

95
New cards

what is the silhouette coefficient for each data point?

(separation - cohesion) / max(separation, cohesion)

96
New cards

what are the steps for silhouette score?

calculate the cohesion and separation for each point, compute the silhouette coefficient, calculate the average silhouette coefficient across all points