data mining exam 2

studied byStudied by 0 people
0.0(0)
Get a hint
Hint

agglomerative hierarchical clustering

1 / 66

flashcard set

Earn XP

Description and Tags

clustering

67 Terms

1

agglomerative hierarchical clustering

where each instance is treated as an individual cluster and then iterated in twos until it creates a single cluster (bottom up approach)

New cards
2

key concepts of k-means clustering

measured as the sum of squared distances from each data point to it’s centroid

New cards
3

what does k-means aim to do?

k-means aims to reduce the total within dispersion at each step

New cards
4

what approach does k-means use?

k-means uses the heuristic approach, providing quick, but not always optimal solutions

New cards
5

advantages of k-means clustering?

speed, high scalability, easy to understand

New cards
6

challenges of using k-means clustering?

number of clusters must be pre-specified, sensitivity to initialization, sensitive to outliers

New cards
7

key issues of k-means?

assumes spherical clusters, sensitivity to cluster size and assumes they all have the same variance, centroid location basis

New cards
8
New cards
9

divisive hierarchical clustering

where all instances belong to a single cluster and are split into twos clusters iteatively until only one instance remains (top down approach)

New cards
10

single linkage cluster

the minimum distance between the nearest pair of records from two clusters

New cards
11

what are characteristics of single linkage clustering?

clusters distant together at an early stage, creating elongated, sausage like clusters

New cards
12

complete linkage clustering

the maximum distance between the farthest pair of records in two clusters

New cards
13

what are characteristics of complete linkage clusters?

form clusters at early stages where records are within a narrow range of distances, result in roughly spherical shapes

New cards
14

average linkage clustering (UPGMA)

based on the average distance between all possible pairs of records in two clusters

New cards
15

what are some characteristics of average linkage clustering?

average linkage relies on actual distances between the records and not just the order, transformation resilience: results are not affected by linear transformations of distances (as long as the order of distances remains unchanged)

New cards
16

what does average linkage clustering (UPGMA) stand for?

unweighted pair group method using averages

New cards
17

centroid linkage clustering

based on the centroid distance where they are measured by their mean values for each variable (the distance between these two clusters is the distance between the two mean vectors)

New cards
18

average linkage vs centroid linkage

average linkage: all points are computed and the average of those distances is taken. centroid linkage: only one distance is calculated, the mean and the vectors of the two clusters

New cards
19

what is another name for centroid linkage?

(UPGMC) unweighted pair group method using centroids

New cards
20

what is the key feature of centroid linkage?

less computation because only the distance between centroids is calculated rather than all pairs

New cards
21

ward’s method

agglomerative clustering that joins records and clusters to form larger clusters, minimizing the loss of information when clusters are formed

New cards
22

key concept of ward’s method

error sum of squares, measures the loss of information when individual records are replaced by a cluster mean

New cards
23

dendrogram

a tree that shows the order in which clusters are grouped together and the distances between the clusters

New cards
24

what is a clade?

a branch of a dendrogram or a vertical line

New cards
25

what is a link?

a link is a horizontal line that connects two clades, whose height gives the distance between clusters

New cards
26

what is a leaf?

a leaf is the terminal end of each clade in a dendrogram which represents a single instance

New cards
27

single linkage dendrogram

join clusters based on the minimum distance between the records

New cards
28

average linkage

joins clusters based on the average distance between all pairs of records

New cards
29

what happens as the number of clusters decreases?

smaller clusters merge into larger clusters as the number of clusters decreases

New cards
30

what provides a visual representation of the hierarchical clustering process?

dendrograms

New cards
31

what are different ways to measure distances between clusters?

single, average, complete, and centroid linkage

New cards
32

how can the number of clusters be adjusted?

by cutting the dendrogram at different heights

New cards
33

advantages of hierarchical clustering

no need to pre-specify the number of clusters, dendrograms provide clear visual representation of clustering

New cards
34

limitations of hierarchical clustering

memory and computational cost, no reallocation of records, low stability, sensitivity to distance metrics, sensitivity to outliers

New cards
35

K-Means

centroid based clustering, minimizing the squared distances between data points and their respective cluster centroids, sensitive to initial placement of the centroids, assumes clusters are of roughly similar size and have equal variance

New cards
36

when does k-means work best?

when clusters are well separated, roughly spherical, and have similar sizes. may not perform well if the clusters have irregular shapes and densities

New cards
37

DBSCAN

density based clustering, algorithm that identifies clusters as dense regions separated by areas of lower point density, can identify outliers and noise, and can be used for arbitrary shapes varying in sizes

New cards
38

what does not assume a fixed number of clusters?

DBSCAN

New cards
39

core point

specified number of points within EPS

New cards
40

border point

not a core point, but in the neighborhood of a core point

New cards
41

noise point

a point that is neither a core point nor a border point

New cards
42

DBSCAN algorithm

label points, eliminate noise, connect core points, form clusters, assign border points

New cards
43

when does DBSCAN not work well

sensitive to its parameters, varying densities (may merge dense clusters or miss sparse ones if they have different densities), struggles with high dimensional data

New cards
44

directly density reachable

not symmetric

New cards
45

density connected

is symmetric

New cards
46

DBSCAN pros

robust to outliers, only 2 parameters, no predefined cluster numbers, identifies arbitrary shapes, insensitive to point ordering

New cards
47

DBSCAN cons

cannot differentiate varying density clusters, not deterministic, parameter sensitivity

New cards
48

DBSCAN clustering approach

identifies clusters based on regions of high density, identify low density regions as noise

New cards
49

gaussian mixture model

a probabilistic approach to clustering

New cards
50

assumption

data is generated from a mixture of multiple gaussian distributions

New cards
51

clusters

each gaussian distribution corresponds to a cluster and the model computes the probability of each data point belonging to a cluster

New cards
52

soft clustering

assigns probabilities offering a flexible probabilistic distributions for cluster assignment

New cards
53

expectations maximization algorithm

to find optimal parameters (means, covariance, and mixing coefficients) for GMM

New cards
54

what are the steps for the EM algorithim?

E-step (expectation), M-Step (Maximization), iterating between these until the parameters converge

New cards
55

strengths of GMM

handles overlapping clusters, models clusters of various shapes, provides soft clustering

New cards
56

weaknesses of GMM

sensitive to initialization, requires number of clusters to be predefined

New cards
57

k-means strengths

simple and fast, effective for well separated clusters

New cards
58

k-means weaknesses

assumes clusters are spherical and equal in size, requires predefined number of clusters

New cards
59

DBSCAN strengths

detects clusters of arbitrary shapes, identifies noise and outliers effectively, no need to predefine the number of clusters

New cards
60

DBSCAN weaknesses

sensitive to parameter settings, struggles with varying density clusters

New cards
61

hierarchical clustering strengths

builds a hierarchy of clusters, no need to predefine the number of clusters, works well with small datasets

New cards
62

hierarchical clustering weaknesses

computationally expensive for long datasets, assumes the clusters are either merged or split based on linkage method, does not handle noise or outliers as effectively as DBSCAN

New cards
63

what is best for overlapping clusters or clusters with different shapes and densities? suitable for soft clustering

GMM

New cards
64

what is ideal for quick clustering of well separated, spherical clusters when the number of clusters is known

k-means

New cards
65

what is effective for detecting irregular shapes and handling noise but requires parameter tuning?

DBSCAN

New cards
66

what is used for visualizing cluster hierarchy (dendrogram) and works well for small datasets where computation is manageable?

hierarchical

New cards
67
New cards

Explore top notes

note Note
studied byStudied by 635 people
... ago
4.8(5)
note Note
studied byStudied by 11 people
... ago
5.0(1)
note Note
studied byStudied by 1 person
... ago
5.0(1)
note Note
studied byStudied by 11 people
... ago
4.5(2)
note Note
studied byStudied by 21 people
... ago
5.0(1)
note Note
studied byStudied by 93 people
... ago
5.0(1)
note Note
studied byStudied by 7 people
... ago
5.0(1)
note Note
studied byStudied by 13 people
... ago
5.0(1)

Explore top flashcards

flashcards Flashcard (32)
studied byStudied by 47 people
... ago
4.0(1)
flashcards Flashcard (22)
studied byStudied by 6 people
... ago
5.0(1)
flashcards Flashcard (500)
studied byStudied by 24 people
... ago
5.0(1)
flashcards Flashcard (90)
studied byStudied by 2 people
... ago
4.0(1)
flashcards Flashcard (118)
studied byStudied by 9 people
... ago
4.0(1)
flashcards Flashcard (25)
studied byStudied by 2 people
... ago
4.0(1)
flashcards Flashcard (30)
studied byStudied by 10 people
... ago
5.0(1)
flashcards Flashcard (88)
studied byStudied by 559 people
... ago
5.0(1)
robot