data mining exam 2

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/95

Earn XP

Description and Tags

clustering

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

96 Terms

New cards

agglomerative hierarchical clustering

where each instance is treated as an individual cluster and then iterated in twos until it creates a single cluster (bottom up approach)

New cards

key concepts of k-means clustering

measured as the sum of squared distances from each data point to it’s centroid

New cards

what does k-means aim to do?

k-means aims to reduce the total within dispersion at each step

New cards

what approach does k-means use?

k-means uses the heuristic approach, providing quick, but not always optimal solutions

New cards

advantages of k-means clustering?

speed, high scalability, easy to understand

New cards

challenges of using k-means clustering?

number of clusters must be pre-specified, sensitivity to initialization, sensitive to outliers

New cards

key issues of k-means?

assumes spherical clusters, sensitivity to cluster size and assumes they all have the same variance, centroid location basis

New cards

k-medoids clustering objective

uses actual data points as cluster centers

New cards

advantages of k-medoids clustering?

less sensitive to outliers compared to k-means, can better handle arbitrary distance metrics

New cards

when is k-medoids best used?

best used for clustering data with outliers and noise

New cards

what is the method of k-medoids?

minimize the sum of dissimilarities between points and their closest medoids, uses as more robust method as to the mean based method

New cards

k-mode clustering objective

extension of k-means for categorical data

New cards

k-mode advantages

handles categorical data by using modes instead of means, efficiently handles data based on the most frequent category

New cards

applications of k-mode

used in market segmentation, document classification, other areas involving categorical variables

New cards

k-mode method

minimize the mismatch between data points and their cluster mode, uses matching dissimilarity measure for categorical data

New cards

divisive hierarchical clustering

where all instances belong to a single cluster and are split into twos clusters iteatively until only one instance remains (top down approach)

New cards

single linkage cluster

the minimum distance between the nearest pair of records from two clusters

New cards

what are characteristics of single linkage clustering?

clusters distant together at an early stage, creating elongated, sausage like clusters

New cards

complete linkage clustering

the maximum distance between the farthest pair of records in two clusters

New cards

what are characteristics of complete linkage clusters?

form clusters at early stages where records are within a narrow range of distances, result in roughly spherical shapes

New cards

average linkage clustering (UPGMA)

based on the average distance between all possible pairs of records in two clusters

New cards

what are some characteristics of average linkage clustering?

average linkage relies on actual distances between the records and not just the order, transformation resilience: results are not affected by linear transformations of distances (as long as the order of distances remains unchanged)

New cards

what does average linkage clustering (UPGMA) stand for?

unweighted pair group method using averages

New cards

centroid linkage clustering

based on the centroid distance where they are measured by their mean values for each variable (the distance between these two clusters is the distance between the two mean vectors)

New cards

average linkage vs centroid linkage

average linkage: all points are computed and the average of those distances is taken. centroid linkage: only one distance is calculated, the mean and the vectors of the two clusters

New cards

what is another name for centroid linkage?

(UPGMC) unweighted pair group method using centroids

New cards

what is the key feature of centroid linkage?

less computation because only the distance between centroids is calculated rather than all pairs

New cards

ward’s method

agglomerative clustering that joins records and clusters to form larger clusters, minimizing the loss of information when clusters are formed

New cards

key concept of ward’s method

error sum of squares, measures the loss of information when individual records are replaced by a cluster mean

New cards

dendrogram

a tree that shows the order in which clusters are grouped together and the distances between the clusters

New cards

what is a clade?

a branch of a dendrogram or a vertical line

New cards

what is a link?

a link is a horizontal line that connects two clades, whose height gives the distance between clusters

New cards

what is a leaf?

a leaf is the terminal end of each clade in a dendrogram which represents a single instance

New cards

single linkage dendrogram

join clusters based on the minimum distance between the records

New cards

average linkage

joins clusters based on the average distance between all pairs of records

New cards

what happens as the number of clusters decreases?

smaller clusters merge into larger clusters as the number of clusters decreases

New cards

what provides a visual representation of the hierarchical clustering process?

dendrograms

New cards

what are different ways to measure distances between clusters?

single, average, complete, and centroid linkage

New cards

how can the number of clusters be adjusted?

by cutting the dendrogram at different heights

New cards

advantages of hierarchical clustering

no need to pre-specify the number of clusters, dendrograms provide clear visual representation of clustering

New cards

limitations of hierarchical clustering

memory and computational cost, no reallocation of records, low stability, sensitivity to distance metrics, sensitivity to outliers

New cards

K-Means

centroid based clustering, minimizing the squared distances between data points and their respective cluster centroids, sensitive to initial placement of the centroids, assumes clusters are of roughly similar size and have equal variance

New cards

when does k-means work best?

when clusters are well separated, roughly spherical, and have similar sizes. may not perform well if the clusters have irregular shapes and densities

New cards

DBSCAN

density based clustering, algorithm that identifies clusters as dense regions separated by areas of lower point density, can identify outliers and noise, and can be used for arbitrary shapes varying in sizes

New cards

what does not assume a fixed number of clusters?

DBSCAN

New cards

core point

specified number of points within EPS

New cards

border point

not a core point, but in the neighborhood of a core point

New cards

noise point

a point that is neither a core point nor a border point

New cards

DBSCAN algorithm

label points, eliminate noise, connect core points, form clusters, assign border points

New cards

when does DBSCAN not work well

sensitive to its parameters, varying densities (may merge dense clusters or miss sparse ones if they have different densities), struggles with high dimensional data

New cards

directly density reachable

not symmetric

New cards

density connected

is symmetric

New cards

DBSCAN pros

robust to outliers, only 2 parameters, no predefined cluster numbers, identifies arbitrary shapes, insensitive to point ordering

New cards

DBSCAN cons

cannot differentiate varying density clusters, not deterministic, parameter sensitivity

New cards

DBSCAN clustering approach

identifies clusters based on regions of high density, identify low density regions as noise

New cards

gaussian mixture model

a probabilistic approach to clustering

New cards

assumption

data is generated from a mixture of multiple gaussian distributions

New cards

clusters

each gaussian distribution corresponds to a cluster and the model computes the probability of each data point belonging to a cluster

New cards

soft clustering

assigns probabilities offering a flexible probabilistic distributions for cluster assignment

New cards

expectations maximization algorithm

to find optimal parameters (means, covariance, and mixing coefficients) for GMM

New cards

what are the steps for the EM algorithim?

E-step (expectation), M-Step (Maximization), iterating between these until the parameters converge

New cards

strengths of GMM

handles overlapping clusters, models clusters of various shapes, provides soft clustering

New cards

weaknesses of GMM

sensitive to initialization, requires number of clusters to be predefined

New cards

k-means strengths

simple and fast, effective for well separated clusters

New cards

k-means weaknesses

assumes clusters are spherical and equal in size, requires predefined number of clusters

New cards

DBSCAN strengths

detects clusters of arbitrary shapes, identifies noise and outliers effectively, no need to predefine the number of clusters

New cards

DBSCAN weaknesses

sensitive to parameter settings, struggles with varying density clusters

New cards

hierarchical clustering strengths

builds a hierarchy of clusters, no need to predefine the number of clusters, works well with small datasets

New cards

hierarchical clustering weaknesses

computationally expensive for long datasets, assumes the clusters are either merged or split based on linkage method, does not handle noise or outliers as effectively as DBSCAN

New cards

what is best for overlapping clusters or clusters with different shapes and densities? suitable for soft clustering

GMM

New cards

what is ideal for quick clustering of well separated, spherical clusters when the number of clusters is known

k-means

New cards

what is effective for detecting irregular shapes and handling noise but requires parameter tuning?

DBSCAN

New cards

what is used for visualizing cluster hierarchy (dendrogram) and works well for small datasets where computation is manageable?

hierarchical

New cards

external cluster evaluation

entropy, measures the disorder or impurity within a cluster

New cards

low entropy

more homogeneous clusters (more points belonging to the same class)

New cards

high entropy

more heterogeneous clusters (points distributed among different classes)

New cards

low entropy clusters indicate…?

well grouped data points

New cards

internal cluster evaluation

measure clustering quality without eternal information based on the datasets inherent features

New cards

intra-cluster compactness (cohesion)

mesures how close data points are to their cluster centroid (sum of squared error SSE)

New cards

elbow method

graphical approach to determine optimal clusters by plotting SSE against the number of clusters

New cards

elbow curve method

helps balance between complexity and clustering effectiveness by plotting within cluster sum of squares

New cards

drawbacks of the elbow method

determining the elbow point can differ between individuals (subjective), assumes spherical and equally sized clusters which may not be suited for more complex data, not suitable for varying density

New cards

dunn index

evaluated intra-cluster compactness and inter cluster separation

New cards

intra cluster distance

how tightly points are clustered around the centroid

New cards

inter cluster distance

measures the distance between the centroids of different clusters

New cards

what does higher dunn index indicate?

better clustering with compact clusters and well separated clusters

New cards

Davies Bouldin index

the summation of the mean value points in a cluster… the mean value of the maximum of the above value among all the clusters

New cards

log ss ratio

compares the sum of squared distances between clusters vs within clusters

New cards

what does a larger log ss ratio indicate?

better cluster seperation with low intra cluster variance and high inter cluster variance

New cards

what does a smaller log ss ratio indicate?

poor seperation with higher variance within clusters than between them

New cards

silhouette score coefficient

quantatively evaluates clustering quality

New cards

what does the silhouette score consider?

both cohesion (how close the points are within clusters) and separation (how far clusters are from each other), measures the distinctness of clusters and overall clustering validity

New cards

silhouette score range

range from -1 to 1; closer to 1 means better defined clusters, negative values suggest overlapping clusters or poor clustering quality

New cards

what is the silhouette coefficient for each data point?

(separation - cohesion) / max(separation, cohesion)

New cards

what are the steps for silhouette score?

calculate the cohesion and separation for each point, compute the silhouette coefficient, calculate the average silhouette coefficient across all points