Clustering

0.0(0)
studied byStudied by 1 person
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/75

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

76 Terms

1
New cards

Clustering Analysis

Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

2
New cards

Cluster

A collection of data objects

3
New cards

Similar (or related)

another within the same group

4
New cards

dissimilar (or unrelated)

to the objects in other groups

5
New cards

Userpervised Learning

no predefined classes

6
New cards

Stand-alone tool

To get insight into data distribution

7
New cards

Data reduction

  • Summarization

  • Compression

8
New cards

Summarization

Preprocessing for regression, PCA, classification, and association analysis

9
New cards

Compression

Image processing: vector quantization

10
New cards

Application of Cluster Analysis

  • Data reduction

  • Hypothesis generation and testing

  • Prediction based on groups

  • Finding K-nearest Neighbors

  • Outlier detection

11
New cards

Basic Steps to Develop a Clustering Task

  • Feature selection

  • Proximity measure

  • Clustering criterion

  • Clustering algorithms

  • Validation of the results

  • Interpretation of the results

12
New cards

Feature Selection

  • Select infor concerning the task of interest

  • Minimal information redundancy

13
New cards

Proximity measure

The similarity of two feature vectors

14
New cards

Clustering criterion

Expressed via a cost function or some rules

15
New cards

Clustering algorithms

 Choice of algorithms

16
New cards

Validation of the results

Validation test (also, clustering tendency test)

17
New cards

Interpretation of the results

Integration with applications

18
New cards

Good Clustering

A _______ method produces high-quality clusters with high intra-class similarity (cohesive within clusters) and low inter-class similarity (distinctive between clusters).

19
New cards

Quality

The _____ depends on the similarity measure used, its implementation, and the method’s ability to discover hidden patterns.

20
New cards

Dissimilarity/Similarity Metric

Quality of clustering is measured using a ________, often expressed as a distance function (e.g., metric: d(i, j))

21
New cards

Partitioning criteria

Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)

22
New cards

Separation of clusters

Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class)

23
New cards

Similarity measure

Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)

24
New cards

Clustering Space

Full space (often when low dimensional) vs subspaces (often in high-dimensional clustering)

25
New cards

Requirements and Challenges

  • Scalability

  • Ability to deal with different types of attributes

  • Constraints-based clustering

  • Interpretability and usability

26
New cards

Scalability

Clusterning all the data isntead of only on samples

27
New cards

Ability to deal with different types of attributes

Numerical, binary, categorical, ordinal, linked, and mixture of these

28
New cards

Constraint-based clustering

  • User may give inputs on constraints

  • Use domain knowledge to determine input parameters

29
New cards

Major Clustering Approaches

  • Partitioning approach

  • Hierarchical approach

  • Density-based approach

  • Grid-based approach

  • Model-based

  • Frequent pattern-based

  • User-guided or constraint-based

  • Link-based clustering

30
New cards

Partitioning approach

  • Construct various _____ and then evaluate them by some criterion, e.g., minimizing the sum of square errors

  • Typical methods: k-means, k-medoids, CLARANS

31
New cards

Hierarchical approach

  • Create a _____decomposition of the set of data (or objects) using some criterion

  • Typical methods: Diana, Agnes, BIRCH, CAMELEON

32
New cards

Density-based approach

  •  Based on connectivity and ____ functions

  • Typical methods: DBSCAN, OPTICS, DenClue

33
New cards

Grid-based approach

  • Based on a multiple-level granularity structure

  • Typical methods: STING, WaveCluster, CLIQUE

34
New cards

Model-based

A ____ is hypothesized for each of the clusters and tries to find the best fit of that model to each other

Typical methods: EM, SOM, COBWEB

35
New cards

Frequent pattern-based

  • Based on the analysis of frequent patterns

  • Typical methods: p-Cluster

36
New cards

User-guided or constraint-based

  • Clustering by considering _______ or application-specific constraints

  • Typical methods: COD (obstacles), constrained clustering

37
New cards

Link-based clustering

  • Objects are often linked together in various ways

  • Massive links can be used to cluster objects: SimRank, LinkClus

38
New cards

Weakness of the K-Means method

  • Applicable only to objects in a continuous n-dimensional space

  • Need to specify k, the number of clusters, in advance

  • Sensitive to noisy data and outliers

  • Not suitable to discover clusters with non-convex shapes

39
New cards

Hierarchical Clustering

It is an alternative approach to k-means clustering for identifying groups in a data set.

40
New cards

Dendrogram

Heirachical clustering results that is visualezed using a tree-based representation.

41
New cards

Agglomerative clusterning

Commonly referred to as Agnes. Works in a bottom-up manner

42
New cards

AGNES (Agglomerative Nesting)

It is a hierarchical clustering method introduced by Kaufmann and Rousseeuw in 1990 and implemented in statistical packages like Splus. It uses the single-link method and a dissimilarity matrix to merge nodes with the least dissimilarity, proceeding in a non-descending fashion until all nodes are merged into a single cluster.

43
New cards

Types of Heirarchical Clustering

  • Single Linkage

  • Complete Linkage

44
New cards

Single Linkage

We merge in each step the two clusters, whose two closest memebers have the smallest distance.

45
New cards

Complete Linkage

We merge in the members of the clusters in each step, which provide the smallest maximum pairwise distance.

46
New cards

DIANA (Divisive Analysis)

Introduced by Kaufmann and Rousseeuw in 1990, is a hierarchical clustering method implemented in statistical packages like Splus. It follows the inverse order of AGNES, starting with all nodes in one cluster and progressively dividing them until each node forms its own individual cluster.

47
New cards

Distance between Clusters

  • Single Link

  • Complete Link

  • Average

  • Centroid

  • Medoid

48
New cards

Single Link

Smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

49
New cards

Complete link

Largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

50
New cards

Average

Average distance between an element in one cluster and an element in the other

51
New cards

Medoid

A chosen, centrally located object in the cluster

52
New cards

Divisive hierarchical clustering

It is a top-down hierarchical clustering method.

  1. It begins with all observations in a single cluster.

  2. At each step, the most heterogeneous cluster is split into two.

  3. This process continues until each observation forms its own cluster.

53
New cards

Centroid

The middle of a cluster

<p>The middle of a cluster </p>
54
New cards

Radius

The square root of average distance from any point of the cluster to its centroid

<p> The square root of average distance from any point of the cluster to its centroid	</p>
55
New cards

Diameter

The square root of average mean squared distance between all pairs of points in the cluster

<p>The square root of average mean squared distance between all pairs of points in the cluster</p>
56
New cards

Density-Based Clustering Methods

  • Clustering based on _____ , specifically _____ points.

  • Major features:

    1. Discover clusters of arbitrary shapes.

    2. Handle noise.

    3. One scan required.

    4. Density parameters are needed for termination.

57
New cards

Two Parameters of Density-Based Clustering

  • Eps: Maximum radius of the neighbourhood 

  • MinPts: Minimum number of points in an Eps-neighbourhood of that point

58
New cards

Directly density-reachable

A point p is _______ from a point q with respect to Eps and MinPts if:

  1. p is within the Eps neighborhood of q (i.e., the distance between p and q is less than or equal to Eps).

  2. q has at least MinPts points within its Eps neighborhood (including q itself).

59
New cards

Density-reachable

A point p is _______ from a point q with respect to Eps and MinPts if there is a chain of points p₁, …, pₙ where p₁ = q and pₙ = p, such that each point pᵢ₊₁ is directly density-reachable from pᵢ.

60
New cards

Density-connected

A point p is density-connected to a point q with respect to Eps and MinPts if there exists a point o such that both p and q are density-reachable from o with respect to Eps and MinPts.

61
New cards

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. It relies on a density-based notion of cluster, where a cluster is defined as a maximal set of density-connected points.

62
New cards

Main features of DBSCAN

DBSCAN discovers clusters of arbitrary shape in spatial databases and is capable of handling noise.

63
New cards

OPTICS

Ordering Points To Identify the Clustering Structure. Introduced by Ankerst, Breunig, Kriegel, and Sander in 1999 at SIGMOD.

64
New cards

Special order of the Database

OPTICS produces a ________ with respect to its density-based clustering structure. This cluster-ordering contains information equivalent to density-based clusterings across a wide range of parameter settings.

65
New cards

Key characteristics of grid-based clustering methods

This method use a multi-resolution grid data structure. Notable methods include:

  • STING (Statistical Information Grid) by Wang, Yang, and Muntz (1997)

  • CLIQUE by Agrawal et al. (SIGMOD’98) — supports both grid-based and subspace clustering

  • WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) — employs a multi-resolution clustering approach using wavelet methods.

66
New cards

STING (Statistical Information Grid Approach)

Authored by Wang, Yang, and Muntz (VLDB’97).
Key Features:

  • The spatial area is divided into rectangular cells.

  • Multiple levels of cells correspond to different levels of resolution.

67
New cards

Key Features of the STING clustering method

  • Cell Hierarchy

  • Statistical Infomration

  • Top-Down Approach

68
New cards

STING Advantages

  • Query-independent.

  • Easy to parallelize.

  • Incremental updates possible.

  • Time complexity of O(K), where K is the number of grid cells at the lowest level.

69
New cards

STING Disadvantages

All cluster boundaries are either horizontal or vertical; diagonal boundaries are not detected.

70
New cards

CLIQUE (Clustering In QUEst)

Authored by Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). Automatically identifies subspaces of high-dimensional data space for improved clustering compared to the original space. Considered both density-based and grid-based.

71
New cards

External Measures

  • Supervised measures that use criteria not inherent to the dataset.

  • Compare a clustering against prior or expert-specified knowledge (ground truth) using specific quality metrics.

72
New cards

Internal Measures

  • Unsupervised measures derived from the data itself.

  • Evaluate clustering quality based on cluster separation and compactness (e.g., Silhouette coefficient

73
New cards

Matching-Based Measures

  • Purity

  • Maximum Matching

  • F-Measure

74
New cards

Entropy-Based Measures

  • Conditional Entropy

  • Normalized Mutual Information (NMI)

  • Variation of Information

75
New cards

Pair-Wise Measures

  • True Positives (TP)

  • False Negatives (FN)

  • False Positives (FP)

  • True Negatives (TN)

  • Jaccard Coefficient

  • Rand Statistic

  • Fowlkes-Mallows Measure

76
New cards

Correlation Measures

  • Discretized Huber Statistic

  • Normalized Discretized Huber Statistic