Ch9: Clustering

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/12

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

13 Terms

1
New cards

What is clustering in data mining?

A: An unsupervised learning method used to automatically group data points into clusters based on similarity, without using labeled data.

2
New cards

What are common use cases of clustering?

A:

  • Organizing documents

  • Customer segmentation

  • Web search result grouping

  • Image segmentation

  • Shopping pattern discovery

3
New cards

How is an instance typically represented in clustering?

A: As a point in n-dimensional space:

x=⟨ a1(x),a2(x),...,an(x) ⟩

4
New cards

What is a common distance metric for measuring similarity in clustering?

Euclidean distance, defined as:

5
New cards

What is the K-means clustering algorithm?

A: An iterative algorithm that partitions data into K clusters by minimizing the distance between data points and their assigned cluster centroids.

6
New cards

What are the steps of the K-means algorithm?

A:

  1. Initialize K cluster centers

  2. Assign each point to the nearest cluster

  3. Update each cluster center to be the mean of its assigned points

  4. Repeat steps 2–3 until assignments no longer change

7
New cards

What does K-means minimize during training?

A: The sum of squared distances from each data point to its assigned cluster center (within-cluster variance).

8
New cards

Is K-means guaranteed to converge?

A: Yes — it converges in a finite number of steps, though not necessarily to a global optimum.

9
New cards

What is the computational complexity(running time) of K-means?

A:

  • O(k⋅n) for assigning points to clusters

  • O(n) for updating cluster means

10
New cards

What distance properties must the similarity measure satisfy?

A:

  1. Symmetry: D(A,B)=D(B,A)

  2. Positivity: D(A,B)>0 and D(A,B)=0 if A=B

  3. Triangle inequality: D(A,C)≤D(A,B)+D(B,C)

11
New cards

How is the number of clusters K determined?

A:

  • It is not well defined and often problem-dependent.

  • A common heuristic: look for a "kink" or elbow in the objective function graph

12
New cards

What are the main advantages of K-means clustering?

A:

  • Simple and intuitive

  • Efficient on large datasets

  • Guaranteed to converge

13
New cards

What are the main disadvantages of K-means?

A:

  • Sensitive to outliers and noise

  • Struggles with non-circular (non-convex) clusters

  • May converge to local optimum

  • Requires pre-specifying K