week 5 - k means algorithm

0.0(0)

Studied by 3 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/16

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

17 Terms

New cards

recall

Partitional clustering problem can be framed as the following optimisation problem:

New cards

k-means

= iterative, greedy descent algorithm - aims to find sub-optimal solution to optimisation problem in (1)

k-means iteratively alternates between the following 2 steps:

Assignment step: for a given set of k cluster centroids, k-means assigns each example to the cluster with closest centroid
- (see pic) = equivalent to fixing the centroids + optimising the cluster assignment to ensure low wcss

Refitting step: re-evaluate + update the cluster centroids
- = equivalent to fixing the cluster assignment + optimising the centroids to ensure low wcss
- consequently, over iterations, wcss continues to decrease

<p>= iterative, greedy descent algorithm - aims to find <strong>sub-optimal</strong> solution to optimisation problem in (1)</p><p></p><p>k-means<u> iteratively alternates</u> between the following 2 steps:</p><ul><li><p><strong>Assignment step:</strong> for a given set of k cluster centroids, k-means assigns each example to the cluster with closest centroid</p><ul><li><p>(see pic) = equivalent to <u>fixing the centroids + optimising </u>the <u>cluster assignment</u> to ensure<em><u> low wcss</u></em></p></li></ul></li></ul><p></p><ul><li><p><strong>Refitting step: </strong>re-evaluate + update the cluster centroids</p><ul><li><p>= equivalent to <u>fixing</u> the <u>cluster assignment</u> + <u>optimising</u> the <u>centroids</u> to ensure low wcss</p></li><li><p>consequently, over iterations, wcss continues to decrease</p></li></ul></li></ul>

New cards

the algorithm: k means algorithm

input: no of k clusters + N examples (x⁽¹⁾,…x^(N))

select k of the N examples at random as centroids + label is as C₁, …, C_K
repeat until the cluster centroids don’t change:
- (assignment step) form k clusters by assigning each observation to its closest centroid, i.e. (see pic)
- (Refitting step) compute the centroid of the obtained k clusters as ( see pic)

<p>input: no of k clusters + N examples (x<sup>(1)</sup>,…x<sup>(N)</sup>)</p><p></p><ol><li><p>select k of the N examples at random as centroids + label is as C<sub>1</sub>, …, C<sub>K</sub></p></li><li><p>repeat until the cluster centroids don’t change:</p><ul><li><p>(assignment step) form k clusters by assigning each observation to its closest centroid, i.e. (see pic)</p></li><li><p>(Refitting step) compute the centroid of the obtained k clusters as ( see pic)</p></li></ul></li></ol>

New cards

points to note

initial choice of cluster centroids are randomly chosen from the data examples
however, centroids found by k-means over the iterations need not be data examples
cluster assignment = based on squared Euclidean distance
stopping criterion: stop when cluster centroids don’t change

New cards

illustration

New cards

example 1: clustering of medicines

problem : use k-means algorithm to cluster the following medicine into k=2 clusters

iteration 0: let initial centred of cluster 1 and cluster 2 be chosen as Med A and Med B respectively e.g. C₁=(1,1) and C₂=(2,1)

iteration 1:

calculate the squared euclidean distance of each point to closer centroids to form an object-distance matrix
- based on the matrix, assign each medicine to its closest centroid. Thus, medicines B, C and D are assigned to cluster 2

<p>problem : use k-means algorithm to cluster the following medicine into k=2 clusters</p><p></p><p><strong>iteration 0:</strong> let initial centred of cluster 1 and cluster 2 be chosen as Med A and Med B respectively e.g. C<sub>1</sub>=(1,1) and C<sub>2</sub>=(2,1)</p><p></p><p>iteration 1: </p><ol><li><p>calculate the squared euclidean distance of each point to closer centroids to form an object-distance matrix</p><ul><li><p>based on the matrix, assign each medicine to its closest centroid. Thus, medicines B, C and D are assigned to cluster 2</p></li></ul></li></ol>

New cards

continued

Iteration 1 cont:

update the centroids of the clusters obtained in the previous step → results in: (see pic) - green + symbol

iteration 2:

calculate distance of each point to new cluster centroids → med B is thus moved to cluster 1 - as only 1 away from cluster 1 and 5.58 away from cluster 2
update the centroids of the cluster

iteration 3:

repeat the same steps as before
note that the cluster assignments don’t change
algorithm has converged
- converged = reached a steady/stable state - further iterations are unlikely to significantly change the result

<p><strong>Iteration 1 cont:</strong></p><ol start="2"><li><p>update the centroids of the clusters obtained in the previous step → results in: (see pic) - green + symbol</p></li></ol><p></p><p><strong>iteration 2:</strong></p><ol><li><p>calculate distance of each point to new cluster centroids → med B is thus moved to cluster 1 - as only 1 away from cluster 1 and 5.58 away from cluster 2</p></li><li><p>update the centroids of the cluster</p></li></ol><p></p><p><strong>iteration 3:</strong></p><ul><li><p>repeat the same steps as before</p></li><li><p>note that the cluster assignments don’t change</p></li><li><p>algorithm has converged</p><ul><li><p>converged = reached a steady/stable state - further iterations are unlikely to significantly change the result</p></li></ul></li></ul>

New cards

space + time complexity of k-means

- space requirement = modest → only data observations + centroids are stored

- storage complexity = of order O((N+k)m) where m = no.of feature attributes

- time complexity of k-means = of order O(IKm) where I=no. iterations required for convergence

- importantly, time complexity of k-means is linear in N

space complexity = amount of memory required by an algorithm to solve a problem →function of the input size

storage complexity = amount of memory to store the input data itself as well as additional memory used by the algorithm

New cards

challenges + issues in k-means

does the k-means algorithm always converge?
can it always find optimal clustering?
how should we initialise the algorithm?
how should we choose the no. of clusters?

New cards

convergence of k-means

zt each iteration, the assignment + refitting steps ensure that the wcss in (1) monotically decreases
moreover , k-means work with infinite number (k^N) of partitions of the data
- above 2 conditions ensure that the k-means algorithm always converges

BUT → optimisation problem of (1) is non-convex

as such, k-means algorithm may not converge to the global minimum, but to a local minimum !!!!
- solution: escaping local minima: multiple random restarts + choose the clustering with lowest wcss

<ul><li><p>zt each iteration, the assignment + refitting steps ensure that the <u>wcss in (1) monotically decreases</u></p></li><li><p>moreover , k-means work with infinite number (k<sup>N</sup>) of partitions of the data</p><ul><li><p>above 2 conditions ensure that the k-means algorithm always converges</p></li></ul></li></ul><p></p><p><strong><span style="color: red">BUT → optimisation problem of (1) is non-convex</span></strong></p><ul><li><p>as such, k-means algorithm may not converge to the global minimum, but to a <strong><u><span style="color: red">local minimum !!!!</span></u></strong></p><ul><li><p><strong>solution</strong>: escaping local minima: multiple random restarts + choose the clustering with lowest wcss </p></li></ul></li></ul>

New cards

choice of initial cluster centroids

choosing initial cluster centroids is crucial for k-means algorithm

diff initialisations may lead to convergence to diff local optima

k-means is a non-deterministic algorithm : where the outcome is x entirely predictable based solely on the input + the algorithms logic - can produce diff results for the sae input under diff executions

<p>choosing initial cluster centroids is crucial for k-means algorithm</p><ul><li><p><strong>diff initialisations may lead to convergence to diff local optima</strong></p></li></ul><p></p><ul><li><p>k-means is a<strong> non-deterministic algorithm </strong>: where the outcome is x entirely predictable based solely on the input + the algorithms logic - can produce diff results for the sae input under diff executions </p></li></ul>

New cards

solutions

run multiple k-means algorithm starting from randomly chosen cluster centroids - choose the cluster assignment that has the minimum wcss
specialised initialisation strategies : k-means++

k-means++

choose 1st centroid C₁ at random
repeat until k centroids found:
- For each data point x, compute the distance d_Euc(x, c) = d(x) from its nearest centroid c.
- Choose a new data point x randomly with probability proportional to d(x)² as the next centroid.
  (Thus, a data point x that is farthest from the nearest centroid is highly likely to be picked as the next centroid)
use the obtained k-centroids as initial centroids for k-means algorithm,
run the k-means

New cards

choice of the number of k clusters

conventional approach: use prior domain knowledge

example: data segmentation - a company determines the no. of clusters into which its employees must be segmented

a data-baed approach for estimating the number of natural clusters in the data set: elbow method

elbow method

run k-means algorithm repeatedly for increasing values of k
evaluate the wcss of the obtained clustering structure in each run of k-means
plot wcss(c) as a function of the number k of clusters
optimal k lies at the elbow of the plot

drawbacks of elbow method:

heuristic concept
not easy to identify the elbow

New cards

elbow method

when k<the no. of distinct clusters (K*) adding more clusters = substantial decrease in wcss- each cluster will capture a subset of the true clusters = making them tighter + ↓ wcss

<p>when k<the no. of distinct clusters (K*) adding more clusters = substantial decrease in wcss- each cluster will capture a subset of the true clusters = making them tighter + ↓ wcss </p>

New cards

other limitations of k-means

problems when. outliers are present - can influence the clusters that are found + also ↑ the wcss

k-means has problems when clusters are of differing:

sizes
densities
non-globular shapes

one solution = to find a large no. of clusters such that each of them represents a part of a natural cluster. But thwse small clusters need to be put together in a post processing step

<p>problems when. <strong>outliers</strong> are present - can influence the clusters that are found + also ↑ the wcss </p><p></p><p>k-means has problems when clusters are of differing:</p><ul><li><p>sizes</p></li><li><p>densities</p></li><li><p>non-globular shapes</p></li></ul><p><em>one solution </em>= to find a <mark data-color="green">large no. of clusters </mark>such that each of them represents a part of a natural cluster. But thwse small clusters need to be<mark data-color="green"> put together in a post processing step</mark></p>

New cards

application of k-means

k-means algorithm is a key tool in the area of image + signal compression - particularly in vector quantisation

<ul><li><p>k-means algorithm is a key tool in the area of image + signal compression - particularly in vector quantisation</p><p></p></li></ul>

New cards

vector quantisation via k-means

break left image into small blocks of 2×2
results in 512×512 blocks for 4 numbers each regarded as a (feature) vector in ℝ⁴
run k-means algorithm in the resulting space of feature vectors - clustering process called encoding step in VQ + the collection of centroids is called the codebook
centre image uses k=200 while right image uses k=4

to represent the compressed image, approximate each of the 512×512 blocks by its closest cluster centroid, known as codeword. Then, construct the image from the centroids, a process called decoding .

storage requirements: