week 4 pt 2 - unsupervised learning

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/19

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

20 Terms

New cards

supervised learning

labelled observations - pairs of data points, each pair consists of a feature vector (x) + its corresponding output label (y) which are related according to an unknown function f(x) = y

during training- use these pairs to learn the relationship between x+y i.e. find a function (or model) h(x) that best fits the observations

goal = ensure that the learned model h(x) accurately predicts the output label of a previously unseen, test feature input (generalisation)

labels = ‘teachers’ during training + ‘validator’ of results during testing

New cards

unsupervised learning

unlabelled data set of feature vectors

what can we deduce?

even without labels, can still learn some interesting stuff from this data:
1. clustering find sub groups or clusters among feature vectors with similar traits
  - like a big box of toys + putting similar toys together
2. dimensionality reduction find patterns within feature vector to identify a lower dimensional representation
  - simplifying a big puzzle by finding patters on the pieces

New cards

CLUSTERING - real world applications

google news

market segmentation

social network analysis

clusters = potential ‘classes’; clustering algorithms automatically find ‘classes’

New cards

DIMENSIONALITY REDUCTION - application

image compression

techniques:

principal component analysis - finds main themes in data
non negative matrix factorisation - breaks down complex mixtures
linear discriminant analysis - helps you desperate diff groups based on their features

New cards

why unsupervised learning?

labelled data = £££ + difficult to collect - unlabelled = cheap + abundant
compressed representation - saves on storage + computation
reduce noise, irrelevant attributes in high dimensional data
pre processing step for supervised learning
often used more in exploratory data analysis

New cards

challenges in unsupervised learning

no simple goal as in supervised learning
validation of results is subjective

New cards

fundamentals of clustering

goal: find natural groupings among observations/objects/feature vectors

segment observations into clusters/groups such that:

objects within a cluster have high similarity (high intra-cluster similarity)
objects across a cluster have low similarity (low intra-cluster similarity)

New cards

example 1: clustering of mammals

challenge 1: identifying ‘similar’ feature vectors:

solution = use proximity indices to quantify the strength of a relationship between any 2 feature vectors

depends on data type:

continuous valued feature attributes
- e.g. x = (0.1, 11.65, 15, 1.4)
nominal feature attributes
- e.g. x = (‘large’, ‘medium’, ‘small’)
- must be mapped to discrete values consistently e.g. ‘large’ → 3, ‘medium’ → 2, ‘small’ → 1

New cards

continued

continuous valued features: for 2 continuous valued features x⁽¹⁾ = (x⁽¹⁾, …, x⁽¹⁾_m) and x⁽²⁾ = (x⁽²⁾, …, x⁽²⁾_m), the proximity index can be any of the following distances
- euclidean distance
- Manhattan distance
- chebychev distance

example 1 applied: see pic

<ol><li><p>continuous valued features: for 2 continuous valued features x<sup>(1)</sup> = (x<sup>(1)</sup>, …, x<sup>(1)</sup><sub>m</sub>) and x<sup>(2)</sup> = (x<sup>(2)</sup>, …, x<sup>(2)</sup><sub>m</sub>), the proximity index can be any of the following distances</p><ul><li><p>euclidean distance</p></li><li><p>Manhattan distance</p></li><li><p>chebychev distance</p></li></ul></li></ol><p></p><p>example 1 applied: see pic</p>

New cards

continued pt 2

properties of distance measures:

distance between 2 points is always non negative
distance between a point to itself is 0
distance measure is symmetric
distance measure satisfies a triangle inequality

main takeaways:

- diff choice of distance functions yields diff measures of similarity

- distance measures implicitly assign more weighting to features with large ranges than those with small ranges

- rule of thumb: when x prior domain knowledge → clustering should follow the principle of equal weighting to each attribute

this necessitates feature scaling of vectors

<p><mark data-color="red">properties of distance measures:</mark></p><ul><li><p>distance between 2 points is always <strong><span style="color: red">non negative</span></strong></p></li><li><p>distance between a <strong><span style="color: red">point to itself</span></strong> is <strong><span style="color: red">0</span></strong></p></li><li><p>distance measure is <strong><span style="color: red">symmetric</span></strong></p></li><li><p>distance measure satisfies a <strong><span style="color: red">triangle inequality</span></strong></p></li></ul><p></p><p>main takeaways:</p><p>- <mark data-color="yellow">diff choice</mark> of distance <mark data-color="yellow">functions</mark> yields <mark data-color="yellow">diff measures</mark> of similarity</p><p>- distance measures implicitly <u>assign more weighting</u> to features with l<u>arge ranges</u> than those with small ranges</p><p>- rule of thumb: when x prior domain knowledge → clustering should follow the principle <strong><span style="color: yellow">of equal weighting to each attribute</span></strong></p><ul><li><p>this necessitates <strong><span style="color: green">feature scaling</span></strong> of vectors </p></li></ul>

New cards

challenge 2: feature scaling

when feature attributes have large differences in the range of values, scaling of features = essential - so that each attribute contributes equally to the distance feature

2 well studied approaches:

min-max normalisation
z score standardisation

New cards

minimum maximum normalisation

all feature attributes are rescaled to lie in the [0, 1] range

steps:

consider N feature vectors x⁽¹⁾ = (x⁽¹⁾, …, x⁽¹⁾_m), x⁽²⁾ = (x⁽²⁾, …, x⁽²⁾_m), x^(N) = (x^(N), …, x^(N)_m)

for each jth feature attribute, for j = 1, … m, compute
- max. of jth feature over all N vectors:
  x_j,max = max x_j⁽ⁱ⁾ i = 1,…,N
- min. of jth feature over all N vectors:
  x_j,min = min x_j⁽ⁱ⁾ i = 1,…,N
- min-max rescaling of jth attribute in each of the i = 1, …, N vectors:
  x⁽ⁱ⁾_{j, new = (}x_j⁽ⁱ⁾ - x_j⁽ⁱ⁾min)/ (x_j⁽ⁱ⁾max - x_j⁽ⁱ⁾min)

New cards

example 2 revisited: normalisation

New cards

CLUSTERING ALGORITHMS: 3 types

→ PARTITIONAL CLUSTERING ALGORITHMS

generate a single partition of the data to recover natural clusters
input: feature vectors
examples: k-means, k-medoids
- put each data point into 1 group or another based on certain characteristics they share

→ HIERACHICAL CLUSTERING ALGORITHMS

generates a sequence of nested partitions
input: distance matrix, which summarises the distances between diff feature vectors
example: agglomerative clustering, divisive clustering
- instead of just 1 partition, they generate a sequence of nested partitions = organise data points into clusters within clusters

→ MODEL-BASED CLUSTERING ALGORITHMS

assumes that the data is generated independently + identically distributed from a mixture of distributions, each of which determines a diff cluster
example: Expectation-Maximisation (EM)

New cards

partitional clustering algorithms: the basics

goal: assign N observations into k clusters, where k<N that ensure high intra cluster similarity + low cluster similarity

high = clusters are v close
low = clusters are v. far

can be formulated as a combinational optimisation problem* = requires defining a measure of intra-cluster similarity

where we search for the best arrangement of clusters among all possible arrangements

Notation:

C denote a clustering structure with K clusters
c ∈ C denote a component cluster in the clustering structure C
e ∈ c denote an example in cluster c

New cards

measuring intra-cluster similarity

like a ruler to see how close or similar things are within a cluster

inertia of a cluster c is defined as

inertia(c) = Σe ∈ c d_euc (e, centroid(c))²

where centroid (c) is the average of all examples in the cluster c

inertia determines how compact the cluster is. Smaller the inertia, more similar the examples are within the cluster

<p>like a ruler to see how close or similar things are within a cluster</p><p></p><p>inertia of a cluster c is defined as</p><p></p><p>inertia(c) = <strong>Σ</strong>e <strong>∈ </strong>c d<sub>euc</sub> (e, centroid(c))²</p><p>where centroid (c) is the average of all examples in the cluster c</p><p></p><p>inertia determines how compact the cluster is. Smaller the inertia, more similar the examples are within the cluster</p>

New cards

within cluster sum of squares (wcss)

of a clustering structure c consisting of k clusters is defined as

wcss(c) = Σe ∈ c inertia(c)

the sum of the inertia of all clusters in the clustering structure

this quantifies the inertial of all clusters in the clustering structure

used to evaluate the compactness of clusters in clustering algorithms - measures total inertia/spread of all clusters in the clustering solution

<p>of a clustering structure c consisting of k clusters is defined as</p><p></p><p>wcss(c) = <strong>Σ</strong>e <strong>∈ </strong>c inertia(c) </p><ul><li><p>the sum of the inertia of all clusters in the clustering structure</p></li></ul><p>this quantifies the inertial of all clusters in the clustering structure</p><p></p><p>used to evaluate the compactness of clusters in clustering algorithms - measures total inertia/spread of all clusters in the clustering solution</p>

New cards

optimisation problem

find a clustering structure c of K clusters that minimises the following objective: (see pic)

the condition (5) ensures that each example belongs to at least 1 of the k clusters
the condition (6) ensures that the clusters are non-overlapping i.e. no example belongs to more than 1 cluster

in this optimisation problem:

wcss is an unnormalised quantity: larger clusters with high inertia are penalised more than smaller clusters with high inertia
minimising wcss of a clustering structure equivalently maximises the intra-cluster similarity

New cards

solving the optimisation problem

- finding exact solution of above problem = prohibitively hard

there are almost k^N ways to partition N observations into k clusters
infeasible when N is large

hence → look for suboptimal approximate solutions via iterative algorithms

- includes algorithms like k-means, k-medoids

New cards

revisiting example 2

assume that a partition clustering algorithm on the mammal data returns two clusters C₁= (Dog, Cow) and C₂= (Badger, Bear, Fox). Using the normalised data matrix, calculate the cluster centroids

<p>assume that a partition clustering algorithm on the mammal data returns two clusters C<sub>1</sub>= (Dog, Cow) and C<sub>2</sub>= (Badger, Bear, Fox). Using the normalised data matrix, calculate the cluster centroids</p>