what is bin packing?
fitting items effectively into a container
what is the bin packing problem?
a combinatorial problem
a number of items n of size x1…xn need to be put into the smalles number of bins of capacity c
what are bin packing algorithms?
algorithms that solve the bin packing problem
what are the applications of bin packing algorithms?
filling recycle bins
tape compilations
tv advertisement time
what is an example of a bin packing algorithm?
first-fit decreasing algorithm
many other algorithms exist
what is the first-fit decreasing algorithm?
n empty bin are created and numbered 1..n
items to be packed are sorted in descending order
starting from largest item, pack item into the first bin it fits into
any empty bins at the end are discarded/ignored
what is data clustering?
the process of arranging objects (as data points) into a number of sets [k] according to a metric [e.g., distance]
what is a pattern?
physical objects
what is a cluster?
a set/ category of objects
what is a feature?
an attribute of an object
objects with similar features are grouped together
data can be grouped based on 1 feature or multiple (eg. group by age vs. group by age and weight)
how do objects relate in data clustering?
each set ideally shares a common trait (eg. similarity or proximity for some defined distance measure)
each set is called a cluster/ group
each set is mutually exclusive (item cannot be in multiple clusters)
objects in 1 cluster are dissimilar to the objects in other clusters
given a set of data, what do we already know about it?
the features of each object in the data set - features represented using a feature vector
what is a feature vector?
in a feature vector there is an object x, and the [ ] shows the values of its features e.g., first 1 has 2 features, 2nd x has d features. compare the feature vectors of objects to cluster the date
what type of data can we use for clustering?
data in a table/matrix X, where xi is the ith row of the table and xij is the jth variable (feature) of row i
how do we interpret data in a matrix to be used for clustering?
columns = features = objects with similar features grouped together. (header shows the actual feature, the y axis shows that object’s value in a certain feature)
rows = the objects
rows need to be clustered together based on how similar the values of their feature are
how can data clustering be visualised?
is using 2 features we can plot data as a graph
cannot do this with hundreds of features
why do we need to clustering?
it is a data analysis technique which helps us understand how objects are related
less complex to model
clustering is used as a pre-processing tool
can give us insight into an object’s unknown properties
what are the applications of data clustering?
retail - identify households via: household income, household size, distance from urban area
sports science - identify players via: points per game, assists per game, steals per game
health science - health companies cluster via: household size, number of doctor visits, average household age
when implementing clustering after data has been collected, what do we need to consider?
cluster representation
data similarity - how can we determine if one data object is similar to another data object - eg. using distance or a different metric
cluster worth - how to evaluate our cluster results and know if they are good or not
clustering method - what clustering method do we use
number of clusters - do we need to define the number of clusters to make ourselves (kmeans = yes vs. hierachal = no)
how are clusters represented?
index 0 is in cluster 1, since there are two 1s in the set C that means there are 2 items in cluster 1, eg. index 1 is in cluster 2
what is data similarity?
the distance between the various data points from the data. different metrics can be used to measure similarity e.g., distance, and different methods are used to measure the different metrics.
how does data similarity affect clustering?
rows are compared to each other
clustering methods use the measure data similarity to place similar rows into the same cluster
a better distance function means better cluster algorithm performance
what methods are there to measure data similarity?
Euclidean distance - used by k means method
correlations
pearson
spearman
kenda
manhattan distance
choose simplest one for given data
what is Euclidean distance
the shortest distance between 2 points
2D - length of the right angle constructed between the 2 points aka Pythagoras's theorem. needed if we have a graph or don’t directly have th exact distances
what is cluster worth?
used to evaluate the goodness of clustering algorithm results - how good the actual clusters created were
different ways to judge the worth of a clustering arrangement so need to choose the right worth metric for success. each metric has different methods to measure it
what type of metrics can be used to measure cluster worth?
what is the sum of squares by cluster metric?
for one cluster, calculate each elements distance from the cluster’s centre, square each distance and sum them
what is the impact of the number of clusters on a clustering method?
Some applications need the number of clusters a solution creates specified and some applications do not eg. hierarchal clustering will cluster into a tree format and not K number of clusters, K-means groups into K clusters
Determining the number of clusters is very difficult
Need a method to find the number clusters and their contents
what is a clustering method/ algorithms?
sequence of events that carries out the data clustering
what are the different types of clustering algorithms?
Centroid-based clustering
K-Means
Hierarchical clustering - categorise data into a tree
Density-based clustering - determine if dense or sparse items
Distribution-based clustering
Optimisation / Search AI
Hill Climbing and Simulated Annealing
what is K-means clustering?
requires the number of clusters K to be defined and maintains each cluster’s means - centre points
judges the worth of a clustering arrangement based on the square of how far each item in the cluster is from the centre
how does the k-means algorithm work?
create k amount of centres (means) and place them in random locations on the matrix. each centre will form a cluster a cluster
for each data point/ object/ row, assign it to the closest centre. (for each row, calc the distance between object and each centre and pick centre with shortest distance)
for each cluster, calculate the mean of all the data points in that cluster (not including the centre). the centre positions become the means
the algorithm terminates when the centres (means) do not change or a fixed number of iterations have been reached
what is the pseudocode for the k-means algorithm?
how can we evaluate the clustering method we used?
obtain insight into how the cluster method performed based on
cluster worth
expert knowledge
run method many times and compare the clustering arrangements
this can helps us see if we selected right method, select correct metric for cluster worth, ensure results are consistent
what metrics can we use to compare 2 clustering arrangements?
Kappa metric
what can we determine about clustering arrangements based on the kappa score?
if the clustering arrangements have a high kappa score, then the clustering arrangements are similar.
if the method produces 2 or more similar clustering arrangements then the method itself is consistent