1/24
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What are the four groups of common algorithms
center-based partitioning clustering, hierarchical clustering, density-based clustering, model-based clustering
center-based partitioning clustering
aims at establishing the center of each cluster (with # of clusters specified), and determining group membership based on distance to original clustercenter.
hierarchical clustering
dendrogram
density-based clustering
groups data based on density of data distribution (can identify clusters with random shapes and sizes)
model-based clustering
assumes the distribution of data is underpinned by latent subgroups
distance measures
measures the dissimilarity
when can Euclidean distance not be used
when distance is not continuous or when variables have scale differences, also sensitive to outliers
Manhattan distance
some of distances in each dimension. faster to calculate, slightly more robust to outliers, but no squared terms
Chebyshev distance
measures maximum distance along any dimension. very fast to calculate, might not be accurate because info on other dimensions is suppressed
the appropriate measure should be chosen according to:
requirement of the clustering algorithm, type of data (continuous/ordinal/nominal/binary/etc) and if the data have outliers (eg Manhattan is better for outliers)
what is a good scale-invariant measure
cosine distance
what to use when data are sparse and have common zeroes that don’t represent similarity (for binary data)
Hamming distance (manhattan distance for binary data)
k-means limitations
sensitive to outliers, unable to identify non-spherical shapes, only obtains local optimum
hierarchical limitations
computationally expensive,
drawback: high dimensional, noise, and sparse data
requires variable selection and dimensionality reduction
drawback: skewed distribution
data normalization is usually needed
drawback: outliers
density-based is good for this
drawback: overlapping boundaries
use fuzzy clustering
drawback: rare events
solution is anomaly detection algorithms
drawback: mixed data
dimensionality reduction is needed before clustering
what happens during data pre-processing
data normalization, variable selection and dimensionality reduction
why is min-max normalization good
it preserves variability differences between variables
what is external criteria
how well the model describes the truth
what is internal criteria
how well the model describes the data
what is relative criteria
which model/modelling parameter produces the best clusters