Data Science Quiz 4

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/24

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

25 Terms

New cards

What are the four groups of common algorithms

center-based partitioning clustering, hierarchical clustering, density-based clustering, model-based clustering

New cards

center-based partitioning clustering

aims at establishing the center of each cluster (with # of clusters specified), and determining group membership based on distance to original clustercenter.

New cards

hierarchical clustering

dendrogram

New cards

density-based clustering

groups data based on density of data distribution (can identify clusters with random shapes and sizes)

New cards

model-based clustering

assumes the distribution of data is underpinned by latent subgroups

New cards

distance measures

measures the dissimilarity

New cards

when can Euclidean distance not be used

when distance is not continuous or when variables have scale differences, also sensitive to outliers

New cards

Manhattan distance

some of distances in each dimension. faster to calculate, slightly more robust to outliers, but no squared terms

New cards

Chebyshev distance

measures maximum distance along any dimension. very fast to calculate, might not be accurate because info on other dimensions is suppressed

New cards

the appropriate measure should be chosen according to:

requirement of the clustering algorithm, type of data (continuous/ordinal/nominal/binary/etc) and if the data have outliers (eg Manhattan is better for outliers)

New cards

what is a good scale-invariant measure

cosine distance

New cards

what to use when data are sparse and have common zeroes that don’t represent similarity (for binary data)

Hamming distance (manhattan distance for binary data)

New cards

k-means limitations

sensitive to outliers, unable to identify non-spherical shapes, only obtains local optimum

New cards

hierarchical limitations

computationally expensive,

New cards

drawback: high dimensional, noise, and sparse data

requires variable selection and dimensionality reduction

New cards

drawback: skewed distribution

data normalization is usually needed

New cards

drawback: outliers

density-based is good for this

New cards

drawback: overlapping boundaries

use fuzzy clustering

New cards

drawback: rare events

solution is anomaly detection algorithms

New cards

drawback: mixed data

dimensionality reduction is needed before clustering

New cards

what happens during data pre-processing

data normalization, variable selection and dimensionality reduction

New cards

why is min-max normalization good

it preserves variability differences between variables

New cards

what is external criteria

how well the model describes the truth

New cards

what is internal criteria

how well the model describes the data

New cards

what is relative criteria

which model/modelling parameter produces the best clusters