Data Science Quiz 4

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/24

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

25 Terms

1
New cards

What are the four groups of common algorithms

center-based partitioning clustering, hierarchical clustering, density-based clustering, model-based clustering

2
New cards

center-based partitioning clustering

aims at establishing the center of each cluster (with # of clusters specified), and determining group membership based on distance to original clustercenter.

3
New cards

hierarchical clustering

dendrogram

4
New cards

density-based clustering

groups data based on density of data distribution (can identify clusters with random shapes and sizes)

5
New cards

model-based clustering

assumes the distribution of data is underpinned by latent subgroups

6
New cards

distance measures

measures the dissimilarity

7
New cards

when can Euclidean distance not be used

when distance is not continuous or when variables have scale differences, also sensitive to outliers

8
New cards

Manhattan distance

some of distances in each dimension. faster to calculate, slightly more robust to outliers, but no squared terms

9
New cards

Chebyshev distance

measures maximum distance along any dimension. very fast to calculate, might not be accurate because info on other dimensions is suppressed

10
New cards

the appropriate measure should be chosen according to:

requirement of the clustering algorithm, type of data (continuous/ordinal/nominal/binary/etc) and if the data have outliers (eg Manhattan is better for outliers)

11
New cards

what is a good scale-invariant measure

cosine distance

12
New cards

what to use when data are sparse and have common zeroes that don’t represent similarity (for binary data)

Hamming distance (manhattan distance for binary data)

13
New cards

k-means limitations

sensitive to outliers, unable to identify non-spherical shapes, only obtains local optimum

14
New cards

hierarchical limitations

computationally expensive,

15
New cards

drawback: high dimensional, noise, and sparse data

requires variable selection and dimensionality reduction

16
New cards

drawback: skewed distribution

data normalization is usually needed

17
New cards

drawback: outliers

density-based is good for this

18
New cards

drawback: overlapping boundaries

use fuzzy clustering

19
New cards

drawback: rare events

solution is anomaly detection algorithms

20
New cards

drawback: mixed data

dimensionality reduction is needed before clustering

21
New cards

what happens during data pre-processing

data normalization, variable selection and dimensionality reduction

22
New cards

why is min-max normalization good

it preserves variability differences between variables

23
New cards

what is external criteria

how well the model describes the truth

24
New cards

what is internal criteria

how well the model describes the data

25
New cards

what is relative criteria

which model/modelling parameter produces the best clusters