describing data, cleaning data, finding structure in data, k-means clustering

0.0(0)
studied byStudied by 2 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/18

flashcard set

Earn XP

Description and Tags

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

19 Terms

1
New cards
variance
knowt flashcard image
2
New cards
covariance
knowt flashcard image
3
New cards
Pearson correlation (r) properties and how to calculate
scale invariant

cor(x,x) = 1

zwischen -1 bis +1
scale invariant 

cor(x,x) = 1

zwischen -1 bis +1
4
New cards
Problems with pearson correlation
korrelation muss linear sein, sonst nicht damit erkennbar

stark beeinflusst von Ausreissern
5
New cards
Spearman correlation How to

1. transform values to ranks
2. correlation between ranks
6
New cards
how to deal with unexpected values
remove NA values

replace (imputation) form


1. mean
2. random value
3. predict missing value from other variables
7
New cards
dealing with heavily skewed distributions
log transformation
8
New cards
clustering
grouping of similar samples
9
New cards
Principal componentn analysis
show variables with high variance
10
New cards
Euclidean distance
knowt flashcard image
11
New cards
manhattan distance
knowt flashcard image
12
New cards
correlation distance
knowt flashcard image
13
New cards
maximum distance
knowt flashcard image
14
New cards
k means clustering
distance = euclidean metric


1. define k random centers
2. assign each point to the clostest of the centers
3. determine center of gravity of the clusters → new centers
4. assign to new centers
5. start over

stop when: no object changes cluster assignment
15
New cards
interpreting k means clustering
repeat process with different starting points, if identical → very strong cluster structure

fluctuating → no clear cluster structure
16
New cards
k means problem
even if there is no group structure clusters will be identified
17
New cards
choosing k
elbow method or silhouette method
18
New cards
elbow method
if each point has 1 cluster WSS = 0

the more k the higher WSS

kink in curve → optimal number of k
if each point has 1 cluster WSS = 0

the more k the higher WSS

kink in curve → optimal number of k
19
New cards
silhouette method
mean distance ai to all members of his cluster, bi smallest average distance to members of all other clustrs

Interpretation:

si \~ 1 → object very well clustered

si \~ 0 ambigous object

si < 0 wrong assignment
mean distance ai to all members of his cluster, bi smallest average distance to members of all other clustrs 

Interpretation: 

si \~ 1 → object very well clustered 

si \~ 0 ambigous object

si < 0 wrong assignment