Pearson correlation (r) properties and how to calculate
scale invariant
cor(x,x) = 1
zwischen -1 bis +1
4
New cards
Problems with pearson correlation
korrelation muss linear sein, sonst nicht damit erkennbar
stark beeinflusst von Ausreissern
5
New cards
Spearman correlation How to
1. transform values to ranks 2. correlation between ranks
6
New cards
how to deal with unexpected values
remove NA values
replace (imputation) form
1. mean 2. random value 3. predict missing value from other variables
7
New cards
dealing with heavily skewed distributions
log transformation
8
New cards
clustering
grouping of similar samples
9
New cards
Principal componentn analysis
show variables with high variance
10
New cards
Euclidean distance
11
New cards
manhattan distance
12
New cards
correlation distance
13
New cards
maximum distance
14
New cards
k means clustering
distance = euclidean metric
1. define k random centers 2. assign each point to the clostest of the centers 3. determine center of gravity of the clusters → new centers 4. assign to new centers 5. start over
stop when: no object changes cluster assignment
15
New cards
interpreting k means clustering
repeat process with different starting points, if identical → very strong cluster structure
fluctuating → no clear cluster structure
16
New cards
k means problem
even if there is no group structure clusters will be identified
17
New cards
choosing k
elbow method or silhouette method
18
New cards
elbow method
if each point has 1 cluster WSS = 0
the more k the higher WSS
kink in curve → optimal number of k
19
New cards
silhouette method
mean distance ai to all members of his cluster, bi smallest average distance to members of all other clustrs