Data Mining (TU Delft Exam + general knowledge)

0.0(0)

Studied by 1 person

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/11

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

12 Terms

New cards

What are (in anomaly detection)

point anomalies
contextual anomalies
collective anomalies

Point anomalies = an individual strange data point
Contextual anomalies = a data point that is only strange given a set of other data points as context
Collective anomalies = a set of data points that together are strange

——————

source: https://hackernoon.com/3-types-of-anomalies-in-anomaly-detection

New cards

From the following categories, name some models and techniques that identify points anomalies.

Unsupervised models
Distance-based techniques
Data reconstruction techniques

Unsupervised models - they identify isolated data points
1. isolation forest (IF)
2. one-class SVM
Distance-based - they identify points that are far apart from others
1. K-NN
2. DBSCAN
3. LOF
Data reconstruction
1. PCA
2. Autoencoders

New cards

What is the goal of DTW ?

To align time-series non-linearly in time, trying to find the best match

New cards

What are the steps of PCA ? (if you were to implement it from scratch in code)

Data Matrix X: n × d, (n data points, d dimensions)

Normalize the data
1. mean-shift the data: x_i = x_i - μ_x
2. turn each point into its z-score: x_i = (x_i - μ_x)/σ_x
Calculate the Covariance Matrix
Compute the eigenvectors of the covariance matrix
Order the EVs by the size of their eigenvalues
Use the transformation matrix K (matrix K = the eigenvectors, put together as columns, forming a matrix) to project the original data matrix into “component space”
Compute the explained variance for each principle
component.
Reproject the data from component space back to raw
data space. Compute pointwise distance to raw data
(reprojection error).

New cards

What is the explained variance in PCA ?

Explained variance = The variance in the original data that is explained by each Principle Component = the value of the eigenvalue for that principle component, divided by the sum of all eigenvalues

New cards

What is z-score (aka standard score) ?

The z-score, also known as the standard score is a statistical measure that tells you exactly how many standard deviations a data point is from the mean of its dataset.

New cards

In PCA, what is Cumulative Explained Variance ?

New cards

Why do we need data normalization in PCA ?

New cards

How do you actually compute the matrix for DTW if you were to write it in code ?

New cards

What is the difference between normal hashing and LSH (locality sensitive hashing) ?

Normal hash functions try to minimize the probability of collision.

LSH hash functions try to maximize probability of similar items colliding.

New cards

If in LSH 2 distant points are hashed in the same bucket, how can you fix this and make LSH more reliable ?