Data Mining (TU Delft Exam + general knowledge)

0.0(0)
studied byStudied by 1 person
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/11

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

12 Terms

1
New cards

What are (in anomaly detection)

  1. point anomalies

  2. contextual anomalies

  3. collective anomalies

  1. Point anomalies = an individual strange data point

  2. Contextual anomalies = a data point that is only strange given a set of other data points as context

  3. Collective anomalies = a set of data points that together are strange

——————

source: https://hackernoon.com/3-types-of-anomalies-in-anomaly-detection

2
New cards

From the following categories, name some models and techniques that identify points anomalies.

  1. Unsupervised models

  2. Distance-based techniques

  3. Data reconstruction techniques

  1. Unsupervised models - they identify isolated data points

    1. isolation forest (IF)

    2. one-class SVM

  2. Distance-based - they identify points that are far apart from others

    1. K-NN

    2. DBSCAN

    3. LOF

  3. Data reconstruction

    1. PCA

    2. Autoencoders

3
New cards

What is the goal of DTW ?

To align time-series non-linearly in time, trying to find the best match

<p>To align time-series non-linearly in time, trying to find the best match</p>
4
New cards

What are the steps of PCA ? (if you were to implement it from scratch in code)

Data Matrix X: n × d, (n data points, d dimensions)

  1. Normalize the data

    1. mean-shift the data: xi = xi - μx

    2. turn each point into its z-score: xi = (xi - μx)/σx

  2. Calculate the Covariance Matrix

  3. Compute the eigenvectors of the covariance matrix

  4. Order the EVs by the size of their eigenvalues

  5. Use the transformation matrix K (matrix K = the eigenvectors, put together as columns, forming a matrix) to project the original data matrix into “component space”

  6. Compute the explained variance for each principle

    component.

  7. Reproject the data from component space back to raw

    data space. Compute pointwise distance to raw data

    (reprojection error).

5
New cards

What is the explained variance in PCA ?

Explained variance = The variance in the original data that is explained by each Principle Component = the value of the eigenvalue for that principle component, divided by the sum of all eigenvalues

6
New cards

What is z-score (aka standard score) ?

The z-score, also known as the standard score is a statistical measure that tells you exactly how many standard deviations a data point is from the mean of its dataset.

7
New cards

In PCA, what is Cumulative Explained Variance ?

8
New cards

Why do we need data normalization in PCA ?

9
New cards

How do you actually compute the matrix for DTW if you were to write it in code ?

10
New cards

What is the difference between normal hashing and LSH (locality sensitive hashing) ?

Normal hash functions try to minimize the probability of collision.

LSH hash functions try to maximize probability of similar items colliding.

11
New cards

If in LSH 2 distant points are hashed in the same bucket, how can you fix this and make LSH more reliable ?

use more hash functions in combination with AND constructions: 2 points are similar if they occur in all query bins

12
New cards

If in LSH 2 nearby points are hashed in different buckets, how can you fix this and make LSH more reliable ?

OR-constructions: Points are candidate neighbors if they can be found together in any of the bins