data mining final

0.0(0)
studied byStudied by 4 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/55

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

56 Terms

1
New cards

clustering

this is a grouping of objects (or data points)

2
New cards

clustering analysis

this is the task of grouping a set of objects in such a way that objects in the same group are more alike to each other than to those in other groups

3
New cards

high intra-class similarity and low inter-class similarity

a good clustering method will produce high quality clusters which should have these 2 things

4
New cards

high intra class similarity

this piece of good clustering is cohesive within clusters

5
New cards

low inter-class similarity

this piece of good clustering is distinctive between clustering

6
New cards

nominal variable

this includes categories, states or name of things

Example: Hair_color and Martial_status

7
New cards

ordinal variable

these values have a meaningful order but magnitutde between successive values is not known.

Example: shirt_size or army rankings

8
New cards

proximity

this refers to either similarity or dissimilarity

9
New cards

similarity

this refers to the measure or similarity function

  • numerical measure of how alike two data objects are

  • value is higher when object is more alike

  • often falls in the range [0,1]

10
New cards

dissimilarity

this is the measure or distance function

  • numerical measure of how different two data objects are

  • lower when objects are more alike

  • minimum is often 0

  • upper limit varies [0,1] or [0, inf]

11
New cards
  • the similarity measure used by the method

  • its implementation

  • its ability to discover some or all of the hidden patterns

the quality of a clustering method depends on

12
New cards

supervised learning (classification)

the training dataset are accompanied by labels indicating thr class of the observation.

new data is classified based on the training set

13
New cards

unsupervised learning (clustering)

the class labels of training data is unknown

given a set of measurements, observations with the aim of establishing the existence of classes or clusters in the data

14
New cards

minkowski distance

the popular distance measure

15
New cards

exclusive clustering

an object can belong to only one cluster

16
New cards

non-exclusive clustering

an object may belong to more than one cluster

17
New cards

partitioning approach

construct various partition and then evaluate them by a user specified criterion

18
New cards

hierarchical approach

creates a hierarchical decomposition of the set of data using a user specified criterion

19
New cards

density-based approach

based on connectivity and density function

20
New cards

grid-based approach

based on multiple-level granularity structure

21
New cards

partitioning method

this is ____ a dataset into a set of K clusters, such that the sum of squared distances is minimized

22
New cards

the K-means clustering method

  1. partition objects into k non-empty subsets

  2. compute seed points as the centroids of the clusters of the current partitioning

  3. Assign each object to the cluster with the nearest seed point

    1. go back to step 2, stop when the assignment does not change

23
New cards

K-medioids method

instead of taking the mean valueof the objet in a cluster as a reference point, this can be used which is the most centrally located object in the clustter

24
New cards

AGNES

uses the single-link method and the dissimilarity matrix

merge nodes that have the least dissimilarity

proceeeds itertively in a non-descending fashion

eventually all nodes belong to the same cluster

25
New cards

DIANA

inverse order of AGNES

eventually each node forms a cluster on its own

26
New cards

single limk

smallest distance between an element in one cluster and an element in the other

27
New cards

complete link

largest distance between an element in one cluster and an element in the other

28
New cards

average link

average distance between an element in one cluster and an element in the other

29
New cards

classification

predicts categorical class labels

classifies new data based on the training set and the corresponding target values

30
New cards

prediction

models continous-valued functions i.e unknown or missing values

31
New cards

model constructuin

describing a set of predetermined classes

  • each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

  • the set of tuples used for model construction is reffered to as the training set

  • the model is represented as classification rules, decision trees, or mathematical equations

32
New cards

model usage

for classifying future or unknown objects

  • estimate accuracy of the model

  • a test set with known labels is compared with the classified result from the model'

  • accuracy rate is the perccentage of test set samples that are correctly classified by the model

33
New cards

CHAID

a popular decision tree algorithm, measure based of chi square test for independence

34
New cards

C-Sep

performs better than info gain and gini index in certain cases

35
New cards

G-statistic

has a close approximation to chi square distribution

36
New cards

overfitting

an induced tree may overfit the training data

  • too manu branches, some may reflect anomalies due to noise or outlier

  • poor accuracy for unseen samples

37
New cards

preprunning

halt tree construction early

  • do not split a node if this would result in the goodness measure falling below a threshold

38
New cards

postpruning

remove branch from a fully grown tree - get a sequence of progressively pruned tree

  • use a validation set of data to decide which is the best pruned tree

39
New cards

scalability

classifying data sets with millions of examples and hundreds of attributes with reasonable speed

40
New cards

training set

different parameters of the selected models are tweaked and the best model is selected for performance estimation

41
New cards

testing set

performance estimation is then performed on a test set T.

it is imperative that the test set should be reserved solely for testing throughout a study

42
New cards

holdout method

this is considered to be the simplest form of performance estimation that partitions the data into two disjoint sets a train set and a test set

43
New cards

k-fold cross validation

this is the most prominently used performance estimation technique in data analytics application

44
New cards

binary classification algorithm

this maps a sample to one of two classes as denoted as C+ and C-

45
New cards

classes

binary classifiers predict only the ____ to which test samples belong to

46
New cards

classifier accuracy

percentage of test set tuples that are correctly classified

47
New cards

TPR and FPR

these are the two most important measures of model performance

48
New cards

ROC-curve

this is a classification evaluation technique that is used to visually compare the performance of classifier

49
New cards

AUC

this is a relative measure that ranges from 0-1 in the ROC space

50
New cards

precision and recall

these are used to

  • evaluate the retrieval performance of a classifier and

  • are suited to application that deal with information retrieval

51
New cards

precission

this is the ratio of the number of true positives to the total number of predicted positives

52
New cards

recall

this is the ratio of the number of true positives to the total number of outcomes of the positive by the model

53
New cards

F-measure

this is the harmonic mean between p and r and is believed to be high when both the p and r valuse are high

54
New cards

collaborative filtering

find the closest customers and recommend based on what closest customer bought

55
New cards

content-based filtering

see what a customer has bought in the past and use this information to predict what they would like in the future

56
New cards

rule-based approach

identify business rules about what products shoul be recommended