data mining final

studied byStudied by 0 people
0.0(0)
Get a hint
Hint

clustering

1 / 62

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

63 Terms

1

clustering

this is a grouping of objects (or data points)

New cards
2

clustering analysis

this is the task of grouping a set of objects in such a way that objects in the same group are more alike to each other than to those in other groups

New cards
3

high intra-class similarity and low inter-class similarity

a good clustering method will produce high quality clusters which should have these 2 things

New cards
4

high intra class similarity

this piece of good clustering is cohesive within clusters

New cards
5

low inter-class similarity

this piece of good clustering is distinctive between clustering

New cards
6

nominal variable

this includes categories, states or name of things

Example: Hair_color and Martial_status

New cards
7

ordinal variable

these values have a meaningful order but magnitutde between successive values is not known.

Example: shirt_size or army rankings

New cards
8

proximity

this refers to either similarity or dissimilarity

New cards
9

similarity

this refers to the measure or similarity function

  • numerical measure of how alike two data objects are

  • value is higher when object is more alike

  • often falls in the range [0,1]

New cards
10

dissimilarity

this is the measure or distance function

  • numerical measure of how different two data objects are

  • lower when objects are more alike

  • minimum is often 0

  • upper limit varies [0,1] or [0, inf]

New cards
11
  • the similarity measure used by the method

  • its implementation

  • its ability to discover some or all of the hidden patterns

the quality of a clustering method depends on

New cards
12

supervised learning (classification)

the training dataset are accompanied by labels indicating thr class of the observation.

new data is classified based on the training set

New cards
13

unsupervised learning (clustering)

the class labels of training data is unknown

given a set of measurements, observations with the aim of establishing the existence of classes or clusters in the data

New cards
14

minkowski distance

the popular distance measure

New cards
15

exclusive clustering

an object can belong to only one cluster

New cards
16

non-exclusive clustering

an object may belong to more than one cluster

New cards
17

partitioning approach

construct various partition and then evaluate them by a user specified criterion

New cards
18

hierarchical approach

creates a hierarchical decomposition of the set of data using a user specified criterion

New cards
19

density-based approach

based on connectivity and density function

New cards
20

grid-based approach

based on multiple-level granularity structure

New cards
21

partitioning method

this is ____ a dataset into a set of K clusters, such that the sum of squared distances is minimized

New cards
22

the K-means clustering method

  1. partition objects into k non-empty subsets

  2. compute seed points as the centroids of the clusters of the current partitioning

  3. Assign each object to the cluster with the nearest seed point

    1. go back to step 2, stop when the assignment does not change

New cards
23

K-medioids method

instead of taking the mean valueof the objet in a cluster as a reference point, this can be used which is the most centrally located object in the clustter

New cards
24

AGNES

uses the single-link method and the dissimilarity matrix

merge nodes that have the least dissimilarity

proceeeds itertively in a non-descending fashion

eventually all nodes belong to the same cluster

New cards
25

DIANA

inverse order of AGNES

eventually each node forms a cluster on its own

New cards
26

single limk

smallest distance between an element in one cluster and an element in the other

New cards
27

complete link

largest distance between an element in one cluster and an element in the other

New cards
28

average link

average distance between an element in one cluster and an element in the other

New cards
29

classification

predicts categorical class labels

classifies new data based on the training set and the corresponding target values

New cards
30

prediction

models continous-valued functions i.e unknown or missing values

New cards
31

model constructuin

describing a set of predetermined classes

  • each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

  • the set of tuples used for model construction is reffered to as the training set

  • the model is represented as classification rules, decision trees, or mathematical equations

New cards
32

model usage

for classifying future or unknown objects

  • estimate accuracy of the model

  • a test set with known labels is compared with the classified result from the model'

  • accuracy rate is the perccentage of test set samples that are correctly classified by the model

New cards
33

CHAID

a popular decision tree algorithm, measure based of chi square test for independence

New cards
34

C-Sep

performs better than info gain and gini index in certain cases

New cards
35

G-statistic

has a close approximation to chi square distribution

New cards
36

overfitting

an induced tree may overfit the training data

  • too manu branches, some may reflect anomalies due to noise or outlier

  • poor accuracy for unseen samples

New cards
37

preprunning

halt tree construction early

  • do not split a node if this would result in the goodness measure falling below a threshold

New cards
38

postpruning

remove branch from a fully grown tree - get a sequence of progressively pruned tree

  • use a validation set of data to decide which is the best pruned tree

New cards
39

scalability

classifying data sets with millions of examples and hundreds of attributes with reasonable speed

New cards
40

training set

different parameters of the selected models are tweaked and the best model is selected for performance estimation

New cards
41

testing set

performance estimation is then performed on a test set T.

it is imperative that the test set should be reserved solely for testing throughout a study

New cards
42

holdout method

this is considered to be the simplest form of performance estimation that partitions the data into two disjoint sets a train set and a test set

New cards
43

k-fold cross validation

this is the most prominently used performance estimation technique in data analytics application

New cards
44

binary classification algorithm

this maps a sample to one of two classes as denoted as C+ and C-

New cards
45
New cards
46
New cards
47
New cards
48
New cards
49
New cards
50
New cards
51
New cards
52
New cards
53
New cards
54
New cards
55
New cards
56
New cards
57
New cards
58
New cards
59
New cards
60
New cards
61
New cards
62
New cards
63
New cards

Explore top notes

note Note
studied byStudied by 51 people
... ago
5.0(1)
note Note
studied byStudied by 9 people
... ago
5.0(1)
note Note
studied byStudied by 14 people
... ago
5.0(1)
note Note
studied byStudied by 4 people
... ago
5.0(1)
note Note
studied byStudied by 59 people
... ago
5.0(3)
note Note
studied byStudied by 7 people
... ago
4.0(1)
note Note
studied byStudied by 123508 people
... ago
4.8(561)

Explore top flashcards

flashcards Flashcard (85)
studied byStudied by 4 people
... ago
5.0(2)
flashcards Flashcard (37)
studied byStudied by 17 people
... ago
5.0(1)
flashcards Flashcard (40)
studied byStudied by 11 people
... ago
5.0(1)
flashcards Flashcard (56)
studied byStudied by 548 people
... ago
4.8(5)
flashcards Flashcard (169)
studied byStudied by 1 person
... ago
5.0(1)
flashcards Flashcard (24)
studied byStudied by 4 people
... ago
5.0(2)
flashcards Flashcard (118)
studied byStudied by 52 people
... ago
5.0(1)
flashcards Flashcard (21)
studied byStudied by 2 people
... ago
5.0(1)
robot