data mining final

0.0(0)

Studied by 4 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/55

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

56 Terms

New cards

clustering

this is a grouping of objects (or data points)

New cards

clustering analysis

this is the task of grouping a set of objects in such a way that objects in the same group are more alike to each other than to those in other groups

New cards

high intra-class similarity and low inter-class similarity

a good clustering method will produce high quality clusters which should have these 2 things

New cards

high intra class similarity

this piece of good clustering is cohesive within clusters

New cards

low inter-class similarity

this piece of good clustering is distinctive between clustering

New cards

nominal variable

this includes categories, states or name of things

Example: Hair_color and Martial_status

New cards

ordinal variable

these values have a meaningful order but magnitutde between successive values is not known.

Example: shirt_size or army rankings

New cards

proximity

this refers to either similarity or dissimilarity

New cards

similarity

this refers to the measure or similarity function

numerical measure of how alike two data objects are
value is higher when object is more alike
often falls in the range [0,1]

New cards

dissimilarity

this is the measure or distance function

numerical measure of how different two data objects are
lower when objects are more alike
minimum is often 0
upper limit varies [0,1] or [0, inf]

New cards

the similarity measure used by the method
its implementation
its ability to discover some or all of the hidden patterns

the quality of a clustering method depends on

New cards

supervised learning (classification)

the training dataset are accompanied by labels indicating thr class of the observation.

new data is classified based on the training set

New cards

unsupervised learning (clustering)

the class labels of training data is unknown

given a set of measurements, observations with the aim of establishing the existence of classes or clusters in the data

New cards

minkowski distance

the popular distance measure

New cards

exclusive clustering

an object can belong to only one cluster

New cards

non-exclusive clustering

an object may belong to more than one cluster

New cards

partitioning approach

construct various partition and then evaluate them by a user specified criterion

New cards

hierarchical approach

creates a hierarchical decomposition of the set of data using a user specified criterion

New cards

density-based approach

based on connectivity and density function

New cards

grid-based approach

based on multiple-level granularity structure

New cards

partitioning method

this is ____ a dataset into a set of K clusters, such that the sum of squared distances is minimized

New cards

the K-means clustering method

partition objects into k non-empty subsets
compute seed points as the centroids of the clusters of the current partitioning
Assign each object to the cluster with the nearest seed point
1. go back to step 2, stop when the assignment does not change

New cards

K-medioids method

instead of taking the mean valueof the objet in a cluster as a reference point, this can be used which is the most centrally located object in the clustter

New cards

AGNES

uses the single-link method and the dissimilarity matrix

merge nodes that have the least dissimilarity

proceeeds itertively in a non-descending fashion

eventually all nodes belong to the same cluster

New cards

DIANA

inverse order of AGNES

eventually each node forms a cluster on its own

New cards

single limk

smallest distance between an element in one cluster and an element in the other

New cards

complete link

largest distance between an element in one cluster and an element in the other

New cards

average link

average distance between an element in one cluster and an element in the other

New cards

classification

predicts categorical class labels

classifies new data based on the training set and the corresponding target values

New cards

prediction

models continous-valued functions i.e unknown or missing values

New cards

model constructuin

describing a set of predetermined classes

each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
the set of tuples used for model construction is reffered to as the training set
the model is represented as classification rules, decision trees, or mathematical equations

New cards

model usage

for classifying future or unknown objects

estimate accuracy of the model
a test set with known labels is compared with the classified result from the model'
accuracy rate is the perccentage of test set samples that are correctly classified by the model

New cards

CHAID

a popular decision tree algorithm, measure based of chi square test for independence

New cards

C-Sep

performs better than info gain and gini index in certain cases

New cards

G-statistic

has a close approximation to chi square distribution

New cards

overfitting

an induced tree may overfit the training data

too manu branches, some may reflect anomalies due to noise or outlier
poor accuracy for unseen samples

New cards

preprunning

halt tree construction early

do not split a node if this would result in the goodness measure falling below a threshold

New cards

postpruning

remove branch from a fully grown tree - get a sequence of progressively pruned tree

use a validation set of data to decide which is the best pruned tree

New cards

scalability

classifying data sets with millions of examples and hundreds of attributes with reasonable speed

New cards

training set

different parameters of the selected models are tweaked and the best model is selected for performance estimation

New cards

testing set

performance estimation is then performed on a test set T.

it is imperative that the test set should be reserved solely for testing throughout a study

New cards

holdout method

this is considered to be the simplest form of performance estimation that partitions the data into two disjoint sets a train set and a test set

New cards

k-fold cross validation

this is the most prominently used performance estimation technique in data analytics application

New cards

binary classification algorithm

this maps a sample to one of two classes as denoted as C+ and C-

New cards

classes

binary classifiers predict only the ____ to which test samples belong to

New cards

classifier accuracy

percentage of test set tuples that are correctly classified

New cards

TPR and FPR

these are the two most important measures of model performance

New cards

ROC-curve

this is a classification evaluation technique that is used to visually compare the performance of classifier

New cards

AUC

this is a relative measure that ranges from 0-1 in the ROC space

New cards

precision and recall

these are used to

evaluate the retrieval performance of a classifier and
are suited to application that deal with information retrieval

New cards

precission

this is the ratio of the number of true positives to the total number of predicted positives

New cards

recall

this is the ratio of the number of true positives to the total number of outcomes of the positive by the model

New cards

F-measure

this is the harmonic mean between p and r and is believed to be high when both the p and r valuse are high

New cards

collaborative filtering

find the closest customers and recommend based on what closest customer bought

New cards

content-based filtering

see what a customer has bought in the past and use this information to predict what they would like in the future

New cards

rule-based approach

identify business rules about what products shoul be recommended