Looks like no one added any tags here yet for you.
clustering
this is a grouping of objects (or data points)
clustering analysis
this is the task of grouping a set of objects in such a way that objects in the same group are more alike to each other than to those in other groups
high intra-class similarity and low inter-class similarity
a good clustering method will produce high quality clusters which should have these 2 things
high intra class similarity
this piece of good clustering is cohesive within clusters
low inter-class similarity
this piece of good clustering is distinctive between clustering
nominal variable
this includes categories, states or name of things
Example: Hair_color and Martial_status
ordinal variable
these values have a meaningful order but magnitutde between successive values is not known.
Example: shirt_size or army rankings
proximity
this refers to either similarity or dissimilarity
similarity
this refers to the measure or similarity function
numerical measure of how alike two data objects are
value is higher when object is more alike
often falls in the range [0,1]
dissimilarity
this is the measure or distance function
numerical measure of how different two data objects are
lower when objects are more alike
minimum is often 0
upper limit varies [0,1] or [0, inf]
the similarity measure used by the method
its implementation
its ability to discover some or all of the hidden patterns
the quality of a clustering method depends on
supervised learning (classification)
the training dataset are accompanied by labels indicating thr class of the observation.
new data is classified based on the training set
unsupervised learning (clustering)
the class labels of training data is unknown
given a set of measurements, observations with the aim of establishing the existence of classes or clusters in the data
minkowski distance
the popular distance measure
exclusive clustering
an object can belong to only one cluster
non-exclusive clustering
an object may belong to more than one cluster
partitioning approach
construct various partition and then evaluate them by a user specified criterion
hierarchical approach
creates a hierarchical decomposition of the set of data using a user specified criterion
density-based approach
based on connectivity and density function
grid-based approach
based on multiple-level granularity structure
partitioning method
this is ____ a dataset into a set of K clusters, such that the sum of squared distances is minimized
the K-means clustering method
partition objects into k non-empty subsets
compute seed points as the centroids of the clusters of the current partitioning
Assign each object to the cluster with the nearest seed point
go back to step 2, stop when the assignment does not change
K-medioids method
instead of taking the mean valueof the objet in a cluster as a reference point, this can be used which is the most centrally located object in the clustter
AGNES
uses the single-link method and the dissimilarity matrix
merge nodes that have the least dissimilarity
proceeeds itertively in a non-descending fashion
eventually all nodes belong to the same cluster
DIANA
inverse order of AGNES
eventually each node forms a cluster on its own
single limk
smallest distance between an element in one cluster and an element in the other
complete link
largest distance between an element in one cluster and an element in the other
average link
average distance between an element in one cluster and an element in the other
classification
predicts categorical class labels
classifies new data based on the training set and the corresponding target values
prediction
models continous-valued functions i.e unknown or missing values
model constructuin
describing a set of predetermined classes
each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
the set of tuples used for model construction is reffered to as the training set
the model is represented as classification rules, decision trees, or mathematical equations
model usage
for classifying future or unknown objects
estimate accuracy of the model
a test set with known labels is compared with the classified result from the model'
accuracy rate is the perccentage of test set samples that are correctly classified by the model
CHAID
a popular decision tree algorithm, measure based of chi square test for independence
C-Sep
performs better than info gain and gini index in certain cases
G-statistic
has a close approximation to chi square distribution
overfitting
an induced tree may overfit the training data
too manu branches, some may reflect anomalies due to noise or outlier
poor accuracy for unseen samples
preprunning
halt tree construction early
do not split a node if this would result in the goodness measure falling below a threshold
postpruning
remove branch from a fully grown tree - get a sequence of progressively pruned tree
use a validation set of data to decide which is the best pruned tree
scalability
classifying data sets with millions of examples and hundreds of attributes with reasonable speed
training set
different parameters of the selected models are tweaked and the best model is selected for performance estimation
testing set
performance estimation is then performed on a test set T.
it is imperative that the test set should be reserved solely for testing throughout a study
holdout method
this is considered to be the simplest form of performance estimation that partitions the data into two disjoint sets a train set and a test set
k-fold cross validation
this is the most prominently used performance estimation technique in data analytics application
binary classification algorithm
this maps a sample to one of two classes as denoted as C+ and C-