Classification and Clustering Techniques in Data Analysis

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/73

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

74 Terms

New cards

Classification

Task of assigning objects to one of several predefined categories.

New cards

Input

Collection of records (instances/examples), each represented by (X, y), where: X = attribute set, y = class label.

New cards

Goal

Learn a target function f that maps attribute sets to class labels.

New cards

Descriptive Modeling

Explain differences between classes.

New cards

Predictive Modeling

Predict unknown class labels using the model.

New cards

General Approach

Use a training set (records with known labels) to build the model and apply the model to a test set (records with unknown labels) to evaluate.

New cards

Confusion Matrix

Summarizes correct/incorrect predictions.

New cards

Accuracy

Accuracy = (Correct Predictions) / (Total Predictions)

New cards

Error Rate

Error Rate = 1 - Accuracy

New cards

Decision Trees

A model structure with root nodes, internal nodes, and leaf nodes used for classification.

New cards

Root Node

No incoming edges.

New cards

Internal Nodes

One incoming edge, two or more outgoing edges.

New cards

Leaf Nodes

No outgoing edges; assigned a class label.

New cards

Class Label Prediction

Start from root, follow the decision rules based on attributes until reaching a leaf node.

New cards

Hunt's Algorithm

Procedure to build a decision tree.

New cards

Greedy Strategy

At each step, choose the attribute test that best separates the classes.

New cards

Gini Index

Gini(t) = 1 - ∑ [p(i|t)]²

New cards

Entropy

Entropy(t) = -∑ p(i|t) log2 p(i|t)

New cards

Misclassification Error

Error(t) = 1 - max p(i|t)

New cards

Information Gain

Measures entropy reduction after a split.

New cards

Rule-Based Classifier

Uses a collection of 'if...then...' rules to perform classification.

New cards

Structure of a Rule

Condition (LHS / antecedent): A conjunction of attribute tests; Conclusion (RHS / consequent): A class label.

New cards

Example Rule

(Give Birth = no) ∧ (Can Fly = yes) → Birds

New cards

Applications of Rule-Based Classifier

Classification of animals based on biological traits and tax fraud prediction based on financial attributes.

New cards

Mutually Exclusive Rules

Each record matches at most one rule.

New cards

Exhaustive Rules

Every record is matched by at least one rule.

New cards

Coverage

The fraction of total records that satisfy the rule's condition.

New cards

Accuracy

The fraction of records satisfying the rule's condition that also have the correct class label.

New cards

High Coverage and High Accuracy

Both are desirable.

New cards

Converting Trees to Rules

Each path from root to leaf becomes a classification rule.

New cards

Ordered Rule Set (Decision List)

Rules are applied in priority order; first matching rule is used for classification.

New cards

Unordered Rule Set

Voting schemes (majority rule) may be used if multiple rules match.

New cards

Rule-based Ordering

Rank rules by individual quality.

New cards

Class-based Ordering

Group and order rules based on the predicted class.

New cards

Direct Method

Extract rules directly from data (e.g., RIPPER, CN2, Holte's 1R).

New cards

Indirect Method

Extract rules from other models like decision trees (e.g., C4.5rules).

New cards

Sequential Covering (Direct Method)

Start with an empty rule; grow a rule to cover as many instances as possible.

New cards

Cluster Analysis

Finding groups of objects such that objects within a group are similar to each other and dissimilar to objects in other groups.

New cards

Maximize inter-cluster distance

Goal of cluster analysis.

New cards

Minimize intra-cluster distance

Goal of cluster analysis.

New cards

Partitional Clustering

Divides data into non-overlapping clusters; each point in exactly one cluster.

New cards

Hierarchical Clustering

Nested clusters organized as a tree (dendrogram).

New cards

Exclusive Clustering

Each point belongs to exactly one cluster.

New cards

Overlapping Clustering

Points may belong to multiple clusters (e.g., 'border' points).

New cards

Fuzzy Clustering

Points belong to all clusters with varying degrees (weights between 0 and 1, must sum to 1).

New cards

Complete Clustering

All data points are clustered.

New cards

Partial Clustering

Only a subset of data is clustered.

New cards

Well-Separated Clusters

Each point is closer to every point within its cluster than to any point outside.

New cards

Prototype-Based Clusters

Points are closer to the cluster's 'center' (centroid or medoid) than to other centers.

New cards

Graph-Based Clusters

Based on nearest neighbor chains; points are closer to some other point in the cluster than to any point outside.

New cards

Density-Based Clusters

Dense regions separated by sparser regions; useful for irregular shapes and handling noise.

New cards

Shared-Property (Conceptual Clusters)

Clusters defined by sharing a common property or concept.

New cards

Clustering Algorithms

K-means, Hierarchical, Density-Based.

New cards

K-means

Partitional clustering that associates each cluster with a centroid and assigns points to the closest centroid, requiring specifying K (number of clusters).

New cards

Hierarchical Clustering

Produces nested clusters visualized with a dendrogram.

New cards

Agglomerative Clustering

A bottom-up merging approach in hierarchical clustering.

New cards

Divisive Clustering

A top-down splitting approach in hierarchical clustering.

New cards

Density-Based Clustering

Identifies clusters as dense regions separated by low-density areas (e.g., DBSCAN).

New cards

Sum of Squared Error (SSE)

Total distance squared from points to cluster centers, with lower SSE preferred.

New cards

Limitations of K-means

Poor performance with clusters of differing sizes, densities, non-globular shapes, sensitive to outliers, and initial centroid placement.

New cards

Overcoming Limitations of K-means

Use many small clusters and combine later, careful selection of initial centroids, and alternative clustering methods if non-globular or varying density.

New cards

DBSCAN Core Idea

Density = Number of points within a radius (Eps); core points: ≥ MinPts within Eps; border points: fewer than MinPts but in neighborhood of a core point; noise points: neither core nor border.

New cards

Strengths of DBSCAN

Handles noise well and finds clusters of varying shapes and sizes.

New cards

Weaknesses of DBSCAN

Struggles with varying densities and is less effective with high-dimensional data.

New cards

Cluster Evaluation Purpose

Avoid finding patterns in random noise and compare clustering algorithms or clusterings.

New cards

Types of Measures in Cluster Evaluation

External Index, Internal Index, Relative Index.

New cards

Anomaly Detection Definition

An object is considered an anomaly if it is distant from most points in the dataset.

New cards

Proximity-based Approaches

Use the distance to the k-nearest neighbor to assess if an object is isolated.

New cards

Outlier Score

Lowest score: 0; Highest score: maximum possible distance (can be infinity).

New cards

Density-Based Relation

Defines density as the reciprocal of the average distance to the k-nearest neighbors.

New cards

Clustering-based Approaches

First, cluster the data into groups of different densities; points in small clusters are chosen as candidate outliers.

New cards

Detection Strategy in Anomaly Detection

Discard small clusters that are far from larger clusters and define thresholds for minimum cluster size and minimum distance between clusters.

New cards

Cluster-based Outlier

An object is a cluster-based outlier if it does not strongly belong to any cluster.

New cards

Example Using K-Means

Outlier score computed in two ways: distance from the point to its closest cluster centroid and relative distance.