1/73
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Classification
Task of assigning objects to one of several predefined categories.
Input
Collection of records (instances/examples), each represented by (X, y), where: X = attribute set, y = class label.
Goal
Learn a target function f that maps attribute sets to class labels.
Descriptive Modeling
Explain differences between classes.
Predictive Modeling
Predict unknown class labels using the model.
General Approach
Use a training set (records with known labels) to build the model and apply the model to a test set (records with unknown labels) to evaluate.
Confusion Matrix
Summarizes correct/incorrect predictions.
Accuracy
Accuracy = (Correct Predictions) / (Total Predictions)
Error Rate
Error Rate = 1 - Accuracy
Decision Trees
A model structure with root nodes, internal nodes, and leaf nodes used for classification.
Root Node
No incoming edges.
Internal Nodes
One incoming edge, two or more outgoing edges.
Leaf Nodes
No outgoing edges; assigned a class label.
Class Label Prediction
Start from root, follow the decision rules based on attributes until reaching a leaf node.
Hunt's Algorithm
Procedure to build a decision tree.
Greedy Strategy
At each step, choose the attribute test that best separates the classes.
Gini Index
Gini(t) = 1 - ∑ [p(i|t)]²
Entropy
Entropy(t) = -∑ p(i|t) log2 p(i|t)
Misclassification Error
Error(t) = 1 - max p(i|t)
Information Gain
Measures entropy reduction after a split.
Rule-Based Classifier
Uses a collection of 'if...then...' rules to perform classification.
Structure of a Rule
Condition (LHS / antecedent): A conjunction of attribute tests; Conclusion (RHS / consequent): A class label.
Example Rule
(Give Birth = no) ∧ (Can Fly = yes) → Birds
Applications of Rule-Based Classifier
Classification of animals based on biological traits and tax fraud prediction based on financial attributes.
Mutually Exclusive Rules
Each record matches at most one rule.
Exhaustive Rules
Every record is matched by at least one rule.
Coverage
The fraction of total records that satisfy the rule's condition.
Accuracy
The fraction of records satisfying the rule's condition that also have the correct class label.
High Coverage and High Accuracy
Both are desirable.
Converting Trees to Rules
Each path from root to leaf becomes a classification rule.
Ordered Rule Set (Decision List)
Rules are applied in priority order; first matching rule is used for classification.
Unordered Rule Set
Voting schemes (majority rule) may be used if multiple rules match.
Rule-based Ordering
Rank rules by individual quality.
Class-based Ordering
Group and order rules based on the predicted class.
Direct Method
Extract rules directly from data (e.g., RIPPER, CN2, Holte's 1R).
Indirect Method
Extract rules from other models like decision trees (e.g., C4.5rules).
Sequential Covering (Direct Method)
Start with an empty rule; grow a rule to cover as many instances as possible.
Cluster Analysis
Finding groups of objects such that objects within a group are similar to each other and dissimilar to objects in other groups.
Maximize inter-cluster distance
Goal of cluster analysis.
Minimize intra-cluster distance
Goal of cluster analysis.
Partitional Clustering
Divides data into non-overlapping clusters; each point in exactly one cluster.
Hierarchical Clustering
Nested clusters organized as a tree (dendrogram).
Exclusive Clustering
Each point belongs to exactly one cluster.
Overlapping Clustering
Points may belong to multiple clusters (e.g., 'border' points).
Fuzzy Clustering
Points belong to all clusters with varying degrees (weights between 0 and 1, must sum to 1).
Complete Clustering
All data points are clustered.
Partial Clustering
Only a subset of data is clustered.
Well-Separated Clusters
Each point is closer to every point within its cluster than to any point outside.
Prototype-Based Clusters
Points are closer to the cluster's 'center' (centroid or medoid) than to other centers.
Graph-Based Clusters
Based on nearest neighbor chains; points are closer to some other point in the cluster than to any point outside.
Density-Based Clusters
Dense regions separated by sparser regions; useful for irregular shapes and handling noise.
Shared-Property (Conceptual Clusters)
Clusters defined by sharing a common property or concept.
Clustering Algorithms
K-means, Hierarchical, Density-Based.
K-means
Partitional clustering that associates each cluster with a centroid and assigns points to the closest centroid, requiring specifying K (number of clusters).
Hierarchical Clustering
Produces nested clusters visualized with a dendrogram.
Agglomerative Clustering
A bottom-up merging approach in hierarchical clustering.
Divisive Clustering
A top-down splitting approach in hierarchical clustering.
Density-Based Clustering
Identifies clusters as dense regions separated by low-density areas (e.g., DBSCAN).
Sum of Squared Error (SSE)
Total distance squared from points to cluster centers, with lower SSE preferred.
Limitations of K-means
Poor performance with clusters of differing sizes, densities, non-globular shapes, sensitive to outliers, and initial centroid placement.
Overcoming Limitations of K-means
Use many small clusters and combine later, careful selection of initial centroids, and alternative clustering methods if non-globular or varying density.
DBSCAN Core Idea
Density = Number of points within a radius (Eps); core points: ≥ MinPts within Eps; border points: fewer than MinPts but in neighborhood of a core point; noise points: neither core nor border.
Strengths of DBSCAN
Handles noise well and finds clusters of varying shapes and sizes.
Weaknesses of DBSCAN
Struggles with varying densities and is less effective with high-dimensional data.
Cluster Evaluation Purpose
Avoid finding patterns in random noise and compare clustering algorithms or clusterings.
Types of Measures in Cluster Evaluation
External Index, Internal Index, Relative Index.
Anomaly Detection Definition
An object is considered an anomaly if it is distant from most points in the dataset.
Proximity-based Approaches
Use the distance to the k-nearest neighbor to assess if an object is isolated.
Outlier Score
Lowest score: 0; Highest score: maximum possible distance (can be infinity).
Density-Based Relation
Defines density as the reciprocal of the average distance to the k-nearest neighbors.
Clustering-based Approaches
First, cluster the data into groups of different densities; points in small clusters are chosen as candidate outliers.
Detection Strategy in Anomaly Detection
Discard small clusters that are far from larger clusters and define thresholds for minimum cluster size and minimum distance between clusters.
Cluster-based Outlier
An object is a cluster-based outlier if it does not strongly belong to any cluster.
Example Using K-Means
Outlier score computed in two ways: distance from the point to its closest cluster centroid and relative distance.