Data Mining Detailed Study Notes Unit 4 Classification Techniques & Cluster Analysis
Data Mining Detailed Study Notes
Unit 4 Classification Techniques and Cluster Analysis
Contents at a Glance
- Part A: Classification
- Decision Trees
- Bayes
- Naïve Bayes
- Bayesian Belief Network (BBN)
- Lazy Learners
- Model Evaluation
- Improving Accuracy
- Part B: Cluster Analysis
- Types of Data
- k-Means
- k-Medoids
- DBSCAN
- STING
- Outlier Analysis
- Part C: Trends & Frontiers
- Web Mining
- Temporal & Spatial Mining
- Statistical Data Mining
- Applications
Part A: Classification Techniques
1. General Approach to Classification
Classification is a supervised learning task, where the objective is to learn a model (classifier) to predict the class of unseen instances based on labeled training examples. The overall approach includes the following components:
- Input: A training set
D = igl{(x1, y1), ldots ,(xn, yn)igr}
where is a feature vector and yi \in {c1, ldots, ck} is the corresponding class label. - Output: A function f : R^d
ightarrow {c1, ldots, ck} that minimizes classification error on unseen data.
1.1 Definition and Terminology
The classification task can be formalized into two main phases:
- Phase 1 — Learning (Training): The model is constructed from the training data where the algorithm analyzes the data to induce the model.
- Phase 2 — Classification (Testing): The model is applied to classify new, unseen test instances. The accuracy is evaluated based on a test set where the class labels are known but concealed from the model.
1.2 Model Accuracy
Accuracy can be calculated using the formula:
1.3 Types of Classifiers
There are several types of classifiers, including:
- Decision Trees: Algorithms such as ID3, C4.5, and CART.
- Probabilistic Classifiers: Using models like Naïve Bayes and Bayesian Belief Networks.
- Lazy Learners: Including k-Nearest Neighbour and Case-based Reasoning.
- Neural Networks: Such as Multi-layer Perceptron and Deep Learning.
- Support Vector Machines (SVM): Including Linear SVM and Kernel SVM.
- Ensemble Methods: Techniques like Bagging, Boosting, and Random Forests.
2. Classification by Decision Tree Induction
A decision tree is a tree structure that resembles a flowchart, where:
- Each internal node represents a test on an attribute.
- Each branch indicates the outcome of the test.
- Each leaf node corresponds to a class label.
- The topmost node is called the root node.
2.1 Algorithm: Basic Decision Tree Induction
The steps for inducing a decision tree are outlined in the algorithm below:
Algorithm 1: Generate Decision Tree(Training set D, Attribute list A)
- Create a node N.
- If all tuples in D belong to the same class C, then return N as a leaf node labeled C.
- If A is empty, return N as a leaf node labeled with majority class in D.
- Select the best attribute based on the splitting criterion.
- Label N with .
- For all known values of do:
- Extract tuples where to form subset .
- If is empty, attach a leaf labeled with majority class of D.
- Otherwise, attach the subtree generated by calling Generate Decision Tree( ).
- Return N.
2.2 Attribute Selection Measures
2.2.1 Information Gain (ID3)
Entropy and Information Gain are defined as follows:
Entropy of a dataset D with m classes is defined as:
Where represents the proportion of class in D.Information Gain for an attribute A:
The attribute with the highest information gain is chosen for splitting.
Example of Entropy Calculation
Given a dataset with 9 “Yes” and 5 “No” samples:
The attribute “Outlook” splits the dataset as follows: Sunny(5), Overcast(4), Rain(5).
2.2.2 Gain Ratio (C4.5)
To alleviate the bias toward attributes with many values, C4.5 adjusts the information gain calculation as:
- Split Information for attribute A:
- Gain Ratio:
2.2.3 Gini Index (CART)
The Gini index for a dataset D is defined as:
-
For a binary split on attribute A into $D1$ and $D2$: - The optimization of the index can be expressed as:
- (to maximize this value).
2.3 Tree Pruning
Pruning techniques are used to avoid overfitting, which occurs when the decision tree is overly complex for the training data. Two main types include:
- Pre-pruning (Early Stopping): Cease tree growth based on a statistical significance test showing no reliable improvement (e.g., using chi-squared test, minimum gain threshold).
- Post-pruning: Complete the tree growth beforehand and prune upwards from the leaves.
Methods of pruning include:
- Reduced-error Pruning: Replace a subtree with a majority-class leaf if the accuracy does not decrease on validation.
- Cost-complexity Pruning (CART): Minimize the cost defined as , where is a complexity parameter.
- Pessimistic Pruning (C4.5): Incorporates a continuity correction on training error.
2.4 Scalability and Enhancements
Several enhancements are included to improve the scalability of decision trees:
- RainForest: Maintains AVC-sets (Attribute-Value-Class) for more scalable tree construction.
- BOAT (Bootstrapped Optimistic Algorithm): Utilizes bootstrapping for distributed tree building.
- Handling Missing Values: More effective solutions can be used, e.g., replacing with probable values or distributing weights across branches.
- Continuous Attributes: Find optimal binary split points via entropy or Gini measures.
3. Bayes Classification Methods
3.1 Bayes’ Theorem
Bayes' Theorem relates conditional and marginal probabilities with the formula:
Where:
- is the prior probability of hypothesis H.
- is the likelihood of observing X given H.
- is the posterior probability of H given evidence X.
- is the marginal probability (normalizing constant).
The classification rule assigns X to class that maximizes:
3.2 Naïve Bayes Classifier
3.2.1 Naïve Bayes Assumption
Naïve Bayes classifiers assume that attributes are conditionally independent given the class:
This greatly simplifies calculations and facilitates processing high-dimensional data effectively.
3.2.2 Classification Formula
The classification can be computed as:
For Categorical Attributes:
For Continuous Attributes (assuming a Gaussian distribution):
To resolve the zero-frequency problem, use Laplace smoothing:
where setting provides basic Laplace smoothing.
Example of Naïve Bayes Classification
Given:
Classify:
- Calculation:
- Decision: Classify as No (since 0.0206 > 0.0053).
3.3 Advantages and Limitations of Naïve Bayes
Advantages:
- Simple and fast to train.
- Performs well with small datasets.
- Naturally handles missing values.
- No complex optimization required.
- Effective for text classification tasks.
Limitations:
- Assumes conditional independence among attributes, which can be violated in practice.
- Cannot capture interactions between attributes naturally.
- Sensitive to attribute correlation.
- Zero frequency issues necessitate smoothing strategies.
- Probability estimates may be unreliable.
4. Model Evaluation and Selection
4.1 Evaluation Metrics
Confusion Matrix: In a binary classifier with classes Positive (P) and Negative (N):
Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) Metrics calculated from the confusion matrix:
- Accuracy:
- Precision (Positive Predictive Value):
- Recall (Sensitivity):
- Specificity:
- F1 Score:
- F-beta Score:
- Accuracy:
4.2 Holdout and Cross-Validation Methods
- Holdout Method: Partition data into training (usually 2/3) and test (1/3) sets.
- k-Fold Cross-Validation: Split data into k disjoint subsets. In each iteration, use one subset for testing and the rest for training.
- Leave-One-Out (LOO): A special case of k-fold where k = n (the number of instances), best suited for smaller datasets.
- Stratified Cross-validation: Ensures each fold maintains the class distribution.
4.3 Bootstrap Method
With bootstrap sampling, the probability of an instance not being selected is:
Thus, approximately 63.2% of original samples appear in the bootstrap sample, with the remaining 36.8% forming the test set.
Final accuracy correction via bootstrap:
4.4 ROC Curves and AUC
- ROC Curve: Plots True Positive Rate (TPR) against False Positive Rate (FPR):
- Area Under Curve (AUC): A single scalar summarizing the ROC curve; An AUC of 1 indicates perfect performance, while 0.5 represents random performance.
4.5 Comparing Classifiers: Statistical Tests
- A t-test can be used for comparing two classifiers on k-fold cross-validation:
Compare t with the critical value from
5. Techniques to Improve Classification Accuracy
5.1 Ensemble Methods
Ensemble methods use multiple classifiers to build a superior classifier. The intuition is that the errors of individual classifiers can cancel out if they are independent.
5.1.1 Bagging (Bootstrap Aggregating)
Proposed by Breiman in 1996, this technique involves:
- Generate k bootstrap training sets via sampling with replacement.
- Train a base classifier for each sample.
- For classification tasks, utilize majority voting; for regression, compute the average.
This approach reduces variance without increasing bias and suits unstable base classifiers like decision trees.
5.1.2 Boosting (AdaBoost)
- Assign equal weights to all instances.
- For iterations :
- Train classifier on weight-adjusted data.
- Compute the error as:
- Calculate the classifier weight:
- Update weights:
- Final Prediction:
This method reduces bias and variance but can be sensitive to noisy data and outliers.
5.1.3 Random Forests
Random Forests combine bagging with random feature selection:
- Build T decision trees on bootstrap samples.
- At each split, randomly select m features (typically , where p is the total features).
- Choose the best split among the m features.
- Aggregate predictions using majority voting.
5.2 Class Imbalance Problem
This issue arises when class distribution is significantly skewed (e.g., 99% vs 1%). Strategies to cope include:
- Oversampling: Duplicate instances of the minority class or apply SMOTE to create new instances.
- Undersampling: Reduce instances of the majority class.
- Cost-sensitive learning: Assign higher misclassification costs to the minority class.
5.3 SMOTE
The Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic instances by interpolating among minority class neighbors.
6. Advanced Classification: Bayesian Belief Networks
6.1 Bayesian Belief Network (BBN)
A BBN is a probabilistic graphical model representing a set of variables with their conditional independence relationships via a Directed Acyclic Graph (DAG). Each node holds a Conditional Probability Table (CPT) detailing .
6.2 Structure and Joint Probability
Given a DAG with n nodes, the joint probability distribution can be articulated as:
Example: For medical diagnosis, with variables: Smoking, Pollution, Lung Cancer, X-ray, and Dyspnoea.
- CPT for Lung Cancer: given Smoking (S) and Pollution (P):
| S | P | P(LC=T) | P(LC=F) | |
|---|---|---|---|---|
| T | High | 0.05 | 0.95 | |
| T | Low | 0.02 | 0.98 | |
| F | High | 0.03 | 0.97 | |
| F | Low | 0.001 | 0.999 | |
6.3 Learning in BBNs |
- Structure Learning: Finding the DAG structure
a. Score-based Learning: Optimize a scoring function (e.g., BIC, MDL, BDe).
b. Constraint-based Learning: Utilize conditional independence tests to create the graph. - Parameter Learning: Estimate the CPTs utilizing MLE or Bayesian methods.
6.4 BIC Score (Bayesian Information Criterion)
Can be provided as:
6.5 Inference in BBNs
- Exact Inference: Techniques like variable elimination and junction tree algorithm.
- Approximate Inference: Methods like Monte Carlo sampling (MCMC), and loopy belief propagation.
Given evidence E = e, compute:
7. Lazy Learners
Lazy Learning
Lazy learners (or instance-based learners) delay model generalization until a query arises. They retain all training data and categorize queries based on comparisons with stored instances, contrasting with eager learners like decision trees or Naïve Bayes that develop a global model during training.
7.1 k-Nearest Neighbour (k-NN) Classifier
Algorithm 2: k-Nearest Neighbour Classification
- Store all n training instances.
- Given a query instance :
- Identify the k nearest neighbors of using a distance metric.
- Assign class based on majority vote among the k neighbors.
Distance Metrics:
- Euclidean:
- Manhattan:
- Minkowski:
- Weighted k-NN: Assign weights to neighbors by inverse distance:
Example of k-NN
Given training data with 3 class-A points and 3 class-B points, classify query point :
- With : if 2 neighbors are class-A and 1 is class-B, classify as class-A.
- With : classify as the class of the nearest neighbor.
7.2 Case-Based Reasoning (CBR)
CBR classifies by retrieving the most similar past cases and adapting their solutions. The CBR cycle involves:
- Retrieve: Find similar past cases.
- Reuse: Adapt retrieved solutions to the current problem.
- Revise: Test and modify solutions as needed.
- Retain: Store new cases for future reference.
7.3 Advantages and Disadvantages of Lazy Learning
Advantages:
- No upfront training time required.
- Naturally accommodates multi-class scenarios.
- Adapts to new training data effortlessly.
- Can model complex decision boundaries.
Disadvantages:
- Classification is time-consuming (search all instances).
- High memory consumption.
- Sensitive to irrelevant features.
- Requires feature normalization.
- Vulnerable to the curse of dimensionality.
Part B: Cluster Analysis
8. Introduction to Cluster Analysis
Clustering is an unsupervised learning task aimed at discovering groups (clusters) in unlabelled data. The goal is to group instances such that:
- Instances within a cluster are similar to each other.
- Instances in one cluster are dissimilar to those in other clusters.
Given a dataset , the clustering process divides D into groups such that: - for all
- (equally partition the dataset)
- for all (no overlap in hard clustering).
8.1 Applications of Clustering
Clustering can be applied in various fields:
- Customer segmentation in marketing.
- Document clustering and topic modeling.
- Image segmentation in computer vision applications.
- Anomaly detection in network security.
- Gene expression analysis in bioinformatics.
- City planning through grouping locations based on value.
9. Types of Data in Cluster Analysis
9.1 Data Types
| Data Type | Description | Example |
|---|---|---|
| Interval-scaled | Continuous real values | Height, temperature |
| Binary | Two states (symmetric/asymmetric) | Gender (M/F), disease presence (Y/N) |
| Nominal | Unordered discrete categories | Colour, city, job |
| Ordinal | Ordered discrete values | Rank, quality (low/medium/high) |
| Ratio-scaled | Positive values with meaningful ratios | Age, salary, weight |
| Mixed types | Combination of different types | Customer records with various features |
9.2 Dissimilarity and Similarity Measures
9.2.1 For Interval-Scaled Attributes
Standardization to achieve uniform scale:
Where is the mean and is the standard deviation.
- Minkowski Distance:
with specific values of q defining various distance metrics (q = 1 for Manhattan, q = 2 for Euclidean).
9.2.2 For Binary Attributes
A contingency table defines the occurrences as follows:
| No. of occurrences | Count |
|---|---|
| Both 1 | a |
| 1 and 0 | b |
| 0 and 1 | c |
| Both 0 | d |
The distance measures become:
- Simple Matching:
- Jaccard Distance:
9.2.3 For Nominal Attributes
where m is the number of matching attributes and p is the total count of attributes.
9.2.4 For Ordinal Attributes
Convert ranks to a numerical scale from 0 to 1 and apply interval-scale methods.
9.2.5 Mixed-Type Attributes
The generalized distance measure can be represented as:
Where if attribute f is missing, else 1.
10. Partitioning Methods
In partitioning methods, the goal is to split n objects into k non-overlapping subsets such that each subset includes at least one object and minimizes a criterion, generally the sum of squared distances to cluster centers.
10.1 k-Means Algorithm
Algorithm 3: k-Means Clustering
- Select k objects randomly as initial cluster centroids
- Repeat the following steps:
- For each object : Assign to cluster where
- For each cluster : Update the centroid:
- Until there is no change in centroids (or maximum iterations reached).
- Objective to Minimize:
Example of k-Means
Data contains: {2, 4, 10, 12, 3, 20, 30, 11, 25}, with k=3.
- Initial centroids:
- Iteration 1 Assignment: Cluster formed as follows:
- Update Centroids:
- Iterate until convergence.
Limitations of k-Means
- Predefined k: Must specify the number of clusters in advance.
- Sensitivity to Initial Centroid Selection: May lead to local optima.
- Sensitivity to Outliers and Noise: Outliers can skew results.
- Assumes Spherical, Equal-sized Clusters: Performance degrades if these conditions are not met.
- Inefficient for Large Datasets: Complexity scales with O(nkt) where t is the number of iterations.
Variants of k-Means
- k-Means++: Uses smart initialization to reduce sensitivity to initial seeds.
- Mini-Batch k-Means: Applies random batches to enhance scalability.
- Fuzzy c-Means: Offers probabilistic or soft cluster assignments.
11 k-Medoids Clustering
Instead of using the mean (centroid), k-medoids (PAM — Partitioning Around Medoids) chooses an actual data object (the medoid) which minimizes dissimilarity within the cluster. This method is more robust against outliers compared to k-Means.
11.1 PAM Algorithm
Algorithm 4: PAM (Partitioning Around Medoids)
- Randomly select k objects as initial medoids .
- Repeat the following:
- Assign each non-medoid object to the nearest medoid’s cluster.
- For each medoid and a non-medoid object :
- Compute cost:
- Compute cost:
- If , perform a swap: (this decreases total cost).
- Continue until no changes occur.
11.2 CLARA (Clustering LARge Applications)
CLARA applies PAM to multiple random samples of a large dataset:
- Draw multiple random samples.
- Apply PAM to each to find medoids.
- Assign objects to the nearest medoid across all samples.
- Select the medoid set minimizing overall cost, leading to:
where s is the sample size.
11.3 CLARANS (Clustering Large Applications based on RANdomized Search)
CLARANS combines sampling with graph-search:
- Models the problem as a graph where nodes represent sets of k medoids.
- Randomly explores neighbor swaps until no better neighbor exists within a maximum defined attempts.
- More efficient than CLARA when working with large datasets.
12 Density-Based Methods
Density-based clustering identifies clusters as dense regions surrounded by lower-density areas, effectively handling noise/outliers and discovering clusters of arbitrary shapes.
12.1 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Key Concepts: Given parameters (neighborhood radius) and MinPts (minimum density):
- Neighbor Definition:
- Core Point: If .
- Border Point: Not core but located in a core point's neighborhood.
- Noise Point (Outlier): Neither core nor border point.
- Density-Reachable: A point is density-reachable from if and is a core point.
- Density-Connected: Points p and q are density-connected if both are reachable from some point .
Algorithm 5: DBSCAN
- Mark all points as unvisited.
- For every unvisited point :
- Mark as visited.
- Compute .
- If |N| < ext{MinPts}, mark as noise.
- Else:
- Create new cluster , add to .
- Set seeds as .
- For each point seeds:
- If is unvisited, mark as visited.
- Compute .
- If , include these points as seeds.
- Add to cluster C if it’s not already assigned to a cluster.
Complexity: O(n log n) with spatial indexing (e.g., R*-tree), O(n^2) without spatial indexing.
DBSCAN Illustration
In an illustration, core points (marked in blue) have neighborhoods exceeding MinPts while border points lie within a core’s ε-neighborhood. Noise points (marked in orange) are isolated.
Advantages of DBSCAN:
- Discovers clusters of arbitrary shape and identifies noise and outliers automatically.
- There's no need to define the number of clusters k.
Limitations:
- DBSCAN is sensitive to and MinPts choice.
- It struggles with varying density clusters.
- High-dimensional data can degrade performance due to the curse of dimensionality.
12.2 OPTICS (Ordering Points to Identify Clustering Structure)
OPTICS expands upon DBSCAN by addressing dense variations and offers:
- Core Distance: Minimum to make point p a core point.
- Reachability Distance: The maximum distance between the core distance of a point o and distance from o to p.
- Produce an ordered list useful for identifying clusters at different densities.
13 Grid-Based Methods
These methods quantize the data space into a finite number of cells enabling clustering operations dependent on the number of cells rather than the number of data objects.
13.1 STING (STatistical Information Grid)
STING is a multiresolution grid-based clustering method:
- The structure divides space into rectangular cells at various resolution levels.
- Each upper-level cell partitions smaller cells at subsequent levels. Each cell stores statistical parameters, e.g., count, mean, and standard deviation.
Query processing steps include:
- Start at a high-level (coarse) layer.
- Compute confidence intervals for relevance using stored statistics.
- Discard irrelevant cells, recursively process relevant cells.
- Return relevant cells as cluster candidates.
Complexity: O(K) where K represents the number of grid cells, independent of n.
Advantages:
- Query-independent preprocessing is beneficial for execution speed (O(n) to build, O(K) to query).
- Supports incremental updates and multiresolution analyses.
Limitations:
- Cluster boundaries can only be vertical/horizontal (- no diagonal).
- Accuracy is influenced by grid granularity and assumes data fits a standard distribution within cells.
13.2 WaveCluster
This approach employs wavelet transforms within grid-based clustering:
- Summarize data in a multi-dimensional grid framework.
- Execute wavelet transformations to locate high-density regions.
- Identify connected components within the transformed grid.
This method is effective at detecting arbitrary shapes and managing noise effectively.
Complexity: O(n).
13.3 CLIQUE (CLustering In QUEst)
CLIQUE integrates grid-based and subspace clustering:
- Automatically identifies high-density clusters within subspaces.
- Utilizes Apriori-like monotonicity assumptions to evaluate cluster density.
14 Outlier Analysis
An outlier is an observation that significantly deviates from other observations,
indicating possible measurement errors or insightful anomalies. Outliers can signify:
- Data entry mistakes.
- Important rare events (e.g., fraud, network intrusions).
- Patterns worth further investigation.
14.1 Statistical Approaches
Outlier detection methods include:
- Distribution-based: Assume data follows a specific distribution; objects with low probability are identified.
- Z-score Method: Calculate and treat observations where |z_i| > 3 as potential outliers.
- IQR Method (Box Plot): Define fences based on quartiles to flag outliers:
- Lower fence:
- Upper fence: where .
14.2 Distance-Based Outlier Detection
DB(p,D)-Outlier: An object is a DB(p,D)-outlier if a fraction of all objects lie more than distance D away from o:
\frac{|{o' \in D\, | d(o, o') \leq D}|}{|D|} < 1 - p
Efficient detection methods optimize using nested loops or indices (e.g., R-tree) for pruning clearly non-outlier objects. The complexity can be O(n²) or O(n log n) with indexing.
14.3 Density-Based Local Outlier Factor (LOF)
The LOF value gauges the local density variance of an object referring to its neighboring points:
- Reachability Distance:
- Local Reachability Density:
- LOF:
Where: - indicates p is a cluster point (similar density), and demonstrates it is an outlier (density lower than neighbors).
14.4 Clustering-Based Outlier Detection
- Objects in small or sparse clusters are labeled as outliers.
- Objects failing to belong to any cluster (e.g., noise in DBSCAN) are classified as outliers.
- Assessing distance to the cluster centroid can provide an understanding of outlier characteristics.
Part C: Data Mining Trends & Frontiers
15 Overview of Web Mining
Web mining encompasses applying data mining techniques to unveil meaningful and potentially beneficial patterns from the World Wide Web, focusing on web content, structure, and usage data.
15.1 Types of Web Mining
- Web Content Mining: Utilizes content from web pages (text, images, audio, and video) to extract information.
- Web Structure Mining: Analyzes the hyperlink structure of the web to identify authority pages, hubs, and communities.
- Web Usage Mining: Examines web server logs and user behavior to reveal usage patterns.
15.1.1 Web Content Mining
- Tasks include:
- Web page classification (topic categorization).
- Information extraction (structuring unstructured data).
- Text mining on web documents (natural language processing, sentiment analysis).
- Multimedia content mining.
- Common techniques:
- TF-IDF for term weighting.
- Topic modeling (LDA — Latent Dirichlet Allocation).
- Named Entity Recognition (NER).
- Sentiment analysis.
- TF-IDF Formula:
where: - = term frequency,
- = total documents,
- = number of documents containing term t.
15.1.2 Web Structure Mining
- Mining the hyperlink connections of the web to uncover authority pages and hubs.
- PageRank (Google):
Where: - is the damping factor,
- represents pages linking to p,
- is the count of out-links from q.
- HITS Algorithm (Hyperlink-Induced Topic Search):
- Each page has two scores for hubs (h) and authorities (a).
- Calculate authority score by summing hub scores of referring pages:
- Calculate hub score by summing authority scores of links:
- Iterate this procedure until convergence (through the power method applied to the adjacency matrix).
15.1.3 Web Usage Mining
- Analyzing logs and user behaviors to uncover patterns in internet usage, drawing from web server logs, browser client logs, and proxy server logs.
- Tasks can include sessionization (creating groups of log entries into user sessions) and applying association rule mining on page sequences, sequential pattern mining (e.g., using AprioriAll and SPADE), and collaborative filtering for recommendations.
16 Temporal and Spatial Mining
16.1 Temporal Data Mining
Temporal data refers to information indexed by time, encompassing time-series, event sequences, and periodic patterns.
16.1.1 Time Series Analysis
A time series is a series of observations observed at consistent time intervals.
- Decomposition:
Where: - T_t: Trend component (long-term direction).
- S_t: Seasonal component (periodic patterns).
- C_t: Cyclical component (irregular, multi-year patterns).
- I_t: Irregular component (noise/residual).
16.1.2 Sequential Pattern Mining
Goal: Identify all sequences with support ≥ min support from the database of sequences using approaches like GSP (Generalized Sequential Patterns), SPADE (vertical format), and PrefixSpan (pattern-growth approach).
16.1.3 Temporal Association Rules
Extend traditional association rules to account for temporal constraints, expressed as (event X must be followed by event Y within time interval ). Additionally, calendar-based patterns reveal findings tied to temporal intervals (daily, weekly, monthly).
16.2 Spatial Data Mining
Spatial data manifests through geometric properties such as location, often found in geographic information systems (GIS), remote sensing, medical imaging, and satellite data.
16.2.1 Spatial Data Types
- Point data: GPS coordinates of events.
- Line data: Features like roads, rivers, and fault lines.
- Polygon data: Regional designations for land use.
- Raster data: Satellite images organized as grids.
16.2.2 Spatial Mining Tasks
Common spatial mining tasks include:
- Spatial Classification: Classifying spatial objects (e.g., land use based on satellite imagery).
- Spatial Clustering: Grouping geographically proximate objects (e.g., utilizing DBSCAN for spatial data).
- Spatial Association Rules: Identify rules involving spatial predicates (e.g., identifying high-value properties close to the ocean).
- Spatial Outlier Detection: Detecting anomalies based on spatial context (e.g., unusually low attribute values relative to neighboring objects).
- Spatial Trend Analysis: Discover directional trends in spatial configurations.
16.2.3 Spatial Autocorrelation
The Moran’s I statistic assesses spatial autocorrelation:
- Where:
- I > 0: indicates positive spatial autocorrelation.
- I < 0: suggests negative autocorrelation.
- I \approx 0: illustrates randomness in spatial patterns.
17 Statistical Data Mining
Statistical data mining bridges classical statistics and modern machine learning methodologies to discern patterns, select models, and make inferences.
17.1 Statistical Learning Foundations
17.1.1 Regression Analysis
- Multiple Linear Regression:
where - Logistic Regression (for classification):
17.1.2 Principal Component Analysis (PCA)
A technique for dimensionality reduction utilizing variance projection directions:
Where:
- W = eigenvectors from .
- Proportion of Variance Explained (PVE) by k components:
17.1.3 Hypothesis Testing in Data Mining
Common statistical tests include:
- Chi-Squared Test: For independence of categorical attributes.
- t-Test: Used in comparing means of two groups.
- ANOVA (Analysis of Variance): Compares means across multiple groups.
- Kolmogorov-Smirnov Test: Evaluates sample distribution goodness.
- Consider corrections for multiple comparisons, namely the Bonferroni correction and controlling the False Discovery Rate (FDR).
17.2 Minimum Description Length (MDL)
The MDL principle asserts that the best model is one that achieves optimal data compression. Formally represented as:
- Where L(H) refers to bits necessary to describe model H and L(D|H) accounts for bits describing data given model H.
17.3 EM Algorithm (Expectation-Maximization)
This algorithm is utilized for learning probabilistic models incorporating hidden variables:
- E-step (Expectation):
- M-step (Maximization):
- Applications are found in:
- Gaussian Mixture Models (GMM) for soft clustering.
- Learning Hidden Markov Models (HMM) for sequential data.
- Filling in missing data.
18 Data Mining Applications
18.1 Business and Finance
| Application | Description |
|---|---|
| Customer segmentation | Grouping customers for targeted marketing; RFM (Recency, Frequency, Monetary) analysis. |
| Fraud detection | Classifying transactions as fraudulent leveraging anomaly detection and supervised learning. |
| Credit scoring | Assessing creditworthiness through regression, SVM, and decision trees. |
| Market basket analysis | Leveraging frequent pattern mining (e.g., Apriori, FP-Growth) for strategies like cross-selling. |
| Stock prediction | Applying time series analysis via LSTM, ARIMA for price forecasting. |
| Churn prediction | Identification of customers at risk of leaving through classification strategies. |
18.2 Healthcare and Biomedical
| Application | Description |
|---|---|
| Disease prediction | Utilizing classification techniques for early diagnosis (e.g., diabetes, cancer). |
| Gene expression analysis | Clustering applications based on expression patterns in microarray data. |
| Drug discovery | Mining associations between various compounds and biological activities. |
| Medical image analysis | Classification and segmentation driven by CNN-based methodologies. |
| EHR Analysis | Temporal pattern mining for discernment of treatment pathways. |
18.3 Telecommunications
| Application | Description |
|---|---|
| Network intrusion detection | Employing anomaly detection. |
| Customer churn analysis | Assessing and enhancing customer retention. |
| CDR analysis | Classification of call data records. |
| Network fault diagnosis | Utilizing classification techniques for diagnosis purposes. |
18.4 Retail and E-Commerce
| Application | Description |
|---|---|
| Recommendation systems | Utilizing collaborative and content-based filtering. |
| Inventory management | Demand forecasting leveraging time series methodologies. |
| Price optimization | Using regression methods and simulations for price setting. |
| Supply chain optimization | Clustering methods combined with outlier detection in logistics. |
18.5 Social Network Analysis
| Application | Description |
|---|---|
| Community detection | Employing graph clustering utilizing methods like modularity optimization. |
| Influence Analysis | Identifying key influencers using centrality measures. |
| Link prediction | Predicting future connections based on graph-derived features. |
| Sentiment analysis | Mining opinion metrics from social media text. |
18.6 Scientific Data Mining
| Application | Description |
|---|---|
| Astronomy | Classifying celestial bodies and detecting phenomena like supernovae from surveys. |
| Climate science | Identifying climate patterns and anomalies while predicting extreme events. |
| Particle physics | Event classification processes at CERN. |
| Seismology | Techniques for predicting earthquakes leveraging sensor data. |
19 Current Trends and Research Frontiers
19.1 Big Data and Scalable Mining
- MapReduce/Hadoop: Deploying data mining on commodity hardware to scale operations for larger datasets.
- Apache Spark: In-memory distributed computing benefiting iterative machine learning algorithms.
- Stream Mining: Mining real-time data streams (using techniques like VFDT and SAMOA).
- Online Learning: Incrementally updating models with new data as it becomes available.
Hoeffding Bound (for stream mining)
\epsilon = \slfrac{R^2 \log(1/\delta)}{2n}
This implies that the true mean differs from the sample mean by at most with probability .
19.2 Deep Learning and Data Mining
- CNNs: Used for image and video mining along with feature extraction.
- RNNs/LSTMs: Directed toward temporal and sequential data mining.
- Graph Neural Networks (GNNs): Focused on mining data structured as graphs.
- Transformers: Utilized in natural language processing tasks, including BERT and GPT for text mining.
- Autoencoders: Applied in anomaly detection and dimensionality reduction processes.
19.3 Privacy-Preserving Data Mining
- k-Anonymity: Each record is indistinguishable from at least k - 1 other records.
- Differential Privacy: Introduces calibrated noise to outputs, with privacy expression for neighboring datasets D, D′ and outputs S.
- Federated Learning: Trains models across devices without requiring raw data sharing.
19.4 Explainable AI (XAI) and Interpretable Mining
- LIME: Provides explanations for individual predictions using local linear approximations.
- SHAP: Attribute importance understood within the framework of game theory.
- Counterfactual explanations: Addresses what minimal changes would impact model decisions.
19 Summary: Key Formulas Quick Reference
The formulas are summarized as follows:
Classification
;
;
;
;
;
Clustering
; ;
Web Mining
;
;
Statistics
;
;
End of Notes
These structured notes serve as a comprehensive reference for advanced concepts in Data Mining, especially emphasizing Classification Techniques, Cluster Analysis, and related methodologies.