Data Mining Detailed Study Notes Unit 4 Classification Techniques & Cluster Analysis

Data Mining Detailed Study Notes

Unit 4 Classification Techniques and Cluster Analysis

Contents at a Glance

Part A: Classification
- Decision Trees
- Bayes
- Naïve Bayes
- Bayesian Belief Network (BBN)
- Lazy Learners
- Model Evaluation
- Improving Accuracy
Part B: Cluster Analysis
- Types of Data
- k-Means
- k-Medoids
- DBSCAN
- STING
- Outlier Analysis
Part C: Trends & Frontiers
- Web Mining
- Temporal & Spatial Mining
- Statistical Data Mining
- Applications

Part A: Classification Techniques

1. General Approach to Classification

Classification is a supervised learning task, where the objective is to learn a model (classifier) to predict the class of unseen instances based on labeled training examples. The overall approach includes the following components:

Input: A training set
D = igl{(x1, y1), ldots ,(xn, yn)igr}
where $xi \in R^d$ is a feature vector and yi \in {c1, ldots, ck} is the corresponding class label.
Output: A function f : R^d
ightarrow {c1, ldots, ck} that minimizes classification error on unseen data.

1.1 Definition and Terminology

The classification task can be formalized into two main phases:

Phase 1 — Learning (Training): The model is constructed from the training data where the algorithm analyzes the data to induce the model.
Phase 2 — Classification (Testing): The model is applied to classify new, unseen test instances. The accuracy is evaluated based on a test set where the class labels are known but concealed from the model.

1.2 Model Accuracy

Accuracy can be calculated using the formula:
$ext{Accuracy} = \frac{ ext{Number of correctly classified test instances}}{ ext{Total number of test instances}}$

1.3 Types of Classifiers

There are several types of classifiers, including:

Decision Trees: Algorithms such as ID3, C4.5, and CART.
Probabilistic Classifiers: Using models like Naïve Bayes and Bayesian Belief Networks.
Lazy Learners: Including k-Nearest Neighbour and Case-based Reasoning.
Neural Networks: Such as Multi-layer Perceptron and Deep Learning.
Support Vector Machines (SVM): Including Linear SVM and Kernel SVM.
Ensemble Methods: Techniques like Bagging, Boosting, and Random Forests.

2. Classification by Decision Tree Induction

A decision tree is a tree structure that resembles a flowchart, where:

Each internal node represents a test on an attribute.
Each branch indicates the outcome of the test.
Each leaf node corresponds to a class label.
The topmost node is called the root node.

2.1 Algorithm: Basic Decision Tree Induction

The steps for inducing a decision tree are outlined in the algorithm below:

Algorithm 1: Generate Decision Tree(Training set D, Attribute list A)

Create a node N.
If all tuples in D belong to the same class C, then return N as a leaf node labeled C.
If A is empty, return N as a leaf node labeled with majority class in D.
Select the best attribute $a^*$ based on the splitting criterion.
Label N with $a^*$ .
For all known values $v_i$ of $a^*$ do:
- Extract tuples where $a^* = vi$ to form subset $Di$ .
- If $D_i$ is empty, attach a leaf labeled with majority class of D.
- Otherwise, attach the subtree generated by calling Generate Decision Tree( $D_i, A \backslash {a^*}$ ).
Return N.

2.2 Attribute Selection Measures

2.2.1 Information Gain (ID3)

Entropy and Information Gain are defined as follows:

Entropy of a dataset D with m classes is defined as:
$H(D) = -\, \sum{i=1}^m pi \log2 pi$
Where $pi = \frac{|Ci|}{|D|}$ represents the proportion of class $C_i$ in D.
Information Gain for an attribute A:
$\text{Gain}(A) = H(D) - \sum{v \in \text{vals}(A)} \frac{|Dv|}{|D|} H(D_v)$

The attribute with the highest information gain is chosen for splitting.

Example of Entropy Calculation

Given a dataset with 9 “Yes” and 5 “No” samples:
$H(D) = -\frac{9}{14} \log2{\frac{9}{14}} - \frac{5}{14} \log2{\frac{5}{14}} = 0.940\ \text{bits}$

The attribute “Outlook” splits the dataset as follows: Sunny(5), Overcast(4), Rain(5).

$H(Sunny) = -\frac{2}{5} \log2{\frac{2}{5}} - \frac{3}{5} \log2{\frac{3}{5}} = 0.971$
$H(Overcast) = 0 \, (\text{all Yes})$
$H(Rain) = 0.971$
$\text{Gain(Outlook)} = 0.940 - \left( \frac{5}{14} \cdot 0.971 + \frac{4}{14} \cdot 0 + \frac{5}{14} \cdot 0.971 \right) = 0.246\ \text{bits}$

2.2.2 Gain Ratio (C4.5)

To alleviate the bias toward attributes with many values, C4.5 adjusts the information gain calculation as:

Split Information for attribute A:
$\text{SplitInfo}(A) = -\sum{j=1}^v \frac{|Dj|}{|D|} \log2 \frac{|Dj|}{|D|}$
Gain Ratio:
$\text{GainRatio}(A) = \frac{\text{Gain}(A)}{\text{SplitInfo}(A)}$

2.2.3 Gini Index (CART)

The Gini index for a dataset D is defined as:

$\text{Gini}(D) = 1 - \sum{i=1}^m pi^2$
For a binary split on attribute A into $D1$ and $D2$:
$\text{Gini}A(D) = \frac{|D1|}{|D|} \text{Gini}(D1) + \frac{|D2|}{|D|} \text{Gini}(D_2)$
The optimization of the index can be expressed as:
$\Delta \text{Gini}(A) = \text{Gini}(D) - \text{Gini}_A(D)$ (to maximize this value).

2.3 Tree Pruning

Pruning techniques are used to avoid overfitting, which occurs when the decision tree is overly complex for the training data. Two main types include:

Pre-pruning (Early Stopping): Cease tree growth based on a statistical significance test showing no reliable improvement (e.g., using chi-squared test, minimum gain threshold).
Post-pruning: Complete the tree growth beforehand and prune upwards from the leaves.

Methods of pruning include:

Reduced-error Pruning: Replace a subtree with a majority-class leaf if the accuracy does not decrease on validation.
Cost-complexity Pruning (CART): Minimize the cost defined as $\text{Cost}(T) + \alpha \cdot |\text{leaves}(T)|$ , where $\alpha$ is a complexity parameter.
Pessimistic Pruning (C4.5): Incorporates a continuity correction on training error.

2.4 Scalability and Enhancements

Several enhancements are included to improve the scalability of decision trees:

RainForest: Maintains AVC-sets (Attribute-Value-Class) for more scalable tree construction.
BOAT (Bootstrapped Optimistic Algorithm): Utilizes bootstrapping for distributed tree building.
Handling Missing Values: More effective solutions can be used, e.g., replacing with probable values or distributing weights across branches.
Continuous Attributes: Find optimal binary split points via entropy or Gini measures.

3. Bayes Classification Methods

3.1 Bayes’ Theorem

Bayes' Theorem relates conditional and marginal probabilities with the formula:
$P(H | X) = \frac{P(X | H) P(H)}{P(X)}$
Where:

$P(H)$ is the prior probability of hypothesis H.
$P(X | H)$ is the likelihood of observing X given H.
$P(H | X)$ is the posterior probability of H given evidence X.
$P(X)$ is the marginal probability (normalizing constant).

The classification rule assigns X to class $Ci$ that maximizes: $\hat{C} = \arg \max{Ci} P(Ci | X) = \arg \max{Ci} P(X | Ci) P(Ci)$

3.2 Naïve Bayes Classifier

3.2.1 Naïve Bayes Assumption

Naïve Bayes classifiers assume that attributes are conditionally independent given the class:
$P(X | Ci) = \prod{k=1}^{n} P(xk | Ci)$
This greatly simplifies calculations and facilitates processing high-dimensional data effectively.

3.2.2 Classification Formula

The classification can be computed as:
$\hat{C} = \arg \max{Ci} P(Ci) \prod{k=1}^{n} P(xk | Ci)$

For Categorical Attributes:
$P(xk = v | Ci) = \frac{|{t \in D | t.xk = v \land t.C = Ci}|}{|C_i|}$

For Continuous Attributes (assuming a Gaussian distribution):
$P(xk | Ci) = \frac{1}{\sqrt{2\pi \sigma{ik}^2}} \exp\left(-\frac{(xk - \mu{ik})^2}{2\sigma{ik}^2}\right)$

To resolve the zero-frequency problem, use Laplace smoothing:
$P(xk = v | Ci) = \frac{\text{count}(xk = v, Ci) + \lambda}{|Ci| + \lambda \cdot |\text{vals}(xk)|}$ where setting $\lambda = 1$ provides basic Laplace smoothing.

Example of Naïve Bayes Classification

Given:
Classify: $X = (\text{Outlook=Sunny}, \text{Temp=Cool}, \text{Humidity=High}, \text{Wind=Strong})$

$P(Yes) = \frac{9}{14}, P(No) = \frac{5}{14}$
$P(Sunny|Yes) = \frac{2}{9}, P(Cool|Yes) = \frac{3}{9}, P(High|Yes) = \frac{3}{9}, P(Strong|Yes) = \frac{3}{9}$
$P(Sunny|No) = \frac{3}{5}, P(Cool|No) = \frac{1}{5}, P(High|No) = \frac{4}{5}, P(Strong|No) = \frac{3}{5}$
Calculation:
$P(X|Yes) imes P(Yes) = \frac{2}{9} \cdot \frac{3}{9} \cdot \frac{3}{9} \cdot \frac{3}{9} \cdot \frac{9}{14} = 0.0053$
$P(X|No) imes P(No) = \frac{3}{5} \cdot \frac{1}{5} \cdot \frac{4}{5} \cdot \frac{3}{5} \cdot \frac{5}{14} = 0.0206$
Decision: Classify as No (since 0.0206 > 0.0053).

3.3 Advantages and Limitations of Naïve Bayes

Advantages:
- Simple and fast to train.
- Performs well with small datasets.
- Naturally handles missing values.
- No complex optimization required.
- Effective for text classification tasks.
Limitations:
- Assumes conditional independence among attributes, which can be violated in practice.
- Cannot capture interactions between attributes naturally.
- Sensitive to attribute correlation.
- Zero frequency issues necessitate smoothing strategies.
- Probability estimates may be unreliable.

4. Model Evaluation and Selection

4.1 Evaluation Metrics

Confusion Matrix: In a binary classifier with classes Positive (P) and Negative (N):
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
Metrics calculated from the confusion matrix:
- Accuracy:
  $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
- Precision (Positive Predictive Value):
  $\text{Precision} = \frac{TP}{TP + FP}$
- Recall (Sensitivity):
  $\text{Recall} = \frac{TP}{TP + FN}$
- Specificity:
  $\text{Specificity} = \frac{TN}{TN + FP}$
- F1 Score:
  $F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{ ext{Precision} + ext{Recall}}$
- F-beta Score:
  $F_\beta = \frac{(1+\beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

4.2 Holdout and Cross-Validation Methods

Holdout Method: Partition data into training (usually 2/3) and test (1/3) sets.
k-Fold Cross-Validation: Split data into k disjoint subsets. In each iteration, use one subset for testing and the rest for training.
Leave-One-Out (LOO): A special case of k-fold where k = n (the number of instances), best suited for smaller datasets.
Stratified Cross-validation: Ensures each fold maintains the class distribution.

4.3 Bootstrap Method

With bootstrap sampling, the probability of an instance not being selected is:
$P(\text{instance not selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368$
Thus, approximately 63.2% of original samples appear in the bootstrap sample, with the remaining 36.8% forming the test set.

Final accuracy correction via bootstrap:
$Acc{0.632} = 0.632\cdot Acc{test} + 0.368\cdot Acc_{train}$

4.4 ROC Curves and AUC

ROC Curve: Plots True Positive Rate (TPR) against False Positive Rate (FPR):
$\text{TPR} = \frac{TP}{TP + FN} \quad (y\text{-axis})$
$\text{FPR} = \frac{FP}{FP + TN} \quad (x\text{-axis})$
Area Under Curve (AUC): A single scalar summarizing the ROC curve; An AUC of 1 indicates perfect performance, while 0.5 represents random performance.

4.5 Comparing Classifiers: Statistical Tests

A t-test can be used for comparing two classifiers on k-fold cross-validation:
$t = \frac{\bar{d} s}{\sqrt{k}} \quad \text{where} \quad \bar{d} = \frac{1}{k} \sum{i=1}^k (e{Ai} - e{Bi}), \quad s^2 = \frac{1}{k-1} \sum{i=1}^k (di - \bar{d})^2$ Compare t with the critical value from $t{k-1, \alpha/2}$

5. Techniques to Improve Classification Accuracy

5.1 Ensemble Methods

Ensemble methods use multiple classifiers to build a superior classifier. The intuition is that the errors of individual classifiers can cancel out if they are independent.

5.1.1 Bagging (Bootstrap Aggregating)

Proposed by Breiman in 1996, this technique involves:

Generate k bootstrap training sets via sampling with replacement.
Train a base classifier for each sample.
For classification tasks, utilize majority voting; for regression, compute the average.
$f{bag}(x) = \text{majority}{f1(x), f2(x), \ldots, fk(x)}$
This approach reduces variance without increasing bias and suits unstable base classifiers like decision trees.

5.1.2 Boosting (AdaBoost)

Assign equal weights $w_i = \frac{1}{n}$ to all instances.
For iterations $t = 1, \ldots, T$ :
- Train classifier $C_t$ on weight-adjusted data.
- Compute the error as: $\epsilont = \sum{i:Ct(xi) \neq yi} wi$
- Calculate the classifier weight: $\alphat = \frac{1}{2} \ln\left(\frac{1 - \epsilont}{\epsilon_t}\right)$
- Update weights:
 $wi \leftarrow wi \cdot e^{-\alphat yi Ct(xi)} \quad ext{(normalize)}$
Final Prediction:
$f(x) = \text{sign}\left( \sum{t=1}^T \alphat C_t(x) \right)$
This method reduces bias and variance but can be sensitive to noisy data and outliers.

5.1.3 Random Forests

Random Forests combine bagging with random feature selection:

Build T decision trees on bootstrap samples.
At each split, randomly select m features (typically $m = \sqrt{p}$ , where p is the total features).
Choose the best split among the m features.
Aggregate predictions using majority voting.

5.2 Class Imbalance Problem

This issue arises when class distribution is significantly skewed (e.g., 99% vs 1%). Strategies to cope include:

Oversampling: Duplicate instances of the minority class or apply SMOTE to create new instances.
Undersampling: Reduce instances of the majority class.
Cost-sensitive learning: Assign higher misclassification costs to the minority class.

5.3 SMOTE

The Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic instances by interpolating among minority class neighbors.

6. Advanced Classification: Bayesian Belief Networks

6.1 Bayesian Belief Network (BBN)

A BBN is a probabilistic graphical model representing a set of variables with their conditional independence relationships via a Directed Acyclic Graph (DAG). Each node $Xi$ holds a Conditional Probability Table (CPT) detailing $P(Xi | ext{Parents}(X_i))$ .

6.2 Structure and Joint Probability

Given a DAG with n nodes, the joint probability distribution can be articulated as:
$P(X1, X2, \ldots, Xn) = \prod{i=1}^n P(Xi | ext{Parents}(Xi))$
Example: For medical diagnosis, with variables: Smoking, Pollution, Lung Cancer, X-ray, and Dyspnoea.

CPT for Lung Cancer: given Smoking (S) and Pollution (P):

S	P	P(LC=T)	P(LC=F)
T	High	0.05	0.95
T	Low	0.02	0.98
F	High	0.03	0.97
F	Low	0.001	0.999

6.3 Learning in BBNs

Structure Learning: Finding the DAG structure
a. Score-based Learning: Optimize a scoring function (e.g., BIC, MDL, BDe).
b. Constraint-based Learning: Utilize conditional independence tests to create the graph.
Parameter Learning: Estimate the CPTs utilizing MLE or Bayesian methods.

6.4 BIC Score (Bayesian Information Criterion)

Can be provided as:
$\text{BIC}(G, D) = \log P(D | \hat{\theta}, G) - \frac{|G|}{2} \log n$

6.5 Inference in BBNs

Exact Inference: Techniques like variable elimination and junction tree algorithm.
Approximate Inference: Methods like Monte Carlo sampling (MCMC), and loopy belief propagation.
Given evidence E = e, compute:
$P(X | E = e) = \frac{P(X, E = e)}{P(E = e)} = \sum{y \backslash X, E} P(y) \sum{y \backslash E} P(y)$

7. Lazy Learners

Lazy Learning

Lazy learners (or instance-based learners) delay model generalization until a query arises. They retain all training data and categorize queries based on comparisons with stored instances, contrasting with eager learners like decision trees or Naïve Bayes that develop a global model during training.

7.1 k-Nearest Neighbour (k-NN) Classifier

Algorithm 2: k-Nearest Neighbour Classification

Store all n training instances.
Given a query instance $x_q$ :
Identify the k nearest neighbors of $x_q$ using a distance metric.
Assign class based on majority vote among the k neighbors.

Distance Metrics:

Euclidean:
$d(x, y) = \sqrt{\sum{i=1}^{d} (xi - y_i)^2}$
Manhattan:
$d(x, y) = \sum{i=1}^{d} |xi - y_i|$
Minkowski:
$d(x, y) = \left(\sum{i=1}^{d} |xi - y_i|^p\right)^{1/p}$
Weighted k-NN: Assign weights to neighbors by inverse distance:
$wi = \frac{1}{d(xq, xi)^2}, \quad \hat{C} = \arg\maxc \sum{i:Ci=c} w_i$

Example of k-NN

Given training data with 3 class-A points and 3 class-B points, classify query point $x_q = (5, 3)$ :

With $k=3$ : if 2 neighbors are class-A and 1 is class-B, classify as class-A.
With $k=1$ : classify as the class of the nearest neighbor.

7.2 Case-Based Reasoning (CBR)

CBR classifies by retrieving the most similar past cases and adapting their solutions. The CBR cycle involves:

Retrieve: Find similar past cases.
Reuse: Adapt retrieved solutions to the current problem.
Revise: Test and modify solutions as needed.
Retain: Store new cases for future reference.

7.3 Advantages and Disadvantages of Lazy Learning

Advantages:
- No upfront training time required.
- Naturally accommodates multi-class scenarios.
- Adapts to new training data effortlessly.
- Can model complex decision boundaries.
Disadvantages:
- Classification is time-consuming (search all instances).
- High memory consumption.
- Sensitive to irrelevant features.
- Requires feature normalization.
- Vulnerable to the curse of dimensionality.

Part B: Cluster Analysis

8. Introduction to Cluster Analysis

Clustering is an unsupervised learning task aimed at discovering groups (clusters) in unlabelled data. The goal is to group instances such that:

Instances within a cluster are similar to each other.
Instances in one cluster are dissimilar to those in other clusters.
Given a dataset $D = {x1, x2, \ldots, xn}$ , the clustering process divides D into $k$ groups $C1, C2, \ldots, Ck$ such that:
$C_i \neq \emptyset$ for all $i$
$\bigcup{i=1}^k Ci = D$ (equally partition the dataset)
$Ci \cap Cj = \emptyset$ for all $i \neq j$ (no overlap in hard clustering).

8.1 Applications of Clustering

Clustering can be applied in various fields:

Customer segmentation in marketing.
Document clustering and topic modeling.
Image segmentation in computer vision applications.
Anomaly detection in network security.
Gene expression analysis in bioinformatics.
City planning through grouping locations based on value.

9. Types of Data in Cluster Analysis

9.1 Data Types

Data Type	Description	Example
Interval-scaled	Continuous real values	Height, temperature
Binary	Two states (symmetric/asymmetric)	Gender (M/F), disease presence (Y/N)
Nominal	Unordered discrete categories	Colour, city, job
Ordinal	Ordered discrete values	Rank, quality (low/medium/high)
Ratio-scaled	Positive values with meaningful ratios	Age, salary, weight
Mixed types	Combination of different types	Customer records with various features

9.2 Dissimilarity and Similarity Measures

9.2.1 For Interval-Scaled Attributes

Standardization to achieve uniform scale:
$zf = \frac{x{if} - mf}{sf}$
Where $mf$ is the mean and $sf$ is the standard deviation.

Minkowski Distance:
$d(i, j) = \left(\sum{f=1}^p |x{if} - x_{jf}|^q\right)^{1/q}$ with specific values of q defining various distance metrics (q = 1 for Manhattan, q = 2 for Euclidean).

9.2.2 For Binary Attributes

A contingency table defines the occurrences as follows:

No. of occurrences	Count
Both 1	a
1 and 0	b
0 and 1	c
Both 0	d

The distance measures become:

Simple Matching:
$d_{simple\,matching} = \frac{b+c}{p}$
Jaccard Distance:
$d_{Jaccard} = \frac{b+c}{a+b+c}$

9.2.3 For Nominal Attributes

$d(i,j) = \frac{p - m}{p}$ where m is the number of matching attributes and p is the total count of attributes.

9.2.4 For Ordinal Attributes

Convert ranks to a numerical scale from 0 to 1 and apply interval-scale methods.

9.2.5 Mixed-Type Attributes

The generalized distance measure can be represented as:
$d(i, j) = \sum{f=1}^p \delta(f){ij} d(f){ij} \bigg/ \sum{f=1}^p \delta(f){ij}$ Where $\delta(f){ij} = 0$ if attribute f is missing, else 1.

10. Partitioning Methods

In partitioning methods, the goal is to split n objects into k non-overlapping subsets such that each subset includes at least one object and minimizes a criterion, generally the sum of squared distances to cluster centers.

10.1 k-Means Algorithm

Algorithm 3: k-Means Clustering

Select k objects randomly as initial cluster centroids $\mu1, \ldots, \muk$
Repeat the following steps:
- For each object $xi$ : Assign $xi$ to cluster $Cj$ where $j = \arg\minj d(xi, \muj)$
- For each cluster $Cj$ : Update the centroid: $\muj = \frac{1}{|Cj|} \sum{x \in C_j} x$
Until there is no change in centroids (or maximum iterations reached).

Objective to Minimize:
$J = \sum{j=1}^k \sum{x \in Cj} |x - \muj|^2$

Example of k-Means

Data contains: {2, 4, 10, 12, 3, 20, 30, 11, 25}, with k=3.

Initial centroids: $\mu1 = 2, \mu2 = 4, \mu_3 = 10$
Iteration 1 Assignment: Cluster formed as follows: $C1 = {2, 3}, C2 = {4}, C_3 = {10, 12, 11, 20, 25, 30}$
Update Centroids: $\mu1 = 2.5, \mu2 = 4, \mu_3 = 18$
Iterate until convergence.

Limitations of k-Means

Predefined k: Must specify the number of clusters in advance.
Sensitivity to Initial Centroid Selection: May lead to local optima.
Sensitivity to Outliers and Noise: Outliers can skew results.
Assumes Spherical, Equal-sized Clusters: Performance degrades if these conditions are not met.
Inefficient for Large Datasets: Complexity scales with O(nkt) where t is the number of iterations.

Variants of k-Means

k-Means++: Uses smart initialization to reduce sensitivity to initial seeds.
Mini-Batch k-Means: Applies random batches to enhance scalability.
Fuzzy c-Means: Offers probabilistic or soft cluster assignments.

11 k-Medoids Clustering

Instead of using the mean (centroid), k-medoids (PAM — Partitioning Around Medoids) chooses an actual data object (the medoid) which minimizes dissimilarity within the cluster. This method is more robust against outliers compared to k-Means.

11.1 PAM Algorithm

Algorithm 4: PAM (Partitioning Around Medoids)

Randomly select k objects as initial medoids $m1, \ldots, mk$ .
Repeat the following:
- Assign each non-medoid object to the nearest medoid’s cluster.
- For each medoid $mi$ and a non-medoid object $oh$ :
 - Compute cost:
 $T{Cih} = \sum{j \notin \text{medoids}} [d(j, \text{nearest medoid excluding } m_i) - d(j, \text{current nearest medoid})]$
- If $T{Cih} < 0$ , perform a swap: $mi \leftarrow o_h$ (this decreases total cost).
Continue until no changes occur.

11.2 CLARA (Clustering LARge Applications)

CLARA applies PAM to multiple random samples of a large dataset:

Draw multiple random samples.
Apply PAM to each to find medoids.
Assign objects to the nearest medoid across all samples.
Select the medoid set minimizing overall cost, leading to:
$O(ks^2 + k(n-k))$ where s is the sample size.

11.3 CLARANS (Clustering Large Applications based on RANdomized Search)

CLARANS combines sampling with graph-search:

Models the problem as a graph where nodes represent sets of k medoids.
Randomly explores neighbor swaps until no better neighbor exists within a maximum defined attempts.
More efficient than CLARA when working with large datasets.

12 Density-Based Methods

Density-based clustering identifies clusters as dense regions surrounded by lower-density areas, effectively handling noise/outliers and discovering clusters of arbitrary shapes.

12.1 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Key Concepts: Given parameters $\epsilon$ (neighborhood radius) and MinPts (minimum density):

Neighbor Definition:
$N_\epsilon(p) = { q \in D | d(p, q) \leq \epsilon }$
Core Point: If $|N_\epsilon(p)| \geq \text{MinPts}$ .
Border Point: Not core but located in a core point's neighborhood.
Noise Point (Outlier): Neither core nor border point.
Density-Reachable: A point $q$ is density-reachable from $p$ if $q\in N_\epsilon(p)$ and $p$ is a core point.
Density-Connected: Points p and q are density-connected if both are reachable from some point $o$ .

Algorithm 5: DBSCAN

Mark all points as unvisited.
For every unvisited point $p$ :
- Mark $p$ as visited.
- Compute $N \leftarrow N_\epsilon(p)$ .
If |N| < ext{MinPts}, mark $p$ as noise.
Else:
- Create new cluster $C$ , add $p$ to $C$ .
- Set seeds as $N \setminus {p}$ .
For each point $q\in$ seeds:
- If $q$ is unvisited, mark $q$ as visited.
- Compute $N' \leftarrow N_\epsilon(q)$ .
- If $|N'| \geq \text{MinPts}$ , include these points as seeds.
- Add $q$ to cluster C if it’s not already assigned to a cluster.

Complexity: O(n log n) with spatial indexing (e.g., R*-tree), O(n^2) without spatial indexing.

DBSCAN Illustration

In an illustration, core points (marked in blue) have neighborhoods exceeding MinPts while border points lie within a core’s ε-neighborhood. Noise points (marked in orange) are isolated.

Advantages of DBSCAN:

Discovers clusters of arbitrary shape and identifies noise and outliers automatically.
There's no need to define the number of clusters k.

Limitations:

DBSCAN is sensitive to $\epsilon$ and MinPts choice.
It struggles with varying density clusters.
High-dimensional data can degrade performance due to the curse of dimensionality.

12.2 OPTICS (Ordering Points to Identify Clustering Structure)

OPTICS expands upon DBSCAN by addressing dense variations and offers:

Core Distance: Minimum $\epsilon$ to make point p a core point.
Reachability Distance: The maximum distance between the core distance of a point o and distance from o to p.
Produce an ordered list useful for identifying clusters at different densities.

13 Grid-Based Methods

These methods quantize the data space into a finite number of cells enabling clustering operations dependent on the number of cells rather than the number of data objects.

13.1 STING (STatistical Information Grid)

STING is a multiresolution grid-based clustering method:

The structure divides space into rectangular cells at various resolution levels.
Each upper-level cell partitions smaller cells at subsequent levels. Each cell stores statistical parameters, e.g., count, mean, and standard deviation.

Query processing steps include:

Start at a high-level (coarse) layer.
Compute confidence intervals for relevance using stored statistics.
Discard irrelevant cells, recursively process relevant cells.
Return relevant cells as cluster candidates.

Complexity: O(K) where K represents the number of grid cells, independent of n.

Advantages:

Query-independent preprocessing is beneficial for execution speed (O(n) to build, O(K) to query).
Supports incremental updates and multiresolution analyses.

Limitations:

Cluster boundaries can only be vertical/horizontal (- no diagonal).
Accuracy is influenced by grid granularity and assumes data fits a standard distribution within cells.

13.2 WaveCluster

This approach employs wavelet transforms within grid-based clustering:

Summarize data in a multi-dimensional grid framework.
Execute wavelet transformations to locate high-density regions.
Identify connected components within the transformed grid.
This method is effective at detecting arbitrary shapes and managing noise effectively.

Complexity: O(n).

13.3 CLIQUE (CLustering In QUEst)

CLIQUE integrates grid-based and subspace clustering:

Automatically identifies high-density clusters within subspaces.
Utilizes Apriori-like monotonicity assumptions to evaluate cluster density.

14 Outlier Analysis

An outlier is an observation that significantly deviates from other observations,
indicating possible measurement errors or insightful anomalies. Outliers can signify:

Data entry mistakes.
Important rare events (e.g., fraud, network intrusions).
Patterns worth further investigation.

14.1 Statistical Approaches

Outlier detection methods include:

Distribution-based: Assume data follows a specific distribution; objects with low probability are identified.
Z-score Method: Calculate $zi = \frac{xi - \bar{x}}{s}$ and treat observations where |z_i| > 3 as potential outliers.
IQR Method (Box Plot): Define fences based on quartiles to flag outliers:
- Lower fence: $Q_1 - 1.5 \times \text{IQR}$
- Upper fence: $Q3 + 1.5 \times \text{IQR}$ where $\text{IQR} = Q3 - Q_1$ .

14.2 Distance-Based Outlier Detection

DB(p,D)-Outlier: An object $o$ is a DB(p,D)-outlier if a fraction $p$ of all objects lie more than distance D away from o:
\frac{|{o' \in D\, | d(o, o') \leq D}|}{|D|} < 1 - p
Efficient detection methods optimize using nested loops or indices (e.g., R-tree) for pruning clearly non-outlier objects. The complexity can be O(n²) or O(n log n) with indexing.

14.3 Density-Based Local Outlier Factor (LOF)

The LOF value gauges the local density variance of an object referring to its neighboring points:

Reachability Distance:
$\text{reach-dist}_k(p,o) = \max(\text{k-dist}(o), d(p,o))$
Local Reachability Density:
$\text{lrd}k(p) = \frac{1}{\sum{o \in Nk(p)} \text{reach-dist}k(p,o) / |N_k(p)|}$
LOF:
$\text{LOF}k(p) = \frac{\sum{o \in Nk(p)} \text{lrd}k(o)}{\text{lrd}k(p) \, |Nk(p)|}$
Where:
$\text{LOF}k(p) \approx 1$ indicates p is a cluster point (similar density), and $\text{LOF}k(p) \gg 1$ demonstrates it is an outlier (density lower than neighbors).

14.4 Clustering-Based Outlier Detection

Objects in small or sparse clusters are labeled as outliers.
Objects failing to belong to any cluster (e.g., noise in DBSCAN) are classified as outliers.
Assessing distance to the cluster centroid can provide an understanding of outlier characteristics.

Part C: Data Mining Trends & Frontiers

15 Overview of Web Mining

Web mining encompasses applying data mining techniques to unveil meaningful and potentially beneficial patterns from the World Wide Web, focusing on web content, structure, and usage data.

15.1 Types of Web Mining

Web Content Mining: Utilizes content from web pages (text, images, audio, and video) to extract information.
Web Structure Mining: Analyzes the hyperlink structure of the web to identify authority pages, hubs, and communities.
Web Usage Mining: Examines web server logs and user behavior to reveal usage patterns.

15.1.1 Web Content Mining

Tasks include:
- Web page classification (topic categorization).
- Information extraction (structuring unstructured data).
- Text mining on web documents (natural language processing, sentiment analysis).
- Multimedia content mining.
Common techniques:
- TF-IDF for term weighting.
- Topic modeling (LDA — Latent Dirichlet Allocation).
- Named Entity Recognition (NER).
- Sentiment analysis.
TF-IDF Formula:
$TF-IDF(t, d) = tf(t, d) \cdot log\left( \frac{N}{df(t)} \right)$
where:
$tf(t, d)$ = term frequency,
$N$ = total documents,
$df(t)$ = number of documents containing term t.

15.1.2 Web Structure Mining

Mining the hyperlink connections of the web to uncover authority pages and hubs.
PageRank (Google):
$PR(p) = 1 - d \frac{1}{N} + d \sum{q \in Bp} \frac{PR(q)}{L(q)}$
Where:
$d \approx 0.85$ is the damping factor,
$B_p$ represents pages linking to p,
$L(q)$ is the count of out-links from q.
HITS Algorithm (Hyperlink-Induced Topic Search):
- Each page has two scores for hubs (h) and authorities (a).
- Calculate authority score by summing hub scores of referring pages:
  $a(p) = \sum_{q \rightarrow p} h(q)$
- Calculate hub score by summing authority scores of links:
  $h(p) = \sum_{p \rightarrow q} a(q)$
- Iterate this procedure until convergence (through the power method applied to the adjacency matrix).

15.1.3 Web Usage Mining

Analyzing logs and user behaviors to uncover patterns in internet usage, drawing from web server logs, browser client logs, and proxy server logs.
Tasks can include sessionization (creating groups of log entries into user sessions) and applying association rule mining on page sequences, sequential pattern mining (e.g., using AprioriAll and SPADE), and collaborative filtering for recommendations.

16 Temporal and Spatial Mining

16.1 Temporal Data Mining

Temporal data refers to information indexed by time, encompassing time-series, event sequences, and periodic patterns.

16.1.1 Time Series Analysis

A time series is a series of observations $x1, x2, …, x_T$ observed at consistent time intervals.

Decomposition:
$xt = Tt + St + Ct + I_t$
Where:
T_t: Trend component (long-term direction).
S_t: Seasonal component (periodic patterns).
C_t: Cyclical component (irregular, multi-year patterns).
I_t: Irregular component (noise/residual).

16.1.2 Sequential Pattern Mining

Goal: Identify all sequences with support ≥ min support from the database of sequences using approaches like GSP (Generalized Sequential Patterns), SPADE (vertical format), and PrefixSpan (pattern-growth approach).

16.1.3 Temporal Association Rules

Extend traditional association rules to account for temporal constraints, expressed as $X \Rightarrow \triangle t Y$ (event X must be followed by event Y within time interval $\triangle t$ ). Additionally, calendar-based patterns reveal findings tied to temporal intervals (daily, weekly, monthly).

16.2 Spatial Data Mining

Spatial data manifests through geometric properties such as location, often found in geographic information systems (GIS), remote sensing, medical imaging, and satellite data.

16.2.1 Spatial Data Types

Point data: GPS coordinates of events.
Line data: Features like roads, rivers, and fault lines.
Polygon data: Regional designations for land use.
Raster data: Satellite images organized as grids.

16.2.2 Spatial Mining Tasks

Common spatial mining tasks include:

Spatial Classification: Classifying spatial objects (e.g., land use based on satellite imagery).
Spatial Clustering: Grouping geographically proximate objects (e.g., utilizing DBSCAN for spatial data).
Spatial Association Rules: Identify rules involving spatial predicates (e.g., identifying high-value properties close to the ocean).
Spatial Outlier Detection: Detecting anomalies based on spatial context (e.g., unusually low attribute values relative to neighboring objects).
Spatial Trend Analysis: Discover directional trends in spatial configurations.

16.2.3 Spatial Autocorrelation

The Moran’s I statistic assesses spatial autocorrelation:
$I = \frac{n}{\sumi \sumj w{ij}} \sumi \sumj w{ij} (x{i} - \bar{x})(x{j} - \bar{x})$

Where:
- I > 0: indicates positive spatial autocorrelation.
- I < 0: suggests negative autocorrelation.
- I \approx 0: illustrates randomness in spatial patterns.

17 Statistical Data Mining

Statistical data mining bridges classical statistics and modern machine learning methodologies to discern patterns, select models, and make inferences.

17.1 Statistical Learning Foundations

17.1.1 Regression Analysis

Multiple Linear Regression:
$y = \beta0 + \beta1 x1 + \beta2 x2 + \ldots + \betap x_p + \epsilon$ where $\hat{\beta} = (X^TX)^{-1}X^Ty$
Logistic Regression (for classification):
$P(y=1 | x) = \frac{1}{1 + e^{-(\beta_0 + \beta^T x)}} = \sigma(\beta^T x)$

17.1.2 Principal Component Analysis (PCA)

A technique for dimensionality reduction utilizing variance projection directions:
$Z = XW$
Where:

W = eigenvectors from $X^TX$ .
Proportion of Variance Explained (PVE) by k components:
$PVEk = \frac{\sum{j=1}^k \lambdaj}{\sum{j=1}^p \lambda_j}$

17.1.3 Hypothesis Testing in Data Mining

Common statistical tests include:

Chi-Squared Test: For independence of categorical attributes.
t-Test: Used in comparing means of two groups.
ANOVA (Analysis of Variance): Compares means across multiple groups.
Kolmogorov-Smirnov Test: Evaluates sample distribution goodness.
Consider corrections for multiple comparisons, namely the Bonferroni correction and controlling the False Discovery Rate (FDR).

17.2 Minimum Description Length (MDL)

The MDL principle asserts that the best model is one that achieves optimal data compression. Formally represented as:
$MDL(H, D) = L(H) + L(D | H)$

Where L(H) refers to bits necessary to describe model H and L(D|H) accounts for bits describing data given model H.

17.3 EM Algorithm (Expectation-Maximization)

This algorithm is utilized for learning probabilistic models incorporating hidden variables:

E-step (Expectation):
$Q(\theta | \theta^{(t)}) = E_{Z|X,\theta^{(t)}} [\log P(X, Z | \theta]$
M-step (Maximization):
$\theta^{(t+1)} = \arg \max_{\theta} Q(\theta | \theta^{(t)})$
Applications are found in:
- Gaussian Mixture Models (GMM) for soft clustering.
- Learning Hidden Markov Models (HMM) for sequential data.
- Filling in missing data.

18 Data Mining Applications

18.1 Business and Finance

Application	Description
Customer segmentation	Grouping customers for targeted marketing; RFM (Recency, Frequency, Monetary) analysis.
Fraud detection	Classifying transactions as fraudulent leveraging anomaly detection and supervised learning.
Credit scoring	Assessing creditworthiness through regression, SVM, and decision trees.
Market basket analysis	Leveraging frequent pattern mining (e.g., Apriori, FP-Growth) for strategies like cross-selling.
Stock prediction	Applying time series analysis via LSTM, ARIMA for price forecasting.
Churn prediction	Identification of customers at risk of leaving through classification strategies.

18.2 Healthcare and Biomedical

Application	Description
Disease prediction	Utilizing classification techniques for early diagnosis (e.g., diabetes, cancer).
Gene expression analysis	Clustering applications based on expression patterns in microarray data.
Drug discovery	Mining associations between various compounds and biological activities.
Medical image analysis	Classification and segmentation driven by CNN-based methodologies.
EHR Analysis	Temporal pattern mining for discernment of treatment pathways.

18.3 Telecommunications

Application	Description
Network intrusion detection	Employing anomaly detection.
Customer churn analysis	Assessing and enhancing customer retention.
CDR analysis	Classification of call data records.
Network fault diagnosis	Utilizing classification techniques for diagnosis purposes.

18.4 Retail and E-Commerce

Application	Description
Recommendation systems	Utilizing collaborative and content-based filtering.
Inventory management	Demand forecasting leveraging time series methodologies.
Price optimization	Using regression methods and simulations for price setting.
Supply chain optimization	Clustering methods combined with outlier detection in logistics.

18.5 Social Network Analysis

Application	Description
Community detection	Employing graph clustering utilizing methods like modularity optimization.
Influence Analysis	Identifying key influencers using centrality measures.
Link prediction	Predicting future connections based on graph-derived features.
Sentiment analysis	Mining opinion metrics from social media text.

18.6 Scientific Data Mining

Application	Description
Astronomy	Classifying celestial bodies and detecting phenomena like supernovae from surveys.
Climate science	Identifying climate patterns and anomalies while predicting extreme events.
Particle physics	Event classification processes at CERN.
Seismology	Techniques for predicting earthquakes leveraging sensor data.

19 Current Trends and Research Frontiers

19.1 Big Data and Scalable Mining

MapReduce/Hadoop: Deploying data mining on commodity hardware to scale operations for larger datasets.
Apache Spark: In-memory distributed computing benefiting iterative machine learning algorithms.
Stream Mining: Mining real-time data streams (using techniques like VFDT and SAMOA).
Online Learning: Incrementally updating models with new data as it becomes available.

Hoeffding Bound (for stream mining)

\epsilon = \slfrac{R^2 \log(1/\delta)}{2n}
This implies that the true mean differs from the sample mean by at most $\epsilon$ with probability $1 - \delta$ .

19.2 Deep Learning and Data Mining

CNNs: Used for image and video mining along with feature extraction.
RNNs/LSTMs: Directed toward temporal and sequential data mining.
Graph Neural Networks (GNNs): Focused on mining data structured as graphs.
Transformers: Utilized in natural language processing tasks, including BERT and GPT for text mining.
Autoencoders: Applied in anomaly detection and dimensionality reduction processes.

19.3 Privacy-Preserving Data Mining

k-Anonymity: Each record is indistinguishable from at least k - 1 other records.
Differential Privacy: Introduces calibrated noise to outputs, with privacy expression $\epsilon: P[M(D)\in S] \leq e^{\epsilon}P[M(D')\in S]$ for neighboring datasets D, D′ and outputs S.
Federated Learning: Trains models across devices without requiring raw data sharing.

19.4 Explainable AI (XAI) and Interpretable Mining

LIME: Provides explanations for individual predictions using local linear approximations.
SHAP: Attribute importance understood within the framework of game theory.
Counterfactual explanations: Addresses what minimal changes would impact model decisions.

19 Summary: Key Formulas Quick Reference

The formulas are summarized as follows:

Classification

$H(D) = -\sumi pi \log2 pi$ ;
$\text{Gain}(A) = H(D) - \sum{v}|Dv|/|D| H(Dv)$ ; $\text{GainRatio}(A) = \text{Gain}(A) / \text{SplitInfo}(A)$ ; $\text{Gini}(D) = 1 - \sumi pi^2$ ; $P(Ci | X) \propto P(Ci) \prod{k} P(xk | Ci) \quad \text{(Naïve Bayes)}$ ;
$\text{LOF}k(p) = \frac{\sum{o \in Nk(p)} \text{lrd}k(o)}{\text{lrd}k(p) \, |Nk(p)|}$

Clustering

$J{k-means} = \sum{j=1}^k \sum{x \in Cj} |x - \muj|^2$ ; $\text{reach-dist}k(p, o) = \max(\text{k-dist}(o), d(p,o))$ ;

Web Mining

$TF-IDF(t, d) = tf(t, d) \cdot \log \left( \frac{N}{df(t)} \right)$ ;
$PR(p) = 1 - d \cdot \frac{1}{N} + d \sum{q \in Bp} \frac{PR(q)}{L(q)}$ ;

Statistics

$\hat{\beta} = (X^TX)^{-1}X^Ty$ ;
$\sigma(\cdot) =\frac{1}{1 + e^{-z}}$ ;
$MDL = L(H) + L(D|H)$

End of Notes

These structured notes serve as a comprehensive reference for advanced concepts in Data Mining, especially emphasizing Classification Techniques, Cluster Analysis, and related methodologies.

Data Mining Detailed Study Notes Unit 4 Classification Techniques & Cluster Analysis