Data Mining Detailed Study Notes Unit 4 Classification Techniques & Cluster Analysis

Data Mining Detailed Study Notes

Unit 4 Classification Techniques and Cluster Analysis

Contents at a Glance

  • Part A: Classification
    • Decision Trees
    • Bayes
    • Naïve Bayes
    • Bayesian Belief Network (BBN)
    • Lazy Learners
    • Model Evaluation
    • Improving Accuracy
  • Part B: Cluster Analysis
    • Types of Data
    • k-Means
    • k-Medoids
    • DBSCAN
    • STING
    • Outlier Analysis
  • Part C: Trends & Frontiers
    • Web Mining
    • Temporal & Spatial Mining
    • Statistical Data Mining
    • Applications

Part A: Classification Techniques

1. General Approach to Classification

Classification is a supervised learning task, where the objective is to learn a model (classifier) to predict the class of unseen instances based on labeled training examples. The overall approach includes the following components:

  • Input: A training set
    D = igl{(x1, y1), ldots ,(xn, yn)igr}
    where x<em>iRdx<em>i \in R^d is a feature vector and yi \in {c1, ldots, ck} is the corresponding class label.
  • Output: A function f : R^d
    ightarrow {c1, ldots, ck} that minimizes classification error on unseen data.
1.1 Definition and Terminology

The classification task can be formalized into two main phases:

  • Phase 1 — Learning (Training): The model is constructed from the training data where the algorithm analyzes the data to induce the model.
  • Phase 2 — Classification (Testing): The model is applied to classify new, unseen test instances. The accuracy is evaluated based on a test set where the class labels are known but concealed from the model.
1.2 Model Accuracy

Accuracy can be calculated using the formula:
extAccuracy=extNumberofcorrectlyclassifiedtestinstancesextTotalnumberoftestinstancesext{Accuracy} = \frac{ ext{Number of correctly classified test instances}}{ ext{Total number of test instances}}

1.3 Types of Classifiers

There are several types of classifiers, including:

  • Decision Trees: Algorithms such as ID3, C4.5, and CART.
  • Probabilistic Classifiers: Using models like Naïve Bayes and Bayesian Belief Networks.
  • Lazy Learners: Including k-Nearest Neighbour and Case-based Reasoning.
  • Neural Networks: Such as Multi-layer Perceptron and Deep Learning.
  • Support Vector Machines (SVM): Including Linear SVM and Kernel SVM.
  • Ensemble Methods: Techniques like Bagging, Boosting, and Random Forests.

2. Classification by Decision Tree Induction

A decision tree is a tree structure that resembles a flowchart, where:

  • Each internal node represents a test on an attribute.
  • Each branch indicates the outcome of the test.
  • Each leaf node corresponds to a class label.
  • The topmost node is called the root node.
2.1 Algorithm: Basic Decision Tree Induction

The steps for inducing a decision tree are outlined in the algorithm below:

Algorithm 1: Generate Decision Tree(Training set D, Attribute list A)

  1. Create a node N.
  2. If all tuples in D belong to the same class C, then return N as a leaf node labeled C.
  3. If A is empty, return N as a leaf node labeled with majority class in D.
  4. Select the best attribute aa^* based on the splitting criterion.
  5. Label N with aa^*.
  6. For all known values viv_i of aa^* do:
    • Extract tuples where a=v<em>ia^* = v<em>i to form subset D</em>iD</em>i.
    • If DiD_i is empty, attach a leaf labeled with majority class of D.
    • Otherwise, attach the subtree generated by calling Generate Decision Tree( Di,A\aD_i, A \backslash {a^*}).
  7. Return N.
2.2 Attribute Selection Measures
2.2.1 Information Gain (ID3)

Entropy and Information Gain are defined as follows:

  • Entropy of a dataset D with m classes is defined as:
    H(D)=<em>i=1mp</em>ilog<em>2p</em>iH(D) = -\, \sum<em>{i=1}^m p</em>i \log<em>2 p</em>i
    Where p<em>i=C</em>iDp<em>i = \frac{|C</em>i|}{|D|} represents the proportion of class CiC_i in D.

  • Information Gain for an attribute A:
    Gain(A)=H(D)<em>vvals(A)D</em>vDH(Dv)\text{Gain}(A) = H(D) - \sum<em>{v \in \text{vals}(A)} \frac{|D</em>v|}{|D|} H(D_v)

The attribute with the highest information gain is chosen for splitting.

Example of Entropy Calculation

Given a dataset with 9 “Yes” and 5 “No” samples:
H(D)=914log<em>2914514log</em>2514=0.940 bitsH(D) = -\frac{9}{14} \log<em>2{\frac{9}{14}} - \frac{5}{14} \log</em>2{\frac{5}{14}} = 0.940\ \text{bits}

The attribute “Outlook” splits the dataset as follows: Sunny(5), Overcast(4), Rain(5).

  • H(Sunny)=25log<em>22535log</em>235=0.971H(Sunny) = -\frac{2}{5} \log<em>2{\frac{2}{5}} - \frac{3}{5} \log</em>2{\frac{3}{5}} = 0.971
  • H(Overcast)=0(all Yes)H(Overcast) = 0 \, (\text{all Yes})
  • H(Rain)=0.971H(Rain) = 0.971
  • Gain(Outlook)=0.940(5140.971+4140+5140.971)=0.246 bits\text{Gain(Outlook)} = 0.940 - \left( \frac{5}{14} \cdot 0.971 + \frac{4}{14} \cdot 0 + \frac{5}{14} \cdot 0.971 \right) = 0.246\ \text{bits}
2.2.2 Gain Ratio (C4.5)

To alleviate the bias toward attributes with many values, C4.5 adjusts the information gain calculation as:

  • Split Information for attribute A:
    SplitInfo(A)=<em>j=1vD</em>jDlog<em>2D</em>jD\text{SplitInfo}(A) = -\sum<em>{j=1}^v \frac{|D</em>j|}{|D|} \log<em>2 \frac{|D</em>j|}{|D|}
  • Gain Ratio:
    GainRatio(A)=Gain(A)SplitInfo(A)\text{GainRatio}(A) = \frac{\text{Gain}(A)}{\text{SplitInfo}(A)}
2.2.3 Gini Index (CART)

The Gini index for a dataset D is defined as:

  • Gini(D)=1<em>i=1mp</em>i2\text{Gini}(D) = 1 - \sum<em>{i=1}^m p</em>i^2
    For a binary split on attribute A into $D1$ and $D2$:
  • Gini<em>A(D)=D</em>1DGini(D<em>1)+D</em>2DGini(D2)\text{Gini}<em>A(D) = \frac{|D</em>1|}{|D|} \text{Gini}(D<em>1) + \frac{|D</em>2|}{|D|} \text{Gini}(D_2)
  • The optimization of the index can be expressed as:
  • ΔGini(A)=Gini(D)GiniA(D)\Delta \text{Gini}(A) = \text{Gini}(D) - \text{Gini}_A(D) (to maximize this value).
2.3 Tree Pruning

Pruning techniques are used to avoid overfitting, which occurs when the decision tree is overly complex for the training data. Two main types include:

  • Pre-pruning (Early Stopping): Cease tree growth based on a statistical significance test showing no reliable improvement (e.g., using chi-squared test, minimum gain threshold).
  • Post-pruning: Complete the tree growth beforehand and prune upwards from the leaves.

Methods of pruning include:

  • Reduced-error Pruning: Replace a subtree with a majority-class leaf if the accuracy does not decrease on validation.
  • Cost-complexity Pruning (CART): Minimize the cost defined as Cost(T)+αleaves(T)\text{Cost}(T) + \alpha \cdot |\text{leaves}(T)|, where α\alpha is a complexity parameter.
  • Pessimistic Pruning (C4.5): Incorporates a continuity correction on training error.
2.4 Scalability and Enhancements

Several enhancements are included to improve the scalability of decision trees:

  • RainForest: Maintains AVC-sets (Attribute-Value-Class) for more scalable tree construction.
  • BOAT (Bootstrapped Optimistic Algorithm): Utilizes bootstrapping for distributed tree building.
  • Handling Missing Values: More effective solutions can be used, e.g., replacing with probable values or distributing weights across branches.
  • Continuous Attributes: Find optimal binary split points via entropy or Gini measures.

3. Bayes Classification Methods

3.1 Bayes’ Theorem

Bayes' Theorem relates conditional and marginal probabilities with the formula:
P(HX)=P(XH)P(H)P(X)P(H | X) = \frac{P(X | H) P(H)}{P(X)}
Where:

  • P(H)P(H) is the prior probability of hypothesis H.
  • P(XH)P(X | H) is the likelihood of observing X given H.
  • P(HX)P(H | X) is the posterior probability of H given evidence X.
  • P(X)P(X) is the marginal probability (normalizing constant).

The classification rule assigns X to class C<em>iC<em>i that maximizes: C^=argmax</em>C<em>iP(C</em>iX)=argmax<em>C</em>iP(XC<em>i)P(C</em>i)\hat{C} = \arg \max</em>{C<em>i} P(C</em>i | X) = \arg \max<em>{C</em>i} P(X | C<em>i) P(C</em>i)

3.2 Naïve Bayes Classifier
3.2.1 Naïve Bayes Assumption

Naïve Bayes classifiers assume that attributes are conditionally independent given the class:
P(XC<em>i)=</em>k=1nP(x<em>kC</em>i)P(X | C<em>i) = \prod</em>{k=1}^{n} P(x<em>k | C</em>i)
This greatly simplifies calculations and facilitates processing high-dimensional data effectively.

3.2.2 Classification Formula

The classification can be computed as:
C^=argmax<em>C</em>iP(C<em>i)</em>k=1nP(x<em>kC</em>i)\hat{C} = \arg \max<em>{C</em>i} P(C<em>i) \prod</em>{k=1}^{n} P(x<em>k | C</em>i)

For Categorical Attributes:
P(x<em>k=vC</em>i)=tDt.x<em>k=vt.C=C</em>iCiP(x<em>k = v | C</em>i) = \frac{|{t \in D | t.x<em>k = v \land t.C = C</em>i}|}{|C_i|}

For Continuous Attributes (assuming a Gaussian distribution):
P(x<em>kC</em>i)=12πσ<em>ik2exp((x</em>kμ<em>ik)22σ</em>ik2)P(x<em>k | C</em>i) = \frac{1}{\sqrt{2\pi \sigma<em>{ik}^2}} \exp\left(-\frac{(x</em>k - \mu<em>{ik})^2}{2\sigma</em>{ik}^2}\right)

To resolve the zero-frequency problem, use Laplace smoothing:
P(x<em>k=vC</em>i)=count(x<em>k=v,C</em>i)+λC<em>i+λvals(x</em>k)P(x<em>k = v | C</em>i) = \frac{\text{count}(x<em>k = v, C</em>i) + \lambda}{|C<em>i| + \lambda \cdot |\text{vals}(x</em>k)|} where setting λ=1\lambda = 1 provides basic Laplace smoothing.

Example of Naïve Bayes Classification

Given:
Classify: X=(Outlook=Sunny,Temp=Cool,Humidity=High,Wind=Strong)X = (\text{Outlook=Sunny}, \text{Temp=Cool}, \text{Humidity=High}, \text{Wind=Strong})

  • P(Yes)=914,P(No)=514P(Yes) = \frac{9}{14}, P(No) = \frac{5}{14}
  • P(SunnyYes)=29,P(CoolYes)=39,P(HighYes)=39,P(StrongYes)=39P(Sunny|Yes) = \frac{2}{9}, P(Cool|Yes) = \frac{3}{9}, P(High|Yes) = \frac{3}{9}, P(Strong|Yes) = \frac{3}{9}
  • P(SunnyNo)=35,P(CoolNo)=15,P(HighNo)=45,P(StrongNo)=35P(Sunny|No) = \frac{3}{5}, P(Cool|No) = \frac{1}{5}, P(High|No) = \frac{4}{5}, P(Strong|No) = \frac{3}{5}
  • Calculation:
    P(XYes)imesP(Yes)=29393939914=0.0053P(X|Yes) imes P(Yes) = \frac{2}{9} \cdot \frac{3}{9} \cdot \frac{3}{9} \cdot \frac{3}{9} \cdot \frac{9}{14} = 0.0053
    P(XNo)imesP(No)=35154535514=0.0206P(X|No) imes P(No) = \frac{3}{5} \cdot \frac{1}{5} \cdot \frac{4}{5} \cdot \frac{3}{5} \cdot \frac{5}{14} = 0.0206
  • Decision: Classify as No (since 0.0206 > 0.0053).
3.3 Advantages and Limitations of Naïve Bayes
  • Advantages:

    • Simple and fast to train.
    • Performs well with small datasets.
    • Naturally handles missing values.
    • No complex optimization required.
    • Effective for text classification tasks.
  • Limitations:

    • Assumes conditional independence among attributes, which can be violated in practice.
    • Cannot capture interactions between attributes naturally.
    • Sensitive to attribute correlation.
    • Zero frequency issues necessitate smoothing strategies.
    • Probability estimates may be unreliable.

4. Model Evaluation and Selection

4.1 Evaluation Metrics
  • Confusion Matrix: In a binary classifier with classes Positive (P) and Negative (N):

    Predicted PositivePredicted Negative
    Actual PositiveTrue Positive (TP)False Negative (FN)
    Actual NegativeFalse Positive (FP)True Negative (TN)
  • Metrics calculated from the confusion matrix:

    • Accuracy:
      Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
    • Precision (Positive Predictive Value):
      Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
    • Recall (Sensitivity):
      Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
    • Specificity:
      Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}
    • F1 Score:
      F1=2PrecisionRecallextPrecision+extRecallF_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{ ext{Precision} + ext{Recall}}
    • F-beta Score:
      Fβ=(1+β2)PrecisionRecallβ2Precision+RecallF_\beta = \frac{(1+\beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}
4.2 Holdout and Cross-Validation Methods
  • Holdout Method: Partition data into training (usually 2/3) and test (1/3) sets.
  • k-Fold Cross-Validation: Split data into k disjoint subsets. In each iteration, use one subset for testing and the rest for training.
  • Leave-One-Out (LOO): A special case of k-fold where k = n (the number of instances), best suited for smaller datasets.
  • Stratified Cross-validation: Ensures each fold maintains the class distribution.
4.3 Bootstrap Method

With bootstrap sampling, the probability of an instance not being selected is:
P(instance not selected)=(11n)ne10.368P(\text{instance not selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368
Thus, approximately 63.2% of original samples appear in the bootstrap sample, with the remaining 36.8% forming the test set.

Final accuracy correction via bootstrap:
Acc<em>0.632=0.632Acc</em>test+0.368AcctrainAcc<em>{0.632} = 0.632\cdot Acc</em>{test} + 0.368\cdot Acc_{train}

4.4 ROC Curves and AUC
  • ROC Curve: Plots True Positive Rate (TPR) against False Positive Rate (FPR):
    TPR=TPTP+FN(y-axis)\text{TPR} = \frac{TP}{TP + FN} \quad (y\text{-axis})
    FPR=FPFP+TN(x-axis)\text{FPR} = \frac{FP}{FP + TN} \quad (x\text{-axis})
  • Area Under Curve (AUC): A single scalar summarizing the ROC curve; An AUC of 1 indicates perfect performance, while 0.5 represents random performance.
4.5 Comparing Classifiers: Statistical Tests
  • A t-test can be used for comparing two classifiers on k-fold cross-validation:
    t=dˉskwheredˉ=1k<em>i=1k(e</em>A<em>ie</em>B<em>i),s2=1k1</em>i=1k(d<em>idˉ)2t = \frac{\bar{d} s}{\sqrt{k}} \quad \text{where} \quad \bar{d} = \frac{1}{k} \sum<em>{i=1}^k (e</em>{A<em>i} - e</em>{B<em>i}), \quad s^2 = \frac{1}{k-1} \sum</em>{i=1}^k (d<em>i - \bar{d})^2 Compare t with the critical value from t</em>k1,α/2t</em>{k-1, \alpha/2}

5. Techniques to Improve Classification Accuracy

5.1 Ensemble Methods

Ensemble methods use multiple classifiers to build a superior classifier. The intuition is that the errors of individual classifiers can cancel out if they are independent.

5.1.1 Bagging (Bootstrap Aggregating)

Proposed by Breiman in 1996, this technique involves:

  1. Generate k bootstrap training sets via sampling with replacement.
  2. Train a base classifier for each sample.
  3. For classification tasks, utilize majority voting; for regression, compute the average.
    f<em>bag(x)=majorityf</em>1(x),f<em>2(x),,f</em>k(x)f<em>{bag}(x) = \text{majority}{f</em>1(x), f<em>2(x), \ldots, f</em>k(x)}
    This approach reduces variance without increasing bias and suits unstable base classifiers like decision trees.
5.1.2 Boosting (AdaBoost)
  1. Assign equal weights wi=1nw_i = \frac{1}{n} to all instances.
  2. For iterations t=1,,Tt = 1, \ldots, T:
    • Train classifier CtC_t on weight-adjusted data.
    • Compute the error as: ϵ<em>t=</em>i:C<em>t(x</em>i)y<em>iw</em>i\epsilon<em>t = \sum</em>{i:C<em>t(x</em>i) \neq y<em>i} w</em>i
    • Calculate the classifier weight: α<em>t=12ln(1ϵ</em>tϵt)\alpha<em>t = \frac{1}{2} \ln\left(\frac{1 - \epsilon</em>t}{\epsilon_t}\right)
    • Update weights:
      w<em>iw</em>ieα<em>ty</em>iC<em>t(x</em>i)ext(normalize)w<em>i \leftarrow w</em>i \cdot e^{-\alpha<em>t y</em>i C<em>t(x</em>i)} \quad ext{(normalize)}
  3. Final Prediction:
    f(x)=sign(<em>t=1Tα</em>tCt(x))f(x) = \text{sign}\left( \sum<em>{t=1}^T \alpha</em>t C_t(x) \right)
    This method reduces bias and variance but can be sensitive to noisy data and outliers.
5.1.3 Random Forests

Random Forests combine bagging with random feature selection:

  1. Build T decision trees on bootstrap samples.
  2. At each split, randomly select m features (typically m=pm = \sqrt{p}, where p is the total features).
  3. Choose the best split among the m features.
  4. Aggregate predictions using majority voting.
5.2 Class Imbalance Problem

This issue arises when class distribution is significantly skewed (e.g., 99% vs 1%). Strategies to cope include:

  • Oversampling: Duplicate instances of the minority class or apply SMOTE to create new instances.
  • Undersampling: Reduce instances of the majority class.
  • Cost-sensitive learning: Assign higher misclassification costs to the minority class.
5.3 SMOTE

The Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic instances by interpolating among minority class neighbors.

6. Advanced Classification: Bayesian Belief Networks

6.1 Bayesian Belief Network (BBN)

A BBN is a probabilistic graphical model representing a set of variables with their conditional independence relationships via a Directed Acyclic Graph (DAG). Each node X<em>iX<em>i holds a Conditional Probability Table (CPT) detailing P(X</em>iextParents(Xi))P(X</em>i | ext{Parents}(X_i)).

6.2 Structure and Joint Probability

Given a DAG with n nodes, the joint probability distribution can be articulated as:
P(X<em>1,X</em>2,,X<em>n)=</em>i=1nP(X<em>iextParents(X</em>i))P(X<em>1, X</em>2, \ldots, X<em>n) = \prod</em>{i=1}^n P(X<em>i | ext{Parents}(X</em>i))
Example: For medical diagnosis, with variables: Smoking, Pollution, Lung Cancer, X-ray, and Dyspnoea.


  • CPT for Lung Cancer: given Smoking (S) and Pollution (P):

SPP(LC=T)P(LC=F)
THigh0.050.95
TLow0.020.98
FHigh0.030.97
FLow0.0010.999
6.3 Learning in BBNs
  1. Structure Learning: Finding the DAG structure
    a. Score-based Learning: Optimize a scoring function (e.g., BIC, MDL, BDe).
    b. Constraint-based Learning: Utilize conditional independence tests to create the graph.
  2. Parameter Learning: Estimate the CPTs utilizing MLE or Bayesian methods.
6.4 BIC Score (Bayesian Information Criterion)

Can be provided as:
BIC(G,D)=logP(Dθ^,G)G2logn\text{BIC}(G, D) = \log P(D | \hat{\theta}, G) - \frac{|G|}{2} \log n

6.5 Inference in BBNs
  • Exact Inference: Techniques like variable elimination and junction tree algorithm.
  • Approximate Inference: Methods like Monte Carlo sampling (MCMC), and loopy belief propagation.
    Given evidence E = e, compute:
    P(XE=e)=P(X,E=e)P(E=e)=<em>y\X,EP(y)</em>y\EP(y)P(X | E = e) = \frac{P(X, E = e)}{P(E = e)} = \sum<em>{y \backslash X, E} P(y) \sum</em>{y \backslash E} P(y)

7. Lazy Learners

Lazy Learning

Lazy learners (or instance-based learners) delay model generalization until a query arises. They retain all training data and categorize queries based on comparisons with stored instances, contrasting with eager learners like decision trees or Naïve Bayes that develop a global model during training.

7.1 k-Nearest Neighbour (k-NN) Classifier

Algorithm 2: k-Nearest Neighbour Classification

  1. Store all n training instances.
  2. Given a query instance xqx_q:
  3. Identify the k nearest neighbors of xqx_q using a distance metric.
  4. Assign class based on majority vote among the k neighbors.

Distance Metrics:

  • Euclidean:
    d(x,y)=<em>i=1d(x</em>iyi)2d(x, y) = \sqrt{\sum<em>{i=1}^{d} (x</em>i - y_i)^2}
  • Manhattan:
    d(x,y)=<em>i=1dx</em>iyid(x, y) = \sum<em>{i=1}^{d} |x</em>i - y_i|
  • Minkowski:
    d(x,y)=(<em>i=1dx</em>iyip)1/pd(x, y) = \left(\sum<em>{i=1}^{d} |x</em>i - y_i|^p\right)^{1/p}
  • Weighted k-NN: Assign weights to neighbors by inverse distance:
    w<em>i=1d(x</em>q,x<em>i)2,C^=argmax</em>c<em>i:C</em>i=cwiw<em>i = \frac{1}{d(x</em>q, x<em>i)^2}, \quad \hat{C} = \arg\max</em>c \sum<em>{i:C</em>i=c} w_i
Example of k-NN

Given training data with 3 class-A points and 3 class-B points, classify query point xq=(5,3)x_q = (5, 3):

  • With k=3k=3: if 2 neighbors are class-A and 1 is class-B, classify as class-A.
  • With k=1k=1: classify as the class of the nearest neighbor.
7.2 Case-Based Reasoning (CBR)

CBR classifies by retrieving the most similar past cases and adapting their solutions. The CBR cycle involves:

  1. Retrieve: Find similar past cases.
  2. Reuse: Adapt retrieved solutions to the current problem.
  3. Revise: Test and modify solutions as needed.
  4. Retain: Store new cases for future reference.
7.3 Advantages and Disadvantages of Lazy Learning
  • Advantages:

    • No upfront training time required.
    • Naturally accommodates multi-class scenarios.
    • Adapts to new training data effortlessly.
    • Can model complex decision boundaries.
  • Disadvantages:

    • Classification is time-consuming (search all instances).
    • High memory consumption.
    • Sensitive to irrelevant features.
    • Requires feature normalization.
    • Vulnerable to the curse of dimensionality.

Part B: Cluster Analysis

8. Introduction to Cluster Analysis

Clustering is an unsupervised learning task aimed at discovering groups (clusters) in unlabelled data. The goal is to group instances such that:

  • Instances within a cluster are similar to each other.
  • Instances in one cluster are dissimilar to those in other clusters.
    Given a dataset D=x<em>1,x</em>2,,x<em>nD = {x<em>1, x</em>2, \ldots, x<em>n}, the clustering process divides D into kk groups C</em>1,C<em>2,,C</em>kC</em>1, C<em>2, \ldots, C</em>k such that:
  • CiC_i \neq \emptyset for all ii
  • <em>i=1kC</em>i=D\bigcup<em>{i=1}^k C</em>i = D (equally partition the dataset)
  • C<em>iC</em>j=C<em>i \cap C</em>j = \emptyset for all iji \neq j (no overlap in hard clustering).
8.1 Applications of Clustering

Clustering can be applied in various fields:

  • Customer segmentation in marketing.
  • Document clustering and topic modeling.
  • Image segmentation in computer vision applications.
  • Anomaly detection in network security.
  • Gene expression analysis in bioinformatics.
  • City planning through grouping locations based on value.
9. Types of Data in Cluster Analysis
9.1 Data Types
Data TypeDescriptionExample
Interval-scaledContinuous real valuesHeight, temperature
BinaryTwo states (symmetric/asymmetric)Gender (M/F), disease presence (Y/N)
NominalUnordered discrete categoriesColour, city, job
OrdinalOrdered discrete valuesRank, quality (low/medium/high)
Ratio-scaledPositive values with meaningful ratiosAge, salary, weight
Mixed typesCombination of different typesCustomer records with various features
9.2 Dissimilarity and Similarity Measures
9.2.1 For Interval-Scaled Attributes

Standardization to achieve uniform scale:
z<em>f=x</em>ifm<em>fs</em>fz<em>f = \frac{x</em>{if} - m<em>f}{s</em>f}
Where m<em>fm<em>f is the mean and s</em>fs</em>f is the standard deviation.

  • Minkowski Distance:
    d(i,j)=(<em>f=1px</em>ifxjfq)1/qd(i, j) = \left(\sum<em>{f=1}^p |x</em>{if} - x_{jf}|^q\right)^{1/q} with specific values of q defining various distance metrics (q = 1 for Manhattan, q = 2 for Euclidean).
9.2.2 For Binary Attributes

A contingency table defines the occurrences as follows:

No. of occurrencesCount
Both 1a
1 and 0b
0 and 1c
Both 0d

The distance measures become:

  • Simple Matching:
    dsimplematching=b+cpd_{simple\,matching} = \frac{b+c}{p}
  • Jaccard Distance:
    dJaccard=b+ca+b+cd_{Jaccard} = \frac{b+c}{a+b+c}
9.2.3 For Nominal Attributes

d(i,j)=pmpd(i,j) = \frac{p - m}{p} where m is the number of matching attributes and p is the total count of attributes.

9.2.4 For Ordinal Attributes

Convert ranks to a numerical scale from 0 to 1 and apply interval-scale methods.

9.2.5 Mixed-Type Attributes

The generalized distance measure can be represented as:
d(i,j)=<em>f=1pδ(f)</em>ijd(f)<em>ij/</em>f=1pδ(f)<em>ijd(i, j) = \sum<em>{f=1}^p \delta(f)</em>{ij} d(f)<em>{ij} \bigg/ \sum</em>{f=1}^p \delta(f)<em>{ij} Where δ(f)</em>ij=0\delta(f)</em>{ij} = 0 if attribute f is missing, else 1.

10. Partitioning Methods

In partitioning methods, the goal is to split n objects into k non-overlapping subsets such that each subset includes at least one object and minimizes a criterion, generally the sum of squared distances to cluster centers.

10.1 k-Means Algorithm

Algorithm 3: k-Means Clustering

  1. Select k objects randomly as initial cluster centroids μ<em>1,,μ</em>k\mu<em>1, \ldots, \mu</em>k
  2. Repeat the following steps:
    • For each object x<em>ix<em>i: Assign x</em>ix</em>i to cluster C<em>jC<em>j where j=argmin</em>jd(x<em>i,μ</em>j)j = \arg\min</em>j d(x<em>i, \mu</em>j)
    • For each cluster C<em>jC<em>j: Update the centroid: μ</em>j=1C<em>j</em>xCjx\mu</em>j = \frac{1}{|C<em>j|} \sum</em>{x \in C_j} x
  3. Until there is no change in centroids (or maximum iterations reached).
  • Objective to Minimize:
    J=<em>j=1k</em>xC<em>jxμ</em>j2J = \sum<em>{j=1}^k \sum</em>{x \in C<em>j} |x - \mu</em>j|^2
Example of k-Means

Data contains: {2, 4, 10, 12, 3, 20, 30, 11, 25}, with k=3.

  • Initial centroids: μ<em>1=2,μ</em>2=4,μ3=10\mu<em>1 = 2, \mu</em>2 = 4, \mu_3 = 10
  • Iteration 1 Assignment: Cluster formed as follows: C<em>1=2,3,C</em>2=4,C3=10,12,11,20,25,30C<em>1 = {2, 3}, C</em>2 = {4}, C_3 = {10, 12, 11, 20, 25, 30}
  • Update Centroids: μ<em>1=2.5,μ</em>2=4,μ3=18\mu<em>1 = 2.5, \mu</em>2 = 4, \mu_3 = 18
  • Iterate until convergence.
Limitations of k-Means
  • Predefined k: Must specify the number of clusters in advance.
  • Sensitivity to Initial Centroid Selection: May lead to local optima.
  • Sensitivity to Outliers and Noise: Outliers can skew results.
  • Assumes Spherical, Equal-sized Clusters: Performance degrades if these conditions are not met.
  • Inefficient for Large Datasets: Complexity scales with O(nkt) where t is the number of iterations.
Variants of k-Means
  • k-Means++: Uses smart initialization to reduce sensitivity to initial seeds.
  • Mini-Batch k-Means: Applies random batches to enhance scalability.
  • Fuzzy c-Means: Offers probabilistic or soft cluster assignments.
11 k-Medoids Clustering

Instead of using the mean (centroid), k-medoids (PAM — Partitioning Around Medoids) chooses an actual data object (the medoid) which minimizes dissimilarity within the cluster. This method is more robust against outliers compared to k-Means.

11.1 PAM Algorithm

Algorithm 4: PAM (Partitioning Around Medoids)

  1. Randomly select k objects as initial medoids m<em>1,,m</em>km<em>1, \ldots, m</em>k.
  2. Repeat the following:
    • Assign each non-medoid object to the nearest medoid’s cluster.
    • For each medoid m<em>im<em>i and a non-medoid object o</em>ho</em>h:
      • Compute cost:
        T<em>Cih=</em>jmedoids[d(j,nearest medoid excluding mi)d(j,current nearest medoid)]T<em>{Cih} = \sum</em>{j \notin \text{medoids}} [d(j, \text{nearest medoid excluding } m_i) - d(j, \text{current nearest medoid})]
    • If T<em>Cih<0T<em>{Cih} < 0, perform a swap: m</em>iohm</em>i \leftarrow o_h (this decreases total cost).
  3. Continue until no changes occur.
11.2 CLARA (Clustering LARge Applications)

CLARA applies PAM to multiple random samples of a large dataset:

  1. Draw multiple random samples.
  2. Apply PAM to each to find medoids.
  3. Assign objects to the nearest medoid across all samples.
  4. Select the medoid set minimizing overall cost, leading to:
    O(ks2+k(nk))O(ks^2 + k(n-k)) where s is the sample size.
11.3 CLARANS (Clustering Large Applications based on RANdomized Search)

CLARANS combines sampling with graph-search:

  • Models the problem as a graph where nodes represent sets of k medoids.
  • Randomly explores neighbor swaps until no better neighbor exists within a maximum defined attempts.
  • More efficient than CLARA when working with large datasets.
12 Density-Based Methods

Density-based clustering identifies clusters as dense regions surrounded by lower-density areas, effectively handling noise/outliers and discovering clusters of arbitrary shapes.

12.1 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Key Concepts: Given parameters ϵ\epsilon (neighborhood radius) and MinPts (minimum density):

  • Neighbor Definition:
    Nϵ(p)=qDd(p,q)ϵN_\epsilon(p) = { q \in D | d(p, q) \leq \epsilon }
  • Core Point: If Nϵ(p)MinPts|N_\epsilon(p)| \geq \text{MinPts}.
  • Border Point: Not core but located in a core point's neighborhood.
  • Noise Point (Outlier): Neither core nor border point.
  • Density-Reachable: A point qq is density-reachable from pp if qNϵ(p)q\in N_\epsilon(p) and pp is a core point.
  • Density-Connected: Points p and q are density-connected if both are reachable from some point oo.

Algorithm 5: DBSCAN

  1. Mark all points as unvisited.
  2. For every unvisited point pp:
    • Mark pp as visited.
    • Compute NNϵ(p)N \leftarrow N_\epsilon(p).
  3. If |N| < ext{MinPts}, mark pp as noise.
  4. Else:
    • Create new cluster CC, add pp to CC.
    • Set seeds as NpN \setminus {p}.
  5. For each point qq\in seeds:
    • If qq is unvisited, mark qq as visited.
    • Compute NNϵ(q)N' \leftarrow N_\epsilon(q).
    • If NMinPts|N'| \geq \text{MinPts}, include these points as seeds.
    • Add qq to cluster C if it’s not already assigned to a cluster.

Complexity: O(n log n) with spatial indexing (e.g., R*-tree), O(n^2) without spatial indexing.

DBSCAN Illustration

In an illustration, core points (marked in blue) have neighborhoods exceeding MinPts while border points lie within a core’s ε-neighborhood. Noise points (marked in orange) are isolated.

Advantages of DBSCAN:

  • Discovers clusters of arbitrary shape and identifies noise and outliers automatically.
  • There's no need to define the number of clusters k.

Limitations:

  • DBSCAN is sensitive to ϵ\epsilon and MinPts choice.
  • It struggles with varying density clusters.
  • High-dimensional data can degrade performance due to the curse of dimensionality.
12.2 OPTICS (Ordering Points to Identify Clustering Structure)

OPTICS expands upon DBSCAN by addressing dense variations and offers:

  • Core Distance: Minimum ϵ\epsilon to make point p a core point.
  • Reachability Distance: The maximum distance between the core distance of a point o and distance from o to p.
  • Produce an ordered list useful for identifying clusters at different densities.
13 Grid-Based Methods

These methods quantize the data space into a finite number of cells enabling clustering operations dependent on the number of cells rather than the number of data objects.

13.1 STING (STatistical Information Grid)

STING is a multiresolution grid-based clustering method:

  • The structure divides space into rectangular cells at various resolution levels.
  • Each upper-level cell partitions smaller cells at subsequent levels. Each cell stores statistical parameters, e.g., count, mean, and standard deviation.

Query processing steps include:

  1. Start at a high-level (coarse) layer.
  2. Compute confidence intervals for relevance using stored statistics.
  3. Discard irrelevant cells, recursively process relevant cells.
  4. Return relevant cells as cluster candidates.

Complexity: O(K) where K represents the number of grid cells, independent of n.

Advantages:

  • Query-independent preprocessing is beneficial for execution speed (O(n) to build, O(K) to query).
  • Supports incremental updates and multiresolution analyses.

Limitations:

  • Cluster boundaries can only be vertical/horizontal (- no diagonal).
  • Accuracy is influenced by grid granularity and assumes data fits a standard distribution within cells.
13.2 WaveCluster

This approach employs wavelet transforms within grid-based clustering:

  1. Summarize data in a multi-dimensional grid framework.
  2. Execute wavelet transformations to locate high-density regions.
  3. Identify connected components within the transformed grid.
    This method is effective at detecting arbitrary shapes and managing noise effectively.

Complexity: O(n).

13.3 CLIQUE (CLustering In QUEst)

CLIQUE integrates grid-based and subspace clustering:

  • Automatically identifies high-density clusters within subspaces.
  • Utilizes Apriori-like monotonicity assumptions to evaluate cluster density.
14 Outlier Analysis

An outlier is an observation that significantly deviates from other observations,
indicating possible measurement errors or insightful anomalies. Outliers can signify:

  • Data entry mistakes.
  • Important rare events (e.g., fraud, network intrusions).
  • Patterns worth further investigation.
14.1 Statistical Approaches

Outlier detection methods include:

  • Distribution-based: Assume data follows a specific distribution; objects with low probability are identified.
  • Z-score Method: Calculate z<em>i=x</em>ixˉsz<em>i = \frac{x</em>i - \bar{x}}{s} and treat observations where |z_i| > 3 as potential outliers.
  • IQR Method (Box Plot): Define fences based on quartiles to flag outliers:
    • Lower fence: Q11.5×IQRQ_1 - 1.5 \times \text{IQR}
    • Upper fence: Q<em>3+1.5×IQRQ<em>3 + 1.5 \times \text{IQR} where IQR=Q</em>3Q1\text{IQR} = Q</em>3 - Q_1.
14.2 Distance-Based Outlier Detection

DB(p,D)-Outlier: An object oo is a DB(p,D)-outlier if a fraction pp of all objects lie more than distance D away from o:
\frac{|{o' \in D\, | d(o, o') \leq D}|}{|D|} < 1 - p
Efficient detection methods optimize using nested loops or indices (e.g., R-tree) for pruning clearly non-outlier objects. The complexity can be O(n²) or O(n log n) with indexing.

14.3 Density-Based Local Outlier Factor (LOF)

The LOF value gauges the local density variance of an object referring to its neighboring points:

  • Reachability Distance:
    reach-distk(p,o)=max(k-dist(o),d(p,o))\text{reach-dist}_k(p,o) = \max(\text{k-dist}(o), d(p,o))
  • Local Reachability Density:
    lrd<em>k(p)=1</em>oN<em>k(p)reach-dist</em>k(p,o)/Nk(p)\text{lrd}<em>k(p) = \frac{1}{\sum</em>{o \in N<em>k(p)} \text{reach-dist}</em>k(p,o) / |N_k(p)|}
  • LOF:
    LOF<em>k(p)=</em>oN<em>k(p)lrd</em>k(o)lrd<em>k(p)N</em>k(p)\text{LOF}<em>k(p) = \frac{\sum</em>{o \in N<em>k(p)} \text{lrd}</em>k(o)}{\text{lrd}<em>k(p) \, |N</em>k(p)|}
    Where:
  • LOF<em>k(p)1\text{LOF}<em>k(p) \approx 1 indicates p is a cluster point (similar density), and LOF</em>k(p)1\text{LOF}</em>k(p) \gg 1 demonstrates it is an outlier (density lower than neighbors).
14.4 Clustering-Based Outlier Detection
  • Objects in small or sparse clusters are labeled as outliers.
  • Objects failing to belong to any cluster (e.g., noise in DBSCAN) are classified as outliers.
  • Assessing distance to the cluster centroid can provide an understanding of outlier characteristics.

Part C: Data Mining Trends & Frontiers

15 Overview of Web Mining

Web mining encompasses applying data mining techniques to unveil meaningful and potentially beneficial patterns from the World Wide Web, focusing on web content, structure, and usage data.

15.1 Types of Web Mining
  • Web Content Mining: Utilizes content from web pages (text, images, audio, and video) to extract information.
  • Web Structure Mining: Analyzes the hyperlink structure of the web to identify authority pages, hubs, and communities.
  • Web Usage Mining: Examines web server logs and user behavior to reveal usage patterns.
15.1.1 Web Content Mining
  • Tasks include:
    • Web page classification (topic categorization).
    • Information extraction (structuring unstructured data).
    • Text mining on web documents (natural language processing, sentiment analysis).
    • Multimedia content mining.
  • Common techniques:
    • TF-IDF for term weighting.
    • Topic modeling (LDA — Latent Dirichlet Allocation).
    • Named Entity Recognition (NER).
    • Sentiment analysis.
  • TF-IDF Formula:
    TFIDF(t,d)=tf(t,d)log(Ndf(t))TF-IDF(t, d) = tf(t, d) \cdot log\left( \frac{N}{df(t)} \right)
    where:
  • tf(t,d)tf(t, d) = term frequency,
  • NN = total documents,
  • df(t)df(t) = number of documents containing term t.
15.1.2 Web Structure Mining
  • Mining the hyperlink connections of the web to uncover authority pages and hubs.
  • PageRank (Google):
    P<em>R(p)=1d1N+d</em>qB<em>pP</em>R(q)L(q)P<em>R(p) = 1 - d \frac{1}{N} + d \sum</em>{q \in B<em>p} \frac{P</em>R(q)}{L(q)}
    Where:
  • d0.85d \approx 0.85 is the damping factor,
  • BpB_p represents pages linking to p,
  • L(q)L(q) is the count of out-links from q.
  • HITS Algorithm (Hyperlink-Induced Topic Search):
    • Each page has two scores for hubs (h) and authorities (a).
    • Calculate authority score by summing hub scores of referring pages:
      a(p)=qph(q)a(p) = \sum_{q \rightarrow p} h(q)
    • Calculate hub score by summing authority scores of links:
      h(p)=pqa(q)h(p) = \sum_{p \rightarrow q} a(q)
    • Iterate this procedure until convergence (through the power method applied to the adjacency matrix).
15.1.3 Web Usage Mining
  • Analyzing logs and user behaviors to uncover patterns in internet usage, drawing from web server logs, browser client logs, and proxy server logs.
  • Tasks can include sessionization (creating groups of log entries into user sessions) and applying association rule mining on page sequences, sequential pattern mining (e.g., using AprioriAll and SPADE), and collaborative filtering for recommendations.

16 Temporal and Spatial Mining

16.1 Temporal Data Mining

Temporal data refers to information indexed by time, encompassing time-series, event sequences, and periodic patterns.

16.1.1 Time Series Analysis

A time series is a series of observations x<em>1,x</em>2,,xTx<em>1, x</em>2, …, x_T observed at consistent time intervals.

  • Decomposition:
    x<em>t=T</em>t+S<em>t+C</em>t+Itx<em>t = T</em>t + S<em>t + C</em>t + I_t
    Where:
  • T_t: Trend component (long-term direction).
  • S_t: Seasonal component (periodic patterns).
  • C_t: Cyclical component (irregular, multi-year patterns).
  • I_t: Irregular component (noise/residual).
16.1.2 Sequential Pattern Mining

Goal: Identify all sequences with support ≥ min support from the database of sequences using approaches like GSP (Generalized Sequential Patterns), SPADE (vertical format), and PrefixSpan (pattern-growth approach).

16.1.3 Temporal Association Rules

Extend traditional association rules to account for temporal constraints, expressed as XtYX \Rightarrow \triangle t Y (event X must be followed by event Y within time interval t\triangle t). Additionally, calendar-based patterns reveal findings tied to temporal intervals (daily, weekly, monthly).

16.2 Spatial Data Mining

Spatial data manifests through geometric properties such as location, often found in geographic information systems (GIS), remote sensing, medical imaging, and satellite data.

16.2.1 Spatial Data Types
  • Point data: GPS coordinates of events.
  • Line data: Features like roads, rivers, and fault lines.
  • Polygon data: Regional designations for land use.
  • Raster data: Satellite images organized as grids.
16.2.2 Spatial Mining Tasks

Common spatial mining tasks include:

  • Spatial Classification: Classifying spatial objects (e.g., land use based on satellite imagery).
  • Spatial Clustering: Grouping geographically proximate objects (e.g., utilizing DBSCAN for spatial data).
  • Spatial Association Rules: Identify rules involving spatial predicates (e.g., identifying high-value properties close to the ocean).
  • Spatial Outlier Detection: Detecting anomalies based on spatial context (e.g., unusually low attribute values relative to neighboring objects).
  • Spatial Trend Analysis: Discover directional trends in spatial configurations.
16.2.3 Spatial Autocorrelation

The Moran’s I statistic assesses spatial autocorrelation:
I=n<em>i</em>jw<em>ij</em>i<em>jw</em>ij(x<em>ixˉ)(x</em>jxˉ)I = \frac{n}{\sum<em>i \sum</em>j w<em>{ij}} \sum</em>i \sum<em>j w</em>{ij} (x<em>{i} - \bar{x})(x</em>{j} - \bar{x})

  • Where:
    • I > 0: indicates positive spatial autocorrelation.
    • I < 0: suggests negative autocorrelation.
    • I \approx 0: illustrates randomness in spatial patterns.
17 Statistical Data Mining

Statistical data mining bridges classical statistics and modern machine learning methodologies to discern patterns, select models, and make inferences.

17.1 Statistical Learning Foundations
17.1.1 Regression Analysis
  • Multiple Linear Regression:
    y=β<em>0+β</em>1x<em>1+β</em>2x<em>2++β</em>pxp+ϵy = \beta<em>0 + \beta</em>1 x<em>1 + \beta</em>2 x<em>2 + \ldots + \beta</em>p x_p + \epsilon where β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty
  • Logistic Regression (for classification):
    P(y=1x)=11+e(β0+βTx)=σ(βTx)P(y=1 | x) = \frac{1}{1 + e^{-(\beta_0 + \beta^T x)}} = \sigma(\beta^T x)
17.1.2 Principal Component Analysis (PCA)

A technique for dimensionality reduction utilizing variance projection directions:
Z=XWZ = XW
Where:

  • W = eigenvectors from XTXX^TX.
  • Proportion of Variance Explained (PVE) by k components:
    PVE<em>k=</em>j=1kλ<em>j</em>j=1pλjPVE<em>k = \frac{\sum</em>{j=1}^k \lambda<em>j}{\sum</em>{j=1}^p \lambda_j}
17.1.3 Hypothesis Testing in Data Mining

Common statistical tests include:

  • Chi-Squared Test: For independence of categorical attributes.
  • t-Test: Used in comparing means of two groups.
  • ANOVA (Analysis of Variance): Compares means across multiple groups.
  • Kolmogorov-Smirnov Test: Evaluates sample distribution goodness.
  • Consider corrections for multiple comparisons, namely the Bonferroni correction and controlling the False Discovery Rate (FDR).
17.2 Minimum Description Length (MDL)

The MDL principle asserts that the best model is one that achieves optimal data compression. Formally represented as:
MDL(H,D)=L(H)+L(DH)MDL(H, D) = L(H) + L(D | H)

  • Where L(H) refers to bits necessary to describe model H and L(D|H) accounts for bits describing data given model H.
17.3 EM Algorithm (Expectation-Maximization)

This algorithm is utilized for learning probabilistic models incorporating hidden variables:

  • E-step (Expectation):
    Q(θθ(t))=EZX,θ(t)[logP(X,Zθ]Q(\theta | \theta^{(t)}) = E_{Z|X,\theta^{(t)}} [\log P(X, Z | \theta]
  • M-step (Maximization):
    θ(t+1)=argmaxθQ(θθ(t))\theta^{(t+1)} = \arg \max_{\theta} Q(\theta | \theta^{(t)})
  • Applications are found in:
    • Gaussian Mixture Models (GMM) for soft clustering.
    • Learning Hidden Markov Models (HMM) for sequential data.
    • Filling in missing data.
18 Data Mining Applications
18.1 Business and Finance
ApplicationDescription
Customer segmentationGrouping customers for targeted marketing; RFM (Recency, Frequency, Monetary) analysis.
Fraud detectionClassifying transactions as fraudulent leveraging anomaly detection and supervised learning.
Credit scoringAssessing creditworthiness through regression, SVM, and decision trees.
Market basket analysisLeveraging frequent pattern mining (e.g., Apriori, FP-Growth) for strategies like cross-selling.
Stock predictionApplying time series analysis via LSTM, ARIMA for price forecasting.
Churn predictionIdentification of customers at risk of leaving through classification strategies.
18.2 Healthcare and Biomedical
ApplicationDescription
Disease predictionUtilizing classification techniques for early diagnosis (e.g., diabetes, cancer).
Gene expression analysisClustering applications based on expression patterns in microarray data.
Drug discoveryMining associations between various compounds and biological activities.
Medical image analysisClassification and segmentation driven by CNN-based methodologies.
EHR AnalysisTemporal pattern mining for discernment of treatment pathways.
18.3 Telecommunications
ApplicationDescription
Network intrusion detectionEmploying anomaly detection.
Customer churn analysisAssessing and enhancing customer retention.
CDR analysisClassification of call data records.
Network fault diagnosisUtilizing classification techniques for diagnosis purposes.
18.4 Retail and E-Commerce
ApplicationDescription
Recommendation systemsUtilizing collaborative and content-based filtering.
Inventory managementDemand forecasting leveraging time series methodologies.
Price optimizationUsing regression methods and simulations for price setting.
Supply chain optimizationClustering methods combined with outlier detection in logistics.
18.5 Social Network Analysis
ApplicationDescription
Community detectionEmploying graph clustering utilizing methods like modularity optimization.
Influence AnalysisIdentifying key influencers using centrality measures.
Link predictionPredicting future connections based on graph-derived features.
Sentiment analysisMining opinion metrics from social media text.
18.6 Scientific Data Mining
ApplicationDescription
AstronomyClassifying celestial bodies and detecting phenomena like supernovae from surveys.
Climate scienceIdentifying climate patterns and anomalies while predicting extreme events.
Particle physicsEvent classification processes at CERN.
SeismologyTechniques for predicting earthquakes leveraging sensor data.
19 Current Trends and Research Frontiers
19.1 Big Data and Scalable Mining
  • MapReduce/Hadoop: Deploying data mining on commodity hardware to scale operations for larger datasets.
  • Apache Spark: In-memory distributed computing benefiting iterative machine learning algorithms.
  • Stream Mining: Mining real-time data streams (using techniques like VFDT and SAMOA).
  • Online Learning: Incrementally updating models with new data as it becomes available.
Hoeffding Bound (for stream mining)

\epsilon = \slfrac{R^2 \log(1/\delta)}{2n}
This implies that the true mean differs from the sample mean by at most ϵ\epsilon with probability 1δ1 - \delta.

19.2 Deep Learning and Data Mining
  • CNNs: Used for image and video mining along with feature extraction.
  • RNNs/LSTMs: Directed toward temporal and sequential data mining.
  • Graph Neural Networks (GNNs): Focused on mining data structured as graphs.
  • Transformers: Utilized in natural language processing tasks, including BERT and GPT for text mining.
  • Autoencoders: Applied in anomaly detection and dimensionality reduction processes.
19.3 Privacy-Preserving Data Mining
  • k-Anonymity: Each record is indistinguishable from at least k - 1 other records.
  • Differential Privacy: Introduces calibrated noise to outputs, with privacy expression ϵ:P[M(D)S]eϵP[M(D)S]\epsilon: P[M(D)\in S] \leq e^{\epsilon}P[M(D')\in S] for neighboring datasets D, D′ and outputs S.
  • Federated Learning: Trains models across devices without requiring raw data sharing.
19.4 Explainable AI (XAI) and Interpretable Mining
  • LIME: Provides explanations for individual predictions using local linear approximations.
  • SHAP: Attribute importance understood within the framework of game theory.
  • Counterfactual explanations: Addresses what minimal changes would impact model decisions.
19 Summary: Key Formulas Quick Reference

The formulas are summarized as follows:

Classification

H(D)=<em>ip</em>ilog<em>2p</em>iH(D) = -\sum<em>i p</em>i \log<em>2 p</em>i;
Gain(A)=H(D)<em>vD</em>v/DH(D<em>v)\text{Gain}(A) = H(D) - \sum<em>{v}|D</em>v|/|D| H(D<em>v); GainRatio(A)=Gain(A)/SplitInfo(A)\text{GainRatio}(A) = \text{Gain}(A) / \text{SplitInfo}(A); Gini(D)=1</em>ip<em>i2\text{Gini}(D) = 1 - \sum</em>i p<em>i^2; P(C</em>iX)P(C<em>i)</em>kP(x<em>kC</em>i)(Naı¨ve Bayes)P(C</em>i | X) \propto P(C<em>i) \prod</em>{k} P(x<em>k | C</em>i) \quad \text{(Naïve Bayes)};
LOF<em>k(p)=</em>oN<em>k(p)lrd</em>k(o)lrd<em>k(p)N</em>k(p)\text{LOF}<em>k(p) = \frac{\sum</em>{o \in N<em>k(p)} \text{lrd}</em>k(o)}{\text{lrd}<em>k(p) \, |N</em>k(p)|}

Clustering

J<em>kmeans=</em>j=1k<em>xC</em>jxμ<em>j2J<em>{k-means} = \sum</em>{j=1}^k \sum<em>{x \in C</em>j} |x - \mu<em>j|^2; reach-dist</em>k(p,o)=max(k-dist(o),d(p,o))\text{reach-dist}</em>k(p, o) = \max(\text{k-dist}(o), d(p,o));

Web Mining

TFIDF(t,d)=tf(t,d)log(Ndf(t))TF-IDF(t, d) = tf(t, d) \cdot \log \left( \frac{N}{df(t)} \right);
P<em>R(p)=1d1N+d</em>qB<em>pP</em>R(q)L(q)P<em>R(p) = 1 - d \cdot \frac{1}{N} + d \sum</em>{q \in B<em>p} \frac{P</em>R(q)}{L(q)};

Statistics

β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty;
σ()=11+ez\sigma(\cdot) =\frac{1}{1 + e^{-z}};
MDL=L(H)+L(DH)MDL = L(H) + L(D|H)

End of Notes

These structured notes serve as a comprehensive reference for advanced concepts in Data Mining, especially emphasizing Classification Techniques, Cluster Analysis, and related methodologies.