Data Mining 1: Preliminaries

Introduction to Data

  • Context: Ludwig-Maximilians-Universität München, Chair for Database Systems and Data Mining, Prof. Dr. Thomas Seidl, Data Mining Algorithms 1 (Knowledge Discovery and Data Mining 1), Winter Semester 2025/26.

Preliminaries: Data

Definition of Data

  • Data is a representation of real or artificial objects, situations, or processes.

  • Measured by physical sensors, examples include:

    • Temperature

    • Humidity

    • Car traffic

    • Speed

    • Color

  • Recorded from digital systems, such as:

    • Bank transfers

    • Web browsing activities

  • Generated by simulations, such as:

    • Weather forecasts

    • Digital mockups

  • Stored and provided by computers (e.g., local disk or remote server).

Representation of Data

  • Data Types:

    • Numerical data types: Represent quantitative values and can include:

    • Integer

    • Float

    • Double

    • Categorical data types: Represent qualitative values and can include:

    • Strings (e.g., names, categories)

  • Similarity models: Used to facilitate pattern mining from data.

  • Data reduction techniques: Aim to enhance efficiency in processing data.

Presentation of Data

  • Visualization techniques: Important for effective interpretation and understanding of data.

  • Privacy aspects: Considerations in presenting sensitive data.

Agenda Overview

  1. Introduction

  2. Preliminaries: Data

    • 2.1 Data Representation

    • 2.2 Visualization

    • 2.3 Data Reduction

  3. Supervised Learning

  4. Unsupervised Learning

  5. Process Mining

  6. Further Topics

Data Types – Algebraic View

Ingredients of Data Types

  • Identifiers: Data types have specific identifiers (e.g., int, long, float).

  • Domains (D): Each type has a set of possible values:

    • For example: {1, 2, ...}, {true, false}, or string sets like {'hello', 'hi', 'howdy'}.

  • Operations: Types dictate the operations applicable to them, such as:

    • Arithmetic: +, -, mod, div

    • Logical: and, or, not

    • Concatenation for strings.

  • Equality tests: Defined by:

    • $=,
      eq$: A function $D imes D
      ightarrow ext{true, false}$.

  • Distance functions: Defined as:

    • $d : D imes D
      ightarrow ext{R}^+_0$.

    • Examples include:

      • Euclidean distance

      • Maximum distance

      • Difference value.

Data Handling is Expensive – Implementation View

Storage Requirements

  • Single values occupy specific bytes based on their types:

    • 1 Byte = 8 Bits

    • 4 Bytes = 32 Bits

    • 8 Bytes = 64 Bits

Data Sizes in Big Data Context

  • Kilobytes $(10^3)$

  • Megabytes $(10^6)$

  • Gigabytes $(10^9)$

  • Terabytes $(10^{12})$

  • Petabytes $(10^{15})$

  • Exabytes $(10^{18})$

  • Zettabytes $(10^{21})$.

Computational Complexity

  • Operations require significant computation time:

    • Calculating the Euclidean distance of two objects from $ ext{R}^d$ requires $O(d)$ time.

    • Mean of $n$ objects from $ ext{R}^d$ needs $O(nd)$ time.

    • Covariance matrix of $n$ objects $x_i ext{ in } ext{R}^d$ needs $O(nd^2)$ time.

Overview of Data Types

  • Simple Data Types:

    • Numerical data (quantitative) and categorical data (qualitative).

  • Composed Data Types: Include structured forms like vectors, sequences, sets, and relations.

  • Complex Data Types: Include:

    • Multimedia data (images, videos, audio, text).

    • Spatial data (geographic shapes, trajectories).

    • Structures like graphs, networks, trees, etc.

Numeric Data

Simple Numeric Data

  • Comprises various numeric types:

    • Natural numbers, integers, rational numbers, real numbers.

    • Domain examples:

    • Age

    • Income

    • Shoe size

    • (Dis-)similarity metric: Difference value defined as:
      $d(x, y) = |x - y|$.

  • Notable example: 3 is closer in similarity to 30 than 30 is to 3,000.

Composed Numeric Data

  • Vectors defined as:

    • $x ext{ in } ext{R}^d = R imes R imes … imes R$ (d times).

    • Domain examples:

    • Geolocation $( ext{lat}, ext{lon}) ext{ in } ext{R}^2$

    • Feature vector $x = (x1, x2, ext{…}, x_d) ext{ in } ext{R}^d$.

  • Comparison metrics:

    • Euclidean distance

    • Manhattan distance

    • Earth Movers’ distance.

Comparisons of Vectors by Lp-Distance Functions

  • Given two vectors $o, q ext{ in } ext{R}^n$:

    • Distances based on $p$ norms:
      dp(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) ext{ } |oi - q_i|^p.

    • Specific formulations:

    • Euclidean distance:
      d2(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) (oi - q_i)^2.

    • Manhattan distance (city blocks):
      d1(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) |oi - q_i|.

    • Maximum distance: As $p o ext{∞}$,
      d{ ext{max}}(o, q) = ext{max} igg( |oi - q_i| igg), ext{ for } i = 1..n.

    • Weighted p-distances:
      d{p,w}(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) wi |oi - qi|^p.

  • Visualization example: Draw circles around center $c$:
    ext{Circle} = ig ext{ }{p ext{ in } ext{R}^n | d(p, c) = ext{ε}} ig ext{ }.

Generalization: Metric Data

Metric Space Definition

  • A metric space $(O, d)$ consists of:

    • An object set $O$.

    • A metric distance function $d : O imes O
      ightarrow ext{R}^+_0$ which fulfills:

    • Symmetry: $ orall p, q ext{ in } O : d(p,q) = d(q,p)$.

    • Identity of Indiscernibles: $ orall p, q ext{ in } O : d(p,q) = 0 ext{ if and only if } p = q$.

    • Triangle Inequality: $ orall p, q, o ext{ in } O : d(p,q) ext{ ≤ } d(p,o) + d(o,q)$.

  • Examples: Points in 2D space or $ ext{R}^n$ with Euclidean distance.

  • Counterexamples: Consider certain real-world scenarios:

    • Non-symmetric distances: For example, one-way roads between $p$ and $q$.

    • Non-definition: Equivalent objects may not always be identical.

    • Triangle measures: Some detours may be preferred over straight lines.

Categorical Data

Overview

  • Represents qualitative symbols, essentially just identifiers.

  • Examples include:

    • Subjects: $ ext{subjects} = ext{physics, biology, math, music, literature}$.

    • Occupations: $ ext{occupation} = ext{butcher, hairdresser, physicist, physician}$.

Similarity Measures

  • Defined to compare values:

    • Example: Trivial metric,
      d(p, q) = egin{cases}0 & ext{if } p = q \ 1 & ext{otherwise} ext{.}
      ext{ } \ ext{ } \ ext{(d(p,q) : O} imes O
      ightarrow ext{R}_{0}^{+}).

Explicit Similarity Matrix


  • An O × O mapping to similarity values between distinct items:

    Car

    Bike

    Trolley


    Car

    1

    0.3

    0.1


    Bike

    0.3

    1

    0.2


    Trolley

    0.1

    0.2

    1

    Hierarchical Path Length Example

    • Represents relationships and associations between subjects across various categories:

      • Biology $
        ightarrow$ Physics $
        ightarrow$ Science

      • Math $
        ightarrow$ Music $
        ightarrow$ Arts $
        ightarrow$ Literature

      • Engineering $
        ightarrow$ Civil Engineering $
        ightarrow$ Mechanical Engineering

    Simple Data Types: Ordinal Characteristics

    Order Definition

    • There exists a total order $ ext{≤}$ on the set of possible data values $O$.

    • Properties of order include:

      • Transitivity: If $p ext{ ≤ } q$ and $q ext{ ≤ } o$, then $p ext{ ≤ } o$.

      • Antisymmetry: If $p ext{ ≤ } q$ and $q ext{ ≤ } p$, then $p = q$.

      • Totality: For any $p, q ext{ in } O$, either $p ext{ ≤ } q$ or $q ext{ ≤ } p$.

      • Examples:

    • Lexicographic ordering of words, sizes, and numerical sequences.

    Composed Data Types: Sequences and Vectors

    Definition

    • Given a domain $D$, a sequence $s$ of length $n$ is a mapping $I_n ightarrow D$, where:

      • $I_n = ext{Index set} = ext{{1, 2, …, n}}$ into $D$.

    • The sequence concatenates $n$ values from $D$ where order is relevant.

    Examples

    • Distance metrics,
      dp(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) ext{ } |oi - q_i|^p.

    Comparison of Composed Data Types: Sets

    Characteristics

    • Represent an unordered collection of values.

      • Examples:

    • Skills = $ ext{Skills} = ext{Java, C, Python}$

    • Hobbies = $ ext{Hobbies} = ext{skiing, diving, hiking}$.

    Comparison and Operations

    • Symmetric Set Difference:
      R riangle S = (R - S) igcup (S - R) = (R igcup S) - (R igcap S).

    • Jaccard Distance:
      d(R, S) = rac{|R riangle S|}{|R igcup S|}.

    Bitvector Representation

    Definition

    • Using an ordered base set $B = (b1, …, bn)$, a set $S$ creates a binary vector $r ext{ in } ext{ {0, 1}}^n$ such that:
      ri = egin{cases} 1 & ext{if } bi ext{ in } S \ 0 & ext{otherwise} ext{.} \ ext{Hamming distance} ext{ : Counts differing entries (sum of different entries).}

    • Example with Hamming Distance:

      • Base Set $B = ( ext{Math, Physics, Chemistry, Biology, Music, Arts, English})$.

      • Sets $S ext{ and } R$ yielding:

      • $S = ext{Math, Music, English} = (1, 0, 0, 0, 1, 0, 1)$,

      • $R = ext{Math, Physics, Arts, English} = (1, 1, 0, 0, 0, 1, 1)$,

      • Hamming Distance Calculation:
        ext{Hamming}(R,S) = 3 ext{ as three entries differ.}

    Complex Data Types and Similarity Models

    Components of Complex Data - Exemplar

    Types include:
    • Structures: Graphs, networks, trees.

    • Geometry: Shapes, contours, trajectories.

    • Multimedia: Images, audio, text.

    Approaches to Similarity Models

    • Direct Measures: Highly dependent on data type.

    • Feature Extraction / Engineering: Craft explicit vector space embeddings using hand-crafted features.

    • Kernel Trick: Implicit vector space embedding methodologies.

    • Feature Learning: Explicit vector space embedding by deep learning methodologies, e.g., neural networks.

    Specific Examples by Data Type

    • Graphs:

      • Similarity approaches:

      • Structural Alignment

      • Degree Histograms

      • Label Sequence Kernel

      • Node embeddings via Spectral Neural Networks.

    • Geometry:

      • Similarity measures:

      • Hausdorff Distance

      • Shape Histograms

      • Spatial Pyramid Kernel

      • Convolutional Neural Networks (CNN).

    • Sequences:

      • Edit Distance

      • Symbol Histograms.

      • Cosine Distance

      • Recurrent Neural Networks (RNN).

    Objects and Attributes in Conceptual Modeling

    Modeling Structures

    Examples of representations:


    • Entity-Relationship Diagram (ER Diagram) including entities like students with attributes:

      • Name

      • Semester

      • Major

      • Skills.


    • UML Class Diagram with equivalent attributes.


    • Relational Model Data Tables to represent students:

      Name

      Semester

      Major

      Skills


      Ann

      3

      CS

      Java, C, R


      Bob

      1

      CS

      Java, PHP


      Charly

      4

      History

      Piano


      Debra

      2

      Arts

      Painting

      Feature Extraction Process

      Process Description

      • Objects from a database (DB) are mapped to feature vectors.

      • Feature vector space where:

        • Points represent objects.

        • Distances correspond to (dis-)similarity.

      Similarity Queries

      Basic Operations in (Multimedia) Databases

      Definitions:
      • Given:

        • Universe function $O$

        • Database $DB ext{ subset of } O$

        • Distance function $d$

        • Query object $q ext{ in } O$.

      Types of Queries:
      • Range Query:

        • Defined as:
          ext{range}(DB, q, d, ext{ε}) = ig ext{o ext{ in } DB | d(o, q) ext{ ≤ } ε} ig ext{ }.

      • Nearest Neighbor Query:

        • Defined as:
          ext{NN}(DB, q, d) = ig ext{o ext{ in } DB | orall o' ext{ in } DB : d(o, q) ext{ ≤ } d(o', q) ig ext{ }.

      k-Nearest Neighbor Queries

      • Definition: For a given parameter $k ext{ in } ext{N}$:
        ext{NN}(DB, q, d, k) ext{ ⊆ } DB ext{ with } |NN(DB, q,d,k)| = k ext{ and } orall o ext{ in } NN(DB, q,d,k), o' ext{ in } DB - NN(DB, q,d,k) : d(o, q) ext{ ≤ } d(o', q).

      • Ranking Query: Related to partial sorting: `` “Get next” functionality for selecting database objects in the order of increasing distance toq`:

        • For any indices $i ext{ and } j$ where $i ext{ ≤ } j$:
          d(q,rank{DB} q,d(i)) ≤ d(q,rank{DB} q,d(j)).

      Similarity Search Techniques

      Example: Range Query

      • Defined as:
        ext{range}(DB, q, d, ε) = ig ext{o ext{ in } DB | d(o, q) ext{ ≤ } ε} ig ext{ }.

      • Naive Search: Sequential scanning of entries, taking $O(n)$ time.

      • Individual distance checks likewise require $O(n)$ time.

      • Fast Search Techniques:

        • Utilizing database techniques like:

        • Filter-refine architecture.

        • Using indexing structures to eliminate sequential scans; analysis for database objects can be $O(n)$, $O( ext{log } n)$ or even $O(1)$.

      Filter-Refine Architecture Visualization

      Core Principle of Multi-Step Searches

      1. Fast filter step generates a candidate set $C ext{ ⊆ } DB ext{ (via approximate distance function } d')$.

      2. Exact distance function $d$ applied only to candidates in $C$.

      Example: Dimensionality Reduction

      • ICES Criteria for Filter Quality:

        • Indexable: Index enabled

        • Complete: No false dismissals

        • Efficient: Quick individual calculations

        • Selective: Maintain a small candidate set.

      Indexing in Databases

      Telephone Book Example

      • Indexed by the alphabetical order of participants to facilitate quicker access:

        • Instead of performing a sequential search, estimate the query object's range (interlocutor).

        • Extract correct branches and subsequently utilize the next identifier to hone in on the query.

      Indexing Principles

      • Importance in data organization for effective retrieval of objects, emphasizing efficient pruning.

      • R-Tree Indexing Example:

        • Organizes minimum bounding rectangles and disregards irrelevant subtrees during queries.

      Data Visualization Techniques

      Importance of Visualization

      • Critical for perceiving patterns in extensive data sets, which may be difficult to discern from tabular numeric formats.

      • Visualization Transformation: Enables data presentation in visually comprehensible depictions (the saying "a picture is worth a thousand words" underscores this).

      Key Capabilities Combined

      • Computers: Excellent at data processing and producing graphical representations.

      • Humans: More adept at visual pattern detection.

      Monthly Average Temperature Example

      • Visualization of temperature data for various cities.

      • Illustrated data includes average temperatures for cities across the months in Celsius.

      Data Visualization Techniques Classification

      • Geometric Visualization: Transformation and projection visualizations.

        • Examples include scatterplots, parallel coordinates, and icon-based methods such as Chernoff Faces.

      • Pixel-Oriented Visualization: Each attribute value represented by a colored pixel.

        • Various recursive patterns implemented for arrangement.

      Limitations of Current Visualization Techniques

      Scatterplot Characteristics

      • Works well primarily in two-dimensional data settings.

      • Scatterplot matrices provide pairwise analysis, which may become cluttered when analyzing many dimensions.

      Alternatives: Polygonal Plots

      • Utilize curves rather than points for depicting data objects,

        • Examples: Spiderweb plots and parallel coordinates.

      Data Reduction Techniques

      Purpose

      • Enhances pattern perception and eases interpretability:

        • Raw data representation can be complex and unmanageable; reduction allows for better evaluation of data patterns.

        • High computational complexity of big datasets extends runtime for mining.

        • Simplified datasets yield improved analytical results.

      Methods of Data Reduction

      • Data Aggregation: Using basic statistics to combine data.

      • Data Generalization: Abstraction at higher levels to simplify.

      Data Reduction Strategies Overview

      • Numerosity Reduction: Reduce the number of objects.

      • Dimensionality Reduction: Eliminate attributes.

      • Quantization or Discretization: Decrease values per domain.

      Quantization and Aggregation Techniques

      • Includes sampling and aggregation focusing on model parameters to aggregate comprehensive data sets.

      Basic Aggregate Types

      Central Tendency
      • Inquiry around data centralization:

        • Mean, median, and mode are examples of central tendency indicators.

      • Spread Measures: Variance and standard deviation characterize spread.

      Additional Aggregate Measures

      Distributive Aggregate Measures

      Examples illustrating the sum from divisions of dataset results:

      • Combination of values must fit basic aggregate functions status.

      Algebraic Aggregate Measures

      • Measures that involve various bound arguments calculated from distributions.

        • Example of average calculation pushing beyond partitions using distributive measures.

      Holistic Measures

      • Measures lacking constant storage size, impacting computation needs:

        • Unique examples: median, mode, and rank statistics.

      Central Tendency Measurement Methods

      Mean

      • Explains how to compute the arithmetic mean highlighting excellent centrality.

      • Special cases arise when considering categories over numerics but run into categorical issues.

      Median

      • Computation reflecting the mid-value of sorted data; recognizes both odd and even counts distinctly.

      Mode

      • Measures the most frequently occurring value and denotes unique cases like unimodal, bimodal, or multi-modal distributions.

      • Primarily tailored for categorical data representation.

      Variability Measures

      Variance and Standard Deviation

      • Variance reflects data dispersion around the mean:
        σ^2 = rac{1}{n - 1} igg( extstyle orall i=1 ext{ to } n igg)(x_i - ar{x})^2.

      • Single-pass method offers a faster yet less numerically robust calculation.

      Boxplot Analysis

      • Derives a five-number summary of a dataset:

        • Minimum, first quartile, median, third quartile, and maximum outcomes represent quantiles.

        • Illustrates data spread and extreme outliers interpretation.

      Example Utilizing the Iris Dataset

      • Depiction and analysis of sepal length across various species.

      Generalization Techniques in Data

      Definition of Quantization and Generalization

      • Quantization represents subdivision of data ranges into categories.

      • Generalization may span abstract categories based on framework contexts:

        • Examples include general age ranges and associated clustering outcomes.

      Binning Techniques with Histograms

      1. Equi-width Binning: Focusing on uniform grid intervals reducing categorical dimensions.

        • Advantages and disadvantages highlighted with effective examples.

      2. Equi-height Binning: Concentrating on managed data frequency balancing.

        • Provides insights into representation and robustness against outliers.

      Concept Hierarchies in Data Structuring

      Structuring Methods and Examples

      • Includes hierarchical frameworks for efficient OLAP operations: Data summarization with dimensional decrease definitions.

      • Specific structures based around scientific categories provide illustrative examples of network and classification:

        • Detailed comparisons across operative contexts demonstrate intelligence in structuring.

      Basic OLAP Operations Overview

      1. Roll-up: Summarizes data advancing to abstraction levels.

      2. Drill-down: Expands to more detailed information.

      3. Slice and Dice: Focuses on sectioning dimensions for clearer inspections.

      4. Pivoting: Changes data visualization orientation.

      Examples of SQL Commands Executing OLAP Operations

      • Queries specified for understanding the transformations in grouped data.

        • Roll-up and drill-down are vital for achieving comprehensive oversight on queries.

      Final Thoughts

      Efficient Implementation and Limitations

      • Discusses OLAP implementations focusing on aggregate measurements, potential generalizations about operations,

        • Data specification through automated or confirmed workflows enhances situational relevance.

      • Acknowledges limitations surrounding intelligent analysis capacity and suggests further exploration remains key to optimizing organizational queries.