Data Mining 1: Preliminaries

Introduction to Data

Context: Ludwig-Maximilians-Universität München, Chair for Database Systems and Data Mining, Prof. Dr. Thomas Seidl, Data Mining Algorithms 1 (Knowledge Discovery and Data Mining 1), Winter Semester 2025/26.

Preliminaries: Data

Definition of Data

Data is a representation of real or artificial objects, situations, or processes.
Measured by physical sensors, examples include:
- Temperature
- Humidity
- Car traffic
- Speed
- Color
Recorded from digital systems, such as:
- Bank transfers
- Web browsing activities
Generated by simulations, such as:
- Weather forecasts
- Digital mockups
Stored and provided by computers (e.g., local disk or remote server).

Representation of Data

Data Types:
- Numerical data types: Represent quantitative values and can include:
- Integer
- Float
- Double
- Categorical data types: Represent qualitative values and can include:
- Strings (e.g., names, categories)
Similarity models: Used to facilitate pattern mining from data.
Data reduction techniques: Aim to enhance efficiency in processing data.

Presentation of Data

Visualization techniques: Important for effective interpretation and understanding of data.
Privacy aspects: Considerations in presenting sensitive data.

Agenda Overview

Introduction
Preliminaries: Data
- 2.1 Data Representation
- 2.2 Visualization
- 2.3 Data Reduction
Supervised Learning
Unsupervised Learning
Process Mining
Further Topics

Data Types – Algebraic View

Ingredients of Data Types

Identifiers: Data types have specific identifiers (e.g., int, long, float).
Domains (D): Each type has a set of possible values:
- For example: {1, 2, ...}, {true, false}, or string sets like {'hello', 'hi', 'howdy'}.
Operations: Types dictate the operations applicable to them, such as:
- Arithmetic: +, -, mod, div
- Logical: and, or, not
- Concatenation for strings.
Equality tests: Defined by:
- $=,
  eq$: A function $D imes D
  ightarrow ext{true, false}$.
Distance functions: Defined as:
- $d : D imes D
  ightarrow ext{R}^+_0$.
- Examples include:
  - Euclidean distance
  - Maximum distance
  - Difference value.

Data Handling is Expensive – Implementation View

Storage Requirements

Single values occupy specific bytes based on their types:
- 1 Byte = 8 Bits
- 4 Bytes = 32 Bits
- 8 Bytes = 64 Bits

Data Sizes in Big Data Context

Kilobytes $(10^3)$
Megabytes $(10^6)$
Gigabytes $(10^9)$
Terabytes $(10^{12})$
Petabytes $(10^{15})$
Exabytes $(10^{18})$
Zettabytes $(10^{21})$.

Computational Complexity

Operations require significant computation time:
- Calculating the Euclidean distance of two objects from $ ext{R}^d$ requires $O(d)$ time.
- Mean of $n$ objects from $ ext{R}^d$ needs $O(nd)$ time.
- Covariance matrix of $n$ objects $x_i ext{ in } ext{R}^d$ needs $O(nd^2)$ time.

Overview of Data Types

Simple Data Types:
- Numerical data (quantitative) and categorical data (qualitative).
Composed Data Types: Include structured forms like vectors, sequences, sets, and relations.
Complex Data Types: Include:
- Multimedia data (images, videos, audio, text).
- Spatial data (geographic shapes, trajectories).
- Structures like graphs, networks, trees, etc.

Numeric Data

Simple Numeric Data

Comprises various numeric types:
- Natural numbers, integers, rational numbers, real numbers.
- Domain examples:
- Age
- Income
- Shoe size
- (Dis-)similarity metric: Difference value defined as:
  $d(x, y) = |x - y|$.
Notable example: 3 is closer in similarity to 30 than 30 is to 3,000.

Composed Numeric Data

Vectors defined as:
- $x ext{ in } ext{R}^d = R imes R imes … imes R$ (d times).
- Domain examples:
- Geolocation $( ext{lat}, ext{lon}) ext{ in } ext{R}^2$
- Feature vector $x = (x1, x2, ext{…}, x_d) ext{ in } ext{R}^d$.
Comparison metrics:
- Euclidean distance
- Manhattan distance
- Earth Movers’ distance.

Comparisons of Vectors by Lp-Distance Functions

Given two vectors $o, q ext{ in } ext{R}^n$:
- Distances based on $p$ norms:
 $dp(o, q) = \bigg( extstyle orall i = 1 ext{ to } n \bigg) ext{ } |oi - q_i|^p.$
- Specific formulations:
- Euclidean distance:
 $d2(o, q) = \bigg( extstyle orall i = 1 ext{ to } n \bigg) (oi - q_i)^2.$
- Manhattan distance (city blocks):
 $d1(o, q) = \bigg( extstyle orall i = 1 ext{ to } n \bigg) |oi - q_i|.$
- Maximum distance: As $p o ext{∞}$,
 $d{ ext{max}}(o, q) = ext{max} \bigg( |oi - q_i| \bigg), ext{ for } i = 1..n.$
- Weighted p-distances:
 $d{p,w}(o, q) = \bigg( extstyle orall i = 1 ext{ to } n \bigg) wi |oi - qi|^p.$
Visualization example: Draw circles around center $c$:
ext{Circle} = ig ext{ }{p ext{ in } ext{R}^n | d(p, c) = ext{ε}} ig ext{ }.

Generalization: Metric Data

Metric Space Definition

A metric space $(O, d)$ consists of:
- An object set $O$.
- A metric distance function $d : O imes O
  ightarrow ext{R}^+_0$ which fulfills:
- Symmetry: $ orall p, q ext{ in } O : d(p,q) = d(q,p)$.
- Identity of Indiscernibles: $ orall p, q ext{ in } O : d(p,q) = 0 ext{ if and only if } p = q$.
- Triangle Inequality: $ orall p, q, o ext{ in } O : d(p,q) ext{ ≤ } d(p,o) + d(o,q)$.
Examples: Points in 2D space or $ ext{R}^n$ with Euclidean distance.
Counterexamples: Consider certain real-world scenarios:
- Non-symmetric distances: For example, one-way roads between $p$ and $q$.
- Non-definition: Equivalent objects may not always be identical.
- Triangle measures: Some detours may be preferred over straight lines.

Categorical Data

Overview

Represents qualitative symbols, essentially just identifiers.
Examples include:
- Subjects: $ ext{subjects} = ext{physics, biology, math, music, literature}$.
- Occupations: $ ext{occupation} = ext{butcher, hairdresser, physicist, physician}$.

Similarity Measures

Defined to compare values:
- Example: Trivial metric,
  d(p, q) = egin{cases}0 & ext{if } p = q \ 1 & ext{otherwise} ext{.}
  ext{ } \ ext{ } \ ext{(d(p,q) : O} imes O
  ightarrow ext{R}_{0}^{+}).

Explicit Similarity Matrix

An O × O mapping to similarity values between distinct items:
Car
Bike
Trolley

Car
1
0.3
0.1

Bike
0.3
1
0.2

Trolley
0.1
0.2
1
Hierarchical Path Length Example
- Represents relationships and associations between subjects across various categories:
 - Biology $
 ightarrow$ Physics $
 ightarrow$ Science
 - Math $
 ightarrow$ Music $
 ightarrow$ Arts $
 ightarrow$ Literature
 - Engineering $
 ightarrow$ Civil Engineering $
 ightarrow$ Mechanical Engineering
Simple Data Types: Ordinal Characteristics
Order Definition
- There exists a total order $ ext{≤}$ on the set of possible data values $O$.
- Properties of order include:
 - Transitivity: If $p ext{ ≤ } q$ and $q ext{ ≤ } o$, then $p ext{ ≤ } o$.
 - Antisymmetry: If $p ext{ ≤ } q$ and $q ext{ ≤ } p$, then $p = q$.
 - Totality: For any $p, q ext{ in } O$, either $p ext{ ≤ } q$ or $q ext{ ≤ } p$.
 - Examples:
- Lexicographic ordering of words, sizes, and numerical sequences.
Composed Data Types: Sequences and Vectors
Definition
- Given a domain $D$, a sequence $s$ of length $n$ is a mapping $I_n ightarrow D$, where:
 - $I_n = ext{Index set} = ext{{1, 2, …, n}}$ into $D$.
- The sequence concatenates $n$ values from $D$ where order is relevant.
Examples
- Distance metrics,
 $dp(o, q) = \bigg( extstyle orall i = 1 ext{ to } n \bigg) ext{ } |oi - q_i|^p.$
Comparison of Composed Data Types: Sets
Characteristics
- Represent an unordered collection of values.
 - Examples:
- Skills = $ ext{Skills} = ext{Java, C, Python}$
- Hobbies = $ ext{Hobbies} = ext{skiing, diving, hiking}$.
Comparison and Operations
- Symmetric Set Difference:
 $R riangle S = (R - S) \bigcup (S - R) = (R \bigcup S) - (R \bigcap S).$
- Jaccard Distance:
 $d(R, S) = rac{|R riangle S|}{|R \bigcup S|}.$
Bitvector Representation
Definition
- Using an ordered base set $B = (b1, …, bn)$, a set $S$ creates a binary vector $r ext{ in } ext{ {0, 1}}^n$ such that:
 ri = egin{cases} 1 & ext{if } bi ext{ in } S \ 0 & ext{otherwise} ext{.} \ ext{Hamming distance} ext{ : Counts differing entries (sum of different entries).}
- Example with Hamming Distance:
 - Base Set $B = ( ext{Math, Physics, Chemistry, Biology, Music, Arts, English})$.
 - Sets $S ext{ and } R$ yielding:
 - $S = ext{Math, Music, English} = (1, 0, 0, 0, 1, 0, 1)$,
 - $R = ext{Math, Physics, Arts, English} = (1, 1, 0, 0, 0, 1, 1)$,
 - Hamming Distance Calculation:
 $ext{Hamming}(R,S) = 3 ext{ as three entries differ.}$
Complex Data Types and Similarity Models
Components of Complex Data - Exemplar
Types include:
- Structures: Graphs, networks, trees.
- Geometry: Shapes, contours, trajectories.
- Multimedia: Images, audio, text.
Approaches to Similarity Models
- Direct Measures: Highly dependent on data type.
- Feature Extraction / Engineering: Craft explicit vector space embeddings using hand-crafted features.
- Kernel Trick: Implicit vector space embedding methodologies.
- Feature Learning: Explicit vector space embedding by deep learning methodologies, e.g., neural networks.
Specific Examples by Data Type
- Graphs:
 - Similarity approaches:
 - Structural Alignment
 - Degree Histograms
 - Label Sequence Kernel
 - Node embeddings via Spectral Neural Networks.
- Geometry:
 - Similarity measures:
 - Hausdorff Distance
 - Shape Histograms
 - Spatial Pyramid Kernel
 - Convolutional Neural Networks (CNN).
- Sequences:
 - Edit Distance
 - Symbol Histograms.
 - Cosine Distance
 - Recurrent Neural Networks (RNN).
Objects and Attributes in Conceptual Modeling
Modeling Structures
Examples of representations:
- Entity-Relationship Diagram (ER Diagram) including entities like students with attributes:
 - Name
 - Semester
 - Major
 - Skills.
- UML Class Diagram with equivalent attributes.
- Relational Model Data Tables to represent students:
 Name
 Semester
 Major
 Skills
 
 Ann
 3
 CS
 Java, C, R
 
 Bob
 1
 CS
 Java, PHP
 
 Charly
 4
 History
 Piano
 
 Debra
 2
 Arts
 Painting
 Feature Extraction Process
 Process Description
 - Objects from a database (DB) are mapped to feature vectors.
 - Feature vector space where:
 - Points represent objects.
 - Distances correspond to (dis-)similarity.
 Similarity Queries
 Basic Operations in (Multimedia) Databases
 Definitions:
 - Given:
 - Universe function $O$
 - Database $DB ext{ subset of } O$
 - Distance function $d$
 - Query object $q ext{ in } O$.
 Types of Queries:
 - Range Query:
 - Defined as:
 ext{range}(DB, q, d, ext{ε}) = ig ext{o ext{ in } DB | d(o, q) ext{ ≤ } ε} ig ext{ }.
 - Nearest Neighbor Query:
 - Defined as:
 ext{NN}(DB, q, d) = ig ext{o ext{ in } DB | orall o' ext{ in } DB : d(o, q) ext{ ≤ } d(o', q) ig ext{ }.
 k-Nearest Neighbor Queries
 - Definition: For a given parameter $k ext{ in } ext{N}$:
 $ext{NN}(DB, q, d, k) ext{ ⊆ } DB ext{ with } |NN(DB, q,d,k)| = k ext{ and } orall o ext{ in } NN(DB, q,d,k), o' ext{ in } DB - NN(DB, q,d,k) : d(o, q) ext{ ≤ } d(o', q).$
 - Ranking Query: Related to partial sorting: `` “Get next” functionality for selecting database objects in the order of increasing distance toq`:
 - For any indices $i ext{ and } j$ where $i ext{ ≤ } j$:
 $d(q,rank{DB} q,d(i)) ≤ d(q,rank{DB} q,d(j)).$
 Similarity Search Techniques
 Example: Range Query
 - Defined as:
 ext{range}(DB, q, d, ε) = ig ext{o ext{ in } DB | d(o, q) ext{ ≤ } ε} ig ext{ }.
 - Naive Search: Sequential scanning of entries, taking $O(n)$ time.
 - Individual distance checks likewise require $O(n)$ time.
 - Fast Search Techniques:
 - Utilizing database techniques like:
 - Filter-refine architecture.
 - Using indexing structures to eliminate sequential scans; analysis for database objects can be $O(n)$, $O( ext{log } n)$ or even $O(1)$.
 Filter-Refine Architecture Visualization
 Core Principle of Multi-Step Searches
 1. Fast filter step generates a candidate set $C ext{ ⊆ } DB ext{ (via approximate distance function } d')$.
 2. Exact distance function $d$ applied only to candidates in $C$.
 Example: Dimensionality Reduction
 - ICES Criteria for Filter Quality:
 - Indexable: Index enabled
 - Complete: No false dismissals
 - Efficient: Quick individual calculations
 - Selective: Maintain a small candidate set.
 Indexing in Databases
 Telephone Book Example
 - Indexed by the alphabetical order of participants to facilitate quicker access:
 - Instead of performing a sequential search, estimate the query object's range (interlocutor).
 - Extract correct branches and subsequently utilize the next identifier to hone in on the query.
 Indexing Principles
 - Importance in data organization for effective retrieval of objects, emphasizing efficient pruning.
 - R-Tree Indexing Example:
 - Organizes minimum bounding rectangles and disregards irrelevant subtrees during queries.
 Data Visualization Techniques
 Importance of Visualization
 - Critical for perceiving patterns in extensive data sets, which may be difficult to discern from tabular numeric formats.
 - Visualization Transformation: Enables data presentation in visually comprehensible depictions (the saying "a picture is worth a thousand words" underscores this).
 Key Capabilities Combined
 - Computers: Excellent at data processing and producing graphical representations.
 - Humans: More adept at visual pattern detection.
 Monthly Average Temperature Example
 - Visualization of temperature data for various cities.
 - Illustrated data includes average temperatures for cities across the months in Celsius.
 Data Visualization Techniques Classification
 - Geometric Visualization: Transformation and projection visualizations.
 - Examples include scatterplots, parallel coordinates, and icon-based methods such as Chernoff Faces.
 - Pixel-Oriented Visualization: Each attribute value represented by a colored pixel.
 - Various recursive patterns implemented for arrangement.
 Limitations of Current Visualization Techniques
 Scatterplot Characteristics
 - Works well primarily in two-dimensional data settings.
 - Scatterplot matrices provide pairwise analysis, which may become cluttered when analyzing many dimensions.
 Alternatives: Polygonal Plots
 - Utilize curves rather than points for depicting data objects,
 - Examples: Spiderweb plots and parallel coordinates.
 Data Reduction Techniques
 Purpose
 - Enhances pattern perception and eases interpretability:
 - Raw data representation can be complex and unmanageable; reduction allows for better evaluation of data patterns.
 - High computational complexity of big datasets extends runtime for mining.
 - Simplified datasets yield improved analytical results.
 Methods of Data Reduction
 - Data Aggregation: Using basic statistics to combine data.
 - Data Generalization: Abstraction at higher levels to simplify.
 Data Reduction Strategies Overview
 - Numerosity Reduction: Reduce the number of objects.
 - Dimensionality Reduction: Eliminate attributes.
 - Quantization or Discretization: Decrease values per domain.
 Quantization and Aggregation Techniques
 - Includes sampling and aggregation focusing on model parameters to aggregate comprehensive data sets.
 Basic Aggregate Types
 Central Tendency
 - Inquiry around data centralization:
 - Mean, median, and mode are examples of central tendency indicators.
 - Spread Measures: Variance and standard deviation characterize spread.
 Additional Aggregate Measures
 Distributive Aggregate Measures
 Examples illustrating the sum from divisions of dataset results:
 - Combination of values must fit basic aggregate functions status.
 Algebraic Aggregate Measures
 - Measures that involve various bound arguments calculated from distributions.
 - Example of average calculation pushing beyond partitions using distributive measures.
 Holistic Measures
 - Measures lacking constant storage size, impacting computation needs:
 - Unique examples: median, mode, and rank statistics.
 Central Tendency Measurement Methods
 Mean
 - Explains how to compute the arithmetic mean highlighting excellent centrality.
 - Special cases arise when considering categories over numerics but run into categorical issues.
 Median
 - Computation reflecting the mid-value of sorted data; recognizes both odd and even counts distinctly.
 Mode
 - Measures the most frequently occurring value and denotes unique cases like unimodal, bimodal, or multi-modal distributions.
 - Primarily tailored for categorical data representation.
 Variability Measures
 Variance and Standard Deviation
 - Variance reflects data dispersion around the mean:
 $σ^2 = rac{1}{n - 1} \bigg( extstyle orall i=1 ext{ to } n \bigg)(x_i - \bar{x})^2.$
 - Single-pass method offers a faster yet less numerically robust calculation.
 Boxplot Analysis
 - Derives a five-number summary of a dataset:
 - Minimum, first quartile, median, third quartile, and maximum outcomes represent quantiles.
 - Illustrates data spread and extreme outliers interpretation.
 Example Utilizing the Iris Dataset
 - Depiction and analysis of sepal length across various species.
 Generalization Techniques in Data
 Definition of Quantization and Generalization
 - Quantization represents subdivision of data ranges into categories.
 - Generalization may span abstract categories based on framework contexts:
 - Examples include general age ranges and associated clustering outcomes.
 Binning Techniques with Histograms
 1. Equi-width Binning: Focusing on uniform grid intervals reducing categorical dimensions.
 - Advantages and disadvantages highlighted with effective examples.
 2. Equi-height Binning: Concentrating on managed data frequency balancing.
 - Provides insights into representation and robustness against outliers.
 Concept Hierarchies in Data Structuring
 Structuring Methods and Examples
 - Includes hierarchical frameworks for efficient OLAP operations: Data summarization with dimensional decrease definitions.
 - Specific structures based around scientific categories provide illustrative examples of network and classification:
 - Detailed comparisons across operative contexts demonstrate intelligence in structuring.
 Basic OLAP Operations Overview
 1. Roll-up: Summarizes data advancing to abstraction levels.
 2. Drill-down: Expands to more detailed information.
 3. Slice and Dice: Focuses on sectioning dimensions for clearer inspections.
 4. Pivoting: Changes data visualization orientation.
 Examples of SQL Commands Executing OLAP Operations
 - Queries specified for understanding the transformations in grouped data.
 - Roll-up and drill-down are vital for achieving comprehensive oversight on queries.
 Final Thoughts
 Efficient Implementation and Limitations
 - Discusses OLAP implementations focusing on aggregate measurements, potential generalizations about operations,
 - Data specification through automated or confirmed workflows enhances situational relevance.
 - Acknowledges limitations surrounding intelligent analysis capacity and suggests further exploration remains key to optimizing organizational queries.

	Car	Bike	Trolley
Car	1	0.3	0.1
Bike	0.3	1	0.2
Trolley	0.1	0.2	1

Hierarchical Path Length Example

Name	Semester	Major	Skills
Ann	3	CS	Java, C, R
Bob	1	CS	Java, PHP
Charly	4	History	Piano
Debra	2	Arts	Painting

Feature Extraction Process

Data Mining 1: Preliminaries

Introduction to Data

Preliminaries: Data

Definition of Data

Representation of Data

Presentation of Data

Agenda Overview

Data Types – Algebraic View

Ingredients of Data Types

Data Handling is Expensive – Implementation View

Storage Requirements

Data Sizes in Big Data Context

Computational Complexity

Overview of Data Types

Numeric Data

Simple Numeric Data

Composed Numeric Data

Comparisons of Vectors by Lp-Distance Functions

Generalization: Metric Data

Metric Space Definition

Categorical Data

Overview

Similarity Measures

Explicit Similarity Matrix

Hierarchical Path Length Example

Simple Data Types: Ordinal Characteristics

Order Definition

Composed Data Types: Sequences and Vectors

Definition

Examples

Comparison of Composed Data Types: Sets

Characteristics

Comparison and Operations

Bitvector Representation

Definition

Complex Data Types and Similarity Models

Components of Complex Data - Exemplar

Types include:

Approaches to Similarity Models

Specific Examples by Data Type

Objects and Attributes in Conceptual Modeling

Modeling Structures

Examples of representations:

Feature Extraction Process

Process Description

Similarity Queries

Basic Operations in (Multimedia) Databases

Definitions:

Types of Queries:

k-Nearest Neighbor Queries

Similarity Search Techniques

Example: Range Query

Filter-Refine Architecture Visualization

Core Principle of Multi-Step Searches

Example: Dimensionality Reduction

Indexing in Databases

Telephone Book Example

Indexing Principles

Data Visualization Techniques

Importance of Visualization

Key Capabilities Combined

Monthly Average Temperature Example

Data Visualization Techniques Classification

Limitations of Current Visualization Techniques

Scatterplot Characteristics

Alternatives: Polygonal Plots

Data Reduction Techniques

Purpose

Methods of Data Reduction

Data Reduction Strategies Overview

Quantization and Aggregation Techniques

Basic Aggregate Types

Central Tendency

Additional Aggregate Measures

Distributive Aggregate Measures

Algebraic Aggregate Measures

Holistic Measures

Central Tendency Measurement Methods

Mean

Median