Data Mining 1: Preliminaries
Introduction to Data
Context: Ludwig-Maximilians-Universität München, Chair for Database Systems and Data Mining, Prof. Dr. Thomas Seidl, Data Mining Algorithms 1 (Knowledge Discovery and Data Mining 1), Winter Semester 2025/26.
Preliminaries: Data
Definition of Data
Data is a representation of real or artificial objects, situations, or processes.
Measured by physical sensors, examples include:
Temperature
Humidity
Car traffic
Speed
Color
Recorded from digital systems, such as:
Bank transfers
Web browsing activities
Generated by simulations, such as:
Weather forecasts
Digital mockups
Stored and provided by computers (e.g., local disk or remote server).
Representation of Data
Data Types:
Numerical data types: Represent quantitative values and can include:
Integer
Float
Double
Categorical data types: Represent qualitative values and can include:
Strings (e.g., names, categories)
Similarity models: Used to facilitate pattern mining from data.
Data reduction techniques: Aim to enhance efficiency in processing data.
Presentation of Data
Visualization techniques: Important for effective interpretation and understanding of data.
Privacy aspects: Considerations in presenting sensitive data.
Agenda Overview
Introduction
Preliminaries: Data
2.1 Data Representation
2.2 Visualization
2.3 Data Reduction
Supervised Learning
Unsupervised Learning
Process Mining
Further Topics
Data Types – Algebraic View
Ingredients of Data Types
Identifiers: Data types have specific identifiers (e.g.,
int,long,float).Domains (D): Each type has a set of possible values:
For example:
{1, 2, ...},{true, false}, or string sets like{'hello', 'hi', 'howdy'}.
Operations: Types dictate the operations applicable to them, such as:
Arithmetic:
+,-,mod,divLogical:
and,or,notConcatenation for strings.
Equality tests: Defined by:
$=,
eq$: A function $D imes D
ightarrow ext{true, false}$.
Distance functions: Defined as:
$d : D imes D
ightarrow ext{R}^+_0$.Examples include:
Euclidean distance
Maximum distance
Difference value.
Data Handling is Expensive – Implementation View
Storage Requirements
Single values occupy specific bytes based on their types:
1 Byte = 8 Bits
4 Bytes = 32 Bits
8 Bytes = 64 Bits
Data Sizes in Big Data Context
Kilobytes $(10^3)$
Megabytes $(10^6)$
Gigabytes $(10^9)$
Terabytes $(10^{12})$
Petabytes $(10^{15})$
Exabytes $(10^{18})$
Zettabytes $(10^{21})$.
Computational Complexity
Operations require significant computation time:
Calculating the Euclidean distance of two objects from $ ext{R}^d$ requires $O(d)$ time.
Mean of $n$ objects from $ ext{R}^d$ needs $O(nd)$ time.
Covariance matrix of $n$ objects $x_i ext{ in } ext{R}^d$ needs $O(nd^2)$ time.
Overview of Data Types
Simple Data Types:
Numerical data (quantitative) and categorical data (qualitative).
Composed Data Types: Include structured forms like vectors, sequences, sets, and relations.
Complex Data Types: Include:
Multimedia data (images, videos, audio, text).
Spatial data (geographic shapes, trajectories).
Structures like graphs, networks, trees, etc.
Numeric Data
Simple Numeric Data
Comprises various numeric types:
Natural numbers, integers, rational numbers, real numbers.
Domain examples:
Age
Income
Shoe size
(Dis-)similarity metric: Difference value defined as:
$d(x, y) = |x - y|$.
Notable example: 3 is closer in similarity to 30 than 30 is to 3,000.
Composed Numeric Data
Vectors defined as:
$x ext{ in } ext{R}^d = R imes R imes … imes R$ (d times).
Domain examples:
Geolocation $( ext{lat}, ext{lon}) ext{ in } ext{R}^2$
Feature vector $x = (x1, x2, ext{…}, x_d) ext{ in } ext{R}^d$.
Comparison metrics:
Euclidean distance
Manhattan distance
Earth Movers’ distance.
Comparisons of Vectors by Lp-Distance Functions
Given two vectors $o, q ext{ in } ext{R}^n$:
Distances based on $p$ norms:
dp(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) ext{ } |oi - q_i|^p.Specific formulations:
Euclidean distance:
d2(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) (oi - q_i)^2.Manhattan distance (city blocks):
d1(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) |oi - q_i|.Maximum distance: As $p o ext{∞}$,
d{ ext{max}}(o, q) = ext{max} igg( |oi - q_i| igg), ext{ for } i = 1..n.Weighted p-distances:
d{p,w}(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) wi |oi - qi|^p.
Visualization example: Draw circles around center $c$:
ext{Circle} = ig ext{ }{p ext{ in } ext{R}^n | d(p, c) = ext{ε}} ig ext{ }.
Generalization: Metric Data
Metric Space Definition
A metric space $(O, d)$ consists of:
An object set $O$.
A metric distance function $d : O imes O
ightarrow ext{R}^+_0$ which fulfills:Symmetry: $ orall p, q ext{ in } O : d(p,q) = d(q,p)$.
Identity of Indiscernibles: $ orall p, q ext{ in } O : d(p,q) = 0 ext{ if and only if } p = q$.
Triangle Inequality: $ orall p, q, o ext{ in } O : d(p,q) ext{ ≤ } d(p,o) + d(o,q)$.
Examples: Points in 2D space or $ ext{R}^n$ with Euclidean distance.
Counterexamples: Consider certain real-world scenarios:
Non-symmetric distances: For example, one-way roads between $p$ and $q$.
Non-definition: Equivalent objects may not always be identical.
Triangle measures: Some detours may be preferred over straight lines.
Categorical Data
Overview
Represents qualitative symbols, essentially just identifiers.
Examples include:
Subjects: $ ext{subjects} = ext{physics, biology, math, music, literature}$.
Occupations: $ ext{occupation} = ext{butcher, hairdresser, physicist, physician}$.
Similarity Measures
Defined to compare values:
Example: Trivial metric,
d(p, q) = egin{cases}0 & ext{if } p = q \ 1 & ext{otherwise} ext{.}
ext{ } \ ext{ } \ ext{(d(p,q) : O} imes O
ightarrow ext{R}_{0}^{+}).
Explicit Similarity Matrix
An O × O mapping to similarity values between distinct items:
Car
Bike
Trolley
Car
1
0.3
0.1
Bike
0.3
1
0.2
Trolley
0.1
0.2
1
Hierarchical Path Length Example
Represents relationships and associations between subjects across various categories:
Biology $
ightarrow$ Physics $
ightarrow$ ScienceMath $
ightarrow$ Music $
ightarrow$ Arts $
ightarrow$ LiteratureEngineering $
ightarrow$ Civil Engineering $
ightarrow$ Mechanical Engineering
Simple Data Types: Ordinal Characteristics
Order Definition
There exists a total order $ ext{≤}$ on the set of possible data values $O$.
Properties of order include:
Transitivity: If $p ext{ ≤ } q$ and $q ext{ ≤ } o$, then $p ext{ ≤ } o$.
Antisymmetry: If $p ext{ ≤ } q$ and $q ext{ ≤ } p$, then $p = q$.
Totality: For any $p, q ext{ in } O$, either $p ext{ ≤ } q$ or $q ext{ ≤ } p$.
Examples:
Lexicographic ordering of words, sizes, and numerical sequences.
Composed Data Types: Sequences and Vectors
Definition
Given a domain $D$, a sequence $s$ of length $n$ is a mapping $I_n ightarrow D$, where:
$I_n = ext{Index set} = ext{{1, 2, …, n}}$ into $D$.
The sequence concatenates $n$ values from $D$ where order is relevant.
Examples
Distance metrics,
dp(o, q) = igg( extstyle orall i = 1 ext{ to } n igg) ext{ } |oi - q_i|^p.
Comparison of Composed Data Types: Sets
Characteristics
Represent an unordered collection of values.
Examples:
Skills = $ ext{Skills} = ext{Java, C, Python}$
Hobbies = $ ext{Hobbies} = ext{skiing, diving, hiking}$.
Comparison and Operations
Symmetric Set Difference:
R riangle S = (R - S) igcup (S - R) = (R igcup S) - (R igcap S).Jaccard Distance:
d(R, S) = rac{|R riangle S|}{|R igcup S|}.
Bitvector Representation
Definition
Using an ordered base set $B = (b1, …, bn)$, a set $S$ creates a binary vector $r ext{ in } ext{ {0, 1}}^n$ such that:
ri = egin{cases} 1 & ext{if } bi ext{ in } S \ 0 & ext{otherwise} ext{.} \ ext{Hamming distance} ext{ : Counts differing entries (sum of different entries).}Example with Hamming Distance:
Base Set $B = ( ext{Math, Physics, Chemistry, Biology, Music, Arts, English})$.
Sets $S ext{ and } R$ yielding:
$S = ext{Math, Music, English} = (1, 0, 0, 0, 1, 0, 1)$,
$R = ext{Math, Physics, Arts, English} = (1, 1, 0, 0, 0, 1, 1)$,
Hamming Distance Calculation:
ext{Hamming}(R,S) = 3 ext{ as three entries differ.}
Complex Data Types and Similarity Models
Components of Complex Data - Exemplar
Types include:
Structures: Graphs, networks, trees.
Geometry: Shapes, contours, trajectories.
Multimedia: Images, audio, text.
Approaches to Similarity Models
Direct Measures: Highly dependent on data type.
Feature Extraction / Engineering: Craft explicit vector space embeddings using hand-crafted features.
Kernel Trick: Implicit vector space embedding methodologies.
Feature Learning: Explicit vector space embedding by deep learning methodologies, e.g., neural networks.
Specific Examples by Data Type
Graphs:
Similarity approaches:
Structural Alignment
Degree Histograms
Label Sequence Kernel
Node embeddings via Spectral Neural Networks.
Geometry:
Similarity measures:
Hausdorff Distance
Shape Histograms
Spatial Pyramid Kernel
Convolutional Neural Networks (CNN).
Sequences:
Edit Distance
Symbol Histograms.
Cosine Distance
Recurrent Neural Networks (RNN).
Objects and Attributes in Conceptual Modeling
Modeling Structures
Examples of representations:
Entity-Relationship Diagram (ER Diagram) including entities like students with attributes:
Name
Semester
Major
Skills.
UML Class Diagram with equivalent attributes.
Relational Model Data Tables to represent students:
Name
Semester
Major
Skills
Ann
3
CS
Java, C, R
Bob
1
CS
Java, PHP
Charly
4
History
Piano
Debra
2
Arts
Painting
Feature Extraction Process
Process Description
Objects from a database (DB) are mapped to feature vectors.
Feature vector space where:
Points represent objects.
Distances correspond to (dis-)similarity.
Similarity Queries
Basic Operations in (Multimedia) Databases
Definitions:
Given:
Universe function $O$
Database $DB ext{ subset of } O$
Distance function $d$
Query object $q ext{ in } O$.
Types of Queries:
Range Query:
Defined as:
ext{range}(DB, q, d, ext{ε}) = ig ext{o ext{ in } DB | d(o, q) ext{ ≤ } ε} ig ext{ }.
Nearest Neighbor Query:
Defined as:
ext{NN}(DB, q, d) = ig ext{o ext{ in } DB | orall o' ext{ in } DB : d(o, q) ext{ ≤ } d(o', q) ig ext{ }.
k-Nearest Neighbor Queries
Definition: For a given parameter $k ext{ in } ext{N}$:
ext{NN}(DB, q, d, k) ext{ ⊆ } DB ext{ with } |NN(DB, q,d,k)| = k ext{ and } orall o ext{ in } NN(DB, q,d,k), o' ext{ in } DB - NN(DB, q,d,k) : d(o, q) ext{ ≤ } d(o', q).Ranking Query: Related to partial sorting:
`` “Get next” functionality for selecting database objects in the order of increasing distance toq`:For any indices $i ext{ and } j$ where $i ext{ ≤ } j$:
d(q,rank{DB} q,d(i)) ≤ d(q,rank{DB} q,d(j)).
Similarity Search Techniques
Example: Range Query
Defined as:
ext{range}(DB, q, d, ε) = ig ext{o ext{ in } DB | d(o, q) ext{ ≤ } ε} ig ext{ }.Naive Search: Sequential scanning of entries, taking $O(n)$ time.
Individual distance checks likewise require $O(n)$ time.
Fast Search Techniques:
Utilizing database techniques like:
Filter-refine architecture.
Using indexing structures to eliminate sequential scans; analysis for database objects can be $O(n)$, $O( ext{log } n)$ or even $O(1)$.
Filter-Refine Architecture Visualization
Core Principle of Multi-Step Searches
Fast filter step generates a candidate set $C ext{ ⊆ } DB ext{ (via approximate distance function } d')$.
Exact distance function $d$ applied only to candidates in $C$.
Example: Dimensionality Reduction
ICES Criteria for Filter Quality:
Indexable: Index enabled
Complete: No false dismissals
Efficient: Quick individual calculations
Selective: Maintain a small candidate set.
Indexing in Databases
Telephone Book Example
Indexed by the alphabetical order of participants to facilitate quicker access:
Instead of performing a sequential search, estimate the query object's range (interlocutor).
Extract correct branches and subsequently utilize the next identifier to hone in on the query.
Indexing Principles
Importance in data organization for effective retrieval of objects, emphasizing efficient pruning.
R-Tree Indexing Example:
Organizes minimum bounding rectangles and disregards irrelevant subtrees during queries.
Data Visualization Techniques
Importance of Visualization
Critical for perceiving patterns in extensive data sets, which may be difficult to discern from tabular numeric formats.
Visualization Transformation: Enables data presentation in visually comprehensible depictions (the saying "a picture is worth a thousand words" underscores this).
Key Capabilities Combined
Computers: Excellent at data processing and producing graphical representations.
Humans: More adept at visual pattern detection.
Monthly Average Temperature Example
Visualization of temperature data for various cities.
Illustrated data includes average temperatures for cities across the months in Celsius.
Data Visualization Techniques Classification
Geometric Visualization: Transformation and projection visualizations.
Examples include scatterplots, parallel coordinates, and icon-based methods such as Chernoff Faces.
Pixel-Oriented Visualization: Each attribute value represented by a colored pixel.
Various recursive patterns implemented for arrangement.
Limitations of Current Visualization Techniques
Scatterplot Characteristics
Works well primarily in two-dimensional data settings.
Scatterplot matrices provide pairwise analysis, which may become cluttered when analyzing many dimensions.
Alternatives: Polygonal Plots
Utilize curves rather than points for depicting data objects,
Examples: Spiderweb plots and parallel coordinates.
Data Reduction Techniques
Purpose
Enhances pattern perception and eases interpretability:
Raw data representation can be complex and unmanageable; reduction allows for better evaluation of data patterns.
High computational complexity of big datasets extends runtime for mining.
Simplified datasets yield improved analytical results.
Methods of Data Reduction
Data Aggregation: Using basic statistics to combine data.
Data Generalization: Abstraction at higher levels to simplify.
Data Reduction Strategies Overview
Numerosity Reduction: Reduce the number of objects.
Dimensionality Reduction: Eliminate attributes.
Quantization or Discretization: Decrease values per domain.
Quantization and Aggregation Techniques
Includes sampling and aggregation focusing on model parameters to aggregate comprehensive data sets.
Basic Aggregate Types
Central Tendency
Inquiry around data centralization:
Mean, median, and mode are examples of central tendency indicators.
Spread Measures: Variance and standard deviation characterize spread.
Additional Aggregate Measures
Distributive Aggregate Measures
Examples illustrating the sum from divisions of dataset results:
Combination of values must fit basic aggregate functions status.
Algebraic Aggregate Measures
Measures that involve various bound arguments calculated from distributions.
Example of average calculation pushing beyond partitions using distributive measures.
Holistic Measures
Measures lacking constant storage size, impacting computation needs:
Unique examples: median, mode, and rank statistics.
Central Tendency Measurement Methods
Mean
Explains how to compute the arithmetic mean highlighting excellent centrality.
Special cases arise when considering categories over numerics but run into categorical issues.
Median
Computation reflecting the mid-value of sorted data; recognizes both odd and even counts distinctly.
Mode
Measures the most frequently occurring value and denotes unique cases like unimodal, bimodal, or multi-modal distributions.
Primarily tailored for categorical data representation.
Variability Measures
Variance and Standard Deviation
Variance reflects data dispersion around the mean:
σ^2 = rac{1}{n - 1} igg( extstyle orall i=1 ext{ to } n igg)(x_i - ar{x})^2.Single-pass method offers a faster yet less numerically robust calculation.
Boxplot Analysis
Derives a five-number summary of a dataset:
Minimum, first quartile, median, third quartile, and maximum outcomes represent quantiles.
Illustrates data spread and extreme outliers interpretation.
Example Utilizing the Iris Dataset
Depiction and analysis of sepal length across various species.
Generalization Techniques in Data
Definition of Quantization and Generalization
Quantization represents subdivision of data ranges into categories.
Generalization may span abstract categories based on framework contexts:
Examples include general age ranges and associated clustering outcomes.
Binning Techniques with Histograms
Equi-width Binning: Focusing on uniform grid intervals reducing categorical dimensions.
Advantages and disadvantages highlighted with effective examples.
Equi-height Binning: Concentrating on managed data frequency balancing.
Provides insights into representation and robustness against outliers.
Concept Hierarchies in Data Structuring
Structuring Methods and Examples
Includes hierarchical frameworks for efficient OLAP operations: Data summarization with dimensional decrease definitions.
Specific structures based around scientific categories provide illustrative examples of network and classification:
Detailed comparisons across operative contexts demonstrate intelligence in structuring.
Basic OLAP Operations Overview
Roll-up: Summarizes data advancing to abstraction levels.
Drill-down: Expands to more detailed information.
Slice and Dice: Focuses on sectioning dimensions for clearer inspections.
Pivoting: Changes data visualization orientation.
Examples of SQL Commands Executing OLAP Operations
Queries specified for understanding the transformations in grouped data.
Roll-up and drill-down are vital for achieving comprehensive oversight on queries.
Final Thoughts
Efficient Implementation and Limitations
Discusses OLAP implementations focusing on aggregate measurements, potential generalizations about operations,
Data specification through automated or confirmed workflows enhances situational relevance.
Acknowledges limitations surrounding intelligent analysis capacity and suggests further exploration remains key to optimizing organizational queries.