Data Modeling for Data Engineers

Types of Data

  • Non-Structured Data: Data that lacks a predefined model and is not organized (e.g., audio files, video files, NoSQL databases).
  • Semi-Structured Data: Data that does not have a predefined model but possesses some inherent structure, often self-describing through tags, markers, or hierarchies (e.g., JSON, XML).
  • Structured Data: Data organized in a tabular format with relationships between rows and columns, reliant on a data model such as relational databases or Excel files.

Data Sources

  • Devices
  • Databases
  • Files (JSON, XML, CSV)
  • Videos

Structured Data

  • Organized in tabular form (rows and columns).
  • Relies on a data model.
  • Examples: Relational Databases, Excel Files.

Non-Structured Data

  • Lacks a predefined model.
  • Not organized in a conventional manner.
  • Experiences large growth in volume.
  • Examples: Audio Files, Video Files, NoSQL Databases.

Semi-Structured Data

  • Lacks a predefined model.
  • Contains some form of structure, often self-describing.
  • Utilizes tags, markers, or hierarchies between elements.

Definition of Data Modeling

  • Data modeling is the process of creating a visual representation of how data is structured, stored, and related within a system.
  • It defines data elements, their attributes, and the relationships between them.
  • It serves as a blueprint for organizing and managing data.

Key Components of Data Models

  • Entities: Represent objects or concepts (e.g., Customer, Order, Product).
  • Attributes: Characteristics or properties of entities (e.g., name, email, date, total).
  • Relationships: Connections between entities (e.g., Customer places Order, Order contains Product).

Data Model Types

  • Conceptual Data Model:
    • Purpose: High-level overview.
    • Target Audience: Business stakeholders.
    • Platform: Agnostic (independent of specific platforms).
  • Logical Data Model:
    • Purpose: Logical definition of data structures.
    • Target Audience: Data engineers, data architects, data analysts.
    • Platform: Agnostic.
  • Physical Data Model:
    • Purpose: Low-level detail of physical data design.
    • Target Audience: Database administrators, developers.
    • Platform: Specific (tailored to a particular database management system).

Conceptual Data Model Details

  • High-level representation of organizational data.
  • Focuses on what data is needed rather than how it will be implemented.

Creating a Conceptual Data Model (Example)

  • Entities: Stadium, Customer, Event, Attendee, Ticket
  • Relationships: A stadium organizes events; a customer interacts with a stadium; an attendee attends events; an event can be held at a stadium, a customer buys a ticket, a ticket contains an event.

Benefits of a Conceptual Data Model

  • Clear communication with stakeholders.
  • Early validation of requirements.
  • Improved planning.
  • Focus on business needs.
  • Technology independence.
  • Enhanced collaboration.
  • Supports documentation & training.

Steps to Create a Logical Data Model

  • Identify entities.
  • Identify candidate keys.
  • Choose primary keys.
  • Apply normalization.
  • Set relationships.
  • Identify relationship cardinality.
  • Iterate to fine-tune.
  • Still independent of technology (not tied to a specific DBMS).

Transformation from Conceptual to Logical Data Model (Example)

  • Conceptual Model: Stadium interacts with Customer; Customer organizes Event; Event attends Attendee; Attendee buys Ticket
  • Logical Model:
    • Stadium: StadiumID (PK), Stadium Name, Stadium Country, Stadium City
    • Customer: CustomerID (PK), Customer Name, Customer Country, Customer Address
    • Event: EventID (PK), Event Name, Event Date
    • Ticket: TicketID (PK), Ticket Price, Ticket Category

Benefits of a Logical Data Model

  • Clarity and structure.
  • Improved communication.
  • Early validation (iteration and fine-tuning), supporting an agile approach.
  • Foundation for the physical model (blueprint).
  • Technology neutral.
  • Supports normalization (avoid redundancy and anomalies).

Normalization

  • Normalization is the process of organizing data in a database.
  • It reduces data redundancy and improves data integrity.
  • It involves dividing large, complex tables into smaller, related tables.
  • Ensures each table contains only data directly related to its primary key.
  • Achieved through a series of normal forms (1NF, 2NF, 3NF, etc.).
  • Each normal form builds on the previous one to systematically eliminate anomalies in data insertion, update, and deletion.

Normalization Checks

  • 1NF (First Normal Form):
    • Atomic values only (each cell contains a single value).
    • No repeating groups.
    • Each record has a unique identifier.
  • 2NF (Second Normal Form):
    • Already in 1NF.
    • No partial dependency on a subset of the primary key (applies to composite keys).
  • 3NF (Third Normal Form):
    • Already in 2NF.
    • No transitive dependency between non-key attributes.
    • All non-key attributes depend only on the primary key.

ER Diagram of Registration

  • Students --Enrolls In--> StudentCourses <-- Courses
  • Students (StudentID) StudentCourses (StudentID, CourseID) Courses (CourseID, CourseName, InstructorName)

3rd Normal Form Explained

  • Builds upon 1NF: Requires atomic values in each cell and no repeating groups.
  • Builds upon 2NF: Removes partial dependencies on a composite primary key.
  • Removes transitive dependencies.

Entity Relationship Diagrams

  • Conceptual: Entities (Nouns), Relationships (Verbs)
  • Logical: Includes data types and constraints (e.g., Students (StudentID INT, Name VARCHAR(100) NN))

Tools to Draw E-R Diagrams

  • Desktop Apps:
    • MySQL Workbench: Free, built-in ERD tool for MySQL databases.
    • Dbeaver: Open-source database IDE with ERD support for many DBMSs.
    • Oracle SQL Data Modeller: Advanced ERD generation from schemas (Oracle DB focused).
    • pgModeler: Open-source PostgreSQL database modeler.
  • Web-Based Tools:
    • dbdiagram.io: Simple ERD syntax, supports SQL reverse-engineering.
    • Lucidchart: Drag-and-drop interface, templates, limited free tier.
    • Draw.io / diagrams.net: Free, offline/online, highly customizable.
    • QuickDBD: Type to diagram instantly for ER modeling.
    • SqlDBM: Online database modeling for SQL Server, MySQL, PostgreSQL, Snowflake.

E-R Diagrams Main Concepts

  • Entities: Real-world objects or concepts.
    • Strong: Can exist independently (e.g., Customer).
    • Weak: Depends on another entity (e.g., Order).
  • Attributes: Characteristics or properties of an entity.
    • Simple, Composite, Derived, Multivalued, Key.
  • Constraints: Rules that limit the type of data that can be stored in a table to ensure accuracy, consistency, and integrity.
  • Relationships: Associations between two or more entities.
    • One-to-One (1:1)
    • One-to-Many (1:N)
    • Many-to-Many (M:N)
  • Keys:
    • Primary Key: Uniquely identifies each instance.
    • Foreign Key: References the PK of another (or the same!!) entity.

Physical Data Modeling

  • Choose the platform.
  • Translate logical entities into tables.
  • Establish relationships.
  • Apply Normalization/Denormalization.
  • Apply table Constraints.
  • Create Indexes and/or partitions.
  • Extend with Functions, Procedures, and Triggers.

Indexing Strategy

  • Workload type:
    • Row store for OLTP, column store for OLAP.
  • Filtering conditions:
    • Columns used in WHERE, HAVING, JOIN.
  • Column data type:
    • Use integers over text when possible.
  • Data selectivity:
    • Create index on a more selective column.
  • Column order:
    • matters!

Table Partitioning

  • Dividing a table into smaller, more manageable parts, often based on date or other criteria.

Data Modeling: SQL vs noSQL Database

  • SQL: Relational model with structured data in tables.
  • NoSQL: Non-relational models with flexible, schema-less structure (e.g. JSON documents).

Key Characteristics of NoSQL Databases

  • Does not use traditional relational models (i.e., tables and SQL).
  • Uses flexible, non-tabular structures to store and manage data.
  • Optimized for scalability, speed, and unstructured or semi-structured data.
    • Schema-less
    • Horizontally Flexible
    • High performance
  • Designed for big data & modern apps (real-time analytics, mobile, IoT, and cloud-native applications).

Types of NoSQL Databases

  • Document:
    • Description: Stores data as JSON-like documents.
    • Example: MongoDB, Couchbase, Firebase.
  • Key-Value:
    • Description: Simple key → value pairs.
    • Example: Redis, Amazon DynamoDB.
  • Column-Family:
    • Description: Data stored in column families (wide rows).
    • Example: Apache Cassandra, HBase.
  • Graph:
    • Description: Nodes and relationships (edges) for connected data.
    • Example: Neo4j, ArangoDB.

Document Type: Sample Order document in MongoDB

  • Illustrates the structure of an order document with nested customer, order status, and order lines, each containing product details.

Common use cases of Document Databases

  • CMS / Blogs: Flexible structure, nested metadata.
  • E-commerce products: Varying attributes, one document per item.
  • User profiles / accounts: Custom fields per user, embedded settings.
  • Real-time chat/feeds: Fast document reads/writes.
  • Shopping carts / orders: Self-contained, easily serialized.
  • Mobile/IoT data: Schema-less and sync-ready.
  • Rapid development: Easy to evolve without migrations.

Key-value Type: Sample Session document in Redis

  • Illustrates storing session data and shopping cart information as key-value pairs.

Column-Family Table Type: Example of user_activity document in Cassandra

  • Shows how user activity data is stored with row keys and clustering keys for efficient querying.

Common use cases of Column-Family

  • Time-series metrics: Wide rows with timestamped columns.
  • Logging and monitoring: High-throughput writes, fast range queries.
  • Real-time analytics: Scales horizontally for large-scale data.
  • Financial/event records: Write-optimized and append-friendly.

Graph Type: example (Neo4j-style)

  • Describes relationships between nodes (entities) using edges (relationships), with properties and labels to define attributes and categories.

Common use cases of Graph Databases

  • Social networks: Fast friend-of-friend lookups (e.g., User → FRIENDS_WITH → User).
  • Recommendations: Related items/users through multi-step paths (e.g., User → RATED → Movie).
  • Fraud detection: Pattern detection across accounts/transactions (e.g., Account → TRANSFERS → Account).
  • IT/network analysis: Topology, dependencies, and impact tracing.
  • Knowledge graphs: Rich, semantically connected data.

Advantages/Disadvantages of NoSQL Structures

  • Advantages:
    • Fast reads: All order data is in one document (no joins).
    • Loose or no relationships.
    • More scalable and flexible for unstructured or nested data.
  • Disadvantages:
    • Weaker data integrity.
    • No standard query language.
    • Complex joins are difficult or impossible.
    • No constraints / or harder to enforce.

Relational to Document Model: Model for Workloads!

  • User Requirements to understand the use of the system and the queries.
  • Identify and Quantify Entities.
  • Attributes.
  • Identify Reads and Writes (to decide which data will be stored together).
  • Quantify R/W (Frequency).

Referencing:

  • Separate documents linked using a common key.
  • Documents might come from different collections

Relationships

  • Embedding: Stored together

Bi-directional one-to-one references

  • Example of how to reference Publisher data from a Headquarters document and vice versa.

One-to-Many Embedded

  • Shows how reviews can be embedded within a book document, storing all review data directly within the book document.

One-to-many references

  • Illustrates how to reference review documents from a book document, maintaining separate collections for books and reviews.

One-to-Many Bi-directional

  • Demonstrates a bi-directional relationship where books reference reviews and reviews reference books.

Many-to-Many Embedding

  • Shows how to embed author information within book documents when multiple authors can write a book.

Many-To-Many Referencing

  • Illustrates how books and authors can reference each other using arrays of IDs.

Many-to-many Reversed

  • Shows how books and authors can reference each other using arrays of IDs.

Ask Questions

  • Guidelines and results based on considerations such as Simplicity, Query Atomicity, and Cardinality.

Schema Validation

  • Allows to enforce rules about the structure of the documents
  • Define them when you create a collection
  • Add to existing collection
  • Validations have:
    • Rules
    • Levels
      • Strict (all new and updated)
      • Moderate (all new and existing valid)
    • Actions
      • Error (reject operation)
      • Warn (allow to proceed, log action)

What is OLAP?

  • OLAP (Online Analytical Processing) is a category of database systems and techniques designed to support complex data analysis and reporting, typically for business intelligence.
  • Allows users to analyze large volumes of data quickly from multiple perspectives.
  • Focuses on aggregations, trends, comparisons, and historical analysis, rather than real-time transactions. Think of it as “data exploration”, not data entry.

Key Characteristics of OLAP

  • Feature - Description
  • Multidimensional analysis- Data is modeled as cubes with dimensions (e.g., time, region, product)
  • Aggregations - Performs summaries like SUM, AVG, COUNT, etc. efficiently
  • Slice-and-dice - View data from different angles (e.g., sales by product, then by region)
  • Drill-down/drill-up - Explore data at different levels of granularity
  • Read-intensive - Optimized for querying, not for frequent updates

Databases and Tools that use OLAP:

  • Databases:
    • Microsoft SQL Server Analysis Services (SSAS)
    • Oracle OLAP
    • IBM Cognos
    • SAP BW/4Hana
  • Tools:
    • Tableau
    • Power BI (as front ends using OLAP cubes or tabular models)

OLAP Workflow

  • Data Ingestion -> Data Transformation -> Data Analysis

Data Ingestion

  • Capture raw data from multiple different sources into a repository.

Data Analysis

  • Query data and visualize it using tools like Microsoft Power BI, Tableau, Qlik, Looker.

OLTP vs OLAP

  • OLTP
    • A large number of small transactions
    • WRITE operations (Insert, Update)
    • Industry-specific (retail, banking)
    • Transactions as source
    • Increase productivity of end-users
    • Normalized databases for efficiency
  • OLAP
    • Large volumes of data
    • READ operations (Select)
    • Subject-specific (sales, marketing)
    • Aggregated data from transactions
    • Increase productivity of analysts and executives
    • Denormalized databases for analysis

Choose Optimal Data Modelling Strategy

  • Analyze key characteristics of your data
  • Understand the data structure
    • Schema-on-read vs. Schema-on-write
  • Understand data architecture requirements
    • Centralized or decentralized
    • Batch-oriented vs. Real-time processing
  • Determine business needs
    • Key requirements and constraints
    • Reporting needs of stakeholders
  • Understand technology stack
    • Incorporate into the existing stack?
  • Choose the appropriate modeling approach
    • Relational vs. Non-relational vs. Dimensional…Or a combination?

Datawarehouse Benefits for Business Intelligence

  • Data consolidation
  • Improved data quality and consistency
  • Historical data analysis
  • Performance and speed
  • Support for complex queries
  • Enhanced decision-making
  • Security
  • Scalability

Types of Analysis Tools

  • Reporting Tools
  • Data Mining Tools
  • OLAP Tools
  • Predictive Analytics Tools
  • Dashboards

Datawarehouse Design

  • Dimensional Modeling
    • Star Schemas
    • Snowflake Schemas

Dimensional Modeling

  • Facts (numeric, transactional)
  • Dimensions (descriptive attributes)

Star Schema

  • A fact table surrounded by dimension tables.

Snowflake Schema Example

  • A refinement of the star schema where dimensional tables are normalized into multiple related tables.

Python Transformation OLTP to OLAP

# Python Transformation OLTP to OLAP, 1
pip install pymysql psycopg2 pandas
import pymysql
import psycopg2
import pandas as pd
from datetime import datetime

# MySQL (OLTP source)
mysql_conn = pymysql.connect(
 host='localhost',
 user='root',
 password='your_mysql_password',
 db='oltp_db'
)

# PostgreSQL (OLAP target)
pg_conn = psycopg2.connect(
 host='localhost',
 user='postgres',
 password='your_pg_password',
 dbname='olap_db'
)

pg_cursor = pg_conn.cursor()

# Extract Data from OLTP
orders_df = pd.read_sql("SELECT * FROM orders", mysql_conn)
customers_df = pd.read_sql("SELECT * FROM customers", mysql_conn)
products_df = pd.read_sql("SELECT * FROM products", mysql_conn)


# Python Transformation OLTP to OLAP, 2
# Load dimension tables
def load_dim_customer():
 for _, row in customers_df.iterrows():
 pg_cursor.execute("""
 INSERT INTO dim_customer (customer_id, name, email)
 VALUES (%s, %s, %s)
 ON CONFLICT (customer_id) DO NOTHING;
 """, (row['customer_id'], row['name'], row['email']))

def load_dim_product():
 for _, row in products_df.iterrows():
 pg_cursor.execute("""
 INSERT INTO dim_product (product_id, name, category)
 VALUES (%s, %s, %s)
 ON CONFLICT (product_id) DO NOTHING;
 """, (row['product_id'], row['name'], row['category']))

def load_dim_date():
 dates = orders_df['order_date'].drop_duplicates()
 for date in dates:
 dt = pd.to_datetime(date)
 pg_cursor.execute("""
 INSERT INTO dim_date (date_id, day, month, year, quarter)
 VALUES (%s, %s, %s, %s, %s)
 ON CONFLICT (date_id) DO NOTHING;
 """, (
 dt.strftime('%Y-%m-%d'), dt.day, dt.month, dt.year, (dt.month - 1)//3 + 1
 ))


# Python Transformation OLTP to OLAP, 3
# Load fact table
def load_fact_sales():
 for _, row in orders_df.iterrows():
 date_id = pd.to_datetime(row['order_date']).strftime('%Y-%m-%d')
 total_price = row['qty'] * row['price']
 pg_cursor.execute("""
 INSERT INTO fact_sales (order_id, date_id, customer_id, product_id, quantity,
 total_price)
 VALUES (%s, %s, %s, %s, %s, %s)
 ON CONFLICT (order_id) DO NOTHING;
 """, (
 row['order_id'], date_id, row['customer_id'], row['product_id'], row['qty'],
 total_price
 ))

# Run ELT steps
load_dim_customer()
load_dim_product()
load_dim_date()
load_fact_sales()

# Commit & Close
pg_conn.commit()
pg_cursor.close()
pg_conn.close()
mysql_conn.close()

print("ETL completed successfully.")

MySQL Schema

CREATE TABLE customers (
 customer_id INT PRIMARY KEY,
 name VARCHAR(100),
 email VARCHAR(100)
);

CREATE TABLE products (
 product_id INT PRIMARY KEY,
 name VARCHAR(50),
 category VARCHAR(50)
);

CREATE TABLE orders (
 order_id INT PRIMARY KEY,
 customer_id INT,
 product_id INT,
 order_date DATE,
 qty INT,
 price DECIMAL(10,2),
 FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
 FOREIGN KEY (product_id) REFERENCES products(product_id)
);

PostgreSQL schema

CREATE TABLE dim_customer (
 customer_id INT PRIMARY KEY,
 name TEXT,
 email TEXT
);

CREATE TABLE dim_product (
 product_id INT PRIMARY KEY,
 name TEXT,
 category TEXT
);

CREATE TABLE dim_date (
 date_id DATE PRIMARY KEY,
 day INT, month INT, year INT, quarter INT
);

CREATE TABLE fact_sales (
 order_id INT PRIMARY KEY,
 date_id DATE REFERENCES dim_date(date_id),
 customer_id INT REFERENCES dim_customer(customer_id),
 product_id INT REFERENCES dim_product(product_id),
 quantity INT,
 total_price NUMERIC
);

MongoDB Embedded Schema

{
 "order_id": 1001,
 "date": "2025-05-01",
 "customer": {
 "customer_id": 42,
 "name": "Alice",
 "email": "alice@example.com"
 },
 "product": {
 "product_id": 12,
 "name": "USB Hub",
 "category": "Electronics"
 },
 "qty": 3,
 "price": 19.99,
 "total": 59.97
}

mongoimport --db your_db --collection fact_orders --file mongodb_star_schema_orders.json --jsonArray

db.fact_orders.aggregate([
 { $group: {
 _id: "$product.product_id",
 total_sales: { $sum: "$total" },
 total_quantity: { $sum: "$qty" }
 }},
 { $sort: { total_sales: -1 } }
]);

db.fact_orders.createIndex({ "product.product_id": 1 });