Data Modeling for Data Engineers

Types of Data

Non-Structured Data: Data that lacks a predefined model and is not organized (e.g., audio files, video files, NoSQL databases).
Semi-Structured Data: Data that does not have a predefined model but possesses some inherent structure, often self-describing through tags, markers, or hierarchies (e.g., JSON, XML).
Structured Data: Data organized in a tabular format with relationships between rows and columns, reliant on a data model such as relational databases or Excel files.

Data Sources

Devices
Databases
Files (JSON, XML, CSV)
Videos

Structured Data

Organized in tabular form (rows and columns).
Relies on a data model.
Examples: Relational Databases, Excel Files.

Non-Structured Data

Lacks a predefined model.
Not organized in a conventional manner.
Experiences large growth in volume.
Examples: Audio Files, Video Files, NoSQL Databases.

Semi-Structured Data

Lacks a predefined model.
Contains some form of structure, often self-describing.
Utilizes tags, markers, or hierarchies between elements.

Definition of Data Modeling

Data modeling is the process of creating a visual representation of how data is structured, stored, and related within a system.
It defines data elements, their attributes, and the relationships between them.
It serves as a blueprint for organizing and managing data.

Key Components of Data Models

Entities: Represent objects or concepts (e.g., Customer, Order, Product).
Attributes: Characteristics or properties of entities (e.g., name, email, date, total).
Relationships: Connections between entities (e.g., Customer places Order, Order contains Product).

Data Model Types

Conceptual Data Model:
- Purpose: High-level overview.
- Target Audience: Business stakeholders.
- Platform: Agnostic (independent of specific platforms).
Logical Data Model:
- Purpose: Logical definition of data structures.
- Target Audience: Data engineers, data architects, data analysts.
- Platform: Agnostic.
Physical Data Model:
- Purpose: Low-level detail of physical data design.
- Target Audience: Database administrators, developers.
- Platform: Specific (tailored to a particular database management system).

Conceptual Data Model Details

High-level representation of organizational data.
Focuses on what data is needed rather than how it will be implemented.

Creating a Conceptual Data Model (Example)

Entities: Stadium, Customer, Event, Attendee, Ticket
Relationships: A stadium organizes events; a customer interacts with a stadium; an attendee attends events; an event can be held at a stadium, a customer buys a ticket, a ticket contains an event.

Benefits of a Conceptual Data Model

Clear communication with stakeholders.
Early validation of requirements.
Improved planning.
Focus on business needs.
Technology independence.
Enhanced collaboration.
Supports documentation & training.

Steps to Create a Logical Data Model

Identify entities.
Identify candidate keys.
Choose primary keys.
Apply normalization.
Set relationships.
Identify relationship cardinality.
Iterate to fine-tune.
Still independent of technology (not tied to a specific DBMS).

Transformation from Conceptual to Logical Data Model (Example)

Conceptual Model: Stadium interacts with Customer; Customer organizes Event; Event attends Attendee; Attendee buys Ticket
Logical Model:
- Stadium: StadiumID (PK), Stadium Name, Stadium Country, Stadium City
- Customer: CustomerID (PK), Customer Name, Customer Country, Customer Address
- Event: EventID (PK), Event Name, Event Date
- Ticket: TicketID (PK), Ticket Price, Ticket Category

Benefits of a Logical Data Model

Clarity and structure.
Improved communication.
Early validation (iteration and fine-tuning), supporting an agile approach.
Foundation for the physical model (blueprint).
Technology neutral.
Supports normalization (avoid redundancy and anomalies).

Normalization

Normalization is the process of organizing data in a database.
It reduces data redundancy and improves data integrity.
It involves dividing large, complex tables into smaller, related tables.
Ensures each table contains only data directly related to its primary key.
Achieved through a series of normal forms (1NF, 2NF, 3NF, etc.).
Each normal form builds on the previous one to systematically eliminate anomalies in data insertion, update, and deletion.

Normalization Checks

1NF (First Normal Form):
- Atomic values only (each cell contains a single value).
- No repeating groups.
- Each record has a unique identifier.
2NF (Second Normal Form):
- Already in 1NF.
- No partial dependency on a subset of the primary key (applies to composite keys).
3NF (Third Normal Form):
- Already in 2NF.
- No transitive dependency between non-key attributes.
- All non-key attributes depend only on the primary key.

ER Diagram of Registration

Students --Enrolls In--> StudentCourses <-- Courses
Students (StudentID) StudentCourses (StudentID, CourseID) Courses (CourseID, CourseName, InstructorName)

3rd Normal Form Explained

Builds upon 1NF: Requires atomic values in each cell and no repeating groups.
Builds upon 2NF: Removes partial dependencies on a composite primary key.
Removes transitive dependencies.

Entity Relationship Diagrams

Conceptual: Entities (Nouns), Relationships (Verbs)
Logical: Includes data types and constraints (e.g., Students (StudentID INT, Name VARCHAR(100) NN))

Tools to Draw E-R Diagrams

Desktop Apps:
- MySQL Workbench: Free, built-in ERD tool for MySQL databases.
- Dbeaver: Open-source database IDE with ERD support for many DBMSs.
- Oracle SQL Data Modeller: Advanced ERD generation from schemas (Oracle DB focused).
- pgModeler: Open-source PostgreSQL database modeler.
Web-Based Tools:
- dbdiagram.io: Simple ERD syntax, supports SQL reverse-engineering.
- Lucidchart: Drag-and-drop interface, templates, limited free tier.
- Draw.io / diagrams.net: Free, offline/online, highly customizable.
- QuickDBD: Type to diagram instantly for ER modeling.
- SqlDBM: Online database modeling for SQL Server, MySQL, PostgreSQL, Snowflake.

E-R Diagrams Main Concepts

Entities: Real-world objects or concepts.
- Strong: Can exist independently (e.g., Customer).
- Weak: Depends on another entity (e.g., Order).
Attributes: Characteristics or properties of an entity.
- Simple, Composite, Derived, Multivalued, Key.
Constraints: Rules that limit the type of data that can be stored in a table to ensure accuracy, consistency, and integrity.
Relationships: Associations between two or more entities.
- One-to-One (1:1)
- One-to-Many (1:N)
- Many-to-Many (M:N)
Keys:
- Primary Key: Uniquely identifies each instance.
- Foreign Key: References the PK of another (or the same!!) entity.

Physical Data Modeling

Choose the platform.
Translate logical entities into tables.
Establish relationships.
Apply Normalization/Denormalization.
Apply table Constraints.
Create Indexes and/or partitions.
Extend with Functions, Procedures, and Triggers.

Indexing Strategy

Workload type:
- Row store for OLTP, column store for OLAP.
Filtering conditions:
- Columns used in WHERE, HAVING, JOIN.
Column data type:
- Use integers over text when possible.
Data selectivity:
- Create index on a more selective column.
Column order:
- matters!

Table Partitioning

Dividing a table into smaller, more manageable parts, often based on date or other criteria.

Data Modeling: SQL vs noSQL Database

SQL: Relational model with structured data in tables.
NoSQL: Non-relational models with flexible, schema-less structure (e.g. JSON documents).

Key Characteristics of NoSQL Databases

Does not use traditional relational models (i.e., tables and SQL).
Uses flexible, non-tabular structures to store and manage data.
Optimized for scalability, speed, and unstructured or semi-structured data.
- Schema-less
- Horizontally Flexible
- High performance
Designed for big data & modern apps (real-time analytics, mobile, IoT, and cloud-native applications).

Types of NoSQL Databases

Document:
- Description: Stores data as JSON-like documents.
- Example: MongoDB, Couchbase, Firebase.
Key-Value:
- Description: Simple key → value pairs.
- Example: Redis, Amazon DynamoDB.
Column-Family:
- Description: Data stored in column families (wide rows).
- Example: Apache Cassandra, HBase.
Graph:
- Description: Nodes and relationships (edges) for connected data.
- Example: Neo4j, ArangoDB.

Document Type: Sample Order document in MongoDB

Illustrates the structure of an order document with nested customer, order status, and order lines, each containing product details.

Common use cases of Document Databases

CMS / Blogs: Flexible structure, nested metadata.
E-commerce products: Varying attributes, one document per item.
User profiles / accounts: Custom fields per user, embedded settings.
Real-time chat/feeds: Fast document reads/writes.
Shopping carts / orders: Self-contained, easily serialized.
Mobile/IoT data: Schema-less and sync-ready.
Rapid development: Easy to evolve without migrations.

Key-value Type: Sample Session document in Redis

Illustrates storing session data and shopping cart information as key-value pairs.

Column-Family Table Type: Example of user_activity document in Cassandra

Shows how user activity data is stored with row keys and clustering keys for efficient querying.

Common use cases of Column-Family

Time-series metrics: Wide rows with timestamped columns.
Logging and monitoring: High-throughput writes, fast range queries.
Real-time analytics: Scales horizontally for large-scale data.
Financial/event records: Write-optimized and append-friendly.

Graph Type: example (Neo4j-style)

Describes relationships between nodes (entities) using edges (relationships), with properties and labels to define attributes and categories.

Common use cases of Graph Databases

Social networks: Fast friend-of-friend lookups (e.g., User → FRIENDS_WITH → User).
Recommendations: Related items/users through multi-step paths (e.g., User → RATED → Movie).
Fraud detection: Pattern detection across accounts/transactions (e.g., Account → TRANSFERS → Account).
IT/network analysis: Topology, dependencies, and impact tracing.
Knowledge graphs: Rich, semantically connected data.

Advantages/Disadvantages of NoSQL Structures

Advantages:
- Fast reads: All order data is in one document (no joins).
- Loose or no relationships.
- More scalable and flexible for unstructured or nested data.
Disadvantages:
- Weaker data integrity.
- No standard query language.
- Complex joins are difficult or impossible.
- No constraints / or harder to enforce.

Relational to Document Model: Model for Workloads!

User Requirements to understand the use of the system and the queries.
Identify and Quantify Entities.
Attributes.
Identify Reads and Writes (to decide which data will be stored together).
Quantify R/W (Frequency).

Referencing:

Separate documents linked using a common key.
Documents might come from different collections

Relationships

Embedding: Stored together

Bi-directional one-to-one references

Example of how to reference Publisher data from a Headquarters document and vice versa.

One-to-Many Embedded

Shows how reviews can be embedded within a book document, storing all review data directly within the book document.

One-to-many references

Illustrates how to reference review documents from a book document, maintaining separate collections for books and reviews.

One-to-Many Bi-directional

Demonstrates a bi-directional relationship where books reference reviews and reviews reference books.

Many-to-Many Embedding

Shows how to embed author information within book documents when multiple authors can write a book.

Many-To-Many Referencing

Illustrates how books and authors can reference each other using arrays of IDs.

Many-to-many Reversed

Shows how books and authors can reference each other using arrays of IDs.

Ask Questions

Guidelines and results based on considerations such as Simplicity, Query Atomicity, and Cardinality.

Schema Validation

Allows to enforce rules about the structure of the documents
Define them when you create a collection
Add to existing collection
Validations have:
- Rules
- Levels
  - Strict (all new and updated)
  - Moderate (all new and existing valid)
- Actions
  - Error (reject operation)
  - Warn (allow to proceed, log action)

What is OLAP?

OLAP (Online Analytical Processing) is a category of database systems and techniques designed to support complex data analysis and reporting, typically for business intelligence.
Allows users to analyze large volumes of data quickly from multiple perspectives.
Focuses on aggregations, trends, comparisons, and historical analysis, rather than real-time transactions. Think of it as “data exploration”, not data entry.

Key Characteristics of OLAP

Feature - Description
Multidimensional analysis- Data is modeled as cubes with dimensions (e.g., time, region, product)
Aggregations - Performs summaries like SUM, AVG, COUNT, etc. efficiently
Slice-and-dice - View data from different angles (e.g., sales by product, then by region)
Drill-down/drill-up - Explore data at different levels of granularity
Read-intensive - Optimized for querying, not for frequent updates

Databases and Tools that use OLAP:

Databases:
- Microsoft SQL Server Analysis Services (SSAS)
- Oracle OLAP
- IBM Cognos
- SAP BW/4Hana
Tools:
- Tableau
- Power BI (as front ends using OLAP cubes or tabular models)

OLAP Workflow

Data Ingestion -> Data Transformation -> Data Analysis

Data Ingestion

Capture raw data from multiple different sources into a repository.

Data Analysis

Query data and visualize it using tools like Microsoft Power BI, Tableau, Qlik, Looker.

OLTP vs OLAP

OLTP
- A large number of small transactions
- WRITE operations (Insert, Update)
- Industry-specific (retail, banking)
- Transactions as source
- Increase productivity of end-users
- Normalized databases for efficiency
OLAP
- Large volumes of data
- READ operations (Select)
- Subject-specific (sales, marketing)
- Aggregated data from transactions
- Increase productivity of analysts and executives
- Denormalized databases for analysis

Choose Optimal Data Modelling Strategy

Analyze key characteristics of your data
Understand the data structure
- Schema-on-read vs. Schema-on-write
Understand data architecture requirements
- Centralized or decentralized
- Batch-oriented vs. Real-time processing
Determine business needs
- Key requirements and constraints
- Reporting needs of stakeholders
Understand technology stack
- Incorporate into the existing stack?
Choose the appropriate modeling approach
- Relational vs. Non-relational vs. Dimensional…Or a combination?

Datawarehouse Benefits for Business Intelligence

Data consolidation
Improved data quality and consistency
Historical data analysis
Performance and speed
Support for complex queries
Enhanced decision-making
Security
Scalability

Types of Analysis Tools

Reporting Tools
Data Mining Tools
OLAP Tools
Predictive Analytics Tools
Dashboards

Datawarehouse Design

Dimensional Modeling
- Star Schemas
- Snowflake Schemas

Dimensional Modeling

Facts (numeric, transactional)
Dimensions (descriptive attributes)

Star Schema

A fact table surrounded by dimension tables.

Snowflake Schema Example

A refinement of the star schema where dimensional tables are normalized into multiple related tables.

Python Transformation OLTP to OLAP

# Python Transformation OLTP to OLAP, 1
pip install pymysql psycopg2 pandas
import pymysql
import psycopg2
import pandas as pd
from datetime import datetime

# MySQL (OLTP source)
mysql_conn = pymysql.connect(
 host='localhost',
 user='root',
 password='your_mysql_password',
 db='oltp_db'
)

# PostgreSQL (OLAP target)
pg_conn = psycopg2.connect(
 host='localhost',
 user='postgres',
 password='your_pg_password',
 dbname='olap_db'
)

pg_cursor = pg_conn.cursor()

# Extract Data from OLTP
orders_df = pd.read_sql("SELECT * FROM orders", mysql_conn)
customers_df = pd.read_sql("SELECT * FROM customers", mysql_conn)
products_df = pd.read_sql("SELECT * FROM products", mysql_conn)


# Python Transformation OLTP to OLAP, 2
# Load dimension tables
def load_dim_customer():
 for _, row in customers_df.iterrows():
 pg_cursor.execute("""
 INSERT INTO dim_customer (customer_id, name, email)
 VALUES (%s, %s, %s)
 ON CONFLICT (customer_id) DO NOTHING;
 """, (row['customer_id'], row['name'], row['email']))

def load_dim_product():
 for _, row in products_df.iterrows():
 pg_cursor.execute("""
 INSERT INTO dim_product (product_id, name, category)
 VALUES (%s, %s, %s)
 ON CONFLICT (product_id) DO NOTHING;
 """, (row['product_id'], row['name'], row['category']))

def load_dim_date():
 dates = orders_df['order_date'].drop_duplicates()
 for date in dates:
 dt = pd.to_datetime(date)
 pg_cursor.execute("""
 INSERT INTO dim_date (date_id, day, month, year, quarter)
 VALUES (%s, %s, %s, %s, %s)
 ON CONFLICT (date_id) DO NOTHING;
 """, (
 dt.strftime('%Y-%m-%d'), dt.day, dt.month, dt.year, (dt.month - 1)//3 + 1
 ))


# Python Transformation OLTP to OLAP, 3
# Load fact table
def load_fact_sales():
 for _, row in orders_df.iterrows():
 date_id = pd.to_datetime(row['order_date']).strftime('%Y-%m-%d')
 total_price = row['qty'] * row['price']
 pg_cursor.execute("""
 INSERT INTO fact_sales (order_id, date_id, customer_id, product_id, quantity,
 total_price)
 VALUES (%s, %s, %s, %s, %s, %s)
 ON CONFLICT (order_id) DO NOTHING;
 """, (
 row['order_id'], date_id, row['customer_id'], row['product_id'], row['qty'],
 total_price
 ))

# Run ELT steps
load_dim_customer()
load_dim_product()
load_dim_date()
load_fact_sales()

# Commit & Close
pg_conn.commit()
pg_cursor.close()
pg_conn.close()
mysql_conn.close()

print("ETL completed successfully.")

MySQL Schema

CREATE TABLE customers (
 customer_id INT PRIMARY KEY,
 name VARCHAR(100),
 email VARCHAR(100)
);

CREATE TABLE products (
 product_id INT PRIMARY KEY,
 name VARCHAR(50),
 category VARCHAR(50)
);

CREATE TABLE orders (
 order_id INT PRIMARY KEY,
 customer_id INT,
 product_id INT,
 order_date DATE,
 qty INT,
 price DECIMAL(10,2),
 FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
 FOREIGN KEY (product_id) REFERENCES products(product_id)
);

PostgreSQL schema

CREATE TABLE dim_customer (
 customer_id INT PRIMARY KEY,
 name TEXT,
 email TEXT
);

CREATE TABLE dim_product (
 product_id INT PRIMARY KEY,
 name TEXT,
 category TEXT
);

CREATE TABLE dim_date (
 date_id DATE PRIMARY KEY,
 day INT, month INT, year INT, quarter INT
);

CREATE TABLE fact_sales (
 order_id INT PRIMARY KEY,
 date_id DATE REFERENCES dim_date(date_id),
 customer_id INT REFERENCES dim_customer(customer_id),
 product_id INT REFERENCES dim_product(product_id),
 quantity INT,
 total_price NUMERIC
);

MongoDB Embedded Schema

{
 "order_id": 1001,
 "date": "2025-05-01",
 "customer": {
 "customer_id": 42,
 "name": "Alice",
 "email": "alice@example.com"
 },
 "product": {
 "product_id": 12,
 "name": "USB Hub",
 "category": "Electronics"
 },
 "qty": 3,
 "price": 19.99,
 "total": 59.97
}

mongoimport --db your_db --collection fact_orders --file mongodb_star_schema_orders.json --jsonArray

db.fact_orders.aggregate([
 { $group: {
 _id: "$product.product_id",
 total_sales: { $sum: "$total" },
 total_quantity: { $sum: "$qty" }
 }},
 { $sort: { total_sales: -1 } }
]);

db.fact_orders.createIndex({ "product.product_id": 1 });