Types of Data
- Non-Structured Data: Data that lacks a predefined model and is not organized (e.g., audio files, video files, NoSQL databases).
- Semi-Structured Data: Data that does not have a predefined model but possesses some inherent structure, often self-describing through tags, markers, or hierarchies (e.g., JSON, XML).
- Structured Data: Data organized in a tabular format with relationships between rows and columns, reliant on a data model such as relational databases or Excel files.
Data Sources
- Devices
- Databases
- Files (JSON, XML, CSV)
- Videos
Structured Data
- Organized in tabular form (rows and columns).
- Relies on a data model.
- Examples: Relational Databases, Excel Files.
Non-Structured Data
- Lacks a predefined model.
- Not organized in a conventional manner.
- Experiences large growth in volume.
- Examples: Audio Files, Video Files, NoSQL Databases.
Semi-Structured Data
- Lacks a predefined model.
- Contains some form of structure, often self-describing.
- Utilizes tags, markers, or hierarchies between elements.
Definition of Data Modeling
- Data modeling is the process of creating a visual representation of how data is structured, stored, and related within a system.
- It defines data elements, their attributes, and the relationships between them.
- It serves as a blueprint for organizing and managing data.
Key Components of Data Models
- Entities: Represent objects or concepts (e.g., Customer, Order, Product).
- Attributes: Characteristics or properties of entities (e.g., name, email, date, total).
- Relationships: Connections between entities (e.g., Customer places Order, Order contains Product).
Data Model Types
- Conceptual Data Model:
- Purpose: High-level overview.
- Target Audience: Business stakeholders.
- Platform: Agnostic (independent of specific platforms).
- Logical Data Model:
- Purpose: Logical definition of data structures.
- Target Audience: Data engineers, data architects, data analysts.
- Platform: Agnostic.
- Physical Data Model:
- Purpose: Low-level detail of physical data design.
- Target Audience: Database administrators, developers.
- Platform: Specific (tailored to a particular database management system).
Conceptual Data Model Details
- High-level representation of organizational data.
- Focuses on what data is needed rather than how it will be implemented.
Creating a Conceptual Data Model (Example)
- Entities: Stadium, Customer, Event, Attendee, Ticket
- Relationships: A stadium organizes events; a customer interacts with a stadium; an attendee attends events; an event can be held at a stadium, a customer buys a ticket, a ticket contains an event.
Benefits of a Conceptual Data Model
- Clear communication with stakeholders.
- Early validation of requirements.
- Improved planning.
- Focus on business needs.
- Technology independence.
- Enhanced collaboration.
- Supports documentation & training.
Steps to Create a Logical Data Model
- Identify entities.
- Identify candidate keys.
- Choose primary keys.
- Apply normalization.
- Set relationships.
- Identify relationship cardinality.
- Iterate to fine-tune.
- Still independent of technology (not tied to a specific DBMS).
- Conceptual Model: Stadium interacts with Customer; Customer organizes Event; Event attends Attendee; Attendee buys Ticket
- Logical Model:
- Stadium: StadiumID (PK), Stadium Name, Stadium Country, Stadium City
- Customer: CustomerID (PK), Customer Name, Customer Country, Customer Address
- Event: EventID (PK), Event Name, Event Date
- Ticket: TicketID (PK), Ticket Price, Ticket Category
Benefits of a Logical Data Model
- Clarity and structure.
- Improved communication.
- Early validation (iteration and fine-tuning), supporting an agile approach.
- Foundation for the physical model (blueprint).
- Technology neutral.
- Supports normalization (avoid redundancy and anomalies).
Normalization
- Normalization is the process of organizing data in a database.
- It reduces data redundancy and improves data integrity.
- It involves dividing large, complex tables into smaller, related tables.
- Ensures each table contains only data directly related to its primary key.
- Achieved through a series of normal forms (1NF, 2NF, 3NF, etc.).
- Each normal form builds on the previous one to systematically eliminate anomalies in data insertion, update, and deletion.
Normalization Checks
- 1NF (First Normal Form):
- Atomic values only (each cell contains a single value).
- No repeating groups.
- Each record has a unique identifier.
- 2NF (Second Normal Form):
- Already in 1NF.
- No partial dependency on a subset of the primary key (applies to composite keys).
- 3NF (Third Normal Form):
- Already in 2NF.
- No transitive dependency between non-key attributes.
- All non-key attributes depend only on the primary key.
ER Diagram of Registration
- Students --Enrolls In--> StudentCourses <-- Courses
- Students (StudentID) StudentCourses (StudentID, CourseID) Courses (CourseID, CourseName, InstructorName)
- Builds upon 1NF: Requires atomic values in each cell and no repeating groups.
- Builds upon 2NF: Removes partial dependencies on a composite primary key.
- Removes transitive dependencies.
Entity Relationship Diagrams
- Conceptual: Entities (Nouns), Relationships (Verbs)
- Logical: Includes data types and constraints (e.g., Students (StudentID INT, Name VARCHAR(100) NN))
- Desktop Apps:
- MySQL Workbench: Free, built-in ERD tool for MySQL databases.
- Dbeaver: Open-source database IDE with ERD support for many DBMSs.
- Oracle SQL Data Modeller: Advanced ERD generation from schemas (Oracle DB focused).
- pgModeler: Open-source PostgreSQL database modeler.
- Web-Based Tools:
- dbdiagram.io: Simple ERD syntax, supports SQL reverse-engineering.
- Lucidchart: Drag-and-drop interface, templates, limited free tier.
- Draw.io / diagrams.net: Free, offline/online, highly customizable.
- QuickDBD: Type to diagram instantly for ER modeling.
- SqlDBM: Online database modeling for SQL Server, MySQL, PostgreSQL, Snowflake.
E-R Diagrams Main Concepts
- Entities: Real-world objects or concepts.
- Strong: Can exist independently (e.g., Customer).
- Weak: Depends on another entity (e.g., Order).
- Attributes: Characteristics or properties of an entity.
- Simple, Composite, Derived, Multivalued, Key.
- Constraints: Rules that limit the type of data that can be stored in a table to ensure accuracy, consistency, and integrity.
- Relationships: Associations between two or more entities.
- One-to-One (1:1)
- One-to-Many (1:N)
- Many-to-Many (M:N)
- Keys:
- Primary Key: Uniquely identifies each instance.
- Foreign Key: References the PK of another (or the same!!) entity.
Physical Data Modeling
- Choose the platform.
- Translate logical entities into tables.
- Establish relationships.
- Apply Normalization/Denormalization.
- Apply table Constraints.
- Create Indexes and/or partitions.
- Extend with Functions, Procedures, and Triggers.
Indexing Strategy
- Workload type:
- Row store for OLTP, column store for OLAP.
- Filtering conditions:
- Columns used in WHERE, HAVING, JOIN.
- Column data type:
- Use integers over text when possible.
- Data selectivity:
- Create index on a more selective column.
- Column order:
Table Partitioning
- Dividing a table into smaller, more manageable parts, often based on date or other criteria.
Data Modeling: SQL vs noSQL Database
- SQL: Relational model with structured data in tables.
- NoSQL: Non-relational models with flexible, schema-less structure (e.g. JSON documents).
Key Characteristics of NoSQL Databases
- Does not use traditional relational models (i.e., tables and SQL).
- Uses flexible, non-tabular structures to store and manage data.
- Optimized for scalability, speed, and unstructured or semi-structured data.
- Schema-less
- Horizontally Flexible
- High performance
- Designed for big data & modern apps (real-time analytics, mobile, IoT, and cloud-native applications).
Types of NoSQL Databases
- Document:
- Description: Stores data as JSON-like documents.
- Example: MongoDB, Couchbase, Firebase.
- Key-Value:
- Description: Simple key → value pairs.
- Example: Redis, Amazon DynamoDB.
- Column-Family:
- Description: Data stored in column families (wide rows).
- Example: Apache Cassandra, HBase.
- Graph:
- Description: Nodes and relationships (edges) for connected data.
- Example: Neo4j, ArangoDB.
Document Type: Sample Order document in MongoDB
- Illustrates the structure of an order document with nested customer, order status, and order lines, each containing product details.
Common use cases of Document Databases
- CMS / Blogs: Flexible structure, nested metadata.
- E-commerce products: Varying attributes, one document per item.
- User profiles / accounts: Custom fields per user, embedded settings.
- Real-time chat/feeds: Fast document reads/writes.
- Shopping carts / orders: Self-contained, easily serialized.
- Mobile/IoT data: Schema-less and sync-ready.
- Rapid development: Easy to evolve without migrations.
Key-value Type: Sample Session document in Redis
- Illustrates storing session data and shopping cart information as key-value pairs.
Column-Family Table Type: Example of user_activity document in Cassandra
- Shows how user activity data is stored with row keys and clustering keys for efficient querying.
Common use cases of Column-Family
- Time-series metrics: Wide rows with timestamped columns.
- Logging and monitoring: High-throughput writes, fast range queries.
- Real-time analytics: Scales horizontally for large-scale data.
- Financial/event records: Write-optimized and append-friendly.
Graph Type: example (Neo4j-style)
- Describes relationships between nodes (entities) using edges (relationships), with properties and labels to define attributes and categories.
Common use cases of Graph Databases
- Social networks: Fast friend-of-friend lookups (e.g., User → FRIENDS_WITH → User).
- Recommendations: Related items/users through multi-step paths (e.g., User → RATED → Movie).
- Fraud detection: Pattern detection across accounts/transactions (e.g., Account → TRANSFERS → Account).
- IT/network analysis: Topology, dependencies, and impact tracing.
- Knowledge graphs: Rich, semantically connected data.
Advantages/Disadvantages of NoSQL Structures
- Advantages:
- Fast reads: All order data is in one document (no joins).
- Loose or no relationships.
- More scalable and flexible for unstructured or nested data.
- Disadvantages:
- Weaker data integrity.
- No standard query language.
- Complex joins are difficult or impossible.
- No constraints / or harder to enforce.
Relational to Document Model: Model for Workloads!
- User Requirements to understand the use of the system and the queries.
- Identify and Quantify Entities.
- Attributes.
- Identify Reads and Writes (to decide which data will be stored together).
- Quantify R/W (Frequency).
Referencing:
- Separate documents linked using a common key.
- Documents might come from different collections
Relationships
- Embedding: Stored together
Bi-directional one-to-one references
- Example of how to reference Publisher data from a Headquarters document and vice versa.
One-to-Many Embedded
- Shows how reviews can be embedded within a book document, storing all review data directly within the book document.
One-to-many references
- Illustrates how to reference review documents from a book document, maintaining separate collections for books and reviews.
One-to-Many Bi-directional
- Demonstrates a bi-directional relationship where books reference reviews and reviews reference books.
Many-to-Many Embedding
- Shows how to embed author information within book documents when multiple authors can write a book.
Many-To-Many Referencing
- Illustrates how books and authors can reference each other using arrays of IDs.
Many-to-many Reversed
- Shows how books and authors can reference each other using arrays of IDs.
Ask Questions
- Guidelines and results based on considerations such as Simplicity, Query Atomicity, and Cardinality.
Schema Validation
- Allows to enforce rules about the structure of the documents
- Define them when you create a collection
- Add to existing collection
- Validations have:
- Rules
- Levels
- Strict (all new and updated)
- Moderate (all new and existing valid)
- Actions
- Error (reject operation)
- Warn (allow to proceed, log action)
What is OLAP?
- OLAP (Online Analytical Processing) is a category of database systems and techniques designed to support complex data analysis and reporting, typically for business intelligence.
- Allows users to analyze large volumes of data quickly from multiple perspectives.
- Focuses on aggregations, trends, comparisons, and historical analysis, rather than real-time transactions. Think of it as “data exploration”, not data entry.
Key Characteristics of OLAP
- Feature - Description
- Multidimensional analysis- Data is modeled as cubes with dimensions (e.g., time, region, product)
- Aggregations - Performs summaries like SUM, AVG, COUNT, etc. efficiently
- Slice-and-dice - View data from different angles (e.g., sales by product, then by region)
- Drill-down/drill-up - Explore data at different levels of granularity
- Read-intensive - Optimized for querying, not for frequent updates
- Databases:
- Microsoft SQL Server Analysis Services (SSAS)
- Oracle OLAP
- IBM Cognos
- SAP BW/4Hana
- Tools:
- Tableau
- Power BI (as front ends using OLAP cubes or tabular models)
OLAP Workflow
- Data Ingestion -> Data Transformation -> Data Analysis
Data Ingestion
- Capture raw data from multiple different sources into a repository.
Data Analysis
- Query data and visualize it using tools like Microsoft Power BI, Tableau, Qlik, Looker.
OLTP vs OLAP
- OLTP
- A large number of small transactions
- WRITE operations (Insert, Update)
- Industry-specific (retail, banking)
- Transactions as source
- Increase productivity of end-users
- Normalized databases for efficiency
- OLAP
- Large volumes of data
- READ operations (Select)
- Subject-specific (sales, marketing)
- Aggregated data from transactions
- Increase productivity of analysts and executives
- Denormalized databases for analysis
Choose Optimal Data Modelling Strategy
- Analyze key characteristics of your data
- Understand the data structure
- Schema-on-read vs. Schema-on-write
- Understand data architecture requirements
- Centralized or decentralized
- Batch-oriented vs. Real-time processing
- Determine business needs
- Key requirements and constraints
- Reporting needs of stakeholders
- Understand technology stack
- Incorporate into the existing stack?
- Choose the appropriate modeling approach
- Relational vs. Non-relational vs. Dimensional…Or a combination?
Datawarehouse Benefits for Business Intelligence
- Data consolidation
- Improved data quality and consistency
- Historical data analysis
- Performance and speed
- Support for complex queries
- Enhanced decision-making
- Security
- Scalability
- Reporting Tools
- Data Mining Tools
- OLAP Tools
- Predictive Analytics Tools
- Dashboards
Datawarehouse Design
- Dimensional Modeling
- Star Schemas
- Snowflake Schemas
Dimensional Modeling
- Facts (numeric, transactional)
- Dimensions (descriptive attributes)
Star Schema
- A fact table surrounded by dimension tables.
Snowflake Schema Example
- A refinement of the star schema where dimensional tables are normalized into multiple related tables.
# Python Transformation OLTP to OLAP, 1
pip install pymysql psycopg2 pandas
import pymysql
import psycopg2
import pandas as pd
from datetime import datetime
# MySQL (OLTP source)
mysql_conn = pymysql.connect(
host='localhost',
user='root',
password='your_mysql_password',
db='oltp_db'
)
# PostgreSQL (OLAP target)
pg_conn = psycopg2.connect(
host='localhost',
user='postgres',
password='your_pg_password',
dbname='olap_db'
)
pg_cursor = pg_conn.cursor()
# Extract Data from OLTP
orders_df = pd.read_sql("SELECT * FROM orders", mysql_conn)
customers_df = pd.read_sql("SELECT * FROM customers", mysql_conn)
products_df = pd.read_sql("SELECT * FROM products", mysql_conn)
# Python Transformation OLTP to OLAP, 2
# Load dimension tables
def load_dim_customer():
for _, row in customers_df.iterrows():
pg_cursor.execute("""
INSERT INTO dim_customer (customer_id, name, email)
VALUES (%s, %s, %s)
ON CONFLICT (customer_id) DO NOTHING;
""", (row['customer_id'], row['name'], row['email']))
def load_dim_product():
for _, row in products_df.iterrows():
pg_cursor.execute("""
INSERT INTO dim_product (product_id, name, category)
VALUES (%s, %s, %s)
ON CONFLICT (product_id) DO NOTHING;
""", (row['product_id'], row['name'], row['category']))
def load_dim_date():
dates = orders_df['order_date'].drop_duplicates()
for date in dates:
dt = pd.to_datetime(date)
pg_cursor.execute("""
INSERT INTO dim_date (date_id, day, month, year, quarter)
VALUES (%s, %s, %s, %s, %s)
ON CONFLICT (date_id) DO NOTHING;
""", (
dt.strftime('%Y-%m-%d'), dt.day, dt.month, dt.year, (dt.month - 1)//3 + 1
))
# Python Transformation OLTP to OLAP, 3
# Load fact table
def load_fact_sales():
for _, row in orders_df.iterrows():
date_id = pd.to_datetime(row['order_date']).strftime('%Y-%m-%d')
total_price = row['qty'] * row['price']
pg_cursor.execute("""
INSERT INTO fact_sales (order_id, date_id, customer_id, product_id, quantity,
total_price)
VALUES (%s, %s, %s, %s, %s, %s)
ON CONFLICT (order_id) DO NOTHING;
""", (
row['order_id'], date_id, row['customer_id'], row['product_id'], row['qty'],
total_price
))
# Run ELT steps
load_dim_customer()
load_dim_product()
load_dim_date()
load_fact_sales()
# Commit & Close
pg_conn.commit()
pg_cursor.close()
pg_conn.close()
mysql_conn.close()
print("ETL completed successfully.")
MySQL Schema
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100)
);
CREATE TABLE products (
product_id INT PRIMARY KEY,
name VARCHAR(50),
category VARCHAR(50)
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
product_id INT,
order_date DATE,
qty INT,
price DECIMAL(10,2),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
FOREIGN KEY (product_id) REFERENCES products(product_id)
);
PostgreSQL schema
CREATE TABLE dim_customer (
customer_id INT PRIMARY KEY,
name TEXT,
email TEXT
);
CREATE TABLE dim_product (
product_id INT PRIMARY KEY,
name TEXT,
category TEXT
);
CREATE TABLE dim_date (
date_id DATE PRIMARY KEY,
day INT, month INT, year INT, quarter INT
);
CREATE TABLE fact_sales (
order_id INT PRIMARY KEY,
date_id DATE REFERENCES dim_date(date_id),
customer_id INT REFERENCES dim_customer(customer_id),
product_id INT REFERENCES dim_product(product_id),
quantity INT,
total_price NUMERIC
);
MongoDB Embedded Schema
{
"order_id": 1001,
"date": "2025-05-01",
"customer": {
"customer_id": 42,
"name": "Alice",
"email": "alice@example.com"
},
"product": {
"product_id": 12,
"name": "USB Hub",
"category": "Electronics"
},
"qty": 3,
"price": 19.99,
"total": 59.97
}
mongoimport --db your_db --collection fact_orders --file mongodb_star_schema_orders.json --jsonArray
db.fact_orders.aggregate([
{ $group: {
_id: "$product.product_id",
total_sales: { $sum: "$total" },
total_quantity: { $sum: "$qty" }
}},
{ $sort: { total_sales: -1 } }
]);
db.fact_orders.createIndex({ "product.product_id": 1 });