Data Lakes & Data Modelling

Centralised, highly-scalable repository into which all data from multiple sources is ingested “as-is”.
Accepts every format:
- Structured (relational tables, CSVs, TSVs)
- Semi-structured (JSON, XML, Parquet, Avro)
- Un-structured (text documents, images, audio, video)
Primary use-cases: exploratory analytics, data science, machine-learning, long-term low-cost archival.
Works on the principle of “store now, understand / model later.”

Traditional OLTP databases optimise for transaction processing and only handle structured data.
Data Warehouses (OLAP):
- Support large, historical, cleansed, structured data.
- High performance but expensive (licensing, compute, storage)
- Impose an up-front modelling / ETL cost (schema-on-write).
Organisations needed a cheaper, schema-on-read, unlimited-volume store → Data Lake.

Schema-on-read: data is interpreted only when queried, not when stored.
Unlimited scale: underlying object storage can grow petabytes without reconfiguration.
Object storage: each file/object has 2 technical items:
- A unique identifier (key / URI)
- A metadata tag set (name–value pairs describing owner, size, ingestion date, etc.)
Cost optimisation tiers: infrequently accessed objects can be moved to colder / cheaper tiers automatically.
Security & Governance:
- IAM / ACL level permissions per bucket, folder, or object.
- Encryption in-flight (TLS) & at-rest (KMS-managed keys).
- Data catalog & lineage tools integrated (e.g., AWS Glue, Azure Purview).

Raw / Landing zone – 100 % untouched source files.
Cleansed / Curated zone – duplicates removed, data quality checks applied.
Query / Trusted zone – optimised formats (Parquet/ORC), partitioned for analytics engines.

Sources → Ingest (batch, streaming, change-data-capture).
Storage inside lake (raw).
Optional transformation / cleansing jobs.
Consumption layers:
- SQL-on-lake engines (e.g., Amazon Athena, Google BigQuery-external tables, Azure Synapse Serverless)
- ML notebooks (Python / R reading directly from object storage)
- Downstream Data Marts or Warehouses (schema-on-write if needed for BI dashboards).

AWS: Amazon S3 (+ Lake Formation, Glue, Athena).
Microsoft Azure: Azure Data Lake Storage Gen2 (ADLS) in Blob Storage.
Google Cloud Platform: Cloud Storage (+ BigLake, Dataplex).
All provide: versioning, lifecycle, cross-region replication, fine-grained access policies.

Raw Data
- Immediately after ingest; no cleansing, standardisation, enrichment.
- May contain nulls, partial rows, “junk” values.
- Analogy: unedited three-hour camera footage.
Processed (Cleansed) Data
- Transformed, validated, business-ready.
- Example: final edited movie, or HR survey results aggregated into KPIs.
Survey Example from class:
1. 100 employees submit answers → raw dataset (100 rows).
2. HR parses responses, computes summary insights → processed dataset.

Goal: visually/semantically describe how data elements relate within a system so developers, analysts, and architects speak the same language.
Enables easier navigation across tens or hundreds of tables, promotes consistency, prevents redundancy.
Produced before or during database / warehouse design; lives as part of design docs.

Entity → typically a table or object (rectangle in diagram).
Attribute → column/field of an entity (listed inside rectangle).
Relationship → line + symbol set denoting cardinality.
- One-to-One (1:1)
- One-to-Many (1:N)
- Many-to-Many (M:N)
- One-to-Zero-or-Many (1:0..N) – used when a table can exist with no matching rows in related table.

Crow’s foot notation used in slides:
- |—| 1 and only 1
- |—< 1 to many
- O—< 0 to many
- O—| 0 or 1

Conceptual Data Model (CDM)
- High-level, technology-agnostic.
- Shows entities + relationships only.
- Audience: business stakeholders, initial scoping.
Logical Data Model (LDM)
- Adds detail: primary keys, foreign keys, optionality, data types.
- Ignores specific DBMS naming rules; field names may still be in business language.
Physical Data Model (PDM)
- Exact blueprint of deployed database.
- Column names match physical implementation, e.g. STUDENT_ID, COURSE_ID.
- Includes indexes, constraints, partition keys, and performance-oriented configs.

Conceptual
- Entities: STUDENT, COURSE, ENROLLMENT
- Relationships: STUDENT 1..N ENROLLMENT, COURSE 1..N ENROLLMENT
Logical
- STUDENT (StudentID PK, Name varchar, DOB date …)
- COURSE (CourseID PK, Title varchar …)
- ENROLLMENT (EnrollID PK, StudentID FK, CourseID FK, EnrollmentDate date)
Physical (PostgreSQL example)
- Table student ( student_id bigint PRIMARY KEY, name text, dob date, … )
- Foreign keys explicitly scripted, naming matches actual DDL.

1:0 – entity has no dependent rows; e.g., a reference table created but yet unused.
1:N – each row in parent appears in one or more child rows; e.g., a department referenced by many employees.

Always ingest at full fidelity; you can’t go back in time for missing columns.
Partition & format (e.g., Parquet + gzip) in the curated zone to reduce $$ query cost.
Maintain a data catalog (Glue/Azure Purview) so scientists know what’s in the lake.
Keep CDM/LDM/PDM under version control; treat them as living documentation.
For exam questions, remember:
- Data Lake ≠ Data Warehouse (schema-on-read vs schema-on-write).
- Conceptual vs Logical vs Physical – differentiate by level of detail & naming conventions.