Data Lakes & Data Modelling

Data Lake – Big-Picture Definition

  • Centralised, highly-scalable repository into which all data from multiple sources is ingested “as-is”.
  • Accepts every format:
    • Structured (relational tables, CSVs, TSVs)
    • Semi-structured (JSON, XML, Parquet, Avro)
    • Un-structured (text documents, images, audio, video)
  • Primary use-cases: exploratory analytics, data science, machine-learning, long-term low-cost archival.
  • Works on the principle of “store now, understand / model later.”

Why Data Lakes Emerged

  • Traditional OLTP databases optimise for transaction processing and only handle structured data.
  • Data Warehouses (OLAP):
    • Support large, historical, cleansed, structured data.
    • High performance but expensive (licensing, compute, storage)
    • Impose an up-front modelling / ETL cost (schema-on-write).
  • Organisations needed a cheaper, schema-on-read, unlimited-volume store → Data Lake.

Core Characteristics & Capabilities

  • Schema-on-read: data is interpreted only when queried, not when stored.
  • Unlimited scale: underlying object storage can grow petabytes without reconfiguration.
  • Object storage: each file/object has 2 technical items:
    • A unique identifier (key / URI)
    • A metadata tag set (name–value pairs describing owner, size, ingestion date, etc.)
  • Cost optimisation tiers: infrequently accessed objects can be moved to colder / cheaper tiers automatically.
  • Security & Governance:
    • IAM / ACL level permissions per bucket, folder, or object.
    • Encryption in-flight (TLS) & at-rest (KMS-managed keys).
    • Data catalog & lineage tools integrated (e.g., AWS Glue, Azure Purview).

Typical Data Zones inside a Lake

  • Raw / Landing zone – 100 % untouched source files.
  • Cleansed / Curated zone – duplicates removed, data quality checks applied.
  • Query / Trusted zone – optimised formats (Parquet/ORC), partitioned for analytics engines.

Workflow & Consumer Patterns

  1. Sources → Ingest (batch, streaming, change-data-capture).
  2. Storage inside lake (raw).
  3. Optional transformation / cleansing jobs.
  4. Consumption layers:
    • SQL-on-lake engines (e.g., Amazon Athena, Google BigQuery-external tables, Azure Synapse Serverless)
    • ML notebooks (Python / R reading directly from object storage)
    • Downstream Data Marts or Warehouses (schema-on-write if needed for BI dashboards).

Cloud Implementations (Examples)

  • AWS: Amazon S3 (+ Lake Formation, Glue, Athena).
  • Microsoft Azure: Azure Data Lake Storage Gen2 (ADLS) in Blob Storage.
  • Google Cloud Platform: Cloud Storage (+ BigLake, Dataplex).
  • All provide: versioning, lifecycle, cross-region replication, fine-grained access policies.

Raw Data vs. Processed Data – Clarification

  • Raw Data
    • Immediately after ingest; no cleansing, standardisation, enrichment.
    • May contain nulls, partial rows, “junk” values.
    • Analogy: unedited three-hour camera footage.
  • Processed (Cleansed) Data
    • Transformed, validated, business-ready.
    • Example: final edited movie, or HR survey results aggregated into KPIs.
  • Survey Example from class:
    1. 100 employees submit answers → raw dataset (100 rows).
    2. HR parses responses, computes summary insights → processed dataset.

Data Modeling – Purpose & Importance

  • Goal: visually/semantically describe how data elements relate within a system so developers, analysts, and architects speak the same language.
  • Enables easier navigation across tens or hundreds of tables, promotes consistency, prevents redundancy.
  • Produced before or during database / warehouse design; lives as part of design docs.

ER (Entity–Relationship) Diagrams – Building Blocks

  • Entity → typically a table or object (rectangle in diagram).
  • Attribute → column/field of an entity (listed inside rectangle).
  • Relationship → line + symbol set denoting cardinality.
    • One-to-One (1:1)
    • One-to-Many (1:N)
    • Many-to-Many (M:N)
    • One-to-Zero-or-Many (1:0..N) – used when a table can exist with no matching rows in related table.

Cardinality Symbols (quick legend)

  • Crow’s foot notation used in slides:
    • |—| 1 and only 1
    • |—< 1 to many
    • O—< 0 to many
    • O—| 0 or 1

Three Levels of Data Models

  1. Conceptual Data Model (CDM)
    • High-level, technology-agnostic.
    • Shows entities + relationships only.
    • Audience: business stakeholders, initial scoping.
  2. Logical Data Model (LDM)
    • Adds detail: primary keys, foreign keys, optionality, data types.
    • Ignores specific DBMS naming rules; field names may still be in business language.
  3. Physical Data Model (PDM)
    • Exact blueprint of deployed database.
    • Column names match physical implementation, e.g. STUDENT_ID, COURSE_ID.
    • Includes indexes, constraints, partition keys, and performance-oriented configs.

Example Walk-Through – Student Enrollment

  • Conceptual
    • Entities: STUDENT, COURSE, ENROLLMENT
    • Relationships: STUDENT 1..N ENROLLMENT, COURSE 1..N ENROLLMENT
  • Logical
    • STUDENT (StudentID PK, Name varchar, DOB date …)
    • COURSE (CourseID PK, Title varchar …)
    • ENROLLMENT (EnrollID PK, StudentID FK, CourseID FK, EnrollmentDate date)
  • Physical (PostgreSQL example)
    • Table student ( student_id bigint PRIMARY KEY, name text, dob date, … )
    • Foreign keys explicitly scripted, naming matches actual DDL.

One-to-Zero versus One-to-Many Clarified (Q&A Recap)

  • 1:0 – entity has no dependent rows; e.g., a reference table created but yet unused.
  • 1:N – each row in parent appears in one or more child rows; e.g., a department referenced by many employees.

Tools Mentioned / Common Modelling Software

  • Oracle SQL Developer Data Modeler
  • ER/Studio, Erwin, Lucidchart, Draw.io, dbdiagram.io
  • Most modern IDEs auto-generate ER diagrams from existing schema.

Practical Takeaways & Best Practices

  • Always ingest at full fidelity; you can’t go back in time for missing columns.
  • Partition & format (e.g., Parquet + gzip) in the curated zone to reduce $$ query cost.
  • Maintain a data catalog (Glue/Azure Purview) so scientists know what’s in the lake.
  • Keep CDM/LDM/PDM under version control; treat them as living documentation.
  • For exam questions, remember:
    • Data Lake ≠ Data Warehouse (schema-on-read vs schema-on-write).
    • Conceptual vs Logical vs Physical – differentiate by level of detail & naming conventions.