Data Lakes & Data Modelling
Data Lake – Big-Picture Definition
- Centralised, highly-scalable repository into which all data from multiple sources is ingested “as-is”.
- Accepts every format:
- Structured (relational tables, CSVs, TSVs)
- Semi-structured (JSON, XML, Parquet, Avro)
- Un-structured (text documents, images, audio, video)
- Primary use-cases: exploratory analytics, data science, machine-learning, long-term low-cost archival.
- Works on the principle of “store now, understand / model later.”
Why Data Lakes Emerged
- Traditional OLTP databases optimise for transaction processing and only handle structured data.
- Data Warehouses (OLAP):
- Support large, historical, cleansed, structured data.
- High performance but expensive (licensing, compute, storage)
- Impose an up-front modelling / ETL cost (schema-on-write).
- Organisations needed a cheaper, schema-on-read, unlimited-volume store → Data Lake.
Core Characteristics & Capabilities
- Schema-on-read: data is interpreted only when queried, not when stored.
- Unlimited scale: underlying object storage can grow petabytes without reconfiguration.
- Object storage: each file/object has 2 technical items:
- A unique identifier (key / URI)
- A metadata tag set (name–value pairs describing owner, size, ingestion date, etc.)
- Cost optimisation tiers: infrequently accessed objects can be moved to colder / cheaper tiers automatically.
- Security & Governance:
- IAM / ACL level permissions per bucket, folder, or object.
- Encryption in-flight (TLS) & at-rest (KMS-managed keys).
- Data catalog & lineage tools integrated (e.g., AWS Glue, Azure Purview).
Typical Data Zones inside a Lake
- Raw / Landing zone – 100 % untouched source files.
- Cleansed / Curated zone – duplicates removed, data quality checks applied.
- Query / Trusted zone – optimised formats (Parquet/ORC), partitioned for analytics engines.
Workflow & Consumer Patterns
- Sources → Ingest (batch, streaming, change-data-capture).
- Storage inside lake (raw).
- Optional transformation / cleansing jobs.
- Consumption layers:
- SQL-on-lake engines (e.g., Amazon Athena, Google BigQuery-external tables, Azure Synapse Serverless)
- ML notebooks (Python / R reading directly from object storage)
- Downstream Data Marts or Warehouses (schema-on-write if needed for BI dashboards).
Cloud Implementations (Examples)
- AWS: Amazon S3 (+ Lake Formation, Glue, Athena).
- Microsoft Azure: Azure Data Lake Storage Gen2 (ADLS) in Blob Storage.
- Google Cloud Platform: Cloud Storage (+ BigLake, Dataplex).
- All provide: versioning, lifecycle, cross-region replication, fine-grained access policies.
Raw Data vs. Processed Data – Clarification
- Raw Data
- Immediately after ingest; no cleansing, standardisation, enrichment.
- May contain nulls, partial rows, “junk” values.
- Analogy: unedited three-hour camera footage.
- Processed (Cleansed) Data
- Transformed, validated, business-ready.
- Example: final edited movie, or HR survey results aggregated into KPIs.
- Survey Example from class:
- 100 employees submit answers → raw dataset (100 rows).
- HR parses responses, computes summary insights → processed dataset.
Data Modeling – Purpose & Importance
- Goal: visually/semantically describe how data elements relate within a system so developers, analysts, and architects speak the same language.
- Enables easier navigation across tens or hundreds of tables, promotes consistency, prevents redundancy.
- Produced before or during database / warehouse design; lives as part of design docs.
ER (Entity–Relationship) Diagrams – Building Blocks
- Entity → typically a table or object (rectangle in diagram).
- Attribute → column/field of an entity (listed inside rectangle).
- Relationship → line + symbol set denoting cardinality.
- One-to-One (1:1)
- One-to-Many (1:N)
- Many-to-Many (M:N)
- One-to-Zero-or-Many (1:0..N) – used when a table can exist with no matching rows in related table.
Cardinality Symbols (quick legend)
- Crow’s foot notation used in slides:
- |—| 1 and only 1
- |—< 1 to many
- O—< 0 to many
- O—| 0 or 1
Three Levels of Data Models
- Conceptual Data Model (CDM)
- High-level, technology-agnostic.
- Shows entities + relationships only.
- Audience: business stakeholders, initial scoping.
- Logical Data Model (LDM)
- Adds detail: primary keys, foreign keys, optionality, data types.
- Ignores specific DBMS naming rules; field names may still be in business language.
- Physical Data Model (PDM)
- Exact blueprint of deployed database.
- Column names match physical implementation, e.g.
STUDENT_ID, COURSE_ID. - Includes indexes, constraints, partition keys, and performance-oriented configs.
Example Walk-Through – Student Enrollment
- Conceptual
- Entities: STUDENT, COURSE, ENROLLMENT
- Relationships: STUDENT 1..N ENROLLMENT, COURSE 1..N ENROLLMENT
- Logical
- STUDENT (StudentID PK, Name varchar, DOB date …)
- COURSE (CourseID PK, Title varchar …)
- ENROLLMENT (EnrollID PK, StudentID FK, CourseID FK, EnrollmentDate date)
- Physical (PostgreSQL example)
- Table
student ( student_id bigint PRIMARY KEY, name text, dob date, … ) - Foreign keys explicitly scripted, naming matches actual DDL.
One-to-Zero versus One-to-Many Clarified (Q&A Recap)
- 1:0 – entity has no dependent rows; e.g., a reference table created but yet unused.
- 1:N – each row in parent appears in one or more child rows; e.g., a department referenced by many employees.
- Oracle SQL Developer Data Modeler
- ER/Studio, Erwin, Lucidchart, Draw.io, dbdiagram.io
- Most modern IDEs auto-generate ER diagrams from existing schema.
Practical Takeaways & Best Practices
- Always ingest at full fidelity; you can’t go back in time for missing columns.
- Partition & format (e.g., Parquet + gzip) in the curated zone to reduce $$ query cost.
- Maintain a data catalog (Glue/Azure Purview) so scientists know what’s in the lake.
- Keep CDM/LDM/PDM under version control; treat them as living documentation.
- For exam questions, remember:
- Data Lake ≠ Data Warehouse (schema-on-read vs schema-on-write).
- Conceptual vs Logical vs Physical – differentiate by level of detail & naming conventions.