Delta Lake Review

Delta Lake delivers open, reliable, and scalable data management for the Lakehouse, empowering you to ingest data from external sources and efficiently manage it across Bronze (raw), Silver (cleaned), and Gold (curated) layers—all with full ACID transactions, time travel, schema enforcement, and support for both batch and streaming workloads.

The goal is to ingest files from external sources like cloud storage into Delta Lake as Delta tables.

Delta Lake is simply an open-source protocol for reading and writing files to cloud storage.

Delta tables offer an open table format that supports Lakehouse architecture.

Within Delta Lake you will work with Delta tables. Under the hood, Delta tables store data within a folder directory. Within the directory, data is stored as parquet files, and what Delta adds is delta logs stored as JSON files alongside the parquet files. The Delta logs keep track of all of the transactions on the data (parquet files) and table versions.

The transaction logs provide a wide array of functionality to the Delta table. With the transaction log, we now have the concept of table states, so if you insert, delete or update data in your table, Delta basically adds a transaction (the log file) and your table stays updated and managed. So, with the transaction log you are able to easily get consistent views of your data and you're actually able to travel back in time!

Delta Tables provide a variety of key features in our cloud data lake:

ACID transactions (atomicity, consistency, isolation, and durability) for all operations, allowing multiple users to read and write data concurrently without conflicts.
Supports Data Manipulation Language (DML) operations such as INSERT, UPDATE, DELETE, and MERGE, enabling flexible data management.
Time travel allows users to query and revert to previous versions of data, facilitating auditing and recovery.
Enforces a defined schema for data integrity while allowing schema evolution, enabling changes to the structure without breaking existing workflows.

Delta Lake also provides many other features like unified batch and streaming

processing, optimization and performance, and scalability. Delta Lake is also open source!

Medallion (Multi-hop) Architecture

As you ingest data into your Delta Lake through batch or streaming methods, or both, you can begin processing and transforming your data in Databricks. The goal is to incrementally and progressively improve the structure and quality of data as it moves through each layer: Bronze, Silver and Gold.

It begins with the Bronze layer, the raw data ingestion layer. This layer ingests raw, unprocessed data from various sources 'as is', serving as the foundational storage for all data.

In the Silver layer, the data is cleaned, transformed, and enriched, providing a more refined dataset that is suitable for analysis.

Lastly the Gold layer contains curated, aggregated, and high-quality data, optimized for reporting and advanced analytics, often used for business intelligence applications