3 Databricks Architecture, Spark Runtime & Azure Demo – Study Notes
Databricks Platform Overview
Databricks markets itself as a “Data-Intelligence Platform” that unifies data engineering, analytics, AI/ML and governance on top of lake-house storage.
Core idea: decouple where you control the system (Control Plane) from where your data & compute actually live (Data Plane) so that you retain cloud ownership of sensitive assets.
Runs natively on the three hyperscalers (Azure, AWS, GCP). When the instructor chooses Azure the service is called Azure Databricks.
Account, Workspace & Control-Plane
Step-1: Sign up for a Databricks Account – gives you access to the vendor-managed Control Plane (browser UI, REST APIs, billing console).
Step-2: Within that account you create one or more Workspaces.
A Workspace = a web application that acts as your “mission-control” (create clusters, SQL warehouses, notebooks, jobs, Unity-Catalog objects, etc.).
No data or compute actually resides here; it only holds metadata & configuration.
Think of the control plane as a SaaS front-end; everything beneath it is provisioned in your own cloud subscription.
Data Storage & Compute Separation (Data Plane)
Your files (CSV, Parquet, Delta, etc.) stay inside your own cloud storage account (Azure Blob/ADLS, S3, GCS).
Your Compute (clusters, VMs, SQL Warehouses) is spun up in your subscription when you click “Create Cluster/Compute” in the workspace UI.
Benefit: security boundary – Databricks never takes custody of your raw data; you inherit all existing cloud policies.
Unity Catalog & Governance Layer
Unity Catalog (UC) is Databricks’ centralized governance service.
Manages catalogs → schemas → tables/volumes with fine-grained ACLs.
Supports lineage, audit, masking, row/column-level security.
UC objects:
Catalog (top level) – synonymous with a database catalog.
Schema (aka database) – logical group of tables/volumes.
Table – structured, Delta/Parquet-backed, ACID-enabled.
Volume – unstructured or semi-structured file container.
UC stores its metadata as files in a storage location you specify (example in demo: kubix-dev-db-storage/UnityCatalog/…).
Spark Runtime & Execution Engine
Every compute resource is launched with a Databricks Runtime (DBR) that bundles a specific Apache Spark version + curated libraries.
Spark = unified distributed processing engine for batch, streaming, SQL, ML, graph workloads.
Splits a job → stages → tasks executed in parallel across CPU cores.
Utilises in-memory processing where possible for speed.
Languages supported out-of-the-box: Additional abstraction layers:
High-level APIs: DataFrame, Spark SQL, pandas-API-on-Spark.
Advanced libraries: Structured Streaming, MLlib, GraphX.
Internally Spark still represents datasets as RDDs (Resilient Distributed Datasets) – low-level immutable partitions – but users rarely code against them directly.
Low- vs High-Level APIs in Spark
Low Level: RDD transformations/actions written in Scala/Java; gives fine control but verbose.
High Level: DataFrame/Dataset API & Spark SQL; automatically compiles to RDD ops, optimised by Catalyst/Whole-Stage Codegen.
Databricks notebooks encourage high-level APIs; magic commands allow mixing languages in a single notebook.
Databricks Deployment Options & Azure Databricks
You can attach Databricks workspaces to AWS, Azure, or GCP.
Azure Databricks = Databricks offered as a first-party service in Azure Marketplace.
Provisioned through Azure Portal / ARM / Bicep / Terraform.
All billing rolled into normal Azure invoice → single vendor.
Single-Sign-On (SSO) via Azure AD (now Entra ID).
Azure Integration Benefits (Why many customers choose Azure)
Tight coupling with Azure ecosystem:
Azure Data Factory pipelines, Power BI DirectQuery, Azure DevOps CI/CD, Event Hub/Kafka, Azure Functions.
Native SSO & Conditional Access with Entra ID.
Private Link, V-Net injection, managed VNet for network hardening.
Simpler procurement & consolidated billing.
Walk-through Demo 1 – Small Local CSV (<)
Use-case: “I have a local CSV, want to analyse it quickly.”
Steps performed live:
Open Azure Portal → Azure Databricks workspace (SSO via Entra ID).
Start a Serverless SQL Warehouse (provides on-demand compute for SQL UI).
Navigate: Catalog → existing catalog “dev” → create new Schema
demo.In that schema create Volume
my_volume→ auto-creates folder hierarchy in storage account (volume ID ).Upload file via UI (limit per file).
Instructor sample: simple Sales CSV (order date, customer, city, country, product, price, qty, amount, ship date).
UI → “Create Table from File.” Auto-detects column names & types (dates, integers, strings).
Table
salesnow visible in UC. Physically saved as Delta/Parquet files in storage location.
Query via SQL Editor:
SELECT * FROM dev.demo.sales;Aggregation example:
SELECT customer_name, SUM(amount) FROM dev.demo.sales GROUP BY 1;Added simple bar-chart visual → can be pinned to dashboards.
Key takeaways:
Table abstraction even though underlying is just files.
Quick ad-hoc analytics possible without clusters (serverless warehouse).
Walk-through Demo 2 – Large CSV (≈ rows, >)
Problem: UI upload limit . Solution: upload via cloud storage tools.
Steps:
Use Azure Storage Explorer to connect to storage account
kubix-dev-db-storage→ containerUnityCatalog→ folders:catalogs/dev/volumes/my_volume/input/.Upload big file
schools.csv(~30 lakh rows, 56.8 MB but assumed >5 GB scenario).* Each row describes an Indian school (state, toilets, teachers, boys/girls counts, etc.).In Databricks UI refresh the Volume – file now visible.
Click Create Table → catalog
dev, schemademo, tableschools.Launch an All-Purpose Cluster (aka Personal Compute) – heavier weight than serverless SQL warehouse, supports Notebooks & multi-language.
Create Notebook “My First Notebook”:
Default language set to Python but demonstrate Magic Commands:
%sql→ execute Spark SQL.%scala,%python(optional explicit),%r.
Attach the notebook to the Personal Compute cluster.
Queries executed:
%sql SELECT * FROM dev.demo.schools LIMIT 100;(shows sample grid, execution time ≈ ).Count schools per state:
sql %sql SELECT state, COUNT(*) AS num_schools FROM dev.demo.schools GROUP BY state ORDER BY num_schools DESC;Result: Uttar Pradesh highest, followed by Madhya Pradesh, etc.
Observations:
Even multi-million-row CSV handled interactively once ingested as Delta/Parquet.
Same analysis could be done directly on file without table using
READ_CSV(...)pattern, e.g.sql SELECT * FROM csv.`/Volumes/dev/demo/my_volume/input/sales.csv`;Demonstrates schema-on-read vs managed table creation.
Notebooks, Languages & Magic Commands
Notebook = interactive document composed of cells (Markdown, code, visualisations).
Attach to a cluster/compute once; every command runs there.
Magic Syntax
%<lang>at top of cell overrides notebook default.Supports version control, job scheduling, and parameterisation (not covered in demo yet).
SQL Warehouse vs All-Purpose Compute
SQL Warehouse (Serverless or Classic): optimized for purely SQL analytics; auto-scales; quicker start-up; no Scala/Python.
All-Purpose Cluster: full Spark runtime, multi-language, for data engineering/ML; costlier; requires manual start/stop or auto-termination.
Performance & Spark Execution Model (touched briefly)
Every SQL/DataFrame command becomes a Spark Job.
Job ➔ Stages ➔ Tasks ➔ executed per core.
Spark relies on parallelism; splitting data into partitions accelerates large-scale processing.
Work happens in-memory; spills to disk when partitions exceed executor memory.
Practical Limits & Best Practices Mentioned
UI file upload hard-limit: / file, files / operation.
Prefer external tools (AzCopy, Azure Storage Explorer, AWS CLI, gsutil) for big data loads.
Always reference tables via 3-part name
catalog.schema.tablefor clarity & cross-catalog safety.Start small (serverless warehouse) for ad-hoc SQL; spin up clusters only when Python/Scala/ML needed.
Ethical, Philosophical, Real-World Implications
Data remains in customer’s cloud tenant ➔ satisfies regulatory requirements (GDPR, HIPAA, etc.).
Unified billing & SSO reduce operational overhead → encourages governed self-service analytics.
Demonstrated governance (Unity Catalog) ensures principle-of-least-privilege can be enforced even in multi-tenant lakehouse.
Upcoming Demos / Course Roadmap (as hinted)
Two additional demos still to come – will further familiarise students with UI & features before deep-dive theory.
Subsequent lectures will cover:
Detailed cluster sizing & cost optimisation.
Delta Lake ACID internals.
Auto-Loader & Streaming pipelines.
CI/CD with Azure DevOps.
Advanced ML & graph analytics libraries.
End of high-level study notes for the recording.