3 Databricks Architecture, Spark Runtime & Azure Demo – Study Notes

Databricks Platform Overview

Databricks markets itself as a “Data-Intelligence Platform” that unifies data engineering, analytics, AI/ML and governance on top of lake-house storage.
Core idea: decouple where you control the system (Control Plane) from where your data & compute actually live (Data Plane) so that you retain cloud ownership of sensitive assets.
Runs natively on the three hyperscalers (Azure, AWS, GCP). When the instructor chooses Azure the service is called Azure Databricks.

Account, Workspace & Control-Plane

Step-1: Sign up for a Databricks Account – gives you access to the vendor-managed Control Plane (browser UI, REST APIs, billing console).
Step-2: Within that account you create one or more Workspaces.
- A Workspace = a web application that acts as your “mission-control” (create clusters, SQL warehouses, notebooks, jobs, Unity-Catalog objects, etc.).
- No data or compute actually resides here; it only holds metadata & configuration.
Think of the control plane as a SaaS front-end; everything beneath it is provisioned in your own cloud subscription.

Data Storage & Compute Separation (Data Plane)

Your files (CSV, Parquet, Delta, etc.) stay inside your own cloud storage account (Azure Blob/ADLS, S3, GCS).
Your Compute (clusters, VMs, SQL Warehouses) is spun up in your subscription when you click “Create Cluster/Compute” in the workspace UI.
Benefit: security boundary – Databricks never takes custody of your raw data; you inherit all existing cloud policies.

Unity Catalog & Governance Layer

Unity Catalog (UC) is Databricks’ centralized governance service.
- Manages catalogs → schemas → tables/volumes with fine-grained ACLs.
- Supports lineage, audit, masking, row/column-level security.
UC objects:
- Catalog (top level) – synonymous with a database catalog.
- Schema (aka database) – logical group of tables/volumes.
- Table – structured, Delta/Parquet-backed, ACID-enabled.
- Volume – unstructured or semi-structured file container.
UC stores its metadata as files in a storage location you specify (example in demo: kubix-dev-db-storage/UnityCatalog/…).

Spark Runtime & Execution Engine

Every compute resource is launched with a Databricks Runtime (DBR) that bundles a specific Apache Spark version + curated libraries.
Spark = unified distributed processing engine for batch, streaming, SQL, ML, graph workloads.
- Splits a job → stages → tasks executed in parallel across CPU cores.
- Utilises in-memory processing where possible for speed.
Languages supported out-of-the-box: $\text{Scala\ (native)},\;Python\ (PySpark),\;SQL,\;Java,\;R.$ Additional abstraction layers:
- High-level APIs: DataFrame, Spark SQL, pandas-API-on-Spark.
- Advanced libraries: Structured Streaming, MLlib, GraphX.
Internally Spark still represents datasets as RDDs (Resilient Distributed Datasets) – low-level immutable partitions – but users rarely code against them directly.

Low- vs High-Level APIs in Spark

Low Level: RDD transformations/actions written in Scala/Java; gives fine control but verbose.
High Level: DataFrame/Dataset API & Spark SQL; automatically compiles to RDD ops, optimised by Catalyst/Whole-Stage Codegen.
Databricks notebooks encourage high-level APIs; magic commands allow mixing languages in a single notebook.

Databricks Deployment Options & Azure Databricks

You can attach Databricks workspaces to AWS, Azure, or GCP.
Azure Databricks = Databricks offered as a first-party service in Azure Marketplace.
- Provisioned through Azure Portal / ARM / Bicep / Terraform.
- All billing rolled into normal Azure invoice → single vendor.
- Single-Sign-On (SSO) via Azure AD (now Entra ID).

Azure Integration Benefits (Why many customers choose Azure)

Tight coupling with Azure ecosystem:
- Azure Data Factory pipelines, Power BI DirectQuery, Azure DevOps CI/CD, Event Hub/Kafka, Azure Functions.
Native SSO & Conditional Access with Entra ID.
Private Link, V-Net injection, managed VNet for network hardening.
Simpler procurement & consolidated billing.

Walk-through Demo 1 – Small Local CSV (< $5\,\text{GB}$ )

Use-case: “I have a local CSV, want to analyse it quickly.”
Steps performed live:
- Open Azure Portal → Azure Databricks workspace (SSO via Entra ID).
- Start a Serverless SQL Warehouse (provides on-demand compute for SQL UI).
- Navigate: Catalog → existing catalog “dev” → create new Schema demo.
- In that schema create Volume my_volume → auto-creates folder hierarchy in storage account (volume ID $\approx 7F65DC…$ ).
- Upload file via UI (limit $5\,\text{GB}$ per file).
  - Instructor sample: simple Sales CSV (order date, customer, city, country, product, price, qty, amount, ship date).
- UI → “Create Table from File.” Auto-detects column names & types (dates, integers, strings).
- Table sales now visible in UC. Physically saved as Delta/Parquet files in storage location.
Query via SQL Editor:
- SELECT * FROM dev.demo.sales;
- Aggregation example: SELECT customer_name, SUM(amount) FROM dev.demo.sales GROUP BY 1;
- Added simple bar-chart visual → can be pinned to dashboards.
Key takeaways:
- Table abstraction even though underlying is just files.
- Quick ad-hoc analytics possible without clusters (serverless warehouse).

Walk-through Demo 2 – Large CSV (≈ $30\,\text{lakh} = 3\,000\,000$ rows, > $5\,\text{GB}$ )

Problem: UI upload limit $5\,\text{GB}$ . Solution: upload via cloud storage tools.

Steps:

Use Azure Storage Explorer to connect to storage account kubix-dev-db-storage → container UnityCatalog → folders: catalogs/dev/volumes/my_volume/input/.
Upload big file schools.csv (~30 lakh rows, 56.8 MB but assumed >5 GB scenario).* Each row describes an Indian school (state, toilets, teachers, boys/girls counts, etc.).
In Databricks UI refresh the Volume – file now visible.
Click Create Table → catalog dev, schema demo, table schools.
Launch an All-Purpose Cluster (aka Personal Compute) – heavier weight than serverless SQL warehouse, supports Notebooks & multi-language.
Create Notebook “My First Notebook”:
- Default language set to Python but demonstrate Magic Commands:
  - %sql → execute Spark SQL.
  - %scala, %python (optional explicit), %r.
- Attach the notebook to the Personal Compute cluster.
Queries executed:
- %sql SELECT * FROM dev.demo.schools LIMIT 100; (shows sample grid, execution time ≈ $11.92\,\text{s}$ ).
- Count schools per state:
  sql %sql SELECT state, COUNT(*) AS num_schools FROM dev.demo.schools GROUP BY state ORDER BY num_schools DESC;
- Result: Uttar Pradesh highest, followed by Madhya Pradesh, etc.
Observations:
- Even multi-million-row CSV handled interactively once ingested as Delta/Parquet.
- Same analysis could be done directly on file without table using READ_CSV(...) pattern, e.g.
  sql SELECT * FROM csv.`/Volumes/dev/demo/my_volume/input/sales.csv`;
- Demonstrates schema-on-read vs managed table creation.

Notebooks, Languages & Magic Commands

Notebook = interactive document composed of cells (Markdown, code, visualisations).
Attach to a cluster/compute once; every command runs there.
Magic Syntax %<lang> at top of cell overrides notebook default.
Supports version control, job scheduling, and parameterisation (not covered in demo yet).

SQL Warehouse vs All-Purpose Compute

SQL Warehouse (Serverless or Classic): optimized for purely SQL analytics; auto-scales; quicker start-up; no Scala/Python.
All-Purpose Cluster: full Spark runtime, multi-language, for data engineering/ML; costlier; requires manual start/stop or auto-termination.

Performance & Spark Execution Model (touched briefly)

Every SQL/DataFrame command becomes a Spark Job.
- Job ➔ Stages ➔ Tasks ➔ executed per core.
Spark relies on parallelism; splitting data into partitions accelerates large-scale processing.
Work happens in-memory; spills to disk when partitions exceed executor memory.

Practical Limits & Best Practices Mentioned

UI file upload hard-limit: $5\,\text{GB}$ / file, $100$ files / operation.
Prefer external tools (AzCopy, Azure Storage Explorer, AWS CLI, gsutil) for big data loads.
Always reference tables via 3-part name catalog.schema.table for clarity & cross-catalog safety.
Start small (serverless warehouse) for ad-hoc SQL; spin up clusters only when Python/Scala/ML needed.

Ethical, Philosophical, Real-World Implications

Data remains in customer’s cloud tenant ➔ satisfies regulatory requirements (GDPR, HIPAA, etc.).
Unified billing & SSO reduce operational overhead → encourages governed self-service analytics.
Demonstrated governance (Unity Catalog) ensures principle-of-least-privilege can be enforced even in multi-tenant lakehouse.

Upcoming Demos / Course Roadmap (as hinted)

Two additional demos still to come – will further familiarise students with UI & features before deep-dive theory.
Subsequent lectures will cover:
- Detailed cluster sizing & cost optimisation.
- Delta Lake ACID internals.
- Auto-Loader & Streaming pipelines.
- CI/CD with Azure DevOps.
- Advanced ML & graph analytics libraries.

End of high-level study notes for the recording.

3 Databricks Architecture, Spark Runtime & Azure Demo – Study Notes

Databricks Platform Overview

Account, Workspace & Control-Plane

Data Storage & Compute Separation (Data Plane)

Unity Catalog & Governance Layer

Spark Runtime & Execution Engine

Low- vs High-Level APIs in Spark

Databricks Deployment Options & Azure Databricks

Azure Integration Benefits (Why many customers choose Azure)

Walk-through Demo 1 – Small Local CSV (<5 GB5\,\text{GB}5GB)

Walk-through Demo 2 – Large CSV (≈30 lakh=3 000 00030\,\text{lakh} = 3\,000\,00030lakh=3000000 rows, >5 GB5\,\text{GB}5GB)

Notebooks, Languages & Magic Commands

SQL Warehouse vs All-Purpose Compute

Performance & Spark Execution Model (touched briefly)

Practical Limits & Best Practices Mentioned

Ethical, Philosophical, Real-World Implications

Upcoming Demos / Course Roadmap (as hinted)

Walk-through Demo 1 – Small Local CSV (< $5\,\text{GB}$ )

Walk-through Demo 2 – Large CSV (≈ $30\,\text{lakh} = 3\,000\,000$ rows, > $5\,\text{GB}$ )