3 Databricks Architecture, Spark Runtime & Azure Demo – Study Notes

Databricks Platform Overview

  • Databricks markets itself as a “Data-Intelligence Platform” that unifies data engineering, analytics, AI/ML and governance on top of lake-house storage.

  • Core idea: decouple where you control the system (Control Plane) from where your data & compute actually live (Data Plane) so that you retain cloud ownership of sensitive assets.

  • Runs natively on the three hyperscalers (Azure, AWS, GCP). When the instructor chooses Azure the service is called Azure Databricks.

Account, Workspace & Control-Plane

  • Step-1: Sign up for a Databricks Account – gives you access to the vendor-managed Control Plane (browser UI, REST APIs, billing console).

  • Step-2: Within that account you create one or more Workspaces.

    • A Workspace = a web application that acts as your “mission-control” (create clusters, SQL warehouses, notebooks, jobs, Unity-Catalog objects, etc.).

    • No data or compute actually resides here; it only holds metadata & configuration.

  • Think of the control plane as a SaaS front-end; everything beneath it is provisioned in your own cloud subscription.

Data Storage & Compute Separation (Data Plane)

  • Your files (CSV, Parquet, Delta, etc.) stay inside your own cloud storage account (Azure Blob/ADLS, S3, GCS).

  • Your Compute (clusters, VMs, SQL Warehouses) is spun up in your subscription when you click “Create Cluster/Compute” in the workspace UI.

  • Benefit: security boundary – Databricks never takes custody of your raw data; you inherit all existing cloud policies.

Unity Catalog & Governance Layer

  • Unity Catalog (UC) is Databricks’ centralized governance service.

    • Manages catalogs → schemas → tables/volumes with fine-grained ACLs.

    • Supports lineage, audit, masking, row/column-level security.

  • UC objects:

    • Catalog (top level) – synonymous with a database catalog.

    • Schema (aka database) – logical group of tables/volumes.

    • Table – structured, Delta/Parquet-backed, ACID-enabled.

    • Volume – unstructured or semi-structured file container.

  • UC stores its metadata as files in a storage location you specify (example in demo: kubix-dev-db-storage/UnityCatalog/…).

Spark Runtime & Execution Engine

  • Every compute resource is launched with a Databricks Runtime (DBR) that bundles a specific Apache Spark version + curated libraries.

  • Spark = unified distributed processing engine for batch, streaming, SQL, ML, graph workloads.

    • Splits a job → stages → tasks executed in parallel across CPU cores.

    • Utilises in-memory processing where possible for speed.

  • Languages supported out-of-the-box: Scala (native),  Python (PySpark),  SQL,  Java,  R.\text{Scala\ (native)},\;Python\ (PySpark),\;SQL,\;Java,\;R. Additional abstraction layers:

    • High-level APIs: DataFrame, Spark SQL, pandas-API-on-Spark.

    • Advanced libraries: Structured Streaming, MLlib, GraphX.

  • Internally Spark still represents datasets as RDDs (Resilient Distributed Datasets) – low-level immutable partitions – but users rarely code against them directly.

Low- vs High-Level APIs in Spark

  • Low Level: RDD transformations/actions written in Scala/Java; gives fine control but verbose.

  • High Level: DataFrame/Dataset API & Spark SQL; automatically compiles to RDD ops, optimised by Catalyst/Whole-Stage Codegen.

  • Databricks notebooks encourage high-level APIs; magic commands allow mixing languages in a single notebook.

Databricks Deployment Options & Azure Databricks

  • You can attach Databricks workspaces to AWS, Azure, or GCP.

  • Azure Databricks = Databricks offered as a first-party service in Azure Marketplace.

    • Provisioned through Azure Portal / ARM / Bicep / Terraform.

    • All billing rolled into normal Azure invoice → single vendor.

    • Single-Sign-On (SSO) via Azure AD (now Entra ID).

Azure Integration Benefits (Why many customers choose Azure)

  • Tight coupling with Azure ecosystem:

    • Azure Data Factory pipelines, Power BI DirectQuery, Azure DevOps CI/CD, Event Hub/Kafka, Azure Functions.

  • Native SSO & Conditional Access with Entra ID.

  • Private Link, V-Net injection, managed VNet for network hardening.

  • Simpler procurement & consolidated billing.

Walk-through Demo 1 – Small Local CSV (<5GB5\,\text{GB})

  1. Use-case: “I have a local CSV, want to analyse it quickly.”

  2. Steps performed live:

    • Open Azure Portal → Azure Databricks workspace (SSO via Entra ID).

    • Start a Serverless SQL Warehouse (provides on-demand compute for SQL UI).

    • Navigate: Catalog → existing catalog “dev” → create new Schema demo.

    • In that schema create Volume my_volume → auto-creates folder hierarchy in storage account (volume ID 7F65DC\approx 7F65DC…).

    • Upload file via UI (limit 5GB5\,\text{GB} per file).

      • Instructor sample: simple Sales CSV (order date, customer, city, country, product, price, qty, amount, ship date).

    • UI → “Create Table from File.” Auto-detects column names & types (dates, integers, strings).

    • Table sales now visible in UC. Physically saved as Delta/Parquet files in storage location.

  3. Query via SQL Editor:

    • SELECT * FROM dev.demo.sales;

    • Aggregation example: SELECT customer_name, SUM(amount) FROM dev.demo.sales GROUP BY 1;

    • Added simple bar-chart visual → can be pinned to dashboards.

  4. Key takeaways:

    • Table abstraction even though underlying is just files.

    • Quick ad-hoc analytics possible without clusters (serverless warehouse).

Walk-through Demo 2 – Large CSV (≈30lakh=300000030\,\text{lakh} = 3\,000\,000 rows, >5GB5\,\text{GB})

Problem: UI upload limit 5GB5\,\text{GB}. Solution: upload via cloud storage tools.

Steps:

  1. Use Azure Storage Explorer to connect to storage account kubix-dev-db-storage → container UnityCatalog → folders: catalogs/dev/volumes/my_volume/input/.

  2. Upload big file schools.csv (~30 lakh rows, 56.8 MB but assumed >5 GB scenario).* Each row describes an Indian school (state, toilets, teachers, boys/girls counts, etc.).

  3. In Databricks UI refresh the Volume – file now visible.

  4. Click Create Table → catalog dev, schema demo, table schools.

  5. Launch an All-Purpose Cluster (aka Personal Compute) – heavier weight than serverless SQL warehouse, supports Notebooks & multi-language.

  6. Create Notebook “My First Notebook”:

    • Default language set to Python but demonstrate Magic Commands:

      • %sql → execute Spark SQL.

      • %scala, %python (optional explicit), %r.

    • Attach the notebook to the Personal Compute cluster.

  7. Queries executed:

    • %sql SELECT * FROM dev.demo.schools LIMIT 100; (shows sample grid, execution time ≈ 11.92s11.92\,\text{s}).

    • Count schools per state:
      sql %sql SELECT state, COUNT(*) AS num_schools FROM dev.demo.schools GROUP BY state ORDER BY num_schools DESC;

    • Result: Uttar Pradesh highest, followed by Madhya Pradesh, etc.

  8. Observations:

    • Even multi-million-row CSV handled interactively once ingested as Delta/Parquet.

    • Same analysis could be done directly on file without table using READ_CSV(...) pattern, e.g.
      sql SELECT * FROM csv.`/Volumes/dev/demo/my_volume/input/sales.csv`;

    • Demonstrates schema-on-read vs managed table creation.

Notebooks, Languages & Magic Commands

  • Notebook = interactive document composed of cells (Markdown, code, visualisations).

  • Attach to a cluster/compute once; every command runs there.

  • Magic Syntax %<lang> at top of cell overrides notebook default.

  • Supports version control, job scheduling, and parameterisation (not covered in demo yet).

SQL Warehouse vs All-Purpose Compute

  • SQL Warehouse (Serverless or Classic): optimized for purely SQL analytics; auto-scales; quicker start-up; no Scala/Python.

  • All-Purpose Cluster: full Spark runtime, multi-language, for data engineering/ML; costlier; requires manual start/stop or auto-termination.

Performance & Spark Execution Model (touched briefly)

  • Every SQL/DataFrame command becomes a Spark Job.

    • Job ➔ Stages ➔ Tasks ➔ executed per core.

  • Spark relies on parallelism; splitting data into partitions accelerates large-scale processing.

  • Work happens in-memory; spills to disk when partitions exceed executor memory.

Practical Limits & Best Practices Mentioned

  • UI file upload hard-limit: 5GB5\,\text{GB} / file, 100100 files / operation.

  • Prefer external tools (AzCopy, Azure Storage Explorer, AWS CLI, gsutil) for big data loads.

  • Always reference tables via 3-part name catalog.schema.table for clarity & cross-catalog safety.

  • Start small (serverless warehouse) for ad-hoc SQL; spin up clusters only when Python/Scala/ML needed.

Ethical, Philosophical, Real-World Implications

  • Data remains in customer’s cloud tenant ➔ satisfies regulatory requirements (GDPR, HIPAA, etc.).

  • Unified billing & SSO reduce operational overhead → encourages governed self-service analytics.

  • Demonstrated governance (Unity Catalog) ensures principle-of-least-privilege can be enforced even in multi-tenant lakehouse.

Upcoming Demos / Course Roadmap (as hinted)

  • Two additional demos still to come – will further familiarise students with UI & features before deep-dive theory.

  • Subsequent lectures will cover:

    • Detailed cluster sizing & cost optimisation.

    • Delta Lake ACID internals.

    • Auto-Loader & Streaming pipelines.

    • CI/CD with Azure DevOps.

    • Advanced ML & graph analytics libraries.


End of high-level study notes for the recording.