37 Databricks Use Case - 1

Previous class topics included:
- Understanding catalog objects such as tables, volumes management, and schemas.
- Introduction to Unity Catalog.

Use case involves building a catalog in Databricks with:
- A schema encompassing necessary tables: customer, product, and sales.
- Views to facilitate querying and dashboards for data visualization.
- Creation of notebooks for data handling and a properly structured dashboard.
Connecting to PostgreSQL Database:
- Utilizing a JDBC connector from Databricks to PostgreSQL, assuming public IP is available.
- Notes that the database could also be an SAP system instead of Postgres without losing the overall concept.

Access Docker Hub:
- Search for PostgreSQL official image.
- Utilize the image for container setup on Azure.
Setting up on Azure: (To Create Docker Container)
- Select "Container Instance" under Azure services.
- Specify resource group and name (e.g., DB_example).
- Choose region (e.g., Central India).
- Optionally adjust container size and specifications, including networking.
  - Set public access and TCP port 5432 for PostgreSQL access.
Environment Variables:
- Required variables to set:
  - POSTGRES_PASSWORD (for database password)
  - POSTGRES_USER (for database username)
- Illustrates how to initiate PostgreSQL container from the official image on Docker Hub.

Objectives:
- Use Python notebooks to dynamically generate datasets for the three tables: customer, product, and sales orders.
- Data generation options adapted for both full data load and delta load.
- Delta loading allows adding new records without overriding existing dataset.
Implementation:
- Created individual notebooks for:
- Customer data
- Product data
- Sales order data
- An additional notebook consolidates data, allowing creation and analysis of views in the Databricks workspace.

Notebooks require scheduling for automation in real-world scenarios.
- Used Databricks workflows for job scheduling to handle execution of notebooks automatically.

Define tasks for each notebook to load data (customer master, product master, sales orders).
Manage dependencies to ensure correct execution order.
Parameterize notebooks to accept interactive inputs when loading data.
Set up notifications for both success and failure of each task.

Transformation through Delta Lake enables:
- Insert, update, delete capabilities.
- Use of "MERGE" command to handle new data efficiently based on unique identifiers (e.g., order ID).
**Executing SQL Queries:
1. Identify new rows added to sales order table.
2. Aggregate and process to create final datasets.**

Visualizations serve to analyze data trends, such as:
- Customer revenue trends per year.
- Product sales dynamics over specified time periods.
Creation of dashboards using:
- Traditional notebook visualizations aggregated into a single dashboard.
- Separate dashboard creation from SQL queries directly for enhanced UX.

Importance of making complex data analyses accessible to non-technical stakeholders.
Ensuring data accuracy by processing fresh datasets regularly.
Discussion points on data performance handling in large datasets spanning billions of rows.

Comprehensive job monitoring view to track execution statuses in Databricks.
Jobs should be tracked and monitored for timely execution and troubleshooting.
Scheduling involves:
1. Time-based automation.
2. File arrival triggers for dynamic data loading.

Continuously engage participants about understanding project parameters and code snippets shared by a colleague who is a senior developer.
Integrate learnings by discussing the implications of various design choices in real-time projects.