In-Depth Notes on Data Ingestion Pipeline and SCD Strategies
Overview of Data Engineering Pipeline
- The project aims to ingest data from on-premises SQL Server to Azure Data Lake (Gentle Lake).
- The pipeline needs to be implemented to extract, transform, and load (ETL) data efficiently.
Control Table Implementation
Control Table Purpose:
- To maintain metadata about data ingestion.
- Contains information such as:
- Tables to ingest.
- Columns involved.
- Load type (incremental/full).
Lookup Activity:
- Connects to the control table to extract ingestion details for further processing.
Data Ingestion Activities
ForEach Activity:
- Iterates over the datasets specified in the control table.
- Contains multiple activities:
- Copy Activity
- Notebook Activity
- Stored Procedure Activity
Copy Activity Details:
- Source: On-premises SQL Server.
- Sink: Azure Data Lake as JSON (real-world cases will use Parquet format for performance).
- Uses Self-Hosted Integration Runtime to establish a connection to on-prem SQL Server.
Transformation Process
Bronze to Silver Transformation:
- Data transformations performed during phase transitions.
- Tasks include:
- Adding new columns.
- Data transformations (case conditions, window functions).
- Using SCD Type 1 and Type 2 strategies for managing historical data.
SCD Overview:
- Type 1: Updates existing records (latest values only).
- Type 2: Maintains history of changes (historical versions of records).
Example of Merge Statement for SCD Type 1
MERGE INTO target_table USING stage_table
ON stage.product_id = target.product_id
WHEN MATCHED THEN UPDATE SET target.price = stage.price
WHEN NOT MATCHED THEN INSERT (product_id, price) VALUES (stage.product_id, stage.price);
Execution Steps
- Data is first inserted into the Branch table.
- After initial loading, the Stored Procedure Activity updates the control table with new run dates.
- The Notebook Activity may contain further processing post data ingestion such as handling schemas.
Debugging Pipeline Failures
- Check monitoring tab for any failures.
- Enable web activities to send email notifications on failure states using:
- Success/Failure conditions in pipeline.
- Set Web Activity to trigger alerts accordingly.
Project Presentation and Interview Preparation
Self Introduction Example:
Structure your intro based on years of experience and role specifics (data engineering with Azure tools).
Project Explanation:
- Built pipelines in Azure Data Factory for incremental data loading using control tables.
- Followed a medallion architecture (Bronze, Silver, Gold zones) for data transformation and analytics.
Be ready to discuss key components of the project, tools used (Azure Data Factory, Databricks), and ETL processes handled.
Common Interview Questions:
- Explain your role in the project.
- Discuss technologies/frameworks applied during the project.
- Talk about handling data processing failures and debugging.
Additional Tips:
- Use practical examples from your experience when answering questions.
- Ensure clarity and conciseness in your answers while aligning with the interviewer’s questions.
- Utilize modern tools like ChatGPT for grammar and clarity in your presentation.