Kafka Submission Pipeline Overview
Introduction to Kafka Submissions
The video discusses the submission process using Kafka, focusing on reliability, performance, and avoiding submission loss. The structure of the submission pipeline is outlined, emphasizing the contrast between raw and filtered submission topics in Kafka, along with their respective uses.
Kafka Topics and Workers
Form Submissions Topic: This is where form submissions are processed. The unfiltered submission topic is used to sync data to Athena, while the filtered topic is actively consumed for processing.
Submission Pipeline: The pipeline includes several Kafka workers, each handling different stages of submission. These workers produce a dependency graph crucial for tracking the submission's flow from raw data to processed results.
Submission Processing and Protobufs
Raw Submission to Protobuf: The goal is to convert raw submissions into Protobuf format, which is structured and more efficient to serialize than typical Java models.
Models: Two main types of models are discussed: immutable models vs Protobufs. Protobufs provide advantages for data transport efficiency but come with limitations like lack of support for maps.
File Management and Idempotency Checks
File Uploads: Files attached to submissions are sent to a file manager for handling uploads. The system implements item potency checks to ensure duplicate submissions are not processed multiple times.
Quarantine Process: A quarantine mechanism allows problematic submissions to be isolated and handled separately.
Submission Creation and Race Conditions
Submission Create Consumer: This consumer is responsible for creating contact records and managing submissions. The video outlines potential issues with race conditions due to how the submission and contact creation processes interact.
Error Handling: The video touches on error handling mechanisms, including the use of a problematic queue for failed submissions and an overflow mechanism to manage excessive requests.
Rate Limiting and Blocking Submissions
Blocking Lists: Submission block lists are implemented to prevent spam or unwanted entries. Different attributes can be blocked, including IPs and email addresses, to safeguard the system from abuse.
Normalization: Efforts are made to normalize data and truncate values based on specific criteria, ensuring data integrity and consistency.
Interaction with Other Systems
Integration with Other Teams: The pipeline's output interfaces with various other systems, impacting contacts, tickets, and feedback mechanisms. The dependency on different structures highlights the complexity of submission processing.
Final Outputs: Each completed submission results in multiple outputs, including notifications and other system integrations, creating a comprehensive data flow across the organization.
Conclusion
The presentation outlines a complex Kafka-based submission pipeline designed to manage user data effectively, ensuring reliability, performance, and integration across different systems while addressing common pitfalls such as race conditions and error handling. The detailed structure encourages revisiting and refining submission processes as needed.