Sender-Service Failure, Exception-Handling Hot-Fix & Event-Driven Redesign
Hosted Service / Task.Run Error Handling
- Current structure
SenderService is implemented as a Hosted Service.- Inside
ExecuteAsync they launch the “main loop” with Task.Run. - The loop is a
while (true) that: - Queries the DB for executable rows.
- Filters by cluster (two Kafka clusters: Netherlands, France).
- Checks whether the associated Kafka topic already exists in the local cluster.
- Sends / processes messages accordingly.
- Original bug
- No surrounding
try–catch in the Task body. - Any exception (DB outage, Kafka error, business-logic bug) bubbles out → the
Task faults silently. - ❌ No logging, ❌ no restart, ❌ no Nomad alert → all Dutch instances died and stopped processing.
- Root cause still unknown because nothing was logged.
- Hot-fix
- Wrapped the whole loop with
try–catch. - On catch:
- Log the exception (to standard logger + centralised logs).
- Continue the loop so the service stays alive.
- Future-proofing ideas
- Introduce a threshold counter. After N=5 consecutive exceptions → deliberately
throw so that the container crashes and Nomad restarts it. - Re-examine all usages of
Task.Run across code-base (Execution Initialiser, Dispatcher, …) to ensure identical protection.
Restart / Nomad Policy
- General company practice: "fail-fast & let the orchestrator restart".
- Open questions
- How many consecutive crashes before Nomad gives up? (Unclear / needs confirmation.)
- Would repeated restarts hide real problems or overload on-call? Balance with the threshold approach.
- Possible implementation
- Global utility wrapper e.g.
RunWithRestartPolicy(Func<Task>) used everywhere instead of raw Task.Run. - Emits Slack alerts every time it enters the catch; after ≥5 →
Environment.Exit(1).
Execution Processing Pipeline (Today)
- Dispatcher
- Detects eligible executions after
preparation_time. - Publishes an execution-dispatch message.
- Data Preparer
- Receives dispatch.
- Flushes data rows into per-execution Kafka topic
exec_{id} (creates the topic on first flush). - For some types (Message-Center) finishes the job itself without involving Sender.
- Sender
- Performs periodic DB polling:
- Selects executions whose
delivery_date <= NOW(). - Ensures the per-execution topic exists in its cluster.
- Starts consuming / sending.
Second Incident: Missed Executions (Yesterday)
- Symptom: executions were skipped, no messages sent.
- Deep dive
- Sender’s SQL has
WHERE delivery_date <= NOW(). - The failed executions were inserted after their
delivery_date (already in the past) → never returned by the query. - Even if the record exists, the Sender also requires the topic to exist. Topic is created only when Data-Preparer first flushes data. Under heavy load this can lag → execution briefly invisible to Sender and skipped.
- Quick mitigation agreed
- Expand the SQL window by +1 hour (i.e. accept
delivery_date <= NOW() + INTERVAL '1 hour'). - Reduces but does not eliminate risk.
Structural Proposal: Event-Driven Signalling
- Replace DB-polling with event signalling:
- Data-Preparer, once it is certain the Kafka topic is created, publishes a “execution-ready” message to a new lightweight topic, e.g.
exec.ready. - Sender consumes that topic and starts processing immediately.
- Guarantees / design points
- Needs exactly-once delivery (easy: single message per execution, low volume, can enable Kafka transactions).
- Must ensure topic truly exists before emitting ready-message (check via
ListTopics, or create-if-missing then confirm). - Concern about long-running consumers not committing offsets quickly: belief is that as long as heartbeats continue, Kafka is fine. ( precedent: abandoned-cart consumer keeps 45-min gaps )
- Benefits
- Eliminates heavy DB polling (performance, scalability).
- Removes race condition between record insertion and topic creation.
- Clearer, event-driven flow; easier to reason about.
- Challenges / To-dos
- Cross-team design session (need Nidal, Batteries, etc.).
- Refactor Data-Preparer / Sender; add transactional publish; add fallback when Sender is down for hours (spam-protection, TTL, etc.).
Implementation Details Already Noted
- Topic creation logic today lives in
DataPreparer.Flush():- Call
TryCreate(topicName, partitionCount, retention) every flush.
- Alternative spots considered:
Dispatcher could pre-create topics earlier, but increases coupling.
- Any change must still allow Sender to pull extra metadata from DB (templates, lists), so DB access cannot be fully removed.
Interim Action Plan
- [x] Add
try–catch around Sender main loop + logging (DONE). - [x] Deploy change immediately.
- [ ] Add 1-hour window to
delivery_date SQL and redeploy (Maxim today). - [ ] Implement threshold-based crash/restart logic.
- [ ] Code scan for unguarded
Task.Run usage and patch. - [ ] Monitor logs tomorrow; on-call (Oscar) expects better visibility.
- [ ] Schedule architecture workshop on event-driven signalling vs. DB polling.
Miscellaneous Notes
- Netherlands & France clusters: topic-existence check is cluster-specific.
- Nomad behaviour (continuous crash loop vs. back-off) still to be confirmed.
- Vacation schedule: Maxim off tomorrow afternoon; can start the refactor but may hand over to Sergei.
- Breakfast interruption caused 10 min audio drop – no technical content lost.