Sender-Service Failure, Exception-Handling Hot-Fix & Event-Driven Redesign

Current structure
- SenderService is implemented as a Hosted Service.
- Inside ExecuteAsync they launch the “main loop” with Task.Run.
- The loop is a while (true) that:
- Queries the DB for executable rows.
- Filters by cluster (two Kafka clusters: Netherlands, France).
- Checks whether the associated Kafka topic already exists in the local cluster.
- Sends / processes messages accordingly.
Original bug
- No surrounding try–catch in the Task body.
- Any exception (DB outage, Kafka error, business-logic bug) bubbles out → the Task faults silently.
- ❌ No logging, ❌ no restart, ❌ no Nomad alert → all Dutch instances died and stopped processing.
- Root cause still unknown because nothing was logged.
Hot-fix
- Wrapped the whole loop with try–catch.
- On catch:
- Log the exception (to standard logger + centralised logs).
- Continue the loop so the service stays alive.
Future-proofing ideas
- Introduce a threshold counter. After $N=5$ consecutive exceptions → deliberately throw so that the container crashes and Nomad restarts it.
- Re-examine all usages of Task.Run across code-base (Execution Initialiser, Dispatcher, …) to ensure identical protection.

General company practice: "fail-fast & let the orchestrator restart".
Open questions
- How many consecutive crashes before Nomad gives up? (Unclear / needs confirmation.)
- Would repeated restarts hide real problems or overload on-call? Balance with the threshold approach.
Possible implementation
- Global utility wrapper e.g. RunWithRestartPolicy(Func<Task>) used everywhere instead of raw Task.Run.
- Emits Slack alerts every time it enters the catch; after $\ge 5$ → Environment.Exit(1).

Dispatcher
- Detects eligible executions after preparation_time.
- Publishes an execution-dispatch message.
Data Preparer
- Receives dispatch.
- Flushes data rows into per-execution Kafka topic exec_{id} (creates the topic on first flush).
- For some types (Message-Center) finishes the job itself without involving Sender.
Sender
- Performs periodic DB polling:
1. Selects executions whose delivery_date <= NOW().
2. Ensures the per-execution topic exists in its cluster.
3. Starts consuming / sending.

Symptom: executions were skipped, no messages sent.
Deep dive
1. Sender’s SQL has WHERE delivery_date <= NOW().
2. The failed executions were inserted after their delivery_date (already in the past) → never returned by the query.
3. Even if the record exists, the Sender also requires the topic to exist. Topic is created only when Data-Preparer first flushes data. Under heavy load this can lag → execution briefly invisible to Sender and skipped.
Quick mitigation agreed
- Expand the SQL window by +1 hour (i.e. accept delivery_date <= NOW() + INTERVAL '1 hour').
- Reduces but does not eliminate risk.

Replace DB-polling with event signalling:
- Data-Preparer, once it is certain the Kafka topic is created, publishes a “execution-ready” message to a new lightweight topic, e.g. exec.ready.
- Sender consumes that topic and starts processing immediately.
Guarantees / design points
- Needs exactly-once delivery (easy: single message per execution, low volume, can enable Kafka transactions).
- Must ensure topic truly exists before emitting ready-message (check via ListTopics, or create-if-missing then confirm).
- Concern about long-running consumers not committing offsets quickly: belief is that as long as heartbeats continue, Kafka is fine. ( precedent: abandoned-cart consumer keeps 45-min gaps )
Benefits
- Eliminates heavy DB polling (performance, scalability).
- Removes race condition between record insertion and topic creation.
- Clearer, event-driven flow; easier to reason about.
Challenges / To-dos
- Cross-team design session (need Nidal, Batteries, etc.).
- Refactor Data-Preparer / Sender; add transactional publish; add fallback when Sender is down for hours (spam-protection, TTL, etc.).

Topic creation logic today lives in DataPreparer.Flush():
- Call TryCreate(topicName, partitionCount, retention) every flush.
Alternative spots considered:
- Dispatcher could pre-create topics earlier, but increases coupling.
Any change must still allow Sender to pull extra metadata from DB (templates, lists), so DB access cannot be fully removed.

Netherlands & France clusters: topic-existence check is cluster-specific.
Nomad behaviour (continuous crash loop vs. back-off) still to be confirmed.
Vacation schedule: Maxim off tomorrow afternoon; can start the refactor but may hand over to Sergei.
Breakfast interruption caused 10 min audio drop – no technical content lost.