Sender-Service Failure, Exception-Handling Hot-Fix & Event-Driven Redesign

Hosted Service / Task.Run Error Handling

  • Current structure
    • SenderService is implemented as a Hosted Service.
    • Inside ExecuteAsync they launch the “main loop” with Task.Run.
    • The loop is a while (true) that:
    • Queries the DB for executable rows.
    • Filters by cluster (two Kafka clusters: Netherlands, France).
    • Checks whether the associated Kafka topic already exists in the local cluster.
    • Sends / processes messages accordingly.
  • Original bug
    • No surrounding try–catch in the Task body.
    • Any exception (DB outage, Kafka error, business-logic bug) bubbles out → the Task faults silently.
    • ❌ No logging, ❌ no restart, ❌ no Nomad alert → all Dutch instances died and stopped processing.
    • Root cause still unknown because nothing was logged.
  • Hot-fix
    • Wrapped the whole loop with try–catch.
    • On catch:
    • Log the exception (to standard logger + centralised logs).
    • Continue the loop so the service stays alive.
  • Future-proofing ideas
    • Introduce a threshold counter. After N=5N=5 consecutive exceptions → deliberately throw so that the container crashes and Nomad restarts it.
    • Re-examine all usages of Task.Run across code-base (Execution Initialiser, Dispatcher, …) to ensure identical protection.

Restart / Nomad Policy

  • General company practice: "fail-fast & let the orchestrator restart".
  • Open questions
    • How many consecutive crashes before Nomad gives up? (Unclear / needs confirmation.)
    • Would repeated restarts hide real problems or overload on-call? Balance with the threshold approach.
  • Possible implementation
    • Global utility wrapper e.g. RunWithRestartPolicy(Func<Task>) used everywhere instead of raw Task.Run.
    • Emits Slack alerts every time it enters the catch; after 5\ge 5Environment.Exit(1).

Execution Processing Pipeline (Today)

  1. Dispatcher
    • Detects eligible executions after preparation_time.
    • Publishes an execution-dispatch message.
  2. Data Preparer
    • Receives dispatch.
    • Flushes data rows into per-execution Kafka topic exec_{id} (creates the topic on first flush).
    • For some types (Message-Center) finishes the job itself without involving Sender.
  3. Sender
    • Performs periodic DB polling:
    1. Selects executions whose delivery_date <= NOW().
    2. Ensures the per-execution topic exists in its cluster.
    3. Starts consuming / sending.

Second Incident: Missed Executions (Yesterday)

  • Symptom: executions were skipped, no messages sent.
  • Deep dive
    1. Sender’s SQL has WHERE delivery_date <= NOW().
    2. The failed executions were inserted after their delivery_date (already in the past) → never returned by the query.
    3. Even if the record exists, the Sender also requires the topic to exist. Topic is created only when Data-Preparer first flushes data. Under heavy load this can lag → execution briefly invisible to Sender and skipped.
  • Quick mitigation agreed
    • Expand the SQL window by +1 hour (i.e. accept delivery_date <= NOW() + INTERVAL '1 hour').
    • Reduces but does not eliminate risk.

Structural Proposal: Event-Driven Signalling

  • Replace DB-polling with event signalling:
    • Data-Preparer, once it is certain the Kafka topic is created, publishes a “execution-ready” message to a new lightweight topic, e.g. exec.ready.
    • Sender consumes that topic and starts processing immediately.
  • Guarantees / design points
    • Needs exactly-once delivery (easy: single message per execution, low volume, can enable Kafka transactions).
    • Must ensure topic truly exists before emitting ready-message (check via ListTopics, or create-if-missing then confirm).
    • Concern about long-running consumers not committing offsets quickly: belief is that as long as heartbeats continue, Kafka is fine. ( precedent: abandoned-cart consumer keeps 45-min gaps )
  • Benefits
    • Eliminates heavy DB polling (performance, scalability).
    • Removes race condition between record insertion and topic creation.
    • Clearer, event-driven flow; easier to reason about.
  • Challenges / To-dos
    • Cross-team design session (need Nidal, Batteries, etc.).
    • Refactor Data-Preparer / Sender; add transactional publish; add fallback when Sender is down for hours (spam-protection, TTL, etc.).

Implementation Details Already Noted

  • Topic creation logic today lives in DataPreparer.Flush():
    • Call TryCreate(topicName, partitionCount, retention) every flush.
  • Alternative spots considered:
    • Dispatcher could pre-create topics earlier, but increases coupling.
  • Any change must still allow Sender to pull extra metadata from DB (templates, lists), so DB access cannot be fully removed.

Interim Action Plan

  • [x] Add try–catch around Sender main loop + logging (DONE).
  • [x] Deploy change immediately.
  • [ ] Add 1-hour window to delivery_date SQL and redeploy (Maxim today).
  • [ ] Implement threshold-based crash/restart logic.
  • [ ] Code scan for unguarded Task.Run usage and patch.
  • [ ] Monitor logs tomorrow; on-call (Oscar) expects better visibility.
  • [ ] Schedule architecture workshop on event-driven signalling vs. DB polling.

Miscellaneous Notes

  • Netherlands & France clusters: topic-existence check is cluster-specific.
  • Nomad behaviour (continuous crash loop vs. back-off) still to be confirmed.
  • Vacation schedule: Maxim off tomorrow afternoon; can start the refactor but may hand over to Sergei.
  • Breakfast interruption caused 10 min audio drop – no technical content lost.