Have you ever seen "Resilience" in any system design book? It's a shortcut to say "What happens to our system when things go wrong?" and it is a good question because I promise you, things will definitely go wrong.

The project I am referencing: Hospital Information System

The Problem with Event-Driven Systems

If you are working with an Event-Driven Design (EDD), you have to make sure that every event is progressing smoothly. But what if it doesn't? What if your consumer crashes mid-processing, or the database is temporarily locked?

In a naive implementation, you have two bad options:

  • Retry forever and block the entire pipeline.
  • Drop the message and pretend it never happened (it is not a crime unless it is deemed a crime, right?).
In a production-grade system, neither is acceptable. A dropped lab result or a missed billing event is not just a bug; it is a liability.

This is exactly the problem that Dead Letter Queues (DLQ) solve.

What is a Dead Letter Queue?

A DLQ is a separate Kafka topic that acts as a "safety net" (or a bunker). When a message fails to be processed after a defined number of retries, instead of crashing your consumer or silently discarding the message, you route it to this special topic. The message is not lost; it is parked, waiting for someone to handle it.

How I Used It

In my project, the billing-service listens to multiple Kafka topics simultaneously (bed charges, lab orders, inventory consumption, and patient discharges). Any of these can fail for various reasons: a transient database lock, a malformed payload, or a downstream service being temporarily unavailable.

Without a DLQ, a single bad PATIENT_DISCHARGED event could poison the entire consumer group and freeze billing for everyone.

The Strategy: Retry-then-DLT

The strategy I adopted is called "Retry-then-DLT" (Dead Letter Topic). The flow is straightforward:

1. Attempt

Attempt to process the message normally.

2. Retry

On failure, wait and retry up to 3 times with exponential backoff.

3. Route to DLT

If all retries are exhausted, route the message to a *.DLT topic.

4. Alert & Monitor

Alert or monitor the DLT topic for manual intervention.

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> kafkaTemplate) {
    // Route to *.DLT topic after all retries are exhausted
    DeadLetterPublishingRecoverer recoverer = 
        new DeadLetterPublishingRecoverer(kafkaTemplate);

    // Exponential backoff: 1s -> 2s -> 4s (3 attempts)
    ExponentialBackOffWithMaxRetries backOff = 
        new ExponentialBackOffWithMaxRetries(3);
    backOff.setInitialInterval(1_000);
    backOff.setMultiplier(2.0);

    return new DefaultErrorHandler(recoverer, backOff);
}

The Trade-offs

As with everything in software engineering, using a DLQ introduces some trade-offs:

Operational Overhead

A DLQ is useless if nobody monitors it. You now have the operational burden of setting up alerts for the DLT, inspecting the failed messages, fixing the underlying issue, and manually replaying the events.

Loss of Strict Ordering

If a message fails and is sent to the DLQ, but the next message in the partition succeeds, strict message ordering is broken. If your domain logic requires strict sequential processing, you will need more complex solutions to halt the partition entirely.

And there we go! Now your events know what to do if anything goes wrong.