Have you ever seen "Resilience" in any system design book? It's a shortcut to say "What happens to our system when things go wrong?" and it is a good question because I promise you, things will definitely go wrong.
The project I am referencing: Hospital Information System
The Problem with Event-Driven Systems
If you are working with an Event-Driven Design (EDD), you have to make sure that every event is progressing smoothly. But what if it doesn't? What if your consumer crashes mid-processing, or the database is temporarily locked?
In a naive implementation, you have two bad options:
- Retry forever and block the entire pipeline.
- Drop the message and pretend it never happened (it is not a crime unless it is deemed a crime, right?).
This is exactly the problem that Dead Letter Queues (DLQ) solve.
What is a Dead Letter Queue?
A DLQ is a separate Kafka topic that acts as a "safety net" (or a bunker). When a message fails to be processed after a defined number of retries, instead of crashing your consumer or silently discarding the message, you route it to this special topic. The message is not lost; it is parked, waiting for someone to handle it.
How I Used It
In my project, the billing-service listens to multiple Kafka topics simultaneously (bed charges, lab orders, inventory consumption, and patient discharges). Any of these can fail for various reasons: a transient database lock, a malformed payload, or a downstream service being temporarily unavailable.
Without a DLQ, a single bad PATIENT_DISCHARGED event could poison the entire consumer group and freeze billing for everyone.
The Strategy: Retry-then-DLT
The strategy I adopted is called "Retry-then-DLT" (Dead Letter Topic). The flow is straightforward:
1. Attempt
Attempt to process the message normally.
2. Retry
On failure, wait and retry up to 3 times with exponential backoff.
3. Route to DLT
If all retries are exhausted, route the message to a *.DLT topic.
4. Alert & Monitor
Alert or monitor the DLT topic for manual intervention.
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> kafkaTemplate) {
// Route to *.DLT topic after all retries are exhausted
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(kafkaTemplate);
// Exponential backoff: 1s -> 2s -> 4s (3 attempts)
ExponentialBackOffWithMaxRetries backOff =
new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(1_000);
backOff.setMultiplier(2.0);
return new DefaultErrorHandler(recoverer, backOff);
}
The Trade-offs
As with everything in software engineering, using a DLQ introduces some trade-offs:
Operational Overhead
A DLQ is useless if nobody monitors it. You now have the operational burden of setting up alerts for the DLT, inspecting the failed messages, fixing the underlying issue, and manually replaying the events.
Loss of Strict Ordering
If a message fails and is sent to the DLQ, but the next message in the partition succeeds, strict message ordering is broken. If your domain logic requires strict sequential processing, you will need more complex solutions to halt the partition entirely.
And there we go! Now your events know what to do if anything goes wrong.