In this article we investigate the use of Dead Letter Topics (DLT) in the context of Apache Kafka, although the points discussed could also be relevant to other distributed messaging systems utilising different brokers.
Distributed messaging systems are a great way to decouple your application into smaller processing units, or services. However there are complications that need to be addressed. One of which is how to handle an error that prevents successful processing of a message.
Often development initially focuses on the happy path flow, which is a perfectly understandable and reasonable approach. Even when concentrating on the happy-path flow the question soon arises: “but what should we do if something goes wrong?” to which someone usually replies “Simple! We’ll just dead-letter it and replay it later”.
On the face of it this sounds ideal, as it’s a simple thing to implement and the developers can concentrate on getting the system working, pacifying their managers / stakeholders desire to meet deadlines / budgets.
Once a DLT has been implemented the question of whether or not it is the correct solution for failed messages never seems to get challenged. Some specific error scenarios will be implemented, where the requirements are well understood, but the general catch-all error handling of dead-lettering the message will remain.
Dead-lettering can unblock the system by preventing a poison pill situation, where all messages on a partition are blocked by a constantly failing message. But other than that they don’t really fix anything. The failed message is still there and what to do with it still needs to be determined. Dead-lettering failed messages and replaying them is easy to say, but in practice difficult to achieve as there are a number of hidden complexities that require additional coding effort and/or infrastructure, leading to additional costs and time that will eventually impact budgets and timescales.
So let’s consider some specifics of dead-lettering
Firstly, dead-lettering a message requires additional infrastructure, introducing the need for additional maintenance and monitoring leading to increased burden / cost on the organisation. The additional infrastructure could range from a single dead-letter topic for the system, to a dead-topic for every topic in the system. A single dead-letter topic for all the dead-lettered messages, would be tricky to work with due to having multiple message types on the topic, making determining the redelivery target more involved. Alternatively each topic could have its own dead-letter topic configured, with all the additional overhead that would entail in configuring, managing and monitoring those topics.
In Apache Kafka dead-lettering a message is not a function of the broker, instead it’s the function of the consumer of the topic. The problem here is that there could be multiple consumers consuming from the topic, with only one of them having failed to process the original message. It’s therefore imperative that in order to replay the message, all the consumers that will consume the message need to be protected against duplicate message processing. This means they need to keep a record of which messages have already been processed. This is good practice anyway as there is always the possibility of duplicate messages being consumed, but if it hasn’t already been considered then this would mean additional development effort.
As distributed messaging systems grow and become more complex over time it becomes harder to know for certain what the impact of replaying a message is. Do we still know which systems are consuming messages from the topic? Are we still certain that all consumers have idempotent message processing in place? Are we still certain that there are no time constraints or data constraints that mean replaying a message is still possible or has too much time elapsed or has the data moved on? Without knowing for certain what the impact of replaying a message is, then there is significant risk to the system by replaying the message.
Before a message can be replayed the underlying problem needs to be fully understood and resolved in a way that will allow the message to succeed when replayed. It’s imperative to understand what went wrong and why. In message driven systems this is not always straightforward. Determining why something failed can depend on the state of the system at that moment in time. Fixing the failure could mean deploying a code fix or fixing the data before moving the message back to the originating topic.
If message processing is failing due to a downstream system being unavailable, then dead-lettering the message to unblock the topic is a pointless exercise as the next message on the topic will have the same problem. This could result in a flood of messages on the DLT, whereas leaving them on the original topic and letting them retry until the downstream system becomes available is a better solution to the problem. Dead-lettering a message is for those exceptional situations where there is a bug in the system, a message is broken or the state of the system prevents it from ever being processed without some intervention taking place. The error scenario will require investigation, remediation and ultimately resolution.
If message ordering is important in your system then dead-lettering a message will break the guaranteed message ordering Kafka offers, because a dead-lettered message will be written to a different topic from any associated messages. In Kafka to enforce message ordering of associated messages, those messages must be written to the same partition. Consideration must therefore be taken of the importance of message ordering in the system and how to handle out of order messages. This is the same regardless of whether or not a DLT is used to handle failures in message processing when order is important. However it’s called out here to simply highlight that “we’ll just dead-letter it” doesn’t actually offer a straightforward solution where message order is required and that it’s important to work through the use-cases early to understand the impact on timescales, resourcing and budgets.
Another consideration of using a DLT is message retention. A DLT is just another Kafka topic and as such messages have a retention period after which they expire, so a suitable message retention period is required to ensure sufficient time is allowed for the dead-lettered message to be processed / investigated prior to it expiring. Longer retention periods could lead to increased resource costs.
What if we just write failed messages to a DLT to record the message for later diagnosis of the problem, with no intention of redelivering the failed message. If the message is not being replayed business processes will be required to get the system back to the state it would be in if the message had been successfully processed. If new business processes are required for this they should be fully secured and audited and additional coding time will be needed to implement them.. When writing a failed message to the DLT, It’s good practice to include some meta-data about the failure (e.g error code, message and state information) recorded as additional message headers. This additional meta-data can be used to aid diagnosis of the problem The messages in the DLT can then be interrogated to triage the issue and work on a fix. Additional functionality may be needed to monitor the DLT and to allow the messages and headers to be viewed to ascertain the best way to fix the issue. This may mean additional coding effort to surface the details of the failed message.
If we’re just using the DLT to store the failed message, with no intention of replaying the message, then a simple log file entry detailing the failed message would achieve the same goal. And utilising standard log monitoring / observability products to surface the details of a message failure, would eliminate the need for the additional coding effort required for surfacing DLT entries. The need for additional business processes would still be required to get the system into the correct state, but would reduce the additional overhead of implementing / maintaining and monitoring DLTs
In this article we have explored the implications of using dead letter topics for failed messages, and tried to point out some of the potential issues and risks with replaying dead-lettered messages. Simply implementing a DLT doesn’t mean you have eliminated the need to give due diligence and thought to how those dead-lettered messages will be subsequently used and processed by the business and what the impact of that means in additional development and infrastructure in terms of resourcing, budgeting and timescales.
“We’ll just dead-letter it” is not a one-size fits all solution to how message processing failures will be handled. Handling error scenarios is not a coding exercise, it’s about business processing and what is required to get the system back into the correct state to resume processing.
View this article on our Medium Publication.