Kafka Poison Pill

Lydtech
Kafka Poison Pill

Introduction

In Apache Kafka a Poison Pill is defined as :

      “a record that has been produced to a Kafka topic and always fails when consumed, no matter how many times it is attempted.” - Confluent.io

Poison Pill scenarios are often overlooked and can easily occur if not correctly considered and failing to address them can have devastating consequences on the smooth running of an event driven system.

Typical causes of a Poison Pill

Incompatible Producers and Consumers

A common cause of a poison pill is where a producer and consumer have incompatible serializers and deserializers configured. The producer would be serializing an object and sending the data to the topic, but the consumer would be unable to deserialize the message. With Spring this used to be a real issue as the error occurred before the poll() returned, making it difficult to handle the error. However the introduction of the ErrorHandlingDeserializer has addressed this situation. Full details can be found in the Spring for Apache Kafka documentation.

Code bugs / unexpected behaviour

Another common cause of a poison pill is where the processing of a message fails due to a bug in the code or unforeseen non-transient error occurring causing an exception to be thrown. The message is then not marked as consumed and is then re-polled. As the problem is non-transient the re-polled message will fail in exactly the same way, resulting in an endless number of attempts to process the message (or at least until the message expires).

Effects of a Kafka Poison Pill

In Kafka, a poison pill message will block the partition preventing any subsequent messages from being processed. This in turn will create a huge burden on system resources as the processing of the message is repeatedly attempted, probably very rapidly.
This rapid retrying of the message will likely cause other side effects, such as excessive disk space consumption as log files get flooded with error messages and stacktraces. This in turn can lead to further knock on effects if log aggregation is being used.

How to minimise the chance of a Poison Pill

When designing a system it’s all too easy to focus on the happy path, when everything works perfectly, but things will go wrong in expected and unexpected ways, so error handling / resolution should be considered and addressed from the outset. Care should also be taken with the actual error handling, as an error within the error handling routine could result in an uncaught exception being thrown, resulting in a Poison Pill situation.

The following could help to limit the possibility of a poison pill situation:

  • Standardise the schema used by producers and consumers maintaining backward compatibility where possible (products such as Apache Avro may be of use here).

  • Implement a well tested messaging library to include a full suite of happy and unhappy path tests catering for poison pill scenarios. This library can then be used across all services to ensure message handling is standardised and the risk of a poison pill scenario is eliminated.

  • Consider the importance of message loss/order versus throughput/availability/response times when determining the error handling requirements. Can a failing message simply be discarded, written to a DLQ, message hospital or a “retry” topic.

  • Careful consideration of the appropriate retry policy required to satisfy the needs of the system in regard to message loss and throughput. Retrying for an excessive length of time is not a good idea. Instead, retry for a limited period and then error the message.

  • Restrict who can produce messages on any given topic to reduce the chance of bad messages being produced, either accidentally and deliberately, which would disrupt the system.

  • Prevent unexpected exceptions from propagating out of the consumer, without being caught and explicitly handled, so that only “known” exceptions are thrown (remembering unexpected exceptions can also be thrown from the exception handler)

  • If Spring Kafka is being used utilise the ErrorHandlingDeserializer

  • Test the system before it goes into production for poison pill scenarios

Recovering from a Poison Pill situation

How to recover the system from a poison pill situation depends on the requirements of the system.

Where message loss is unacceptable, a code fix will be required to specifically handle the poison pill message.

If message loss is acceptable there are some quick work-arounds that can be employed to get the system back on an even keel. These are by no means a long-term solution to the problem and will not prevent the same situation occurring in the future. The idea is to get the system to skip the poison pill message by causing the current offset to be updated to a point beyond the failing message. This can be achieved by:

  • updating the consumer group to start consuming from the head of the partition. This would result in the poison pill and any other messages that are already on the partition from being skipped.
  • anually updating the partition offset to beyond the poison pill message
  • allowing the re-polling to continue until the message expires. This is likely to only be feasible where the retention policy is very short and would result in the poison pill message being lost and potentially other messages on the partition too.

Updating partition offsets as outlined above will not address the root cause of the poison pill and is not a solution for preventing it from occurring again. It is simply a mechanism for recovering from the effects of the poison pill and getting the system back in a state where subsequent messages can be processed. The root cause should always be established and a code fix implemented to prevent it from occurring again in the future

Preventing Poison Pills

Preventing a potential poison pill would involve ensuring non-transient exceptions are caught and handled gracefully, and that the exception is not propagated up the stack to the consumer, causing it to be re-polled. Typically a poison pill message would be handled by:

  • sending the poison pill message to a Dead Letter Queue (DLQ) or retry topic
  • implementing a message hospital, where the message can be assessed and resolved.
  • logging the poison pill message

By ensuring that all non-transient exceptions are caught and handled in an appropriate manner will mean that every message completes its processing (either via a happy path or unhappy path flow) and that the partition offsets are then successfully updated preventing the partition from being blocked by a poison pill message.

In Conclusion

A Kafka Poison Pill can have devastating consequences on a system, but with careful consideration and by taking appropriate steps the incidence of them can be eliminated / minimised.


View this article on our Medium Publication.