Kafka Chaos Testing With Conduktor Gateway

Introduction

Chaos testing validates that applications behave correctly under unexpected error conditions. Being able to simulate these error conditions with Kafka is a difficult problem as Kafka is by its nature a resilient messaging broker. This article looks at how Conduktor Gateway solves this problem, enabling the developer or tester to write automated chaos tests that validate the application handles these broker errors as required.

The source code for the accompanying Spring Boot application is available here.

Conduktor Gateway

Conduktor Gateway is a proxy application that sits between the application and Kafka. Instead of connecting directly to Kafka, the application instead connects to the Gateway. This means that the Gateway can intercept requests made by the application and perform a number of operations on them.

Figure 1: Conduktor Gateway proxying requests to Kafka

Features of the Gateway include support for multi-tenancy on a single Kafka instance. Clusters are kept isolated from one another, meaning different users can safely use the same Kafka instance, keeping costs down, with Gateway making this separation transparent to users.

Gateway can help meet compliance requirements, enforcing encryption on messages. It can be used to enforce best practices around configuration, naming conventions and schema usage. And as demonstrated in this article, it enables chaos testing, testing whether errors that could be thrown by the broker are handled by the application.

For the full range of features that Conduktor Gateway provides, visit Conducktor Gateway.

What Is Chaos Testing?

Chaos testing refers to the type of testing whereby unexpected errors are simulated in an environment to observe how an application responds. In the case of chaos testing with Kafka, this can mean simulating errors occurring when an application attempts to write messages to the broker. Conduktor Gateway is able to perform this role, as it proxies the write requests from the application to Kafka. Instead of performing the write, it can instead return different types of error response that reflect the real errors thrown by Kafka. The application can then be observed to see if transient errors from the producer writes are retried if configured to do so, and non-transient errors are correctly handled by the application. Importantly the Kafka cluster itself is not disrupted in order to facilitate this testing.

Figure 2: Conduktor Gateway intercepting a request and returning an error

Manual vs Automated Testing

Conduktor Gateway can be spun up in a local or remote environment with the application connecting to it rather than Kafka directly. The Gateway is primed via Conduktor Console or REST APIs to simulate the required error scenarios, and the application can then be manually tested to observe its behaviour. The Conduktor Gateway code examples cover this in detail, and the Spring Boot demo application ReadMe also steps through such testing.

Real value can then be added by automating these tests. The Dev and/or QA teams can develop and run a set of chaos tests both locally and automated in the CI pipeline. These ensure that any application or configuration changes do not impact the resilience of the application. The remainder of this article focuses on the automating of these tests.

Automating Chaos Tests

Lydtech's component-test-framework can be used to orchestrate the Testcontainers testing library to bring up Conduktor Gateway in a docker container, along with Kafka, Zookeeper and the application under test in their own containers. A suite of chaos tests are written which can run on every build, merge or release as deemed most suitable.

The tests are written as standard JUnit 5 tests, annotated with the extension that hooks into the component test framework in order to spin up the containers. Gateway provides a REST API that each test then calls at the outset to prime the proxy with the required behaviour. So for example, configure the proxy to simulate broker errors for a certain percentage of requests such as there not being enough replicas available to satisfy the replication configuration when the application producer attempts to send a new record. In this scenario the producer can be configured to retry its send, so the test can assert that the writes are eventually successful. Meanwhile the test can then assert that non-retryable errors, such as an invalid acks value being received by the broker, are also handled. If the message was produced as a result of a message being consumed then this might be to assert that the consumed message was dead lettered, or if triggered by a REST request that the expected error response was returned.

Spring Boot Demo

Overview

A Spring Boot application has been developed to demonstrate using Conduktor Gateway for chaos testing. The application exposes a REST endpoint that can be called to trigger sending the requested number of events to a topic on Kafka.

Figure 3: Demo Spring Boot application

Chaos Testing

For the chaos test, the application is configured to connect to the Gateway rather than Kafka directly, so the producer requests are received by the Gateway. Depending on how the Gateway has been primed it will allow some producer requests through to Kafka and respond with simulated Kafka errors for a certain percentage of others.

Figure 4: Chaos testing with Conduktor Gateway

Figure 4: Chaos testing with Conduktor Gateway

Step 1 The test begins by priming the Gateway with the Kafka error condition to simulate.

Step 2 The test calls the application REST endpoint to trigger sending events to Kafka, via the Gateway.

Step 3 The test asserts the REST response from the application. This will either expect a success as any errors should have been retried until completion, or an error if the producer is not able to retry the given exception.

Step 4 If the messages should still have been successfully written to Kafka due to producer retries, then the test consumes the events from the topic to assert they have all been written eventually.

Test Configuration

The demo component test ResilienceCT uses the component-test-framework to spin up the Docker containers using the following annotation on the test class:

@ExtendWith(TestContainersSetupExtension.class)
public class ResilienceCT {

Kafka and Conduktor Gateway are enabled for the test configuration in the pom.xml:

<kafka.enabled>true</kafka.enabled>
<conduktor.gateway.enabled>true</conduktor.gateway.enabled>

The application-component-test.yml overrides the Kafka bootstrap URL to connect to Conduktor Gateway. The proxy host is conduktorgateway, and the port is configured as 6969 (which can be overridden in the pom.xml configuration):

kafka:
    bootstrap-servers: conduktorgateway:6969

Test Execution

The Spring Boot application jar and docker container are first built:

mvn clean install
docker build -t ct/kafka-chaos-testing:latest .

The chaos tests are then executed using the component maven profile:

mvn test -Pcomponent

The chaos test class ResilienceCT defines a number of test cases that cover different broker failure scenarios. It also defines a happy path test where the Gateway simply proxies requests directly through to Kafka, and a test that simulates the broker being slow to respond.

Each test begins by priming the Conduktor Gateway with the Kafka error scenario that it should simulate, using the ConduktorGatewayClient provided by the component test framework. Between tests it resets the Gateway.

Each test then calls the application to trigger producing events, and asserts on the results. When retryable errors are simulated as being thrown by Kafka, where the producer should be able to retry a request and it succeed on a subsequent attempt, the events should eventually all be written to Kafka. In this case the test will assert that it has been able to consume all the outbound events within a given timeframe. An example test is shown in full here:

@Test
public void testResilience_NotEnoughReplicas() throws Exception {
	Integer NUMBER_OF_EVENTS = 250;
	Integer FAILURE_RATE_PERCENTAGE = 20;

	conduktorGatewayClient.simulateBrokenBroker(FAILURE_RATE_PERCENTAGE, BrokenBrokerErrorType.NOT_ENOUGH_REPLICAS);

	TriggerEventsRequest request = TriggerEventsRequest.builder()
		.numberOfEvents(NUMBER_OF_EVENTS)
		.build();

	RestAssured.given()
		.spec(ServiceClient.getInstance().getRequestSpecification())
		.contentType("application/json")
		.body(JsonMapper.writeToJson(request))
		.post("/v1/demo/trigger?async=true")
		.then()
		.statusCode(202);

	KafkaClient.getInstance().consumeAndAssert("NotEnoughReplicas", consumer, NUMBER_OF_EVENTS, 3);
}

For non-retryable errors the producer should not attempt to retry the send and the exception should be returned in the REST response.

Test Results

As each test runs the Spring Boot application logs can be observed using the following command:

docker logs -f ct-kafka-chaos-testing-1

For the test shown above the logs show the error being returned after every few send attempts (as expected, with a 20% failure rate):

WARN  o.a.k.c.p.i.Sender - [Producer clientId=producer-1] Got error produce response with correlation id 699 on topic-partition demo-outbound-topic-2, retrying (2147483646 attempts left). Error: NOT_ENOUGH_REPLICAS

However all the messages are eventually sent successfully as the write is retried every time.

For tests expecting a non-retryable exception to be thrown and a 500 response returned, this can then be asserted for in the test:

RestAssured.given()
	.spec(ServiceClient.getInstance().getRequestSpecification())
	.contentType("application/json")
	.body(JsonMapper.writeToJson(request))
	.post("/v1/demo/trigger?async=false")
	.then()
	.statusCode(500)
	.and()
	.body(containsString("org.apache.kafka.common.errors.InvalidRequiredAcksException: Produce request specified an invalid value for required acks."));

Error Scenarios

The following table shows the error scenarios supported by the component-test-framework enabling automation of their testing, and demonstrated in ReslienceCT. For the full set of failure scenarios supported by Conduktor Gateway, view the documentation at Conduktor.

Test	Exception Thrown	Cause	Is Retryable?
Not Enough Replicas	NotEnoughReplicasException	An insufficient number of in-sync replica partitions are available to satisfy the configured min-isr requirement	Yes - on a subsequent attempt there may be enough in-sync replicas
Corrupt Message	CorruptRecordException	The record being written fails the internal cyclic redundancy check indicating network or disk corruption	Yes - the write could succeed on a subsequent attempt
Invalid Required Acks	InvalidRequiredAcksException	An invalid value has been supplied for the produce request required acks parameter	No - as the required acks is invalid it will never be accepted
Unknown Server Error	UnknownServerException	An unknown error occurred on the server for which there is no error code	No - as the error is unknown this is considered not retryable
Leader Election	NotLeaderOrFollowerException	A new partition leader is being elected as the current leader became unavailable. Writes cannot be accepted at this time	Yes - once the leader election has completed a write being retried can succeed

The ResilienceCT test also demonstrates configuring an interceptor on the Gateway that adds a variable amount of latency to producer requests:

conduktorGatewayClient.simulateSlowBroker(RATE_IN_PERCENTAGE, MIN_LATENCY, MAX_LATENCY);

This can be used to verify that when requests are encountering high amounts of latency, such as when the broker is under heavy load, that this does not impact the integrity and consistency of data in the system.

Testing Configuration Changes

With a suite of chaos tests in place testing various error scenarios, the application configuration can be tuned and the impact observed. For example, the default producer retries configured in the application.yml (and in the application-component-test.yml with overrides for the component tests) can be overridden. If set to 0 we observe that the producer no longer retries the Kafka retryable exceptions, and instead such an exception is percolated up, in this case resulting in a 500 Internal Server Error response returned to the caller.

Kafka:
	producer:
		# Default: 2147483647
		# Change to 0 to stop retry behaviour
		retries: 0

Conclusion

Chaos testing is an important but often overlooked facet of a comprehensive test suite. Moving from manual to automated chaos testing has traditionally been a step harder still. By utilising Conduktor Gateway within the component test framework to simulate Kafka broker errors it becomes straightforward to add this type of test for applications that use Kafka as their messaging broker.

Source Code

The source code for the accompanying Spring Boot demo application is available here:

https://github.com/lydtechconsulting/kafka-chaos-testing/tree/v1.0.0

More On Kafka & Conduktor

Conduktor Platform & Kafka Component Testing: using the Conduktor Platform as an aid to component testing a Kafka based application.
Kafka Monitoring & Metrics: With Conduktor: monitoring the Kafka topic and broker metrics using the Conduktor Platform.

Introduction to Kafka with Spring Boot

Lydtech's Udemy course Introduction to Kafka with Spring Boot covers everything from the core concepts of messaging and Kafka through to step by step code walkthroughs to build a fully functional Spring Boot application that integrates with Kafka. Put together by our team of Kafka and Spring experts, this course is the perfect introduction to using Kafka with Spring Boot.

View this article on our Medium Publication.