Failover and Circuit Breaker with Resilience4j

Lydtech
Failover and Circuit Breaker with Resilience4j

Overview

Resilience4j is an open source library offering many features for managing fault tolerance in an application. It is viewed as the recommended choice and natural successor to the now end-of-life Spring Cloud Hystrix libraries (developed by Netflix and libraries I used to good effect on a previous project). It supports a number of resilience patterns including failover, circuit breaker, retry, bulkhead, cache, and timelimiter. It also publishes metrics enabling real time monitoring. It is straightforward to integrate with Spring Boot, with a Resillience4j Spring Boot library being available.

As with the adoption of any new tool or library it is usual to spike its usage to gain a good understanding of it, prove whether it is fit for purpose, and whether it meets the requirements of the project. To that end I looked at using resilience4j for two aspects of fault tolerance, namely failover and the circuit breaker, within a Spring Boot application.

Demo Application

For this demo a Spring Boot application has been developed which illustrates these patterns. A REST API provides an endpoint which when called attempts performs a lookup against a banking system to retrieve the bank account details for the provided request parameters, iban, country and currency. The application can be viewed in full in Github, and is referenced throughout this blog.

The application is a Spring Boot application. Spring Boot has first class support for Resilience4j, with provision of a resilience4j-spring-boot2 library.

Included in the project are integration tests that are used to demonstrate the functionality. Wiremock is used to mock the third party provided (3PP) services allowing simulation of the required behaviour that results in failover and the circuit breaker state changes being triggered.

The application uses the Lombok library for the API request and response PoJOs, removing the need for the boilerplate constructors, getters and setters. It is also used for its logging annotation support.

Failover

Failover and circuit breaker are part and parcel of the same pattern, however it is useful to first describe the behaviour of failover.

The demo application calls a 3PP service to lookup account details for the given parameters. If this look up fails, then the call fails over to calling a second 3PP service to attempt the lookup.

This is the flow:

Figure 1 - Failover Sequence Diagram

The fallback behaviour is implemented in AccountLookupRouter, which then routes to a lookup service to perform the external call to a banking application.

The two 3PP services are injected into this component.

The lookupAccount(..) method is annotated with the @CircuitBreaker annotation. The name annotation property is important as it links to the associated configuration in the application properties. More on this below in the ‘Circuit Breaker’ section. The fallbackMethod annotation property specifies which method to call (failover to) should this primary call fail.

@Slf4j
@Component
public class AccountLookupRouter {
    @Autowired
    private BankOneAccountLookupService bankOneService;

    @Autowired
    private BankTwoAccountLookupService bankTwoService;

    /**
     * Primary lookup with circuit breaker.
     */
    @CircuitBreaker(name = "lookupAccount", fallbackMethod = "lookupAccountFallback")
    public AccountLookupResponse lookupAccount(final String iban, final String country, final String currency) {
        return bankOneService.lookupAccount(iban, country, currency);
    }

    /**
     * Fallback method for 4xx exceptions, just percolate the exception back.
     */
    private AccountLookupResponse lookupAccountFallback(final String iban, final String country, final String currency, final HttpClientErrorException e) {
        log.debug("Account lookup request resulted in a client exception with status {}", e.getStatusCode());
        throw e;
    }

    /**
     * Fallback method for all other exception types.
     */
    private AccountLookupResponse lookupAccountFallback(final String iban, final String country, final String currency, final Throwable t) {
        log.error("Primary lookup request failed, failing over to Bank Two.  Error was: " + t.getMessage());
        log.debug("Fallback: routing account lookup request to Bank Two {}", iban);
        return bankTwoService.lookupAccount(iban, country, currency);
    }
}

Here we see there are two overloaded fallback methods. In fact any number can be provided, with the last argument, the exception type, being the differentiator. This is to provide the flexibility required to perform different fallback actions based on the exception thrown. The example here is that the default behaviour, using the ‘catch all’ Throwable type, is to attempt the lookup against the secondary 3PP. However if the exception thrown is a HttpClientErrorException, then the other fallback method handles this (that takes this exception type as a method argument), and the exception is simply re-thrown.

Why might we want to re-throw an exception rather than attempt the secondary lookup? Well, Spring’s HttpClientErrorException represents 4xx exceptions, such as Bad Requests and Not Found exceptions. This indicates that the client needs to correct their request, as otherwise any retry or secondary lookup will also fail. Meanwhile 5xx exceptions, represented by Spring’s HttpServerErrorException, cover problems such as a Service Not Available and Bad Gateway. And it is these 5xx issues where a retry or secondary lookup are the behaviour we are after.

For this to work of course the necessary exception handling must be implemented on the call to the 3PP. Digging into the demo application further we see the Router calls the BankOneService for the primary call, which in turn calls the BankOneGateway. It is this gateway component that makes the call to the external 3PP:

public BankOneAccountResponse accountLookup(final String iban, final String country, final String currency) {
    try {
        URI uri = UriComponentsBuilder
                .fromUri(config.getBaseUrl().toURI())
                .path(config.getAccountLookupPath())
                .queryParam("iban", iban)
                .queryParam("country", country)
                .queryParam("currency", currency)
                .build(new HashMap<>());

        ResponseEntity response = restTemplate.exchange(uri, GET, new HttpEntity<>(headers()), BankOneAccountResponse.class);
        return response.getBody();
    } catch (HttpClientErrorException e) {
        // Covers 4xx exceptions.
        log.error("HttpClientErrorException exception thrown in Bank One.", e);
        throw e;
    } catch (Exception e) {
        throw new RuntimeException(String.format("Error searching Bank One for account [%s, %s]", iban, country), e);
    }
}

We see here that if an HttpClientErrorException is thrown on the API call then this will be re-thrown, enabling the correct overloaded fallback method to be called. All other exceptions are caught and re-thrown as a RuntimeException, so these will be handled by the Throwable fallback method.

In this demo application, if the secondary 3PP lookup fails, then that exception will be returned to the client of this service. It is however possible to chain the fallback methods, such that if any fallback method fails, then a further fallback method to that one can be called. Simply configure the fallback method(s) to use by applying the same @CircuitBreaker annotation. Perhaps in this situation a fallback method to the secondary call could be added that checks a cache for the last known account details for the request parameters, and if present return these.

The failover behaviour is demonstrated in the AccountLookupFailoverIntegrationTest. It shows:

  1. Successful primary 3PP lookup.
  2. Primary 3PP lookup fails with account not found (404), which is returned to the client.
  3. Primary 3PP lookup fails with service unavailable (503), prompting failover to secondary 3PP which succeeds.
  4. Both primary and secondary 3PP lookups fail with 5xx responses, so the latter response is returned to the client.

We will look at the third of these scenarios (primary 3PP fails, secondary 3PP succeeds) here:

/**
 * A 5xx such as a SERVICE UNAVAILABLE (503) should result in a failover to BANK TWO.
 */
@Test
public void lookupUpAccount_BankOneUnavailable_FailoverToBankTwo_Test() {
    primeBankOneForFailure(503);
    primeBankTwoForSuccess(IBAN, COUNTRY, CURRENCY);

    ResponseEntity response = callLookupAccount(IBAN, COUNTRY, CURRENCY);

    assertThat(response.getStatusCode(), equalTo(HttpStatus.OK));
    assertThat(response.getBody().getAccountLookupProvider(), equalTo(BANK_TWO_NAME));
    assertThat(response.getBody().getRoutingNumber(), equalTo(ROUTING_NUMBER));
}

It is worth noting that the parent test class, BaseAccountLookupIntegrationTest, which defines the @SpringBootTest annotation, has the methods defined for configuring the 3PP wire mock behaviours.

Use the @AutoConfigureWireMock(port=0) annotation to ensure the wire mock is assigned an available port. The endpoint URLs are then overridden in the application-test.yml, to ensure that the wiremock is hit on localhost and the assigned port:

account-lookup:
    endpoints:
        bankOne:
            baseUrl: http://localhost:${wiremock.server.port}/bankone
            accountLookupPath: /api/account
        bankTwo:
            baseUrl: http://localhost:${wiremock.server.port}/banktwo
            accountLookupPath: /api/account

The wiremock for the primary 3PP bank service can be configured to return an error response via this method call:

protected void primeBankOneForFailure(int responseCode) {
    stubFor(get(urlPathEqualTo("/bankone/api/account"))
            .willReturn(aResponse()
            .withHeader("Content-Type", "application/json")
            .withStatus(responseCode)));
}

A similar call is made to configure the secondary 3PP bank service to return a success (200). The test then simply asserts that when the service is called, that the response is decorated with the secondary 3PP bank name, proving that failover was successful.

Circuit Breaker

The application is configured to use the circuit breaker pattern. If the number of failed calls to the primary 3PP within a given timeframe exceeds the threshold, then the circuit is opened, and the proceeding calls are directed straight to the fallback 3PP service. After a configurable period then the circuit breaker moves to half open. Requests are again routed to the primary 3PP service, and if sufficient requests are successfully handled then the circuit is closed, and subsequent requests all go to the primary 3PP. However if a configurable number of the requests fail while the circuit is half open, then it is fully opened, and subsequent requests are again routed direct to the secondary 3PP service.

The circuit breaker state transitions are shown here, the diagram taken from the resilience4j website:

Figure 1 - Circuit Breaker State Transitions

The following sequence diagram shows the flow and these circuit transitions being triggered in relation to the demo application:

Figure 1 - Circuit Breaker Sequence Diagram

As covered in the ‘Failover’ section, the Spring Boot @CircuitBreaker annotation on the router method is all that is required in the code to enable the circuit breaker behaviour, with the annotation name property linking to the configuration in the application properties.

    /**
     * Primary lookup with circuit breaker.
     */
    @CircuitBreaker(name = "lookupAccount", fallbackMethod = "lookupAccountFallback")
    public AccountLookupResponse lookupAccount(final String iban, final String country, final String currency) {
        return bankOneService.lookupAccount(iban, country, currency);
    }

These are the circuit breaker configuration properties as defined in the application.yml. Properties can be defaulted and overridden by different instances (e.g. lookupAccount / updateAccount etc) as required. In this case all the properties are defaulted, and the lookupAccount instance (that tallies with the @CircuitBreaker annotation name property) uses those as its base config:

resilience4j.circuitbreaker:
configs:
    default:
        registerHealthIndicator: true
        # Failure rate threshold percentage
        failureRateThreshold: 10
        # Minimum number of call attempts before rate threshold percentage is checked.
        ringBufferSizeInClosedState: 10
        # How long to wait until switching to half open.
        waitDurationInOpenState: 3s
        # Number of successful requests before moving back to closed from half open.
        ringBufferSizeInHalfOpenState: 5
        # Exceptions that do not count towards opening the circuit.
        ignoreExceptions:
            # Ignore 4xx exceptions.
            - org.springframework.web.client.HttpClientErrorException
        instances:
            lookupAccount:
                baseConfig: default

Each property has a comment as to its effect, and more detail is of course available in the resilience4j documentation. Of particular note is the ignoreExceptions property, where Exception types that you do not want to count towards opening a circuit are listed. In the demo application we do not want the circuit to open if 4xx responses are returned from the 3PP service. These are instead returned to the user, as the expectation is that the client request needs to be fixed. Hence opening the circuit is not appropriate in these circumstances.

The AccountLookupCircuitBreakerIntegrationTest Spring Boot integration test demonstrates the circuit breaker behaviour in action, driven by these configuration properties. The test runs through a full flow where the circuit breaker starts in a CLOSED state, with all requests being routed to the primary 3PP, but failing. They failover to the secondary 3PP, and the circuit breaker moves to OPEN once the ringBufferSizeInClosedState limit is exceeded. Requests are now routed directly to the secondary 3PP.

After a period of time elapses, as configured on the waitDurationInOpenState property, the circuit moves to HALF_OPEN, and the next requests are routed back to the primary 3PP. But those requests still fail, and once the number of requests exceeds the ringBufferSizeInHalfOpenState property configured limit, the circuit moves back to OPEN. Requests are again routed to the secondary 3PP, while in the meantime the primary 3PP ‘recovers’. As before, once the waitDurationInOpenState period elapses the circuit moves to HALF_OPEN. Now the requests are routed back to the primary 3PP, which is able to successfully handle them. After the configured ringBufferSizeInHalfOpenState number of requests are successfully handled by the primary 3PP, the circuit moves to CLOSED.

@Test
public void lookupUpAccount_CircuitBreakerStateTransitions_Test() throws Exception {
    primeBankOneForFailure(503);
    primeBankTwoForSuccess(IBAN, COUNTRY, CURRENCY);

    // First 10 calls failover, but circuit breaker remains CLOSED.
    IntStream.range(1, 10).forEach($ -> {
        ResponseEntity response = callLookupAccount(IBAN, COUNTRY, CURRENCY);
        log.debug("Iteration: {} - Circuit breaker state: {}", $, getCircuitBreakerStatus());
        performAssertions(response, HttpStatus.OK, BANK_TWO_NAME, CircuitBreaker.State.CLOSED);
    });

    // The circuit breaker limit is reached, causing it to OPEN on the next request that fails.  See: ringBufferSizeInClosedState
    // 10 more calls happen with the circuit breaker remaining OPEN.
    IntStream.range(11, 20).forEach($ -> {
        ResponseEntity response = callLookupAccount(IBAN, COUNTRY, CURRENCY);
        log.debug("Iteration: {} - Circuit breaker state: {}", $, getCircuitBreakerStatus());
        performAssertions(response, HttpStatus.OK, BANK_TWO_NAME, CircuitBreaker.State.OPEN);
    });

    // Now wait for 3 seconds before the next request.  At this point the circuit breaker moves to HALF_OPEN.  See: waitDurationInOpenState
    TimeUnit.SECONDS.sleep(3);
    // 5 calls happen with the circuit breaker at HALF_OPEN.
    IntStream.range(21, 25).forEach($ -> {
        ResponseEntity response = callLookupAccount(IBAN, COUNTRY, CURRENCY);
        log.debug("Iteration: {} - Circuit breaker state: {}", $, getCircuitBreakerStatus());
        performAssertions(response, HttpStatus.OK, BANK_TWO_NAME, CircuitBreaker.State.HALF_OPEN);
    });

    // Those 5 calls failed over so the circuit breaker OPENs again on the next request that fails.  See: ringBufferSizeInHalfOpenState
    IntStream.range(26, 27).forEach($ -> {
        ResponseEntity response = callLookupAccount(IBAN, COUNTRY, CURRENCY);
        log.debug("Iteration: {} - Circuit breaker state: {}", $, getCircuitBreakerStatus());
        performAssertions(response, HttpStatus.OK, BANK_TWO_NAME, CircuitBreaker.State.OPEN);
    });

    // Now BANK_ONE is able to successfully respond to the request.  But while the circuit is OPEN, it will not be hit.
    primeBankOneForSuccess(IBAN, COUNTRY, CURRENCY);
    IntStream.range(28, 30).forEach($ -> {
        ResponseEntity response = callLookupAccount(IBAN, COUNTRY, CURRENCY);
        log.debug("Iteration: {} - Circuit breaker state: {}", $, getCircuitBreakerStatus());
        performAssertions(response, HttpStatus.OK, BANK_TWO_NAME, CircuitBreaker.State.OPEN);
    });

    // Now wait for 3 seconds before the next request.  At this point the circuit breaker moves to HALF_OPEN.  See: waitDurationInOpenState
    TimeUnit.SECONDS.sleep(3);
    IntStream.range(31, 35).forEach($ -> {
        ResponseEntity response = callLookupAccount(IBAN, COUNTRY, CURRENCY);
        log.debug("Iteration: {} - Circuit breaker state: {}", $, getCircuitBreakerStatus());
        performAssertions(response, HttpStatus.OK, BANK_ONE_NAME, CircuitBreaker.State.HALF_OPEN);
    });

    // Those 5 calls were handled by BANK_ONE so the circuit breaker CLOSEs on the next request that succeeds.  See: ringBufferSizeInHalfOpenState
    IntStream.range(36, 40).forEach($ -> {
        ResponseEntity response = callLookupAccount(IBAN, COUNTRY, CURRENCY);
        log.debug("Iteration: {} - Circuit breaker state: {}", $, getCircuitBreakerStatus());
        performAssertions(response, HttpStatus.OK, BANK_ONE_NAME, CircuitBreaker.State.CLOSED);
    });
}

private void performAssertions(ResponseEntity response,
                              HttpStatus httpStatus,
                              String lookupProvider,
                              CircuitBreaker.State circuitBreaker) {
    assertThat(response.getStatusCode(), equalTo(httpStatus));
    assertThat(response.getBody().getAccountLookupProvider(), equalTo(lookupProvider));
    assertThat(getCircuitBreakerStatus(), equalTo(circuitBreaker));
}

Conclusion

Resilience4j has proven straightforward to integrate in a Spring Boot application, providing failover and circuit breaker with little boiler plate code required. Using wiremock to test the application proved the behaviour and provides a useful means to tweak the configuration and observe the results.

Viewing The Source

The source code is available on GitHub


View this article on our Medium Publication.