Using Resilience4j framework to implement partition mechanism / circuit breaker in Java project


So far in this series, we have learned about Resilience4j and its application Retry, RateLimiter, TimeLimiter , and Bulkhead modular. In this article, we will explore the circuit breaker module. We'll learn when and how to use it, and look at some examples.

Code example

Attached to this article On GitHub Working code example for.

What is Resilience4j?

Please refer to the description in the previous article for a quick understanding General working principle of Resilience4j.

What is a circuit breaker?

The idea of the circuit breaker is to block calls to remote services if we know that the call may fail or time out. We do this so that we do not unnecessarily waste critical resources in our services and remote services. Such an exit also gives the remote service some time to recover.

How do we know that a call may fail? By tracking the results of previous requests to remote services. For example, if 8 of the first 10 calls result in a failure or timeout, the next call may also fail.

The circuit breaker tracks the response by wrapping the call to the remote service. During normal operation, when the remote service responds successfully, we say that the circuit breaker is in the "closed" state. When in the off state, the circuit breaker normally passes the request to the remote service.

When the remote service returns an error or timeout, the circuit breaker will increase an internal counter. If the error count exceeds the configured threshold, the circuit breaker will switch to the "off" state. When in the off state, the circuit breaker immediately returns an error to the caller without even trying a remote call.

After a configured period of time, the circuit breaker switches from off state to "half open" state. In this state, it allows some requests to be delivered to the remote service to check whether it is still unavailable or slow. If the error rate or slow call rate is higher than the configured threshold, switch back to the disconnected state. However, if the error rate or slow call rate is lower than the configured threshold, switch to the off state to resume normal operation.

Type of circuit breaker

Circuit breakers can be count based or time-based. If the last n calls fail or are slow, the count based circuit breaker switches the status from off to off. If the response fails or is slow in the last N seconds, the time-based circuit breaker will switch to the off state. In both circuit breakers, we can also specify thresholds for failed or slow calls.

For example, if 70% of the last 25 calls fail or take more than 2 seconds to complete, we can configure a count based circuit breaker to "disconnect the circuit". Similarly, if 80% of the calls in the past 30 seconds fail or take more than 5 seconds, we can tell the time-based circuit breaker to open the circuit.

Circuit breaker for Resilience4j   concept

resilience4j-circuitbreaker The working principle of is similar to other Resilience4j modules. We provide code that we want to construct and execute as a function -- a lambda expression that makes a remote call or a Supplier of a value retrieved from a remote service, etc. -- and the circuit breaker modifies it with code to track the response and switch the state if necessary.

Resilience4j supports both count based and time-based circuit breakers.

We use the slidingWindowType() configuration to specify the type of circuit breaker. This configuration can take one of two values-
SlidingWindowType.COUNT_BASED or
SlidingWindowType.TIME_BASED.

failureRateThreshold() and slowCallRateThreshold() configure the failure rate threshold and slow call rate as a percentage.

slowCallDurationThreshold() configures the time in seconds that the call is considered slow.

We can specify a minimumNumberOfCalls(), which is required before the circuit breaker can calculate the error rate or slow call rate.

As mentioned earlier, the circuit breaker switches from off state to semi off state after a certain time to check the condition of remote service. Waitduration inopenstate() specifies the time the circuit breaker should wait before switching to the half open state.

permittedNumberOfCallsInHalfOpenState() configures the number of calls allowed in the half open state,
Maxwaitduration inhalfopenstate() determines the time that the circuit breaker can remain in the half open state before switching back to the open state.

The default value of 0 for this configuration means that the circuit breaker will wait indefinitely until all
permittedNumberOfCallsInHalfOpenState() completed.

By default, the circuit breaker treats any exception as a failure. However, we can adjust this to use the recordExceptions() configuration to specify the list of exceptions that should be treated as failures and the ignoreExceptions() configuration to ignore.

If we want more fine-grained control when determining whether exceptions should be treated as failures or ignored, we can provide predict < throwable > as recordException() or ignoreException() configuration.

When the circuit breaker rejects a call that is disconnected, it throws a CallNotPermittedException. We can use the writablestacktraceEnabled() configuration to control the amount of information in the stack trace of CallNotPermittedException.

Using Resilience4j CircuitBreaker module

Let's see how to use it
resilience4j-circuitbreaker Various functions available in the module.

We will use the same example as the previous articles in this series. Suppose we are setting up a website for an airline to allow its customers to search and book flights. Our service talks with the remote service encapsulated by the FlightSearchService class.

When using Resilience4j circuit breakers, CircuitBreakerRegistry, CircuitBreakerConfig, and CircuitBreaker are the main abstractions we use.

CircuitBreaker registry is a factory for creating and managing CircuitBreaker objects.

CircuitBreakerConfig encapsulates all the configurations in the previous section. Each CircuitBreaker object is associated with a CircuitBreakerConfig.

The first step is to create a CircuitBreakerConfig:

CircuitBreakerConfig config = CircuitBreakerConfig.ofDefaults();

This creates a CircuitBreakerConfig with the following default values:

to configureDefault value
slidingWindowTypeCOUNT_BASED
failureRateThreshold50%
slowCallRateThreshold100%
slowCallDurationThreshold60s
minimumNumberOfCalls100
permittedNumberOfCallsInHalfOpenState10
maxWaitDurationInHalfOpenState0s

Counting based circuit breaker

Suppose we want the circuit breaker to open when 70% of the last 10 calls fail:

CircuitBreakerConfig config = CircuitBreakerConfig
  .custom()
  .slidingWindowType(SlidingWindowType.COUNT_BASED)
  .slidingWindowSize(10)
  .failureRateThreshold(70.0f)
  .build();

Then we use this configuration to create a CircuitBreaker:

CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
CircuitBreaker circuitBreaker = registry.circuitBreaker("flightSearchService");

Now let's express our code to run a flight search as a Supplier and decorate it with a circuit breaker:

Supplier<List<Flight>> flightsSupplier =
  () -> service.searchFlights(request);
Supplier<List<Flight>> decoratedFlightsSupplier =
  circuitBreaker.decorateSupplier(flightsSupplier);

Finally, let's call several modification operations to understand the working principle of the circuit breaker. We can use completable future to simulate concurrent flight search requests from users:

for (int i=0; i<20; i++) {
  try {
    System.out.println(decoratedFlightsSupplier.get());
  }
  catch (...) {
    // Exception handling
  }
}

The output shows that the previous flight searches were successful, and then 7 flight searches failed. At this point, the circuit breaker opens and throws a CallNotPermittedException for subsequent calls:

Searching for flights; current time = 12:01:12 884
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:01:12 954
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:01:12 957
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:01:12 958
io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search
... stack trace omitted ...
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... other lines omitted ...
io.reflectoring.resilience4j.circuitbreaker.Examples.countBasedSlidingWindow_FailedCalls(Examples.java:56)
  at io.reflectoring.resilience4j.circuitbreaker.Examples.main(Examples.java:229)

Now, suppose we want 70% of the last 10 calls of the circuit breaker to take 2 seconds or more to complete:

CircuitBreakerConfig config = CircuitBreakerConfig
  .custom()
  .slidingWindowType(SlidingWindowType.COUNT_BASED)
  .slidingWindowSize(10)
  .slowCallRateThreshold(70.0f)
  .slowCallDurationThreshold(Duration.ofSeconds(2))
  .build();

The timestamp in the sample output shows that the request always takes 2 seconds to complete. After 7 slow responses, the circuit breaker opens and no further calls are allowed:

Searching for flights; current time = 12:26:27 901
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:26:29 953
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:26:31 957
Flight search successful
... other lines omitted ...
Searching for flights; current time = 12:26:43 966
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... stack trace omitted ...
        at io.reflectoring.resilience4j.circuitbreaker.Examples.main(Examples.java:231)
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... stack trace omitted ...
        at io.reflectoring.resilience4j.circuitbreaker.Examples.main(Examples.java:231)

Usually, we will configure a circuit breaker with failure rate and slow call rate threshold:

CircuitBreakerConfig config = CircuitBreakerConfig
  .custom()
  .slidingWindowType(SlidingWindowType.COUNT_BASED)
  .slidingWindowSize(10)
  .failureRateThreshold(70.0f)
  .slowCallRateThreshold(70.0f)
  .slowCallDurationThreshold(Duration.ofSeconds(2))
  .build();

Time based circuit breaker

Suppose we want the circuit breaker to open when 70% of the requests fail in the past 10 seconds:

CircuitBreakerConfig config = CircuitBreakerConfig
  .custom()
  .slidingWindowType(SlidingWindowType.COUNT_BASED)
  .slidingWindowSize(10)
  .failureRateThreshold(70.0f)
  .slowCallRateThreshold(70.0f)
  .slowCallDurationThreshold(Duration.ofSeconds(2))
  .build();

We created CircuitBreaker, represented the flight search call as supplier < list < flight > > and decorated it with CircuitBreaker, as we did in the previous section.

The following is a sample output after multiple calls to the grooming operation:

Start time: 18:51:01 552
Searching for flights; current time = 18:51:01 582
Flight search successful
[Flight{flightNumber='XY 765', ... }]
... other lines omitted ...
Searching for flights; current time = 18:51:01 631
io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search
... stack trace omitted ...
Searching for flights; current time = 18:51:01 632
io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search
... stack trace omitted ...
Searching for flights; current time = 18:51:01 633
... other lines omitted ...
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... other lines omitted ...

The first three requests succeeded and the next seven failed. At this time, the circuit breaker is disconnected, and subsequent requests fail because CallNotPermittedException is thrown.

Now, suppose we want 70% of the calls of the circuit breaker in the past 10 seconds to take 1 second or more to complete:

CircuitBreakerConfig config = CircuitBreakerConfig
  .custom()
  .slidingWindowType(SlidingWindowType.TIME_BASED)
  .minimumNumberOfCalls(10)
  .slidingWindowSize(10)
  .slowCallRateThreshold(70.0f)
  .slowCallDurationThreshold(Duration.ofSeconds(1))
  .build();

The timestamp in the sample output shows that the request always takes 1 second to complete. After 10 requests (minimumNumberOfCalls), when the circuit breaker determines that 70% of the previous requests took 1 second or more, it opens the circuit:

Start time: 19:06:37 957
Searching for flights; current time = 19:06:37 979
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 19:06:39 066
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 19:06:40 070
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 19:06:41 070
... other lines omitted ...
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... stack trace omitted ...

Usually, we will configure a time-based circuit breaker with failure rate and slow call rate threshold:

Specifies the waiting time in the disconnected state

Suppose we want to wait 10 seconds when the circuit breaker is in the off state, then switch to the half off state and pass some requests to the remote service:

CircuitBreakerConfig config = CircuitBreakerConfig
  .custom()
  .slidingWindowType(SlidingWindowType.TIME_BASED)
  .slidingWindowSize(10)
  .minimumNumberOfCalls(10)
  .failureRateThreshold(70.0f)
  .slowCallRateThreshold(70.0f)
  .slowCallDurationThreshold(Duration.ofSeconds(2))
  .build();

The timestamp in the sample output shows that the circuit breaker initially transitions to the off state, blocks some calls for the next 10 seconds, and then changes to the semi off state. Later, in the half open state, the consistent successful response caused it to switch to the closed state again:

Searching for flights; current time = 20:55:58 735
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 20:55:59 812
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 20:56:00 816
... other lines omitted ...
io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Flight search failed
    at
... stack trace omitted ...
2020-12-13T20:56:03.850115+05:30: CircuitBreaker 'flightSearchService' changed state from CLOSED to OPEN
2020-12-13T20:56:04.851700+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
2020-12-13T20:56:05.852220+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
2020-12-13T20:56:06.855338+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
... other similar lines omitted ...
2020-12-13T20:56:12.862362+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
2020-12-13T20:56:13.865436+05:30: CircuitBreaker 'flightSearchService' changed state from OPEN to HALF_OPEN
Searching for flights; current time = 20:56:13 865
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
... other similar lines omitted ...
2020-12-13T20:56:16.877230+05:30: CircuitBreaker 'flightSearchService' changed state from HALF_OPEN to CLOSED
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 20:56:17 879
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
... other similar lines omitted ...

Specify fallback method

A common pattern when using a circuit breaker is to specify the fallback method to call when the circuit is disconnected. Fallback methods can provide some default values or behavior for remote calls that are not allowed.

We can set it up using the decorators utility class. Decorators are builders from the Resilience4j all module. They have methods such as withCircuitBreaker(), withRetry(), withratelimit(), which can help apply multiple Resilience4j decorators to suppliers, functions, etc.

When the circuit breaker opens and throws a CallNotPermittedException, we will use its withFallback() method to return flight search results from the local cache:

Supplier<List<Flight>> flightsSupplier = () -> service.searchFlights(request);
Supplier<List<Flight>> decorated = Decorators
  .ofSupplier(flightsSupplier)
  .withCircuitBreaker(circuitBreaker)
  .withFallback(Arrays.asList(CallNotPermittedException.class),
                e -> this.getFlightSearchResultsFromCache(request))
  .decorate();

The following example output shows the search results returned from the cache after the circuit breaker is disconnected:

Searching for flights; current time = 22:08:29 735
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 22:08:29 854
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 22:08:29 855
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 22:08:29 855
2020-12-13T22:08:29.856277+05:30: CircuitBreaker 'flightSearchService' recorded an error: 'io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search'. Elapsed time: 0 ms
Searching for flights; current time = 22:08:29 912
... other lines omitted ...
2020-12-13T22:08:29.926691+05:30: CircuitBreaker 'flightSearchService' changed state from CLOSED to OPEN
Returning flight search results from cache
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Returning flight search results from cache
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
... other lines omitted ...

Reduce information in Stacktrace

Whenever the circuit breaker is disconnected, it will throw CallNotPermittedException:

io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
    at io.github.resilience4j.circuitbreaker.CallNotPermittedException.createCallNotPermittedException(CallNotPermittedException.java:48)
... other lines in stack trace omitted ...
at io.reflectoring.resilience4j.circuitbreaker.Examples.timeBasedSlidingWindow_SlowCalls(Examples.java:169)
    at io.reflectoring.resilience4j.circuitbreaker.Examples.main(Examples.java:263)

Except for the first row, the other rows in the stack trace do not add much value. If CallNotPermittedException occurs multiple times, these stack traces will be repeated in our log file.

We can reduce the amount of information generated in the stack trace by setting the writablestacktraceEnabled() configuration to false:

CircuitBreakerConfig config = CircuitBreakerConfig
  .custom()
  .slidingWindowType(SlidingWindowType.COUNT_BASED)
  .slidingWindowSize(10)
  .failureRateThreshold(70.0f)
  .writablestacktraceEnabled(false)
  .build();

Now, when CallNotPermittedException occurs, there is only one line in the stack trace:

Searching for flights; current time = 20:29:24 476
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 20:29:24 540
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
... other lines omitted ...
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
...

Other useful methods

And Retry Similar to the module, CircuitBreaker also has methods such as ignoreExceptions() and recordExceptions(), which allow us to specify which exceptions should be ignored and considered by CircuitBreaker when tracking the call results.

For example, we may not want to ignore the seats unavailable exception from the remote flight service - in this case, we really don't want to disconnect the circuit.

Similar to other Resilience4j modules we have seen, CircuitBreaker also provides additional methods, such as decorateCheckedSupplier(), decorateCompletionStage(), decoratrunnable (), decorateConsumer(), so we can provide our code in structures other than Supplier.

Circuit breaker events

CircuitBreaker has an EventPublisher that can generate the following types of events:

  • CircuitBreakerOnSuccessEvent,
  • CircuitBreakerOnErrorEvent,
  • CircuitBreakerOnStateTransitionEvent,
  • CircuitBreakerOnResetEvent,
  • CircuitBreakerOnIgnoredErrorEvent,
  • CircuitBreakerOnCallNotPermittedEvent,
  • Circuitbreakeronfailurerateexceedevent and
  • CircuitBreakerOnSlowCallRateExceededEvent.

We can listen to these events and record them, for example:

circuitBreaker.getEventPublisher()
  .onCallNotPermitted(e -> System.out.println(e.toString()));
circuitBreaker.getEventPublisher()
  .onError(e -> System.out.println(e.toString()));
circuitBreaker.getEventPublisher()
  .onFailureRateExceeded(e -> System.out.println(e.toString()));
circuitBreaker.getEventPublisher().onStateTransition(e -> System.out.println(e.toString()));

The following is the log output of the example:

2020-12-13T22:25:52.972943+05:30: CircuitBreaker 'flightSearchService' recorded an error: 'io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search'. Elapsed time: 0 ms
Searching for flights; current time = 22:25:52 973
... other lines omitted ...
2020-12-13T22:25:52.974448+05:30: CircuitBreaker 'flightSearchService' exceeded failure rate threshold. Current failure rate: 70.0
2020-12-13T22:25:52.984300+05:30: CircuitBreaker 'flightSearchService' changed state from CLOSED to OPEN
2020-12-13T22:25:52.985057+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
... other lines omitted ...

CircuitBreaker indicator

Circuitbreak has exposed many indicators. These are some important items:

  • Total number of successful, failed, or ignored calls (resilience4j.circuitbreaker.calls)
  • Breaker status (resilience4j.circuitbreaker.state)
  • Circuit breaker failure rate (resilience4j.circuitbreaker.failure.rate)
  • Total number of calls not allowed (resilience4.circuitbreaker.not.permitted.calls)
  • Slow call of circuit breaker (resilience4j.circuitbreaker.slow.call.rate)

First, we create CircuitBreakerConfig, CircuitBreakerRegistry, and CircuitBreaker as usual. Then, we create a MeterRegistry and bind the CircuitBreakerRegistry to it:

MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(registry)
  .bindTo(meterRegistry);

After running several circuit breaker modification operations, we show the captured indicators. Here are some sample outputs:

The number of slow failed calls which were slower than a certain threshold - resilience4j.circuitbreaker.slow.calls: 0.0
The states of the circuit breaker - resilience4j.circuitbreaker.state: 0.0, state: metrics_only
Total number of not permitted calls - resilience4j.circuitbreakernot.permitted.calls: 0.0
The slow call of the circuit breaker - resilience4j.circuitbreaker.slow.call.rate: -1.0
The states of the circuit breaker - resilience4j.circuitbreaker.state: 0.0, state: half_open
Total number of successful calls - resilience4j.circuitbreaker.calls: 0.0, kind: successful
The failure rate of the circuit breaker - resilience4j.circuitbreaker.failure.rate: -1.0

In practical application, we will regularly export the data to the monitoring system and analyze it on the dashboard.

conclusion

In this article, we learned how to use the circuit breaker module of Resilience4j to pause making a request to a remote service when it returns an error. We learned why this is important and saw some practical examples of how to configure it.

You can use On GitHub Code to demonstrate a complete application.

This article is translated from: Implementing a Circuit Breaker with Resilience4j - Reflectoring

Tags: ElasticSearch Spring Boot search engine

Posted on Wed, 01 Dec 2021 22:50:16 -0500 by bodge