Fault isolation using Resilience4j framework in Java projects

So far in this series, we have learned about Resilience4j and its [Retry](
https://icodewalker.com/blog/...), [RateLimiter](
https://icodewalker.com/blog/... )And [TimeLimiter](
https://icodewalker.com/blog/... )Module. In this article, we will explore the Bulkhead module. We'll see what problems it solves, when and how to use it, and look at some examples.

Code example

This article is attached [on GitHub](
https://github.com/thombergs/... )Working code example for.

What is Resilience4j?

Please refer to the description in the previous article to quickly understand the general working principle of [Resilience4j](
https://icodewalker.com/blog/...).

What is fault isolation?

A few years ago, we encountered a production problem. One of the servers stopped responding to the health check and the load balancer took the server out of the pool.

Just as we began to investigate this problem, there was a second alert - another server had stopped responding to health checks and was removed from the pool.

After a few minutes, each server stopped responding to health detection and our service was completely shut down.

We use Redis to cache some data for several functions supported by the application. As we found later, the Redis cluster has some problems at the same time, and it has stopped accepting new connections. We use the Jedis library to connect to Redis. The default behavior of the library is to block the calling thread indefinitely until the connection is established.

Our service is hosted on Tomcat. Its default request processing thread pool size is 200 threads. Therefore, each request through the code path connected to Redis will eventually block threads indefinitely.

Within minutes, all 2000 threads in the cluster were blocked indefinitely -- not even idle threads to respond to the health check of the load balancer.

The service itself supports multiple functions, and not all functions need to access Redis cache. However, when there are problems in this aspect, it will eventually affect the whole service.

This is the problem that fault isolation solves - it can prevent problems in a service area from affecting the whole service.

Although what happens to our service is an extreme example, we can see how slow upstream dependencies affect irrelevant areas of the calling service.

If we set a limit of 20 concurrent requests for Redis on each server instance, only these threads will be affected when Redis connection problems occur. The remaining request processing threads can continue to serve other requests.

The idea behind fault isolation is to set a limit on the number of concurrent calls we make to remote services. We regard calls to different remote services as different and isolated pools, and set a limit on the number of calls that can be made at the same time.

The term bulkhead itself comes from its use in ships where the bottom of the ship is divided into separate parts. If there is a crack and water begins to flow in, only that part will be filled with water. This can prevent the whole ship from sinking.

Resilience4j diaphragm concept

resilience4j-bulkhead The working principle of is similar to other Resilience4j modules. We provide it with the code we want to construct and execute as a function -- a lambda expression for remote calls or a Supplier of a value retrieved from a remote service, etc. -- and decorate it with code to control the number of concurrent calls.

Resilience4j provides two types of diaphragms - SemaphoreBulkhead and ThreadPoolBulkhead.

SemaphoreBulkhead internal use
java.util.concurrent.Semaphore to control the number of concurrent calls and execute our code on the current thread.

ThreadPoolBulkhead uses a thread in the thread pool to execute our code. It is used internally
java.util.concurrent.ArrayBlockingQueue and
java.util.concurrent.ThreadPoolExecutor to control the number of concurrent calls.

SemaphoreBulkhead

Let's look at the configuration related to the semaphore partition and its meaning.

maxConcurrentCalls determines the maximum number of concurrent calls we can make to the remote service. We can think of this value as the number of permissions for the initialization semaphore.

Any thread that attempts to call a remote service beyond this limit can immediately obtain a BulkheadFullException or wait for another thread to release the license. This is determined by the maxWaitDuration value.

When there are multiple threads waiting for permission, the fairCallHandlingEnabled configuration determines whether the waiting threads obtain permission in first in first out order.

Finally, the writableStackTraceEnabled configuration allows us to reduce the amount of information in the stack trace when a BulkheadFullException occurs. This is useful because without it, our log may be filled with a lot of similar information when an exception occurs multiple times. Usually, when reading the log, it is sufficient to only know that a BulkheadFullException has occurred.

ThreadPoolBulkhead

coreThreadPoolSize, maxThreadPoolSize, keepAliveDuration, and queueCapacity are the main configurations related to ThreadPoolBulkhead. ThreadPoolBulkhead uses these configurations internally to Construct a ThreadPoolExecutor.

The internalThreadPoolExecutor uses one of the available idle threads to execute the incoming task. If no thread is free to execute the incoming task, the task will be queued for execution later when the thread is available. If queueCapacity is reached, the remote call will be rejected and BulkheadFullException will be returned.

ThreadPoolBulkhead also has a writableStackTraceEnabled configuration to control the amount of information in the stack trace of BulkheadFullException.

Using Resilience4j bulkhead module

Let's see how to use it resilience4j-bulkhead Various functions available in the module.

We will use the same example as the previous articles in this series. Suppose we are setting up a website for an airline to allow its customers to search and book flights. Our service talks to a remote service encapsulated by the FlightSearchService class.

SemaphoreBulkhead

When using semaphore based partitions, BulkheadRegistry, BulkheadConfig, and Bulkhead are the main abstractions we use.

Bulkhead registry is a factory for creating and managing bulkhead objects.

BulkheadConfig encapsulates maxConcurrentCalls, maxWaitDuration, writableStackTraceEnabled, and fairCallHandlingEnabled configurations. Each Bulkhead object is associated with a BulkheadConfig.

The first step is to create a BulkheadConfig:

BulkheadConfig config = BulkheadConfig.ofDefaults();

This creates a BulkheadConfig with default values of maxConcurrentCalls(25), maxWaitDuration(0s), writableStackTraceEnabled(true), and fairCallHandlingEnabled(true).

Suppose we want to limit the number of concurrent calls to 2, and we are willing to wait 2 seconds for the thread to get permission:

BulkheadConfig config = BulkheadConfig.custom()
  .maxConcurrentCalls(2)
  .maxWaitDuration(Duration.ofSeconds(2))
  .build();

Then we create a Bulkhead:

BulkheadRegistry registry = BulkheadRegistry.of(config);

Bulkhead bulkhead = registry.bulkhead("flightSearchService");

Now let's express our code to run the flight search as a Supplier and decorate it with bulkhead:

BulkheadRegistry registry = BulkheadRegistry.of(config);
Bulkhead bulkhead = registry.bulkhead("flightSearchService");

Finally, let's call several decoration operations to understand how the partition works. We can use completable future to simulate concurrent flight search requests from users:

for (int i=0; i<4; i++) {
  CompletableFuture
    .supplyAsync(decoratedFlightsSupplier)
    .thenAccept(flights -> System.out.println("Received results"));
}

The timestamp and thread name in the output show that among the four concurrent requests, the first two requests pass immediately:

Searching for flights; current time = 11:42:13 187; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 11:42:13 187; current thread = ForkJoinPool.commonPool-worker-5
Flight search successful at 11:42:13 226
Flight search successful at 11:42:13 226
Received results
Received results
Searching for flights; current time = 11:42:14 239; current thread = ForkJoinPool.commonPool-worker-9
Searching for flights; current time = 11:42:14 239; current thread = ForkJoinPool.commonPool-worker-7
Flight search successful at 11:42:14 239
Flight search successful at 11:42:14 239
Received results
Received results

The third and fourth requests can be licensed only after 1 second, after the previous request is completed.

If the thread cannot obtain permission within the 2s maxWaitDuration specified by us, it will throw a BulkheadFullException:

Caused by: io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
    at io.github.resilience4j.bulkhead.BulkheadFullException.createBulkheadFullException(BulkheadFullException.java:49)
    at io.github.resilience4j.bulkhead.internal.SemaphoreBulkhead.acquirePermission(SemaphoreBulkhead.java:164)
    at io.github.resilience4j.bulkhead.Bulkhead.lambda$decorateSupplier$5(Bulkhead.java:194)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    ... 6 more

Except for the first row, the other rows in the stack trace do not add much value. If BulkheadFullException occurs multiple times, these stack traces will be repeated in our log file.

We can reduce the amount of information generated in the stack trace by setting the writableStackTraceEnabled configuration to false:

BulkheadConfig config = BulkheadConfig.custom()
    .maxConcurrentCalls(2)
    .maxWaitDuration(Duration.ofSeconds(1))
    .writableStackTraceEnabled(false)
.build();

Now, when the BulkheadFullException occurs, there is only one line in the stack trace:

Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-5
io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
Flight search successful at 12:27:58 699
Flight search successful at 12:27:58 699
Received results
Received results

Similar to other Resilience4j modules we have seen, Bulkhead also provides additional methods, such as decorateCheckedSupplier(), decorateCompletionStage(), decoratrunnable (), decorateConsumer(), etc. Therefore, we can provide our code in other structures other than Supplier suppliers.

ThreadPoolBulkhead

When using thread pool based partitions,
ThreadPoolBulkheadRegistry, ThreadPoolBulkheadConfig, and ThreadPoolBulkhead are the main abstractions we use.

ThreadPoolBulkhead registry is a factory for creating and managing ThreadPoolBulkhead objects.

ThreadPoolBulkheadConfig encapsulates the core threadpoolsize, maxThreadPoolSize, keepAliveDuration, and queueCapacity configurations. Each ThreadPoolBulkhead object is associated with a ThreadPoolBulkhead config.

The first step is to create a ThreadPoolBulkheadConfig:

ThreadPoolBulkheadConfig config =
  ThreadPoolBulkheadConfig.ofDefaults();

This creates a ThreadPoolBulkheadConfig with default values of coreThreadPoolSize (number of available processors - 1), maxThreadPoolSize (maximum number of available processors), keepAliveDuration (20ms), and queueCapacity (100).

Suppose we want to limit the number of concurrent calls to 2:

ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
  .maxThreadPoolSize(2)
  .coreThreadPoolSize(1)
  .queueCapacity(1)
  .build();

Then we create a ThreadPoolBulkhead:

ThreadPoolBulkheadRegistry registry = ThreadPoolBulkheadRegistry.of(config);
ThreadPoolBulkhead bulkhead = registry.bulkhead("flightSearchService");

Now let's express our code to run the flight search as a Supplier and decorate it with bulkhead:

Supplier<List<Flight>> flightsSupplier =
  () -> service.searchFlightsTakingOneSecond(request);
Supplier<CompletionStage<List<Flight>>> decoratedFlightsSupplier =
  ThreadPoolBulkhead.decorateSupplier(bulkhead, flightsSupplier);

And returning a supplier < list < flight > >
SemaphoreBulkhead.decorateSupplier(),
ThreadPoolBulkhead.decorateSupplier() returns a supplier < completionstage < list < flight > >. This is because ThreadPoolBulkHead does not execute code synchronously on the current thread.

Finally, let's call several decoration operations to understand the working principle of the diaphragm:

for (int i=0; i<3; i++) {
  decoratedFlightsSupplier
    .get()
    .whenComplete((r,t) -> {
      if (r != null) {
        System.out.println("Received results");
      }
      if (t != null) {
        t.printStackTrace();
      }
    });
}

The timestamp and thread name in the output show that although the first two requests are executed immediately, the third request has been queued and will be executed later by one of the released threads:

Searching for flights; current time = 16:15:00 097; current thread = bulkhead-flightSearchService-1
Searching for flights; current time = 16:15:00 097; current thread = bulkhead-flightSearchService-2
Flight search successful at 16:15:00 136
Flight search successful at 16:15:00 135
Received results
Received results
Searching for flights; current time = 16:15:01 151; current thread = bulkhead-flightSearchService-2
Flight search successful at 16:15:01 151
Received results

If there are no idle threads and capacity in the queue, a BulkheadFullException is thrown:

Exception in thread "main" io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
 at io.github.resilience4j.bulkhead.BulkheadFullException.createBulkheadFullException(BulkheadFullException.java:64)
 at io.github.resilience4j.bulkhead.internal.FixedThreadPoolBulkhead.submit(FixedThreadPoolBulkhead.java:157)
... other lines omitted ...

We can use the writableStackTraceEnabled configuration to reduce the amount of information generated in the stack trace:

ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
  .maxThreadPoolSize(2)
  .coreThreadPoolSize(1)
  .queueCapacity(1)
  .writableStackTraceEnabled(false)
  .build();

Now, when the BulkheadFullException occurs, there is only one line in the stack trace:

Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-5
io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
Flight search successful at 12:27:58 699
Flight search successful at 12:27:58 699
Received results
Received results

Context propagation

Sometimes we store data in a ThreadLocal variable and read it in different areas of the code. We do this to avoid explicitly passing data as parameters between method chains, especially when the value is not directly related to the core business logic we are implementing.

For example, we might want to record the current user ID or transaction ID or a request tracking ID in each log statement to make it easier to search the log. Using ThreadLocal is a useful technique for such scenarios.

When using ThreadPoolBulkhead, because our code is not executed on the current thread, the data we store in the ThreadLocal variable will not be available in other threads.

Let's look at an example to understand this problem. First, we define a RequestTrackingIdHolder class, a wrapper class around ThreadLocal:

class RequestTrackingIdHolder {
  static ThreadLocal<String> threadLocal = new ThreadLocal<>();


  static String getRequestTrackingId() {
    return threadLocal.get();
  }


  static void setRequestTrackingId(String id) {
    if (threadLocal.get() != null) {
      threadLocal.set(null);
      threadLocal.remove();
    }
    threadLocal.set(id);
  }


  static void clear() {
    threadLocal.set(null);
    threadLocal.remove();
  }
}

Static methods can easily set and get values stored on ThreadLocal. Next, we set a request tracking ID before calling the flight search operation:

for (int i=0; i<2; i++) {
  String trackingId = UUID.randomUUID().toString();
  System.out.println("Setting trackingId " + trackingId + " on parent, main thread before calling flight search");
  RequestTrackingIdHolder.setRequestTrackingId(trackingId);
  decoratedFlightsSupplier
    .get()
    .whenComplete((r,t) -> {
        // other lines omitted
    });
}

The sample output shows that this value is not available in threads managed by the partition:

Setting trackingId 98ff99df-466a-47f7-88f7-5e31fc8fcb6b on parent, main thread before calling flight search
Setting trackingId 6b98d73c-a590-4a20-b19d-c85fea783caf on parent, main thread before calling flight search
Searching for flights; current time = 19:53:53 799; current thread = bulkhead-flightSearchService-1; Request Tracking Id = null
Flight search successful at 19:53:53 824
Received results
Searching for flights; current time = 19:53:54 836; current thread = bulkhead-flightSearchService-1; Request Tracking Id = null
Flight search successful at 19:53:54 836
Received results

To solve this problem, ThreadPoolBulkhead provides a ContextPropagator. Context propagator is an abstraction for retrieving, copying, and cleaning up values across thread boundaries. It defines an interface that contains methods to get values from the current thread (retrieve()), copy them to the new execution thread (copy()), and finally clean up on the execution thread (clear()).

Let's implement a
RequestTrackingIdPropagator:

class RequestTrackingIdPropagator implements ContextPropagator {
  @Override
  public Supplier<Optional> retrieve() {
    System.out.println("Getting request tracking id from thread: " + Thread.currentThread().getName());
    return () -> Optional.of(RequestTrackingIdHolder.getRequestTrackingId());
  }


  @Override
  Consumer<Optional> copy() {
    return optional -> {
      System.out.println("Setting request tracking id " + optional.get() + " on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.setRequestTrackingId(s.toString()));
    };
  }


  @Override
  Consumer<Optional> clear() {
    return optional -> {
      System.out.println("Clearing request tracking id on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.clear());
    };
  }
}

We provide a ContextPropagator for ThreadPoolBulkhead by setting on ThreadPoolBulkhead config:

class RequestTrackingIdPropagator implements ContextPropagator {
  @Override
  public Supplier<Optional> retrieve() {
    System.out.println("Getting request tracking id from thread: " + Thread.currentThread().getName());
    return () -> Optional.of(RequestTrackingIdHolder.getRequestTrackingId());
  }


  @Override
  Consumer<Optional> copy() {
    return optional -> {
      System.out.println("Setting request tracking id " + optional.get() + " on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.setRequestTrackingId(s.toString()));
    };
  }


  @Override
  Consumer<Optional> clear() {
    return optional -> {
      System.out.println("Clearing request tracking id on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.clear());
    };
  }
}

Now, the sample output shows that the request tracking ID is available in the thread managed by the partition:

Setting trackingId 71d44cb8-dab6-4222-8945-e7fd023528ba on parent, main thread before calling flight search
Getting request tracking id from thread: main
Setting trackingId 5f9dd084-f2cb-4a20-804b-038828abc161 on parent, main thread before calling flight search
Getting request tracking id from thread: main
Setting request tracking id 71d44cb8-dab6-4222-8945-e7fd023528ba on thread: bulkhead-flightSearchService-1
Searching for flights; current time = 20:07:56 508; current thread = bulkhead-flightSearchService-1; Request Tracking Id = 71d44cb8-dab6-4222-8945-e7fd023528ba
Flight search successful at 20:07:56 538
Clearing request tracking id on thread: bulkhead-flightSearchService-1
Received results
Setting request tracking id 5f9dd084-f2cb-4a20-804b-038828abc161 on thread: bulkhead-flightSearchService-1
Searching for flights; current time = 20:07:57 542; current thread = bulkhead-flightSearchService-1; Request Tracking Id = 5f9dd084-f2cb-4a20-804b-038828abc161
Flight search successful at 20:07:57 542
Clearing request tracking id on thread: bulkhead-flightSearchService-1
Received results

Bulkhead event

Both Bulkhead and ThreadPoolBulkhead have an EventPublisher to generate the following types of events:

  • BulkheadOnCallPermittedEvent
  • BulkheadOnCallRejectedEvent and
  • BulkheadOnCallFinishedEvent

We can listen to these events and record them, for example:

Bulkhead bulkhead = registry.bulkhead("flightSearchService");
bulkhead.getEventPublisher().onCallPermitted(e -> System.out.println(e.toString()));
bulkhead.getEventPublisher().onCallFinished(e -> System.out.println(e.toString()));
bulkhead.getEventPublisher().onCallRejected(e -> System.out.println(e.toString()));

The sample output shows the contents of the record:

2020-08-26T12:27:39.790435: Bulkhead 'flightSearch' permitted a call.
... other lines omitted ...
2020-08-26T12:27:40.290987: Bulkhead 'flightSearch' rejected a call.
... other lines omitted ...
2020-08-26T12:27:41.094866: Bulkhead 'flightSearch' has finished a call.

Bulkhead indicator

SemaphoreBulkhead

Bulkhead exposed two indicators:

  • The maximum number of available permissions (resilience4j.bulkhead.max.allowed.concurrent.calls), and
  • The number of concurrent calls allowed (resilience4j.bulkhead.available.concurrent.calls).

The bulkhead.available indicator is the same as maxconcurrent calls configured on bulkhead config.

First, we create BulkheadConfig, BulkheadRegistry, and Bulkhead as before. Then, we create a MeterRegistry and bind BulkheadRegistry to it:

MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedBulkheadMetrics.ofBulkheadRegistry(registry)
  .bindTo(meterRegistry);

After running several partition decoration operations, we show the captured indicators:

Consumer<Meter> meterConsumer = meter -> {
  String desc = meter.getId().getDescription();
  String metricName = meter.getId().getName();
  Double metricValue = StreamSupport.stream(meter.measure().spliterator(), false)
    .filter(m -> m.getStatistic().name().equals("VALUE"))
    .findFirst()
    .map(m -> m.getValue())
    .orElse(0.0);
  System.out.println(desc + " - " + metricName + ": " + metricValue);};meterRegistry.forEachMeter(meterConsumer);

Here are some sample outputs:

The maximum number of available permissions - resilience4j.bulkhead.max.allowed.concurrent.calls: 8.0
The number of available permissions - resilience4j.bulkhead.available.concurrent.calls: 3.0

ThreadPoolBulkhead

ThreadPoolBulkhead exposure has five indicators:

  • The current length of the queue (resilience4j.bulkhead.queue.depth),
  • Current thread pool size (resilience4j.bulkhead.thread.pool.size),
  • The core and maximum capacity of the thread pool (resilience4j.bulkhead.core.thread.pool.size and resilience4j.bulkhead.max.thread.pool.size), and
  • The capacity of the queue (resilience4j.bulkhead.queue.capacity).

First, we create ThreadPoolBulkheadConfig as before
ThreadPoolBulkheadRegistry and ThreadPoolBulkhead. Then, we create a MeterRegistry and
ThreadPoolBulkheadRegistry is bound to it:

MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedThreadPoolBulkheadMetrics.ofThreadPoolBulkheadRegistry(registry).bindTo(meterRegistry);

After running the partition trim operation several times, we will display the captured indicators:

The queue capacity - resilience4j.bulkhead.queue.capacity: 5.0
The queue depth - resilience4j.bulkhead.queue.depth: 1.0
The thread pool size - resilience4j.bulkhead.thread.pool.size: 5.0
The maximum thread pool size - resilience4j.bulkhead.max.thread.pool.size: 5.0
The core thread pool size - resilience4j.bulkhead.core.thread.pool.size: 3.0

In practical application, we will regularly export the data to the monitoring system and analyze it on the dashboard.

Traps and good practices in implementing diaphragms

Make the diaphragm a single example

All calls to a given remote service should pass through the same Bulkhead instance. For a given remote service, Bulkhead must be a singleton.

If we don't enforce this operation, some areas of our code base may bypass Bulkhead and call the remote service directly. To prevent this, the actual invocation of the remote service should be in a core, inner layer, and other areas, and the inner layer exposed partition decorator should be used.

How can we ensure that future new developers understand this intention? Check out Tom's article, which shows a way to solve such problems, namely These intentions are clarified through the organizational package structure . In addition, it shows how to enforce this by coding intent in the ArchUnit test.

Combined with other Resilience4j modules

It is more efficient to use the diaphragm in combination with one or more other Resilience4j modules, such as retry and rate limiter. For example, if there is a BulkheadFullException, we may want to retry after some delay.

conclusion

In this article, we learned how to use the Bulkhead module of Resilience4j to set limits on our concurrent calls to remote services. We learned why this is important and saw some practical examples of how to configure it.

You can use [on GitHub](
https://github.com/thombergs/... )The code demonstrates a complete application.

This article is translated from: Implementing Bulkhead with Resilience4j - Reflectoring

Tags: Java

Posted on Fri, 26 Nov 2021 04:55:25 -0500 by uluru75