Sling ThreadLocal and talk about why FastThreadLocal can be so fast?

1 Introduction background and principle of fastthreadlocal

Since jdk already has a ThreadLocal, why should netty create a FastThreadLocal? Where is FastThreadLocal?

This needs to start with jdk ThreadLocal itself. As shown below:

In java threads, each thread has a ThreadLocalMap instance variable (if ThreadLocal is not used, this Map will not be created. It will be created only when a thread accesses a ThreadLocal variable for the first time).

This Map uses linear detection to solve the problem of hash conflict. If no free slot is found, keep trying back until a free location is found and an entry is inserted. This method affects the efficiency when hash conflicts are often encountered.

FastThreadLocal (hereinafter referred to as ftl) directly uses arrays to avoid hash conflicts. The specific methods are as follows: when each FastThreadLocal instance is created, a subscript index is allocated; The allocation index is implemented using AtomicInteger, and each FastThreadLocal can obtain a non repeated subscript.

When the ftl.get() method is called to obtain the value, it is directly obtained from the array and returned, such as return array[index], as shown in the following figure:

2 implementation of source code analysis

According to the above diagram, the implementation of ftl involves several classes: InternalThreadLocalMap, FastThreadLocalThread and FastThreadLocal. From the bottom up, let's start with InternalThreadLocalMap.

The inheritance diagram of InternalThreadLocalMap class is as follows:

2.1 main properties of unpaddedinternalthreadlocalmap

static final ThreadLocal<InternalThreadLocalMap> slowThreadLocalMap = new ThreadLocal<InternalThreadLocalMap>();
static final AtomicInteger nextIndex = new AtomicInteger();
Object[] indexedVariables;

The array indexedVariables is used to store the value of ftl and is accessed directly by subscript. nextIndex is used to assign a subscript to each ftl instance when the ftl instance is created. slowThreadLocalMap is used when the thread is not ftlt.

2.2 InternalThreadLocalMap analysis

Main properties of InternalThreadLocalMap:

// The slot used to identify the array has not been used
public static final Object UNSET = new Object();
 * Used to identify whether the ftl variable has a cleaner registered
 * BitSet Brief principle:
 * BitSet The default underlying data structure is a long [] array. At the beginning, the length is 1, that is, there is only long[0], and a long has 64bit.
 * When BitSet.set(1), it means that the second bit of long[0] is set to true, that is, 0000... 0010 (64bit), then long[0]==2
 * When BitSet.get(1), the second bit is 1, which means true; If it is 0, it means false
 * When BitSet.set(64), it means that bit 65 is set. At this time, long[0] is not enough. Use long[1] at the capacity expansion for storage
 * Store key value pairs similar to {index:boolean} to prevent a FastThreadLocal from starting the cleanup thread multiple times
 * Set the bit at the index position to true, indicating that the cleanup thread has been started for the FastThreadLocal in the InternalThreadLocalMap
private BitSet cleanerFlags; 
private InternalThreadLocalMap() {

private static Object[] newIndexedVariableTable() {
        Object[] array = new Object[32];
        Arrays.fill(array, UNSET);
        return array;

Relatively simple, the newIndexedVariableTable() method creates an array with a length of 32, initializes it to UNSET, and then passes it to the parent class. After that, the value of ftl is saved in this array.

Note that the value of the variable is directly saved here, not the entry, which is different from jdk ThreadLocal. InternalThreadLocalMap will analyze this first, and other methods will analyze ftl later.

2.3 implementation analysis of FTLT

To give full play to the performance advantages of ftl, it must be used in combination with ftlt, otherwise it will degenerate to the ThreadLocal of jdk. ftlt is relatively simple. The key codes are as follows:

public class FastThreadLocalThread extends Thread {
  // This will be set to true if we have a chance to wrap the Runnable.
  private final boolean cleanupFastThreadLocals;
  private InternalThreadLocalMap threadLocalMap;
  public final InternalThreadLocalMap threadLocalMap() {
        return threadLocalMap;
  public final void setThreadLocalMap(InternalThreadLocalMap threadLocalMap) {
        this.threadLocalMap = threadLocalMap;

The trick of ftlt is in the threadLocalMap attribute, which inherits java Thread and aggregates its own InternalThreadLocalMap. The ftl variable is accessed later. For ftlt threads, the variable value is directly obtained from the InternalThreadLocalMap.

2.4 ftl implementation analysis

ftl implementation analysis is based on netty-4.1.34. The version is specifically declared because the source code of this version has commented out the call of ObjectCleaner where it is cleared, which is different from the previous version.

2.4.1 attributes and instantiation of FTL
private final int index;

public FastThreadLocal() {
    index = InternalThreadLocalMap.nextVariableIndex();

It is very simple to assign a value to the attribute index. The static method of assignment is in InternalThreadLocalMap:

 public static int nextVariableIndex() {
        int index = nextIndex.getAndIncrement();
        if (index < 0) {
            throw new IllegalStateException("too many thread-local indexed variables");
        return index;

It can be seen that each ftl instance obtains the index value in an increasing sequence with step size of 1, which ensures that the length of the array in the InternalThreadLocalMap will not increase abruptly.

2.4.2 get() method implementation analysis
public final V get() {
    InternalThreadLocalMap threadLocalMap = InternalThreadLocalMap.get(); // 1
    Object v = threadLocalMap.indexedVariable(index); // 2
    if (v != InternalThreadLocalMap.UNSET) {
        return (V) v;

    V value = initialize(threadLocalMap); // 3
    registerCleaner(threadLocalMap);  // 4
    return value;

1. Let's take a look at how the InternalThreadLocalMap.get() method obtains threadLocalMap:

  public static InternalThreadLocalMap get() {
        Thread thread = Thread.currentThread();
        if (thread instanceof FastThreadLocalThread) {
            return fastGet((FastThreadLocalThread) thread);
        } else {
            return slowGet();
  private static InternalThreadLocalMap fastGet(FastThreadLocalThread thread) {
        InternalThreadLocalMap threadLocalMap = thread.threadLocalMap();
        if (threadLocalMap == null) {
            thread.setThreadLocalMap(threadLocalMap = new InternalThreadLocalMap());
        return threadLocalMap;

Because the performance advantages of FastThreadLocal can only be brought into play when combined with FastThreadLocalThread, we mainly focus on the fastGet method. This method directly obtains the threadLocalMap from the ftlt thread. If not, create an InternalThreadLocalMap instance, set it in, and then return.

2.threadLocalMap.indexedVariable(index) is simple. Get the value directly from the array, and then return:

  public Object indexedVariable(int index) {
        Object[] lookup = indexedVariables;
        return index < lookup.length? lookup[index] : UNSET;

3. If the obtained value is not UNSET, it is a valid value and is returned directly. If UNSET, initialize.

initialize(threadLocalMap) method:

  private V initialize(InternalThreadLocalMap threadLocalMap) {
        V v = null;
        try {
            v = initialValue();
        } catch (Exception e) {

        threadLocalMap.setIndexedVariable(index, v); // 3-1
        addToVariablesToRemove(threadLocalMap, this); // 3-2
        return v;

3.1. Get the initial value of ftl, and then save it to the array in ftl. If the array length is not enough, expand the array length, and then save without expanding.

3.2. The implementation of addtovariablestoremove (threadLocalMap, this) is to save the ftl instance in the Set set of the 0th element of the threadLocalMap internal array.

No code is pasted here, as shown below:

4. Implementation of registercleaner (threadlocalmap), source code in netty-4.1.34:

private void registerCleaner(final InternalThreadLocalMap threadLocalMap) {
        Thread current = Thread.currentThread();
        if (FastThreadLocalThread.willCleanupFastThreadLocals(current) || threadLocalMap.isCleanerFlagSet(index)) {


        // TODO: We need to find a better way to handle this.
        // We will need to ensure we will trigger remove(InternalThreadLocalMap) so everything will be released
        // and FastThreadLocal.onRemoval(...) will be called.
        ObjectCleaner.register(current, new Runnable() {
            public void run() {

                // It's fine to not call InternalThreadLocalMap.remove() here as this will only be triggered once
                // the Thread is collected by GC. In this case the ThreadLocal will be gone away already.

Since the code ObjectCleaner.register has been commented out in this version, and the remaining logic is relatively simple, no analysis will be performed.

2.5 performance degradation of ordinary threads using ftl

With the completion of the analysis of the get() method, the principle of the set(value) method is ready to come out. It is limited to space and will not be analyzed separately.

As mentioned earlier, ftl should be combined with ftlt to maximize its performance. If it is other ordinary threads, it will degenerate to the ThreadLocal of jdk, because ordinary threads do not contain data structures such as InternalThreadLocalMap. Let's see how to degenerate.

From the get() method of InternalThreadLocalMap:

  public static InternalThreadLocalMap get() {
        Thread thread = Thread.currentThread();
        if (thread instanceof FastThreadLocalThread) {
            return fastGet((FastThreadLocalThread) thread);
        } else {
            return slowGet();

  private static InternalThreadLocalMap slowGet() {
       // The type of the parent class is the static property of JDK threadLocal, and the InternalThreadLocalMap is obtained from the threadLocal
        ThreadLocal<InternalThreadLocalMap> slowThreadLocalMap = UnpaddedInternalThreadLocalMap.slowThreadLocalMap;
        InternalThreadLocalMap ret = slowThreadLocalMap.get();
        if (ret == null) {
            ret = new InternalThreadLocalMap();
        return ret;

From the perspective of ftl, the whole process of degradation operation is to obtain InternalThreadLocalMap from the ThreadLocal variable of a jdk, and then obtain the value of the subscript of the specified array from InternalThreadLocalMap. The object relationship diagram is as follows:

3 ftl resource recovery mechanism

In netty, there are three recycling mechanisms for ftl:

Automatic: use ftlt to execute a Runnable task wrapped by fastthreadlocalrunnable. After the task is executed, ftl will be cleaned up automatically.

Manual: both ftl and InternalThreadLocalMap provide the remove method. When appropriate, the user can (and sometimes must, for example, the thread pool of ordinary threads uses ftl) call it manually to delete the display.

Automatic: register a Cleaner for each ftl of the current thread. When the thread object is not strongly reachable, the Cleaner thread will recycle the current ftl of the current thread. (netty recommends that if the other two methods can be used, do not use this method again, because another thread is required, which consumes resources, and multithreading will cause some resource competition. In netty-4.1.34, the code calling ObjectCleaner has been commented out.)

4. Use of FTL in netty

The most important use of ftl in netty is to allocate ByteBuf. The basic method is: each thread allocates a piece of memory (PoolArena). When ByteBuf needs to be allocated, the thread first allocates it from the PoolArena it holds. If it cannot allocate it, it then adopts global allocation.

However, due to limited memory resources, there will still be multiple threads holding the same PoolArena. However, this method has minimized the resource competition of multithreading and improved the program efficiency.

The specific code is in the internal class PoolThreadLocalCache of poolbytebufalocator:

  final class PoolThreadLocalCache extends FastThreadLocal<PoolThreadCache> {

        protected synchronized PoolThreadCache initialValue() {
            final PoolArena<byte[]> heapArena = leastUsedArena(heapArenas);
            final PoolArena<ByteBuffer> directArena = leastUsedArena(directArenas);

            Thread current = Thread.currentThread();
            if (useCacheForAllThreads || current instanceof FastThreadLocalThread) {
              // PoolThreadCache is the encapsulation of memory blocks held by each thread  
              return new PoolThreadCache(
                        heapArena, directArena, tinyCacheSize, smallCacheSize, normalCacheSize,
            // No caching so just use 0 as sizes.
            return new PoolThreadCache(heapArena, directArena, 0, 0, 0, 0, 0);

Tags: Algorithm

Posted on Wed, 27 Oct 2021 20:34:31 -0400 by -entropyman