The art of multiprocessor programming - 3. Spin lock and contention

This series is the reading notes of the art of multiprocessor programming. It is understood and implemented on the basis of the original book and combined with the code of OpenJDK 11 or above. And share some personal information with those who want to have a deeper understanding according to their personal data search and understanding experience

Spin lock and contention

1. Further discussion on spin locking of TAS and TTAS

In the previous chapter, we implemented TASLock and TTASLock spin locks. Because compareAndSet will lead to broadcasting on the interconnect, it will lead to the delay of all threads, including threads without waiting for locks. To make matters worse, the compareAndSet call will cause other processors to discard the copies in their own cache. As a result, each spinning thread will encounter a cache miss almost every time and need to obtain a new value through the bus. What's worse, when the thread holding the lock attempts to release the lock, the release may be delayed because the interconnect may be monopolized by the spinning thread. These are the reasons why TASLock has such poor performance.

The following analyzes the behavior of TTASLock lock when the lock is held by thread A. When thread B reads the lock for the first time, the cache is missing, so the blocking waiting value is loaded into its cache. As long as A holds the lock, B will keep reading the value and hit the cache every time. In this way, when A holds the lock, no bus traffic will be generated, and the access speed of other threads will not be reduced. In addition, A's release of the lock is not delayed by the spinning thread.

However, when the lock is released, it will cause A bus storm: thread A writes the false value to the lock variable to release the lock. This operation will invalidate the cache copy of the spinning thread immediately.

2. Exponential Backoff

We may often see the term Backoff in the design of micro service system. It often occurs when microservice calls fail. When retrying, it is often not a direct retry, but a retry with a certain interval. The retry interval is generally not fixed. For the same request, the retry interval is related to the number of retries. The most commonly used is the exponential function relationship.

This design actually comes from the software design of the underlying hardware. First, let's clarify the concept of contention: multiple threads compete for the same resource, which refers to locks. High contention means that a large number of threads compete for the same lock, and low contention means the opposite.

In our previously implemented TTASLock, lock is mainly divided into two steps: constantly reading the lock state and trying to obtain the lock when it is idle. If a thread passes through the whole process but fails to obtain the lock, and other threads obtain the lock, it is likely that the lock is facing high contention. Trying to obtain a highly contested resource is an operation that should be avoided. In this way, the probability of a thread acquiring resources is very small, but the bus traffic is very large. On the contrary, if you let the thread back for a period of time without competing for locks, it will be more efficient.

How long should the thread back before retrying again? A better way is to make the backward time proportional to the number of retries, because the more retries, the higher the possibility of high contention. Here is a simple method:

  1. Read lock status
  2. Attempt to acquire lock when reading idle
  3. If the lock acquisition fails, it will be backed up randomly for a period of time
  4. Repeat steps 1 to 3. If the lock acquisition fails, double the backward time of step 3 until a fixed maximum maxDelay.

Let's implement this lock:

public class Backoff {
    private final long minDelay;
    private final long maxDelay;
    private long current;

    public Backoff(long minDelay, long maxDelay) {
        this.minDelay = minDelay;
        this.maxDelay = maxDelay;
        //The initial random maximum is minDelay
        this.current = minDelay;

    public void backoff() {
        //Use ThreadLocalRandom to prevent concurrency from affecting randomness
        long delay = ThreadLocalRandom.current().nextLong(1, current);
        //As the number doubles until maxDelay
        current = Math.min(current * 2L, maxDelay);
        try {
        } catch (InterruptedException e) {
public class TTASWithBackoffLock implements Lock {
    private boolean locked = false;
    private final Backoff backoff = new Backoff(10L, 100L);
    //Handle to operation locked
    private static final VarHandle LOCKED;
    static {
        try {
            //Initialization handle
            LOCKED = MethodHandles.lookup().findVarHandle(TTASWithBackoffLock.class, "locked", boolean.class);
        } catch (Exception e) {
            throw new Error(e);

    public void lock() {
        while (true) {
            //Normally read locked. If it is occupied, it will always SPIN
            while ((boolean) LOCKED.get(this)) {
                //Giving up CPU resources is the best way to achieve SPIN effect at present. When the number of threads is much greater than the number of CPUs, the effect is better than Thread.yield, and the effect is much better than Thread.sleep from the perspective of timeliness
            //Success means that the lock was acquired
            if (LOCKED.compareAndSet(this, false, true)) {
            } else {
                //Fallback if failed

    public void unlock() {
        LOCKED.setVolatile(this, false);

After that, we use JMH to test the performance difference between TTASWithBackoffLock and the previously implemented TTASLock lock:

//The test indicator is the single call time
//Preheating is required to eliminate the impact of jit real-time compilation and JVM collection of various indicators. Since we cycle many times in a single cycle, preheating once is OK
@Warmup(iterations = 1)
//Single thread is enough
//Test times, we test 10 times
@Measurement(iterations = 10)
//The life cycle of a class instance is defined, and all test threads share an instance
@State(value = Scope.Benchmark)
public class LockTest {
    private static class ValueHolder {
        int count = 0;

    //Test the number of different threads
    @Param(value = {"1", "2", "5", "10", "20", "50", "100"})
    private int threadsCount;

    public void testTTASWithBackoffLock(Blackhole blackhole) throws InterruptedException {
        test(new TTASWithBackoffLock());

    public void testTTASLock(Blackhole blackhole) throws InterruptedException {
        test(new TTASLock());

    private void test(Lock lock) throws InterruptedException {
        ValueHolder valueHolder = new ValueHolder();
        Thread[] threads = new Thread[threadsCount];
        //Test accumulation 5000000 times
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(() -> {
                for (int j = 0; j < 5000000 / threads.length; j++) {
                    try {
                    } finally {
        for (int i = 0; i < threads.length; i++) {
        if (valueHolder.count != 5000000) {
            throw new RuntimeException("something wrong in lock implementation");

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(LockTest.class.getSimpleName()).build();
        new Runner(opt).run();

The result is:

Benchmark                         (threadsCount)  Mode  Cnt  Score   Error  Units
LockTest.testTTASLock                          1    ss   10  0.064 ± 0.005   s/op
LockTest.testTTASLock                          2    ss   10  0.138 ± 0.044   s/op
LockTest.testTTASLock                          5    ss   10  0.426 ± 0.100   s/op
LockTest.testTTASLock                         10    ss   10  0.699 ± 0.128   s/op
LockTest.testTTASLock                         20    ss   10  0.932 ± 0.241   s/op
LockTest.testTTASLock                         50    ss   10  1.162 ± 0.542   s/op
LockTest.testTTASLock                        100    ss   10  1.379 ± 0.939   s/op
LockTest.testTTASWithBackoffLock               1    ss   10  0.068 ± 0.008   s/op
LockTest.testTTASWithBackoffLock               2    ss   10  0.080 ± 0.023   s/op
LockTest.testTTASWithBackoffLock               5    ss   10  0.135 ± 0.037   s/op
LockTest.testTTASWithBackoffLock              10    ss   10  0.187 ± 0.072   s/op
LockTest.testTTASWithBackoffLock              20    ss   10  0.200 ± 0.063   s/op
LockTest.testTTASWithBackoffLock              50    ss   10  0.239 ± 0.052   s/op
LockTest.testTTASWithBackoffLock             100    ss   10  0.261 ± 0.042   s/op

Process finished with exit code 0

From the results, we can see that the performance is much better.

Although the lock implementation based on fallback is simple, it also improves performance. However, for different machines and different configurations, it is difficult to find the most suitable general minDelay and maxDelay.

Tags: Java Concurrent Programming

Posted on Thu, 04 Nov 2021 20:55:36 -0400 by JustinMs66