How to select LongAdder and AtomicLong in different concurrency scenarios

|Write in front

This article will not go directly to the topic of why LongAdder performs better than AtomicLong, but first introduce volatile. First, I can sort out what I have learned recently. Second, I think AtomicLong is to solve the scenarios where volatile is not applicable as a foreshadowing. Then, I introduce AtomicLong, and finally introduce LongAdder and the performance comparison between LongAdder and AtomicLong, If you want to see the reason directly, jump to the end of the text: the reason for the performance difference.

| volatile

Volatile keyword can be understood as lightweight synchronized. Its use will not cause thread context switching and scheduling, and the use cost is lower than synchronized. However, volatile only ensures visibility, which means that when a thread modifies a variable modified by volatile, the new value is always immediately known to other threads. Volatile is not suitable for computing scenarios such as i + +, that is, the operation result depends on the current value of the variable. Take an example: VolatileTest.java.

public class VolatileTest {
    private static final int THREAD_COUNT = 20;

    private static volatile int race = 0;

    public static void increase() {
        race++;
    }

    public static void main(String[] args) {
        Thread[] threads = new Thread[THREAD_COUNT];
        for (int i = 0; i < THREAD_COUNT; i++) {
            threads[i] = new Thread(new Runnable() {
                @Override
                public void run() {
                    for (int i = 0; i < 1000; i++) {
                        increase();
                    }
                }
            });
            threads[i].start();
        }

        //Wait until all accumulation threads end
        while (Thread.activeCount() > 1) {
            Thread.yield();
        }

        System.out.println("race: " + race);
    }
}

The function of this method is very simple, that is, each thread performs 1000 self increment operations on race, 20 threads perform self increment on race, and 20 * 1000 = 20000. However, no matter how many times the program is run, the result is less than 20000.

The reason lies in the increase method. Although the increase method has only one line, it will be found that the increase method with only one line of code is composed of four lines of bytecode instructions after decompilation.

| AtomicLong

Although locking the increase method can ensure the correctness of the results, synchronized and reentrock are mutually exclusive locks. Only one thread is allowed to execute at the same time, and the other threads can only wait. The execution efficiency will be very poor. Fortunately, jdk provides atomic classes for this operation scenario, and modifies the race variable of int type modified by volatile to AtomicLong type. The code is as follows: AtomicLongTest.java.

public class AtomicLongTest {
    private static final int THREAD_COUNT = 20;

    private static volatile AtomicLong race = new AtomicLong(0);

    public static void increase() {
        race.getAndIncrement();
    }

    public static void main(String[] args) {
        Thread[] threads = new Thread[THREAD_COUNT];
        for (int i = 0; i < THREAD_COUNT; i++) {
            threads[i] = new Thread(new Runnable() {
                @Override
                public void run() {
                    for (int i = 0; i < 1000; i++) {
                        increase();
                    }
                }
            });
            threads[i].start();
        }

        //Wait until all accumulation threads end
        while (Thread.activeCount() > 1) {
            Thread.yield();
        }

        System.out.println("race: " + race);
    }
}

The expected result is 20000.

Although AtomicLong can guarantee the correctness of the results, the performance of using AtomicLong in high concurrency scenarios is not good. In order to solve the problem of performance, long adder is introduced in jdk1.8.

| LongAdder

The usage posture of LongAdder is similar to AtomicLong. Modify AtomicLong in the above code to LongAdder. The test code is as follows:

public class LongAdderTest {
    private static final int THREAD_COUNT = 20;

    //The default initialization value is 0
    private static volatile LongAdder race = new LongAdder();

    public static void increase() {
        race.increment();
    }

    public static void main(String[] args) {
        Thread[] threads = new Thread[THREAD_COUNT];
        for (int i = 0; i < THREAD_COUNT; i++) {
            threads[i] = new Thread(new Runnable() {
                @Override
                public void run() {
                    for (int i = 0; i < 1000; i++) {
                        increase();
                    }
                }
            });
            threads[i].start();
        }

        while (Thread.activeCount() > 1) {
            Thread.yield();
        }

        System.out.println("race: " + race);
    }
}

The result is also expected.

|Performance comparison between AtomicLong and LongAdder

After knowing the volatile keyword, AtomicLong and LongAdder, let's test the performance of AtomicLong and LongAdder. The functions of both are similar. How to choose to speak with data JMH is used for Benchmark test, and the test code is as follows:

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class PerformaceTest {
    private static AtomicLong atomicLong = new AtomicLong();
    private static LongAdder longAdder = new LongAdder();

    @Benchmark
    @Threads(10)
    public void atomicLongAdd() {
        atomicLong.getAndIncrement();
    }

    @Benchmark
    @Threads(10)
    public void longAdderAdd() {
        longAdder.increment();
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder().include(PerformaceTest.class.getSimpleName()).build();
        new Runner(options).run();
    }
}

explain:
  • @Benchmark mode (mode. Throughput) = > test throughput
  • @Outputtimeunit (timeunit. Milliseconds) = > output time unit
  • @Threads (10) = > number of test threads in each process

Test results: Number of threads is 1:

Benchmark                      Mode  Cnt       Score     Error   Units
PerformaceTest.atomicLongAdd  thrpt  200  153824.699 ± 137.947  ops/ms
PerformaceTest.longAdderAdd   thrpt  200  124087.220 ±  81.015  ops/ms

The number of threads is 5:
PerformaceTest.atomicLongAdd  thrpt  200   56392.136 ± 1165.361  ops/ms
PerformaceTest.longAdderAdd   thrpt  200  605501.870 ± 4140.190  ops/ms

Number of threads is 10:
Benchmark                      Mode  Cnt       Score      Error   Units
PerformaceTest.atomicLongAdd  thrpt  200   53286.334 ±  957.765  ops/ms
PerformaceTest.longAdderAdd   thrpt  200  713884.602 ± 3950.884  ops/ms

From the test results, when the number of threads is 5, LongAdder The performance has been better than AtomicLong. 

|Reasons for performance differences

To analyze the performance difference, you must go deep into the source code and analyze the source code. First, take a look at AtomicLong's getAndIncrement method.

AtomicLong#getAndIncrement method analysis
//AtomicLong#getAndIncrement
public final long getAndIncrement() {
    return unsafe.getAndAddLong(this, valueOffset, 1L);
}

//Unsafe#getAndAddLong
public final long getAndAddLong(Object var1, long var2, long var4) {
    long var6;
    do {
        var6 = this.getLongVolatile(var1, var2);
    } while(!this.compareAndSwapLong(var1, var2, var6, var6 + var4));

    return var6;
}


The CAS algorithm is used at the bottom, and the CAS operation in the JVM is realized by using the CMPXCHG instruction provided by the processor. The basic idea of spin CAS implementation is to cycle CAS operation until it is successful, which also brings performance problems under high concurrency. The cycle time is long and the overhead is large. If the spin CAS is not successful for a long time, it will bring very large execution overhead to the processor. In the high concurrency environment, when N threads spin at the same time, there will be a large number of failures and continuous spin. Therefore, in the above test, when the number of test threads is very large, the performance of using LongAdder is better than that of using AtomicLong.
Analysis of LongAdder#increment method
public void increment() {
    add(1L);
}

public void add(long x) {
    Cell[] as; long b, v; int m; Cell a;
    if ((as = cells) != null || !casBase(b = base, b + x)) {
        boolean uncontended = true;
        if (as == null || (m = as.length - 1) < 0 ||
            (a = as[getProbe() & m]) == null ||
            !(uncontended = a.cas(v = a.value, v + x)))
            longAccumulate(x, null, uncontended);
    }
}

final void longAccumulate(long x, LongBinaryOperator fn,
                              boolean wasUncontended) {
    int h;
    if ((h = getProbe()) == 0) {
        ThreadLocalRandom.current(); // force initialization
        h = getProbe();
        wasUncontended = true;
    }
    boolean collide = false;                // True if last slot nonempty
    for (;;) {
        Cell[] as; Cell a; int n; long v;
        if ((as = cells) != null && (n = as.length) > 0) {
            if ((a = as[(n - 1) & h]) == null) {
                if (cellsBusy == 0) {       // Try to attach new Cell
                    Cell r = new Cell(x);   // Optimistically create
                    if (cellsBusy == 0 && casCellsBusy()) {
                        boolean created = false;
                        try {               // Recheck under lock
                            Cell[] rs; int m, j;
                            if ((rs = cells) != null &&
                                (m = rs.length) > 0 &&
                                rs[j = (m - 1) & h] == null) {
                                rs[j] = r;
                                created = true;
                            }
                        } finally {
                            cellsBusy = 0;
                        }
                        if (created)
                            break;
                        continue;           // Slot is now non-empty
                    }
                }
                collide = false;
            }
            else if (!wasUncontended)       // CAS already known to fail
                wasUncontended = true;      // Continue after rehash
            else if (a.cas(v = a.value, ((fn == null) ? v + x :
                                             fn.applyAsLong(v, x))))
                break;
            else if (n >= NCPU || cells != as)
                collide = false;            // At max size or stale
            else if (!collide)
                collide = true;
            else if (cellsBusy == 0 && casCellsBusy()) {
                try {
                    if (cells == as) {      // Expand table unless stale
                        Cell[] rs = new Cell[n << 1];
                        for (int i = 0; i < n; ++i)
                            rs[i] = as[i];
                        cells = rs;
                    }
                } finally {
                    cellsBusy = 0;
                }
                collide = false;
                continue;                   // Retry with expanded table
            }
            h = advanceProbe(h);
        }
        else if (cellsBusy == 0 && cells == as && casCellsBusy()) {
            boolean init = false;
            try {                           // Initialize table
                if (cells == as) {
                    Cell[] rs = new Cell[2];
                    rs[h & 1] = new Cell(x);
                    cells = rs;
                    init = true;
                }
            } finally {
                cellsBusy = 0;
            }
            if (init)
                break;
        }
        else if (casBase(v = base, ((fn == null) ? v + x :
                                        fn.applyAsLong(v, x))))
            break;                          // Fall back on using base
    }
}

The code is very long and can be understood in combination with pictures:

The reason for the high performance of LongAdder is that it uses the Cell array to avoid the competition of shared variables with space for efficiency. In LongAdder, the base variable is used internally to save the Long value. When there is no thread conflict, CAS is used to update the base value. When there is thread conflict, the thread that does not execute CAS successfully operates the Cell array and sets the elements in the array to 1, that is, cell[i]=1, When the count is finally obtained, the sum of cell[i] will be calculated. Adding base will be the final count result. The sum code is as follows:

public long sum() {
    Cell[] as = cells; Cell a;
    long sum = base;
    if (as != null) {
        for (int i = 0; i < as.length; ++i) {
            if ((a = as[i]) != null)
                sum += a.value;
        }
    }
    return sum;
}

|AtomicLong and LongAdder selection

Long adder is selected for high parallel delivery, and AtomicLong is selected for non high parallel delivery.

Posted on Thu, 18 Nov 2021 21:52:14 -0500 by OMorchoe