Loop expansion of JVM optimization

In the previous articles in the JVM internal implementation series, we have seen the just in time (JIT) compilation technology of Java's HotSpot virtual machine, including escape analysis and lock elimination. In this paper, we will discuss another automatic optimization, called loop expansion. The JIT compiler uses this technique to make loops (such as Java's for or while loops) execute more efficiently.

Since we need to deeply analyze the internal mechanism of the JVM, you will see various C codes and even assembly languages used to explain and introduce from time to time. Hold on!

Let's start with the following C code. It will allocate 1 million long spaces, and then fill them with 1 million random long values.

int main(int argv, char** argc) {
    int MAX = 1000000;
    long* data = (long*)calloc(MAX, sizeof(long));
    for (int i = 0; i < MAX; i++) {
        data[i] = randomLong();
    }
}

C is considered a high-level language, but is this really the case? On Apple Mac, compiling the previous code with the Clang compiler (turn on the - S option to print Intel format assembly language) will get the following output results:

_main:                       ## @main
## BB#0:
    pushq   %rbp
    movq    %rsp, %rbp
    subq    $48, %rsp
    movl    $8, %eax
    movl    %eax, %ecx
    movl    $0, -4(%rbp)
    movl    %edi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movl    $1000000, -20(%rbp)   ## imm = 0xF4240
    movslq  -20(%rbp), %rdi
    movq    %rcx, %rsi
    callq   _calloc
    movq    %rax, -32(%rbp)
    movl    $0, -36(%rbp)
LBB1_1:                       ##    LBB1_1 is the header depth of the internal loop = 1
    movl    -36(%rbp), %eax
    cmpl    -20(%rbp), %eax
    jge LBB1_4
## BB#2:                      ##    Inside the loop body: Header=BB1_1 Depth=1   
    callq   _randomLong
    movslq  -36(%rbp), %rcx
    movq    -32(%rbp), %rdx
    movq    %rax, (%rdx,%rcx,8)
## BB#3:                      ##    Inside the loop body: Header=BB1_1 Depth=1
    movl    -36(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -36(%rbp)
    jmp LBB1_1
LBB1_4:
    movl    -4(%rbp), %eax
    addq    $48, %rsp
    popq    %rbp
    retq

Looking at this code, you will find that there is a call to the calloc function at the beginning, and there is only one call to the randomLong() function (in the loop). There are two jumps, which are essentially the same as the machine code generated by the following variant C code:

//java learning and exchange: 737251827 enter to receive learning resources and ask questions about leaders with ten years of development experience for free!
int main(int argv, char** argc) {
    int MAX = 1_000_000;
    long* data = (long*)calloc(MAX, sizeof(long));
    int i = 0;
   LOOP: if (i >= MAX)
        goto END;
    data[i] = randomLong();
    ++i;
    goto LOOP;
    END: return 0;
}

The same code in Java should be like this:

public class LoopUnroll {
    public static void main(String[] args) {
        int MAX = 1000000;
        long[] data = new long[MAX];
        java.util.Random random = new java.util.Random();
            for (int i = 0; i < MAX; i++) {
           data[i] = random.nextLong();
        }
    }
}

If it is compiled into bytecode, it will be like this:

public static void main(java.lang.String[]);
    Code:
 0: ldc                        #2       // int 1000000
 2: istore_1
 3: iload_1
 4: newarray       long
 6: astore_2
 7: new                       #3       // class java/util/Random
10: dup
11: invokespecial             #four       //  Method Java / util / random. "< init >: () V
14: astore_3
15: iconst_0
16: istore        4
18: iload         4
20: iload_1
21: if_icmpge     38
24: aload_2
25: iload         4
27: aload_3
28: invokevirtual             #five      //  Method java/util/Random.nextLong:()J
31: lastore
32: iinc          4, 1
35: goto          18
38: return

These programs are very similar in code structure. They all perform an operation on array data in a loop. A real processor will have an instruction pipeline. If the program is executed linearly all the time, it can fully reference the pipeline, because the next executed instruction will be ready immediately.

However, once a jump instruction is encountered, the advantage of the instruction pipeline usually disappears, because the contents of the pipeline need to be discarded and the new opcode needs to be loaded again from the jump address in main memory. The performance loss here is similar to the cache miss - both need to be loaded from main memory once more.

For forward jump (Note: the original is back branch, which means to jump back to the previously executed branch from the perspective of code execution sequence, which is called forward jump for the time being) ————Jump back to the previous execution point - as in the previous for loop, the impact on performance depends on the accuracy of the branch prediction algorithm provided by the CPU. Section 3.4.1 of the Intel 64 and IA-32 architecture optimization reference manual [PDF] describes the branch prediction algorithm of a specific chip in detail.

However, due to the existence of HotSpot's JIT compiler, there are more possibilities for Java programs. The JIT compiler has been optimized, and the compiled code will be very different under different circumstances.

Especially when using int, short, or char variables as counted loops Among them, JIT has made many optimizations. It will expand the loop body and replace it with copies of the original loop body arranged one by one. The reconstruction of the loop reduces the forward jump required. Moreover, compared with the assembly code generated after C code compilation, the performance is greatly improved, because the cache of the instruction pipeline is discarded much less times.

Let's use several simple methods to test the differences between different loop execution methods. You can see the assembly code after the loop is expanded and how the original multiple loop operations are completed in one loop.

Before starting the assembly code journey, we also need to make some simple modifications to the previous Java code to make the JIT compiler work, because the HotSpot virtual machine will only compile the whole method body. Not only that, this method also needs to be executed a certain number of times in interpretation mode before the compiler will consider compiling it (usually, it will enter the fully optimized compilation mode after 10000 times of execution). If it is only a separate main method as before, the JIT compiler will never be aroused, and there will be no optimization.

The following Java method is basically similar to the original example. You can use it to test:

private long intStride1()
{
    long sum = 0;
    for (int i = 0; i < MAX; i += 1)
    {
        sum += data[i];
    }
    return sum;
}

This method will sequentially take values from the array, accumulate them, and then return the results. This is similar to the previous example, but we choose to return the results to ensure that the JIT compiler will not combine loop expansion and escape analysis for further optimization, so it is not easy to determine the actual effect of loop expansion.

We can identify a key access pattern from the assembly language, which makes it easier for us to understand what the code is doing. This is the triple [base, index, offset] composed of registers and offsets

  • The base register stores the starting address of the array

  • The index register stores the counter (this is multiplied by the size of the data type)

  • Offset is used to record the offset in the expansion loop

The actual assembly language looks like this:

add rbx, QWORD PTR [base register + index register * size + offset]

Assuming that the array type is long, let's look at the conditions under which circular expansion will be triggered. It should be noted that the behavior of circular expansion is different between different versions of HotSpot virtual machine, and also depends on the specific CPU architecture, but the overall concept is the same.

To get the disassembled local code generated by the JIT compiler, you also need a disassembly Library (usually hsdis, HotSpot Disassembler), which should be installed in the jre/lib directory under your Java installation address.

Hsdis can be compiled from the OpenJDK source code. See the JITWatch wiki for specific operation documents. Another method is that Oracle's GraalVM project distributes hsdis as a downloadable binary file - you can copy it from the GraalVM installation directory to the Java installation location.

After installing hsdis, you need to configure the virtual machine to output the compiled assembly code of the method. To do so, you have to add some VM startup parameters, including - XX:+PrintAssembly.

It should be noted that after the JIT thread compiles the method, it will directly disassemble the corresponding local code into a readable assembly language. This is a very expensive operation and will affect the performance of the application, so it should not be used in the production environment.

Execute the program with the following VM options, and you can see the disassembled assembly language of the specified method:

java -XX:+UnlockDiagnosticVMOptions \
     -XX:-UseCompressedOops         \
     -XX:PrintAssemblyOptions=intel \
     -XX:CompileCommand=print,javamag.lu.LoopUnrolling::intStride1 \
     javamag.lu.LoopUnrolling

This command generates the assembly code corresponding to an int count loop with a fixed step of 1.

It is worth noting that - XX:-UseCompressedOops is used here to turn off the optimization of pointer address compression and simplify the generated assembly code. On a 64 bit JVM, this will save some memory, but we don't recommend you to do so in an ordinary virtual machine usage scenario. You can learn more about the compression of ordinary object pointers (OOPS) in the wiki of OpenJDK.

The summation result of the continuously accumulated long type is stored in the 64 bit register rbx. Each add instruction takes the next value from the data array and adds it to rbx. After each load, the offset constant increases by 8 (which is the size of the long base type in Java).

When the expanded part jumps back to the beginning of the main loop, the offset register will increase automatically, plus the amount of data processed in this loop iteration:

//==============================
// setup code
//==============================
// Assign the address of the data array to rcx
0x00007f475d1109f7: mov rcx,QWORD PTR [rbp+0x18]  ;*getfield data
// Assign array size to   edx
0x00007f475d1109fb: mov edx,DWORD PTR [rcx+0x10]
// Assign MAX to   r8d
0x00007f475d1109fe: mov r8d,DWORD PTR [rbp+0x10]  ;*getfield MAX
// The loop counter is r13d, which is compared with MAX
0x00007f475d110a02: cmp r13d,r8d
// If counter > = max, jump to exit
0x00007f475d110a05: jge L0006
0x00007f475d110a0b: mov r11d,r13d
0x00007f475d110a0e: inc r11d
0x00007f475d110a11: xor r9d,r9d
0x00007f475d110a14: cmp r11d,r9d
0x00007f475d110a17: cmovl r11d,r9d
0x00007f475d110a1b: cmp r11d,r8d
0x00007f475d110a1e: cmovg r11d,r8d
 
//==============================
// topping cycle 
//==============================
// Array boundary check
             L0000: cmp r13d,edx
0x00007f475d110a25: jae L0007
// Perform addition
0x00007f475d110a2b: add rbx,QWORD PTR [rcx+r13*8+0x18]  ;*ladd
// Counter self increment
0x00007f475d110a30: mov r9d,r13d
0x00007f475d110a33: inc r9d  ;*iinc
// If PRE-LOOP has been completed, jump to main loop
0x00007f475d110a36: cmp r9d,r11d
0x00007f475d110a39: jge L0001
// Check the cycle counter, if not completed, jump forward (at L0000)
0x00007f475d110a3b: mov r13d,r9d
0x00007f475d110a3e: jmp L0000
//==============================
// Main loop initialization
//==============================
             L0001: cmp r8d,edx
0x00007f475d110a43: mov r10d,r8d
0x00007f475d110a46: cmovg r10d,edx
0x00007f475d110a4a: mov esi,r10d
0x00007f475d110a4d: add esi,0xfffffff9
0x00007f475d110a50: mov edi,0x80000000
0x00007f475d110a55: cmp r10d,esi
0x00007f475d110a58: cmovl esi,edi
0x00007f475d110a5b: cmp r9d,esi
0x00007f475d110a5e: jge L000a
0x00007f475d110a64: jmp L0003
0x00007f475d110a66: data16 nop WORD PTR [rax+rax*1+0x0]
//==============================
// Start main loop (expanded part)
// Add 8 times per iteration
//==============================
             L0002: mov r9d,r13d
             L0003: add rbx,QWORD PTR [rcx+r9*8+0x18]   ;*ladd
0x00007f475d110a78: movsxd r10,r9d
0x00007f475d110a7b: add rbx,QWORD PTR [rcx+r10*8+0x20]  ;*ladd
0x00007f475d110a80: add rbx,QWORD PTR [rcx+r10*8+0x28]  ;*ladd
0x00007f475d110a85: add rbx,QWORD PTR [rcx+r10*8+0x30]  ;*ladd
0x00007f475d110a8a: add rbx,QWORD PTR [rcx+r10*8+0x38]  ;*ladd
0x00007f475d110a8f: add rbx,QWORD PTR [rcx+r10*8+0x40]  ;*ladd
0x00007f475d110a94: add rbx,QWORD PTR [rcx+r10*8+0x48]  ;*ladd
0x00007f475d110a99: add rbx,QWORD PTR [rcx+r10*8+0x50]  ;*ladd
// Cycle counter self increment 8
0x00007f475d110a9e: mov r13d,r9d
0x00007f475d110aa1: add r13d,0x8  ;*iinc
// Check the cycle counter, if not completed, jump forward (at L0002)
088
0x00007f475d110aa5: cmp r13d,esi
0x00007f475d110aa8: jl L0002
//==============================
0x00007f475d110aaa: add r9d,0x7  ;*iinc

// If cycle counter > = max, jump to exit
             L0004: cmp r13d,r8d
0x00007f475d110ab1: jge L0009

0x00007f475d110ab3: nop

//==============================
// Post cycle
//==============================
// Array boundary check
             L0005: cmp r13d,edx
0x00007f475d110ab7: jae L0007
// Perform an addition
0x00007f475d110ab9: add rbx,QWORD PTR [rcx+r13*8+0x18];*ladd

// Cycle counter self increment
0x00007f475d110abe: inc r13d  ;*iinc

// Check the cycle counter, if not completed, jump forward (at L0005)
0x00007f475d110ac1: cmp r13d,r8d

0x00007f475d110ac4: jl L0005

//==============================

(in order to make it easier for you to understand, we have added some comments in the assembly code, so that each independent part is clearer. For the sake of brevity, we only retain one exit method block, but there are usually multiple exit blocks in the assembly language to deal with various possible situations when the method ends. The code of the setting part is also included later in this article It will be compared with other operations.)

When accessing an array in a loop, the HotSpot virtual opportunity splits the loop into three parts to eliminate the boundary check of the array:

  • Pre loop: performs the initial iteration and performs a boundary check.

  • Main loop: the maximum number of iterations that can be performed without boundary checking is calculated by the loop step (that is, the increase of the counter at each iteration).

  • Post loop: performs the remaining iterations and performs boundary checks.

Calculate the ratio of add operation and jump operation, and you can know the actual optimization effect of this method. In the unoptimized version of C language we tested earlier, this ratio is 1:1, while the JIT compiler of Java HotSpot virtual machine increases this number to 8:1, reducing the number of jumps by 87%. Generally speaking, the impact of a jump will consume 2 to 300 CPU cycles to wait for the code to be reloaded from main memory, so the improvement effect is very obvious. (if you want to know how the HotSpot virtual machine eliminates boundary checking during array looping, you can see this online document.)

The HotSpot virtual machine can also expand the int count in a loop with a constant step of 2 or 4. For example, if the step size is 4, it can expand the loop into 8 times, and the address offset of each loop will increase by 0x20(32). The compiler can also support expanding the loop body counted by short, byte or char, but the long type is not supported, which will be discussed in the next section.

Safepoints

The code of the Java method using long type for loop counting looks very similar to that of int type:

private long longStride1()
{
    long sum = 0;
    for (long l = 0; l < MAX; l++)
    {
        sum += data[(int) l];

   }
    return sum;
}

However, after using the long type to count, the initialization part of the assembly code generated by it is completely different from that in the assembly code listed above -- even if the step size is a constant 1, there will be no loop expansion:

// Assign the array length to r9d
0x00007fefb0a4bb7b: mov    r9d,DWORD PTR [r11+0x10]

// Jump to the end of the cycle and check whether the counter exceeds the upper limit
0x00007fefb0a4bb7f: jmp    0x00007fefb0a4bb90

//(here is the destination address of the forward jump) - sum through r14
0x00007fefb0a4bb81: add    r14,QWORD PTR [r11+r10*8+0x18]
// Cycle counter rbx self increment
0x00007fefb0a4bb86: add    rbx,0x1 

// Safety point inspection

0x00007fefb0a4bb8a: test   DWORD PTR [rip+0x9f39470],eax 

// If the cycle counter > = 1_000_000, jump to the exit

0x00007fefb0a4bb90: cmp    rbx,0xf4240

0x00007fefb0a4bb97: jge    0x00007fefb0a4bbc9

 

// Assign the lower 32 bits of the loop counter to r10d

0x00007fefb0a4bb99: mov    r10d,ebx
 

// Array boundary check and jump forward back to the beginning of the loop

0x00007fefb0a4bb9c: cmp    r10d,r9d

0x00007fefb0a4bb9f: jb     0x00007fefb0a4bb81

Now there is only one add instruction in the loop body - the ratio of add and jump instructions has changed back to 1:1, and the benefit of loop expansion is no longer. Not only that, there is another safety point check in the loop.

Safety points are special positions in the code. When the execution thread executes to this place, it will know that it has completed all modifications to the internal data structure (such as objects in the heap) . this is a good time to check and confirm whether the JVM needs to pause all threads executing Java code. Application threads provide an opportunity for the JVM to perform some operations that may modify the memory layout or internal data structure, such as stop the world (STW) garbage collection, by checking the security point and suspending execution.

During code interpretation and execution, there is a good time for security point checking: when one bytecode has just been executed and the next bytecode has not been executed.

Security point checking between bytecodes is very useful for interpretation and execution, but for JIT compiled methods, this check must be integrated and inserted into the code generated by the compiler.

If these checks are missing, other threads have been suspended at their safety point, but some threads are still executing. This will lead to the virtual machine into a chaotic state. Almost all application threads have stopped running, but some are still executing.

HotSpot uses several heuristics to insert security point checks into the compiled code. The two most commonly used are before the forward jump (as in this example), and when the method is about to exit but the control flow has not returned to the caller.

However, the emergence of security point checks in the long count example also exposes another feature of the int count loop: they have no security point checks. That is, no safety point check will occur during the execution of the whole int count cycle (with constant steps), which may take a long time in extreme cases.

However, if it is a loop body with int count but variable step size, for example, the step size may change every time a method is called:

private long intStrideVariable(int stride)

{
    long sum = 0;
    for (int i = 0; i < MAX; i += stride)
    {
        sum += data[i];
    }

return sum;
}

This code will force the JIT compiler to generate a safety point check at the forward jump.

The long-running int counting loop will cause other threads to wait at the safe point until its execution ends. If you are sensitive to the delay and pause time caused by this, you can use the startup parameter - XX:+UseCountedLoopSafepoints to solve this problem. This option will add a safety point check at the forward jump of the unexpanded loop body. In this way, in the long assembly code generated in the previous example, a security point check will occur every 8 addition operations.

Unless you have clearly confirmed in the performance test that adding this parameter will significantly improve the performance, do not activate this option. Other command-line parameters that will affect the performance are also handled according to the same principle. Few programs benefit from enabling this option, so don't turn it on blindly. Java 10 introduces a more advanced technology called loop strip mining to further balance the impact of security point checking on throughput and latency.

We use JMH to test the performance difference between using int count and long count for the same array. As explained earlier, loops using long counting are not expanded, and each loop also includes a safety point check.

package optjava.jmh;
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit; 

@BenchmarkMode(Mode.Throughput)

@OutputTimeUnit(TimeUnit.SECONDS)

@State(Scope.Thread)

public class LoopUnrollingCounter

{

    private static final int MAX = 1_000_000;

    private long[] data = new long[MAX];

    @Setup
    public void createData()
   {
        java.util.Random random = new java.util.Random();
        for (int i = 0; i < MAX; i++)
        {
            data[i] = random.nextLong();
        }
    }
     @Benchmark
  public long intStride1()
  {
      long sum = 0;
      for (int i = 0; i < MAX; i++)
      {
          sum += data[i];
      }
       return sum;
  }

  @Benchmark
  public long longStride1()
  {
      long sum = 0;
     for (long l = 0; l < MAX; l++)
      {
          sum += data[(int) l];
       }
       return sum;
   }
}

The final output results are as follows:

1Benchmark             Mode  Cnt     Score   Error  Units

2LoopUnrollingCounter.intStride1   thrpt  200  2423.818 ± 2.547  ops/s

3LoopUnrollingCounter.longStride1  thrpt  200  1469.833 ± 0.721  ops/s

That is, loops that use int counts can perform 64% more operations per second.

conclusion

The HotSpot virtual machine can perform more complex loop unrolling optimizations -- for example, when the loop contains multiple exit points. In this case, the loop will be expanded, and the termination condition will be checked once for each expanded iteration.

As a virtual machine, HotSpot uses the ability of loop unrolling to reduce or eliminate the performance loss caused by forward jump. But for most Java developers, they don't need to know this capability -- it's just another performance optimization that the runtime provides transparent to them.

Tags: Java Back-end Programmer

Posted on Sun, 21 Nov 2021 19:14:11 -0500 by jcornett