[Linux optimization] how to understand "average load"

Concept of average load

Average load refers to the average number of processes in the running and non interruptible state of the system per unit time, that is, the average number of active processes, which is not directly related to CPU utilization

  • Runnable state: the so-called runnable state process refers to the process that is using the CPU or waiting for the CPU, that is, the process in the R state (Running or runnable) seen by the ps command.
  • Non interruptible state: the processes in the kernel state key processes are non interruptible. For example, the most common process is waiting for the I/O response of the hardware device, that is, the process in D state (also known as Disk Sleep) seen in the ps command.

Therefore, you can simply understand that the average load is actually the average number of active processes. The average number of active processes is intuitively understood as the number of active processes per unit time, but it is actually the exponential decay average of the number of active processes. You don't have to worry about the detailed meaning of this "exponential decay average". It's just a faster calculation method of the system. You can directly regard it as the average of the number of active processes.

Since the average is the number of active processes, the ideal is that there is just one process running on each CPU, so that each CPU is fully utilized. For example, what does it mean when the average load is 2?

  • On a system with only 2 CPUs, it means that all CPUs are just fully occupied;
  • On a system with 4 CPUs, it means that the CPU is 50% idle;
  • In a system with only one CPU, it means that half of the processes cannot compete with the CPU;

What is the reasonable average load

Ideally, the average load is equal to the number of CPUs. Therefore, when judging the average load, you should first know how many CPUs the system has, which can be read through the top command or from the file / proc/cpuinfo, for example:

$ grep 'model name' /proc/cpuinfo | wc -l

Use the uptime command to obtain the average load of the system,

$ uptime
12:26:56 up 1 days, 14:20, 1 user, load average: 0.43, 0.72, 0.89

The last three numbers are the average load in the past 1 minute, 5 minutes and 15 minutes.

  • If the values of 1 minute, 5 minutes and 15 minutes are basically the same, or there is little difference, it indicates that the system load is very stable.
  • However, if the value of 1 minute is much less than the value of 15 minutes, it indicates that the load of the system in the last minute is decreasing, but there is a large load in the past 15 minutes.
  • Conversely, if the value of 1 minute is much greater than the value of 15 minutes, it indicates that the load in the last minute is increasing. This increase may be temporary or may continue to increase, so it needs to be observed continuously. Once the average load in one minute approaches or exceeds the number of CPU s, it means that the system is overloaded. At this time, we have to analyze and investigate what caused the problem and find ways to optimize it.

For another example, if we see that the average load on a single CPU system is 1.73, 0.60 and 7.98, it shows that the system has 73% overload in the past minute and 698% overload in the past 15 minutes. From the overall trend, the load of the system is decreasing.


Average load and CPU utilization

Let's go back to the meaning of average load. Average load refers to the number of processes in running state and non interruptible state per unit time. Therefore, it includes not only processes using CPU, but also processes waiting for CPU and I/O.

CPU utilization is the statistics of CPU busy in unit time, which does not necessarily correspond to the average load.

For example:

  • For CPU intensive processes, using a large number of CPUs will increase the average load. At this time, the two are consistent;
  • I/O intensive processes, waiting for I/O will also lead to an increase in the average load, but the CPU utilization is not necessarily very high;
  • A large number of processes waiting for CPU scheduling will also increase the average load, and the CPU utilization will be relatively high.

Analysis method of over average load

Introduce two tools, mpstat and pidstat:

  • mpstat is a commonly used multi-core CPU performance analysis tool, which is used to view the performance indicators of each CPU and the average indicators of all CPUs in real time;
  • pidstat is a commonly used process performance analysis tool, which is used to view the process performance indicators such as CPU, memory, I/O and context switching in real time;

① First, use mpstat to check the CPU usage. If iowait is high, it may be that the average load caused by I/O operation is too high;
② Use pidstat to view the process status. If the number of processes is less than the number of CPUs or the wait parameter of the process is very small, the average load may be too high due to CPU intensive processes;
③ If you use pidstat to find that the wait parameter of a process is very high, it should be that the average load caused by multiple processes competing for CPU resources is too high;

Scenario 1: CPU intensive process

Run mpstat to check the change of CPU utilization:

# -P ALL means to monitor all CPU s, and the following number 2 means to output a group of data after an interval of 2 seconds
$ mpstat -P ALL 2
Linux 5.32.0 (ubuntu) (2 CPU)
14:21:20     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14:21:26     all   50.05    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.95
14:21:26       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
14:21:26       1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

One CPU has a utilization rate of 100%, but its iowait is only 0. This shows that the increase in the average load is due to the 100% CPU utilization.

Which process causes the CPU utilization to be 100%? You can use pidstat to query:

# Output a set of data after an interval of 2 seconds
$ pidstat -u 2 1
14:22:07      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:22:12        0      5684  100.00    0.00    0.00    0.00  100.00     1  tcp_test

Scenario 2: I/O intensive processes

Run mpstat to check the change of CPU utilization:

# Display the indicators of all CPU s and output a set of data at an interval of 2 seconds
$ mpstat -P ALL 2 1
Linux 5.32.0 (ubuntu) (2 CPU)
14:25:48     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14:25:52     all    0.21    0.00   22.07   22.67    0.00    0.21    0.00    0.00    0.00   54.84
14:25:52       0    0.43    0.00   13.87   77.53    0.00    0.43    0.00    0.00    0.00    7.74
14:25:52       1    0.00    0.00    0.81    0.20    0.00    0.00    0.00    0.00    0.00   98.99

The system CPU utilization of one CPU increased to 13.87%, while iowait was as high as 77.53%. This shows that the increase of average load is due to the increase of iowait.

So which process leads to such a high iowait? We still use pidstat to query:

# A group of data is output after an interval of 5 seconds, -u represents the CPU index
$ pidstat -u 5 1
Linux 5.32.0 (ubuntu) (2 CPU)
14:27:12      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:27:13        0       300    0.00    3.39    0.00    0.00    3.39     1  kworker/1:1H
14:27:13        0       301    0.00    0.40    0.00    0.00    0.40     0  kworker/0:1H
14:27:13        0      4526    2.00   35.53    0.00    3.99   37.52     1  tcp_test
14:27:13        0      6521    0.00    0.40    0.00    0.00    0.40     0  pidstat

Scenario 3: scenario with a large number of processes

Run pidstat to see the process:

# Output a set of data after an interval of 5 seconds
$ pidstat -u 2 1
14:45:25      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:45:30        0      5960   30.00    0.00    0.00   79.80   25.00     0  tcp_test
14:45:30        0      5961   30.00    0.00    0.00   80.20   25.00     0  tcp_test
14:45:30        0      5962   30.00    0.00    0.00   79.80   25.00     1  tcp_test
14:45:30        0      5963   30.00    0.00    0.00   80.00   25.00     1  tcp_test
14:45:30        0      5964   19.80    0.00    0.00   79.60   24.80     0  tcp_test
14:45:30        0      5965   19.80    0.00    0.00   80.00   24.80     0  tcp_test
14:45:30        0      5966   19.80    0.00    0.00   79.60   24.80     1  tcp_test
14:45:30        0      5967   19.80    0.00    0.00   79.80   24.80     1  tcp_test
14:45:30        0      6512    0.00    0.20    0.00    0.20    0.20     0  pidstat

It can be seen that eight processes are competing for two CPUs, and the time for each process to wait for the CPU (that is, the% wait column in the code block) is as high as 80%. These processes that exceed the computing power of the CPU eventually lead to CPU overload.


Summary

  • The high average load may be caused by CPU intensive processes;
  • High average load does not necessarily mean high CPU utilization, and I/O may be busier;
  • When you find that the load is high, you can use tools such as mpstat and pidstat to analyze the source of the load.

Tags: Linux Operation & Maintenance Load Balance

Posted on Sat, 20 Nov 2021 11:20:08 -0500 by alcapone