Summary of common monitoring commands in Linux system, strongly recommended to collect!

1. CPU

cat /proc/cpuinfo
# Number of physical CPU s
cat /proc/cpuinfo | grep 'physical id' | sort | uniq | wc -l
# Number of cores per CPU
cat /proc/cpuinfo | grep 'core id' | sort | uniq | wc -l
# Logical CPU
cat /proc/cpuinfo | grep 'processor' | sort | uniq | wc -l
# mpstat
mpstat 2 10

2. Memory

cat /proc/meminfo
free -gt
df -hT
du -csh ./*

Operating system IPC shared memory / queue:

ipcs #(shmems, queues, semaphores)

We often need to monitor the memory usage status. The commonly used commands are free, vmstat, top, dstat -m, etc.

2.1 free

> free -h
             total       used       free     shared    buffers     cached
Mem:          7.7G       6.2G       1.5G        17M        33M       184M
-/+ buffers/cache:       6.0G       1.7G
Swap:          24G       581M        23G
Meaning of line data

First line Mem:

  • Total: the total memory is 7.7G. The physical memory size is the actual memory of the machine
  • Used: 6.2G of memory has been used. This value includes cached and the memory actually used by the application
  • Free: free memory 1.5G, unused memory size
  • Shared: size of shared memory, 17M
  • buffers: memory occupied by buffer, 33M
  • cached: the memory occupied by the cache, 184M

Among them are:

total = used + free

The second line - / + buffers/cache represents the memory actually used by the application:

  • The previous value indicates used - buffers/cached, which indicates the memory actually used by the application
  • The latter value represents free + buffers/cached, which indicates the memory that can be used in theory

You can see that these two values add up to total

The third line, swap, represents the usage of the swap partition: total, used, and unused

Cache cache

Cache represents cache. When the system reads files, it will first read the data from the hard disk to the memory. Because the hard disk is much slower than the memory, this process will be very time-consuming.

In order to improve efficiency, Linux will cache the read files in memory (locality principle). Even if the program ends, the cache will not be released automatically. Therefore, when a program reads a large number of files, it will be found that the memory utilization increases.

When other programs need to use memory, Linux will release these unused caches to other programs according to its own cache strategy (such as LRU). Of course, it can also release the cache manually:

echo 1 > /proc/sys/vm/drop_caches
Buffer buffer

Consider the scenario of writing files from memory to the hard disk, because the hard disk is too slow. If the memory needs to wait for the data to be written before continuing the subsequent operations, the efficiency will be very low and the running speed of the program will be affected. Therefore, there is a buffer.

When the memory needs to write data to the hard disk, it will be put into the buffer first. The memory will quickly write the data to the buffer and can continue other work. The hard disk can slowly read out the data in the buffer in the background and save it, which improves the efficiency of reading and writing.

For example, when copying the files in the computer to the USB flash disk, if the files are very large, sometimes there will be such a situation: it is clear that the files have been copied, but the system will still prompt that the USB flash disk is in use. This is the reason for the buffer: Although the copying program has put the data into the buffer, it has not written all the data to the USB flash disk

Similarly, you can use the sync command to manually flush the contents of the buffer:

> sync --help

Usage: sync [OPTION] [FILE]...
Synchronize cached writes to persistent storage

If one or more files are specified, sync only them,
or their containing file systems.

  -d, --data             sync only file data, no unneeded metadata
  -f, --file-system      sync the file systems that contain the files
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <>
Full documentation at: <>
or available locally via: info '(coreutils) sync invocation'
swap partition

Swap partition is an important concept to realize virtual memory. Swap is to use part of the space on the hard disk as memory. Running programs will use physical memory and put unused memory on the hard disk, which is called swap out. The memory in the hard disk swap partition is put back into the physical memory, which is called swap in.

Swapping partitions can logically expand memory space, but it will also slow down the system because the read and write speed of the hard disk is very slow. The Linux system will put the memory that is not often used into the swap partition.

The difference between cache and buffer
  • Cache: as the memory of the page cache, it is the cache of the file system. The data at the file level will be cached in the page cache
  • Buffer: as the memory of buffer cache, it is the cache of disk blocks. The data directly operated on the disk will be cached in buffer cache

Simply put, page cache is used to cache file data, and buffer cache is used to cache disk data. If there is a file system, the data will be cached in the page cache when the file is operated. If you directly use dd and other tools to read and write to the disk, the data will be cached in the buffer cache.

2.2 vmstat

Vmstat (virtual memory statistics) is used to make statistics on the overall situation of the system, including the statistics of kernel process, virtual memory, disk, interrupt and CPU activity:

> vmstat --help

 vmstat [options] [delay [count]]

 -a, --active           active/inactive memory
 -f, --forks            number of forks since boot
 -m, --slabs            slabinfo
 -n, --one-header       do not redisplay header
 -s, --stats            event counter statistics
 -d, --disk             disk statistics
 -D, --disk-sum         summarize disk statistics
 -p, --partition <dev>  partition specific statistics
 -S, --unit <char>      define display unit
 -w, --wide             wide output
 -t, --timestamp        show timestamp

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see vmstat(8).

> vmstat -SM 1 100 # 1 indicates refresh interval (seconds), 100 indicates printing times, in MB

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0    470    188   1154    0    0     0     4    3    0  0  0 99  0  0
 0  0      0    470    188   1154    0    0     0     0  112  231  1  1 98  0  0
 0  0      0    470    188   1154    0    0     0     0   91  176  0  0 100  0  0
 0  0      0    470    188   1154    0    0     0     0  118  229  1  0 99  0  0
 0  0      0    470    188   1154    0    0     0     0   78  156  0  0 100  0  0
 0  0      0    470    188   1154    0    0     0    64   84  186  0  1 97  2  0
  • Column r: indicates the number of processes running and waiting for CPU time slice. If this value is greater than the number of CPUs for a long time, it indicates that CPU resources are insufficient. You can consider increasing CPU
  • Column b: indicates the number of processes waiting for resources, such as I/O or memory exchange
  • swpn column: indicates the memory size of the switch partition. If the value of swpd is not 0 or relatively large, and the values of si and so are 0 for a long time, this will not affect the system performance temporarily
  • Free column: the current free physical memory size
  • buff column: indicates the memory size of buffers cache. Generally, buffering is required for reading and writing to block devices
  • Cache column: indicates the memory size of page cache. Generally, it is used as the cache of file system. Frequently accessed files will be cached. If the cache value is large, it indicates that there are a large number of cached files. If the bi in I/O is small at this time, the file system efficiency is better
  • si column: indicates swap in, that is, the memory is put into the physical memory by the swap partition
  • so column: indicates swap out, that is, put the unused memory into the swap partition of the hard disk
  • Column bi: indicates the total amount of data read from the block device, that is, the read disk, in KB/s
  • Column bo: indicates the total amount of data written to the block device, that is, to the disk, in KB/s

The bi+bo reference value set here is 1000. If it exceeds 1000 and the wa value is relatively large, it indicates that the system disk I/O performance is bottleneck

  • Column in: indicates the number of device interrupts per second observed in a certain time interval
  • cs column: indicates the number of context switches generated per second

The larger the above two values, the more CPU time the kernel consumes

  • us column: indicates the percentage of CPU time consumed by the user process. When the us value is relatively high, it indicates that the user process consumes more CPU time. If it is greater than 50% for a long time, you can consider optimizing the program
  • SY column: indicates the percentage of CPU time consumed by the kernel process. When the sy value is high, it indicates that the kernel consumes more CPU time. If the us+sy exceeds 80%, it indicates that the CPU resources are insufficient
  • id column: indicates the percentage of time the CPU is idle
  • Column Wa: indicates the percentage of CPU time occupied by I/O Wait. The higher the wa value, the more serious the I/O Wait. If the wa value exceeds 20%, it indicates that the I/O Wait is serious
  • st column: indicates CPU Steal Time, for virtual machines

3. Network

3.1 interface


3.2 ports

# port
netstat -ntlp # TCP
netstat -nulp # UDP
netstat -nxlp # UNIX
netstat -nalp # Show not only the listening port, but also the connections in other stages
lsof -p <PID> -P
lsof -i :5900
sar -n DEV 1  # network flow
ss -s

3.3 tcpdump

sudo tcpdump -i any udp port 20112 and ip[0x1f:02]=0x4e91 -XNnvvv
sudo tcpdump -i any -XNnvvv
sudo tcpdump -i any udp -XNnvvv
sudo tcpdump -i any udp port 20112 -XNnvvv
sudo tcpdump -i any udp port 20112 and ip[0x1f:02]=0x4e91 -XNnvvv

3.4 nethogs

Monitor the network traffic of each process


4. I/O performance

iostat -kx 2
vmstat -SM
vmstat 2 10
dstat --top-io --top-bio

5. Process

top -H
ps auxf
ps -eLf # Presentation thread
ls /proc/<PID>/task

5.1 top

For example, the most commonly used top command:

Help for Interactive Commands - procps version 3.2.8
Window 1:Def: Cumulative mode Off.  System: Delay 3.0 secs; Secure mode Off.

  Z,B       Global: 'Z' change color mappings; 'B' disable/enable bold
  l,t,m     Toggle Summaries: 'l' load avg; 't' task/cpu stats; 'm' mem info
  1,I       Toggle SMP view: '1' single/separate states; 'I' Irix/Solaris mode

  f,o     . Fields/Columns: 'f' add or remove; 'o' change display order
  F or O  . Select sort field
  <,>     . Move sort field: '<' next col left; '>' next col right
  R,H     . Toggle: 'R' normal/reverse sort; 'H' show threads
  c,i,S   . Toggle: 'c' cmd name/line; 'i' idle tasks; 'S' cumulative time
  x,y     . Toggle highlights: 'x' sort field; 'y' running tasks
  z,b     . Toggle: 'z' color/mono; 'b' bold/reverse (only if 'x' or 'y')
  u       . Show specific user only
  n or #  . Set maximum tasks displayed

  k,r       Manipulate tasks: 'k' kill; 'r' renice
  d or s    Set update interval
  W         Write configuration file
  q         Quit
          ( commands shown with '.' require a visible task display window ) 
Press 'h' or '?' for help with Windows,
any other key to continue
  • 1: Displays the usage of each CPU
  • c: Displays the full path of the process
  • H: Display thread
  • P: Sort - CPU usage
  • M: Sort - memory usage
  • R: Reverse order
  • Z: Change color mappings
  • B: Disable/enable bold
  • l: Toggle load avg
  • t: Toggle task/cpu stats
  • m: Toggle mem info
us - Time spent in user space
sy - Time spent in kernel space
ni - Time spent running niced user processes (User defined priority)
id - Time spent in idle operations
wa - Time spent on waiting on IO peripherals (eg. disk)
hi - Time spent handling hardware interrupt routines. (Whenever a peripheral unit want attention form the CPU, it literally pulls a line, to signal the CPU to service it)
si - Time spent handling software interrupt routines. (a piece of code, calls an interrupt routine...)
st - Time spent on involuntary waits by virtual cpu while hypervisor is servicing another processor (stolen from a virtual machine)

5.2 lsof

lsof -P -p 123

6. Performance test

stress --cpu 8 \
       --io 4  \
       --vm 2  \
       --vm-bytes 128M \
       --timeout 60s

time command

7. Users


8. System status


9. Hardware equipment

lsblk -fm # Display file system, permissions
lshw -c display

10. File system

# mount 
cat /etc/fstab
df -hT

11. Kernel and interrupt

cat /proc/modules
sysctl -a | grep ...
cat /proc/interrupts

12. System log and kernel log

less /var/log/messages
less /var/log/secure
less /var/log/auth

13. cron scheduled tasks

crontab -l
crontab -l -u nobody
 # View cron for all users
sudo find /var/spool/cron/ | sudo xargs cat

14. Commissioning tools

14.1 perf

14.2 strace

The strace command is used to print system calls and signals:

strace -p
strace -p 5191 -f
strace -e trace=signal -p 5191

-e trace=open
-e trace=file
-e trace=process
-e trace=network
-e trace=signal
-e trace=ipc
-e trace=desc
-e trace=memory

14.3 ltrace

The ltrace command is used to print dynamic link library access:

ltrace -p <PID>
ltrace -S # syscall

15. Scenario cases

Scenario 1: after connecting to the server

w       # Displays the currently logged in user, login IP, executing process, etc
last    # See who logged in to the server recently and the server restart time
uptime  # Startup time, login user, average load
history # View history commands

What information does the scenario 2: / proc directory contain

cat /proc/...


Scenario 3: executing commands in the background

nohup <command> &>[some.log] &

Some commands

# comprehensive
dstat & sar
# performance analysis 
# process
pstree -p
Ctrl+z & jobs & fg
# network
# Disk I/O
# virtual machine
# user
# Running time
# disk
# jurisdiction
# service
systemctl list-unit-files
# location
# performance testing 

Tags: Linux Operation & Maintenance

Posted on Mon, 01 Nov 2021 21:56:35 -0400 by TreColl