GPU server troubleshooting

1. GPU log collection

On the system with GPU driver installed, execute the command in any directory: NVIDIA bug report.sh

After executing the command, a log compression package will be generated in the current directory: NVIDIA bug report.log.gz

2. GPU basic state detection

For the GPU server, it is recommended that the customer maintain a newer GPU driver version, disable the nouveau module, turn on the GPU driver memory resident mode and configure the power on self starting mode.

For GPU servers, the following configurations are recommended:

Maintain a newer and correct GPU driver version
Disable nouveau module
Turn on GPU driver memory resident mode and configure boot

When dealing with GPU server failure, as long as the operation of server shutdown is involved, it is recommended to detect the basic state of GPU, including:

Whether nouveau module is disabled, GPU recognition, GPU driver memory resident mode, GPU bandwidth, GPU ECC error, GPU ERR error, GPU nvlink status.

2.1 nouveau module disable check

Nouveau is an open-source driver of NVIDIA graphics card built by a group of developers. It conflicts with NVIDIA's official GPU driver, so the nouveau module needs to be disabled under the system.

# No output from the following command indicates that the nouveau module is disabled [root@zj ~]# lsmod | grep -i nouveau # The following output indicates that the nouveau module is not disabled [root@zj ~]# lsmod | grep -i nouveau nouveau 1662531 0 mxm_wmi 13021 1 nouveau wmi 19086 2 mxm_wmi,nouveau i2c_algo_bit 13413 1 nouveau video 24538 1 nouveau drm_kms_helper 176920 2 nouveau,vmwgfx ttm 99555 2 nouveau,vmwgfx drm 397988 6 ttm,drm_kms_helper,nouveau,vmwgfx i2c_core 63151 5 drm,i2c_piix4,drm_kms_helper,i2c_algo_bit,nouveau

The method to disable nouveau module is as follows:

CentOS 7:

# Edit or create a new blacklist-nouveau.conf file [root@zj ~]# vim /usr/lib/modprobe.d/blacklist-nouveau.conf blacklist nouveau options nouveau modeset=0 # Execute the following command and restart the system for the kernel to take effect [root@zj ~]# dracut -force [root@zj ~]# shutdown -ry 0

2.2 GPU drive memory resident mode

Turning on GPU driver memory resident mode can reduce GPU card dropping, GPU bandwidth reduction, GPU temperature monitoring failure and many other problems. It is recommended to turn on memory resident mode of GPU driver and configure power on self starting.

Common methods for checking the memory resident mode of GPU driver:

The Persistence-M status in NVIDIA SMI output is on
In nvidia-but-report.log, Persistence Mode is Enabled

NVIDIA SMI output:

nvidia-but-report.log log:

GPU 00000000:3B:00.0 Product Name : Tesla P40 Product Brand : Tesla Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled

Please make sure the on-site server:

Turn on GPU driver memory resident mode
Configure power on self start

To start the memory resident mode of GPU driver, execute the command:

nvidia-smi -pm 1 //or # The following commands are valid for newer GPU drivers nvidia-persistenced --persistence-mode

Example of boot configuration:

# vim /etc/rc.d/rc.local # Add a line to the file # nvidia-smi -pm 1 # Give executable permission to / etc/rc.d/rc.local file # chmod +x /etc/rc.d/rc.local # Restart the system for verification

2.3 check whether GPU is recognized

In GPU recognition state detection, first ensure that lspci command recognizes all GPUs, and then ensure NVIDIA SMI command recognizes all GPUs.

lspci check GPU recognition

In the output of lspci | grep -i nvidia command, ensure that all GPUs are recognized normally, and the end of each GPU is marked (rev a1).

The end of the output message is (rev ff), indicating that the GPU is abnormal.

# The following command indicates that 8 GPUs are recognized, and the GPU with the end mark (rev a1) is in normal state # B5:00.0 the end of GPU is marked with (rev ff), indicating that the GPU status is abnormal ~]# lspci | grep -i nvidia 3e:00.0 3D controller: NVIDIA Corporation Device 1db8 (rev a1) 3f:00.0 3D controller: NVIDIA Corporation Device 1db8 (rev a1) 40:00.0 3D controller: NVIDIA Corporation Device 1db8 (rev a1) 41:00.0 3D controller: NVIDIA Corporation Device 1db8 (rev ff)

nvidia-smi check GPU recognition
# nvidia-smi Thu Dec 26 09:53:57 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM3... On | 00000000:3E:00.0 Off | 0 | | N/A 42C P0 54W / 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM3... On | 00000000:3F:00.0 Off | 0 | | N/A 40C P0 48W / 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM3... On | 00000000:40:00.0 Off | 0 | | N/A 40C P0 52W / 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM3... On | 00000000:41:00.0 Off | 0 | | N/A 43C P0 54W / 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+

2.4 GPU bandwidth check

It is necessary to ensure that the current bandwidth of GPU is consistent with the rated bandwidth. Generally, x16 indicates normal.

You can use the lspci command or NVIDIA SMI command to check the GPU bandwidth.

#lspci command Rated bandwidth: lspci -vvd 10de: | grep -i Lnkcap: Current bandwidth: lspci -vvd 10de: | grep -i Lnksta:

# NVIDIA SMI command check nvidia-smi -q | grep -i -A 2 'Link width'

2.5 GPU ECC count check

The GPU ECC count can be checked by the following methods.

The flag of Pending in the following output is No, which means that all ECC error address spaces have been blocked, and the error address spaces will not be called by the software program in the future, and will not affect the program operation.

Pending :No

Yes indicates that there is an ECC error address that needs to be masked. You need to restart the system or reset the GPU to No.

# Use the - i parameter to specify the GPU id to query the ECC count of a GPU # nvidia-smi -i <target gpu> -q -d PAGE_RETIREMENT ... Retired pages Single Bit ECC : 2 Double Bit ECC : 0 Pending : No # Query ECC count of all GPU s without - i parameter # nvidia-smi -q -d PAGE_RETIREMENT

You can also view it through NVIDIA SMI | grep - I 'bit ECC' command.

For GPU ECC counting, please replace the GPU according to the company's index requirements. In addition, ensure that the error address space of the GPU with ECC counting has been blocked, that is, Pending: No.

2.6 GPU ERR error detection

During the operation of GPU, Fan ERR and power ERR error will appear. You can judge whether the NVIDIA SMI output contains ERR! Error.

# nvidia-smi Thu Dec 26 09:53:57 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM3... On | 00000000:3E:00.0 Off | 0 | | ERR! 44C P0 ERR!/ 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+

2.7 GPU serial number query

# nvidia-smi -q | grep -i serial Serial Number : 0324018045603 Serial Number : 0324018044864 Serial Number : 0324018027716 Serial Number : 0323918059881 # You can check the GPU serial number of the specified id through NVIDIA SMI - Q - I id # nvidia-smi -q -i 0 | grep -i serial Serial Number : 0324018045603

3. GPU fault diagnosis process

The following is the flow chart of GPU common fault inspection

3.1 GPU basic status check

3.2 lspci GPU not recognized

! [02 lspci GPU not recognized]( https://gitee.com/Gavin_zj/blog/raw/master/blog_img/02-lspci GPU does not recognize. svg)

3.3 NVIDIA SMI GPU not recognized

! [03 NVIDIA SMI GPU not recognized]( https://gitee.com/Gavin_zj/blog/raw/master/blog_img/03-nvidia-smi GPU does not recognize. svg)

3.4 GPU ERR error

![04-GPU ERR](https://gitee.com/Gavin_zj/blog/raw/master/blog_img/04-GPU ERR.svg)

3.5 GPU ECC error detection

! [05-GPU ECC error 2]( https://gitee.com/Gavin_zj/blog/raw/master/blog_img/05-GPU ECC error 2.svg)