GPU server troubleshooting

1. GPU log collection

On the system with GPU driver installed, execute the command in any directory: NVIDIA bug report.sh

After executing the command, a log compression package will be generated in the current directory: NVIDIA bug report.log.gz

2. GPU basic state detection

For the GPU server, it is recommended that the customer maintain a newer GPU driver version, disable the nouveau module, turn on the GPU driver memory resident mode and configure the power on self starting mode.

For GPU servers, the following configurations are recommended:

  • Maintain a newer and correct GPU driver version
  • Disable nouveau module
  • Turn on GPU driver memory resident mode and configure boot

When dealing with GPU server failure, as long as the operation of server shutdown is involved, it is recommended to detect the basic state of GPU, including:

Whether nouveau module is disabled, GPU recognition, GPU driver memory resident mode, GPU bandwidth, GPU ECC error, GPU ERR error, GPU nvlink status.

2.1 nouveau module disable check

Nouveau is an open-source driver of NVIDIA graphics card built by a group of developers. It conflicts with NVIDIA's official GPU driver, so the nouveau module needs to be disabled under the system.

# No output from the following command indicates that the nouveau module is disabled
[root@zj ~]# lsmod | grep -i nouveau

# The following output indicates that the nouveau module is not disabled
[root@zj ~]# lsmod | grep -i nouveau
nouveau              1662531  0
mxm_wmi                13021  1 nouveau
wmi                    19086  2 mxm_wmi,nouveau
i2c_algo_bit           13413  1 nouveau
video                  24538  1 nouveau
drm_kms_helper        176920  2 nouveau,vmwgfx
ttm                    99555  2 nouveau,vmwgfx
drm                   397988  6 ttm,drm_kms_helper,nouveau,vmwgfx
i2c_core               63151  5 drm,i2c_piix4,drm_kms_helper,i2c_algo_bit,nouveau

The method to disable nouveau module is as follows:

  • CentOS 7:
# Edit or create a new blacklist-nouveau.conf file
[root@zj ~]# vim /usr/lib/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0

# Execute the following command and restart the system for the kernel to take effect
[root@zj ~]# dracut -force
[root@zj ~]# shutdown -ry 0

2.2 GPU drive memory resident mode

Turning on GPU driver memory resident mode can reduce GPU card dropping, GPU bandwidth reduction, GPU temperature monitoring failure and many other problems. It is recommended to turn on memory resident mode of GPU driver and configure power on self starting.

Common methods for checking the memory resident mode of GPU driver:

  • The Persistence-M status in NVIDIA SMI output is on

  • In nvidia-but-report.log, Persistence Mode is Enabled

    NVIDIA SMI output:

nvidia-but-report.log log:

GPU 00000000:3B:00.0
    Product Name                    : Tesla P40
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled

Please make sure the on-site server:

  • Turn on GPU driver memory resident mode

  • Configure power on self start

To start the memory resident mode of GPU driver, execute the command:

nvidia-smi -pm 1
//or
# The following commands are valid for newer GPU drivers
nvidia-persistenced --persistence-mode

Example of boot configuration:

# vim /etc/rc.d/rc.local
# Add a line to the file
# nvidia-smi -pm 1
# Give executable permission to / etc/rc.d/rc.local file
# chmod +x /etc/rc.d/rc.local
# Restart the system for verification

2.3 check whether GPU is recognized

In GPU recognition state detection, first ensure that lspci command recognizes all GPUs, and then ensure NVIDIA SMI command recognizes all GPUs.

  • lspci check GPU recognition

    In the output of lspci | grep -i nvidia command, ensure that all GPUs are recognized normally, and the end of each GPU is marked (rev a1).

    The end of the output message is (rev ff), indicating that the GPU is abnormal.

  # The following command indicates that 8 GPUs are recognized, and the GPU with the end mark (rev a1) is in normal state
  # B5:00.0 the end of GPU is marked with (rev ff), indicating that the GPU status is abnormal
  ~]# lspci | grep -i nvidia
  3e:00.0 3D controller: NVIDIA Corporation Device 1db8 (rev a1)
  3f:00.0 3D controller: NVIDIA Corporation Device 1db8 (rev a1)
  40:00.0 3D controller: NVIDIA Corporation Device 1db8 (rev a1)
  41:00.0 3D controller: NVIDIA Corporation Device 1db8 (rev ff)
  • nvidia-smi check GPU recognition
    # nvidia-smi
    Thu Dec 26 09:53:57 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM3...  On   | 00000000:3E:00.0 Off |                    0 |
    | N/A   42C    P0    54W / 350W |      0MiB / 32480MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM3...  On   | 00000000:3F:00.0 Off |                    0 |
    | N/A   40C    P0    48W / 350W |      0MiB / 32480MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM3...  On   | 00000000:40:00.0 Off |                    0 |
    | N/A   40C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM3...  On   | 00000000:41:00.0 Off |                    0 |
    | N/A   43C    P0    54W / 350W |      0MiB / 32480MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+

2.4 GPU bandwidth check

It is necessary to ensure that the current bandwidth of GPU is consistent with the rated bandwidth. Generally, x16 indicates normal.

You can use the lspci command or NVIDIA SMI command to check the GPU bandwidth.

#lspci command
 Rated bandwidth: lspci -vvd 10de: | grep -i Lnkcap:
Current bandwidth: lspci -vvd 10de: | grep -i Lnksta:

# NVIDIA SMI command check
nvidia-smi -q | grep -i -A 2 'Link width'

2.5 GPU ECC count check

The GPU ECC count can be checked by the following methods.

The flag of Pending in the following output is No, which means that all ECC error address spaces have been blocked, and the error address spaces will not be called by the software program in the future, and will not affect the program operation.

Pending :No

Yes indicates that there is an ECC error address that needs to be masked. You need to restart the system or reset the GPU to No.

# Use the - i parameter to specify the GPU id to query the ECC count of a GPU
# nvidia-smi -i <target gpu> -q -d PAGE_RETIREMENT
    ...
    Retired pages
    Single Bit ECC             : 2
    Double Bit ECC             : 0
    Pending                    : No

# Query ECC count of all GPU s without - i parameter
# nvidia-smi -q -d PAGE_RETIREMENT

You can also view it through NVIDIA SMI | grep - I 'bit ECC' command.

For GPU ECC counting, please replace the GPU according to the company's index requirements. In addition, ensure that the error address space of the GPU with ECC counting has been blocked, that is, Pending: No.

2.6 GPU ERR error detection

During the operation of GPU, Fan ERR and power ERR error will appear. You can judge whether the NVIDIA SMI output contains ERR! Error.

# nvidia-smi
Thu Dec 26 09:53:57 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:3E:00.0 Off |                    0 |
| ERR!  44C     P0   ERR!/ 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

2.7 GPU serial number query

# nvidia-smi -q | grep -i serial
    Serial Number                   : 0324018045603
    Serial Number                   : 0324018044864
    Serial Number                   : 0324018027716
    Serial Number                   : 0323918059881

# You can check the GPU serial number of the specified id through NVIDIA SMI - Q - I id
# nvidia-smi -q -i 0 | grep -i serial
    Serial Number                   : 0324018045603

3. GPU fault diagnosis process

The following is the flow chart of GPU common fault inspection

3.1 GPU basic status check

3.2 lspci GPU not recognized

! [02 lspci GPU not recognized]( https://gitee.com/Gavin_zj/blog/raw/master/blog_img/02-lspci GPU does not recognize. svg)

3.3 NVIDIA SMI GPU not recognized

! [03 NVIDIA SMI GPU not recognized]( https://gitee.com/Gavin_zj/blog/raw/master/blog_img/03-nvidia-smi GPU does not recognize. svg)

3.4 GPU ERR error

![04-GPU ERR](https://gitee.com/Gavin_zj/blog/raw/master/blog_img/04-GPU ERR.svg)

3.5 GPU ECC error detection

! [05-GPU ECC error 2]( https://gitee.com/Gavin_zj/blog/raw/master/blog_img/05-GPU ECC error 2.svg)

3.6 GPU bandwidth check

3.7 GPU backplane cannot be powered up

3.8 GPU driver version check

Tags: Operation & Maintenance vim CentOS

Posted on Mon, 10 Feb 2020 04:09:40 -0500 by throx