MVSNet code replication problem solved, cuBlas call failed status=13

brief introduction

This article documents some problems encountered in reproducing MVSNet.

It's reappeared recently MVSNet The code of is a 2018 paper, using the code of tensorflow GPU > = 1.5

cuda9.0 and cudnn7.0 are recommended in the source code. Therefore, according to the corresponding versions of tensorflow and cuda, only versions 1.5 ~ 1.12 can be installed. Finally, choose to install the gpu version tensorflow of 1.5.

The CUDA version on the server is 11.2, so you need to install a CUDA9.0. Because it is a server used together, CUDA9.0 is installed in the personal directory rather than in the system, and you can change the CUDA version at will by changing the environment variables later.

Install multiple versions of CUDA and cuDNN under non root users. Refer to the following article for a very nanny level tutorial

https://blog.csdn.net/hizengbiao/article/details/88625044

Minor code changes:

After installing tensorflow, cuda, cudnn and other packages, the running code will report a lot of errors

The reason for these errors is that some codes may be different in different versions, and the function name and return value may change, so some problematic codes will be changed

1)tf.compat.v1.logging —> tf.logging

# previous
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
# changed
tf.logging.set_verbosity(tf.logging.ERROR)

2)tf.random.shuffle —> tf.random_shuffle

# previous
indices = tf.random.shuffle(indices)
 # changed
indices = tf.random_shuffle(indices)

3) train.py: after reading the image in train(), the image is five-dimensional, but per_ image_ The input of the standardization function is four-dimensional, so before passing in the image, you should first square, and then expand the standardized image_ Dims extension dimension.

for view in range(0, FLAGS.view_num):
    image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
    if FLAGS.online_augmentation:
        image = tf.map_fn(online_augmentation, image, back_prop=False)
    # >>> add begin
    image = tf.squeeze(image, axis=0)
    # <<< add end
    
    image = tf.image.per_image_standardization(image)
    
    # >>> add begin
    image = tf.expand_dims(image, 0)
    # <<< add end
    arg_images.append(image)
images = tf.stack(arg_images, axis=1)

4)validate.py: validate_mvsnet(), the same as the problem in 3) above

for view in range(0, FLAGS.view_num):
    image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
    # >>> add begin
    image = tf.squeeze(image, axis=0)
    # <<< add end
    
    image = tf.image.per_image_standardization(image)
    
    # >>> add begin
    image = tf.expand_dims(image, 0)
    # <<< add end
    normalized_images.append(image)
images = tf.stack(normalized_images, axis=1)

5)photometric_augmentation.py: motion_blur(), tf.numpy_function—>tf.py_ func

# previous
blurred = tf.numpy_function(_py_motion_blur, [image], tf.float32)
# changed
blurred = tf.py_func(_py_motion_blur, [image], tf.float32)

CUDA and graphics cards are incompatible

After modifying all the codes, execute:

CUDA_VISIBLE_DEVICES=2,3 python train.py --regularization '3DCNNs' --train_dtu --max_w 640 --max_h 512 --max_d 128 --dtu_data_root "../data/dtu/training/dtu" --log_folder "../log/dtu" --model_folder "../model/dtu" --num_gpus 2

It can run, but it will report an error, not a code error. I forgot the screenshot where the error was reported. The general error is:

cuBlas call failed status=13

Find the solution to the error on the Internet, which means that the version of CUDA does not match the graphics card:

reference resources:
https://www.zhihu.com/question/424656505
https://tieba.baidu.com/p/7045399988

There are several solutions:
1) Replace it with an older graphics card, that is, a graphics card that supports CUDA9.0
2) Change the code and reconstruct the code into the version of tensorflow 2. X. This version needs to be adapted to the CUDA that the new graphics card can support. (not recommended)
3) Using docker

The video card of the server you use is 3090. It only supports CUDA above 11.0. Even 11.0 is not very stable. Our code is old. The highest version of tensorflow that can run corresponds to CUDA 10.0. Therefore, there is no way. You can only replace it with the old server and use 1080ti for experiments.

I also found other methods on the Internet. I can run tf1.5 code on 3090 by using docker. I haven't tried this. If you have time, you can refer to this article to try it: https://zhuanlan.zhihu.com/p/341969571

MVSNet post-processing: code problem of generating point cloud using Fusibile

After switching to the old service, install CUDA 9.0 and tensorflow 1.5, and the MVSNet code can run through.

MVSNet generates depth maps of multiple views. In order to generate point clouds, some post-processing is required. The author uses fusibile The code realizes the generation of point cloud. The code is written in C + +. It needs to compile and generate an executable file, and transfer the path of the executable file to depthfusion.py of MVSNet to complete the post-processing work.

Compiling fusibile code only requires two pre requirements: 1) CUDA > = 6.0; 2)opencv

CUDA9.0 has just been installed. During the compilation process, it is prompted that opencv is required. The installation of OpenCV on the old service is very smooth without any errors, and then c make succeeds again. However, when making, the prompt code does not support gnu versions after 6.0. The gcc and g + + versions on the old server are 7.4, because fusibile is the code of the 2015 paper, Therefore, you also need to install a version of gcc and g + + below 6.

During the installation of gcc, I had a lot of trouble. I can't write this article at all. I'll write it in another article later, which will introduce:
1) opencv installation process and troubleshooting
2) gcc installation process and troubleshooting
3) fusibile code reproduction process and error resolution
4) Troubleshooting of various small problems
The article hasn't been written yet, so put an empty link here first: opencv and gcc installation

Tags: Linux TensorFlow CUDA nvidia

Posted on Sat, 20 Nov 2021 20:34:08 -0500 by burgessm