CUDA: accelerating applications with CUDA C/C + + (NVIDIA course)

The foundation of C/C + + accelerated computing

With the technology and tools you have now, you are close to getting ready to start accelerating your actual applications. This section will provide you with the following details:

a. Set up your own CUDA environment
b. How best to continue learning in developing accelerated applications
c. Solve another problem for practice
d. Other useful resources

1, Set up an environment with CUDA

The two easiest ways to set up a CUDA environment for your own work are:

1. Through cloud service providers

2. Install CUDA on your own system with NVIDIA GPU

1. Through cloud service providers

All major cloud service providers provide instances that support NVIDIA GPU. Simply search the Internet for "NVIDIA GPU < cloud service provider of your choice >" and you can easily find a result of how to configure such an instance in the cloud of your choice. All of these instances have CUDA toolkit installed. You only need to use SSH to enter these instances to start working.

2. Your own system

If you have access to a system with NVIDIA GPU but CUDA toolkit is not installed, simply follow the instructions here on your specific operating system Download and install CUDA.

2, Continue to develop your accelerated computing applications

After setting up your own acceleration system, a very good thing to further enhance your ability as an accelerated computing programmer is to accelerate your own applications. Besides,

1. Strive to speed up your own applications

You've learned how to accelerate your calculations iteratively based on the results of program performance analysis, so:

1. Make some benchmark performance measurements for the computing intensive applications you are using
2. Make some assumptions about where to accelerate
3. Make some simple and intuitive changes to the possible acceleration parts
4. Evaluate the performance again and repeat the above process

2. Read and apply the techniques described in the CUDA C best practice guide

Although you've been able to improve the performance of applications that were originally run only on the CPU, you can take the acceleration one step further.

CUDA C best practice guide It is an effective basic resource of CUDA Programming. After you've accelerated your application in the way you've learned, start exploring this document and applying the techniques it describes to further improve the performance of your application.

3, Practical application of GPU acceleration

So far, the best practice is to speed up your own applications, but for those who may not yet have real use cases, try accelerating the following Mandelbrot Set Simulator. By convention, take an iterative and performance analysis driven approach.

Mandelbrot Set Simulator : this C + + emulator contains a link to a detailed description of the application, allowing you to see the impact of GPU acceleration visually.
Useful resources
Many very talented programmers use CUDA to create highly optimized libraries to speed up computing. There are many scenarios in your own application that you need to write your own CUDA code, but in programming, there are usually many scenarios that others have written for you.

Please read the GPU accelerated computing library carefully to find out where you can use the highly optimized CUDA library to perform tasks such as Basic linear algebra solver(BLAS),Graphic analysis,fast Fourier transform (FFT),generation of random number (RNG) and Image and signal processing Other tasks.

4, Environment Quick Start

This is a quick-start for users who just want to get going.

Use the nvidia AMI on AWS ( 10 minutes):
https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2

Get started with nvidia-docker (5 minutes):
https://github.com/NVIDIA/nvidia-docker

Get started with the CUDA development image (5 minutes):
"docker pull nvidia/cuda:9.1-devel"
https://hub.docker.com/r/nvidia/cuda/

Using CUDA C/C + + to speed up applications

brief introduction

Accelerated computing is replacing CPU computing as the best computing practice. The endless breakthroughs brought by accelerated computing, the growing demand for accelerated applications, easy programming specifications for accelerated computing, and the continuous improvement of hardware supporting accelerated computing all promote the transition from computing to accelerometer.

Whether it's performance or ease of use, CUDA Computing platforms are the magic weapon to accelerate computing. CUDA provides an extensible coding paradigm for languages such as C, C + +, Python and Fortran, which can run a large number of accelerated parallel code on NVIDIA GPU, the world's super powerful parallel processor. CUDA can accelerate applications effortlessly with DNN,BLAS,Graphic analysis and FFT Highly optimized library ecosystem with powerful command line and Visual analyzer.

CUDA supports many, if not most, of the following areas Super powerful applications in the world: Computational fluid dynamics,molecular dynamics,quantum chemistry,physics And high performance computing (HPC).

Learning CUDA will help you speed up your applications. The execution speed of acceleration application is much faster than that of CPU application, and the CPU application can execute the calculation which is limited by its performance. In this experiment, you will learn how to use CUDA C/C + + to speed up application programming, which is enough for you to start to speed up your CPU application to get performance improvement and help you enter a new computing field.

To make full use of this experiment, you should be competent for the following tasks:

To get the most out of this lab you should already be able to:

  • Declare variables in C, write loops, and use if/else statements.
  • Define and call functions in C.
  • Allocate arrays in C.

No CUDA pre knowledge is required.

Objectives

When you have completed this experiment, you will be able to:

  • Write, compile and run C/C + + programs that can call CPU functions or start GPU kernel functions.
  • Use the execution configuration to control the parallel thread hierarchy.
  • Refactor the serial loop to perform its iterations in parallel on the GPU.
  • Allocate and free memory available for CPU and GPU.
  • Handle errors generated by CUDA code.
  • Speed up CPU applications.

Accelerated Systems

Acceleration system, also known as heterogeneous system, consists of CPU and GPU. The acceleration system will run CPU programs, and these programs will turn to start functions that will benefit from GPU's large-scale parallel computing capability. This experimental environment is an acceleration system including NVIDIA GPU. You can use the NVIDIA SMI (Systems Management Interface) command line command to query information about this GPU. Now, you can use CTRL + ENTER on the code execution unit below to issue the NVIDIA SMI command. Whenever you need to execute code, you can find these units throughout the experiment. After the code runs, the output of running the command will be printed directly below the code execution unit. After running the code execution block below, note that the name of the GPU is found and recorded in the output.

!nvidia-smi

GPU-accelerated Vs. CPU-only Applications

The following slides will give a visual overview of the material to be released. Click to go through the slides and then move on to the topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_CUDA_C-zh/AC_CUDA_C_1-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Comparison between GPU accelerated application and CPU application: In CPU application, data is allocated on CPU, and all work is executed on CPU. In accelerated application, cudaMallocManaged() can be used to allocate data, which can be accessed and processed by CPU, and can be automatically migrated to GPU which can perform parallel work. GPU performs work asynchronously, while CPU can perform work The CPU code can be synchronized with the asynchronous GPU through cudadevicesynchronize(), and wait for the latter to complete. The data accessed by the CPU will be migrated automatically.

Writing Application Code for the GPU

CUDA provides extensions for many common programming languages. In this experiment, we will provide extensions for C/C + +. These language extensions make it easy for developers to run functions in their source code on GPU.

Here is a. cu file (. cu is the file extension for the CUDA accelerator). There are two functions, the first one will run on the CPU and the second on the GPU. Take a moment to find out the difference between the two functions in the way they are defined and called.

void CPUFunction()
{
  printf("This function is defined to run on the CPU.\n");
}

__global__ void GPUFunction()
{
  printf("This function is defined to run on the GPU.\n");
}

int main()
{
  CPUFunction();

  GPUFunction<<<1, 1>>>();
  cudaDeviceSynchronize();
}

Here are some important lines of code that need special attention, as well as some other common terms used in accelerated Computing:

__global__ void GPUFunction()

  • __The global_uuuuukeyword indicates that the following functions will run on the GPU and can be called globally, in which case, by the CPU or GPU.
  • In general, we call the code executed on the CPU the host code, and the code running on the GPU the device code.
  • Note that the return type is void. Functions defined with the global keyword need to return void types.

GPUFunction<<<1, 1>>>();

  • In general, when we call a function to run on the GPU, we call it a started kernel function.
  • When you start a kernel function, you must provide an execution configuration, which is done using the < <... > > syntax before passing any expected parameters to the kernel function.
  • At the macro level, the programmer can define the number of thread groups (called thread blocks) and the number of threads to execute in each thread block by executing the configuration to start the specified thread hierarchy for the kernel function. We'll explore execution configuration later in this experiment, but now notice that you're using the 1 thread block (the first execution configuration parameter) that contains 1 thread (the second configuration parameter) to start the kernel function.

cudaDeviceSynchronize();

  • Unlike many C/C + + code, the kernel function is started asynchronously: the CPU code will continue to execute without waiting for the kernel function to finish starting.
  • Calling cudaDeviceSynchronize, a function provided by CUDA runtime, will cause the host (CPU) code to wait until the device (GPU) code execution is completed before resuming execution on the CPU.

Exercise: Write a Hello GPU Kernel

01-hello-gpu.cu (shown below:) contains programs that are already running. It contains two functions, both of which have the message "Hello from the CPU". Your goal is to refactor the helloGPU function in the source file so that it actually runs on the GPU and prints a message indicating that the operation is performed.

#include <stdio.h>

void helloCPU()
{
  printf("Hello from the CPU.\n");
}

/*
 * Refactor the `helloGPU` definition to be a kernel
 * that can be launched on the GPU. Update its message
 * to read "Hello from the GPU!"
 */

void helloGPU()
{
  printf("Hello also from the CPU.\n");
}

int main()
{

  helloCPU();

  /*
   * Refactor this call to `helloGPU` so that it launches
   * as a kernel on the GPU.
   */

  helloGPU();

  /*
   * Add code below to synchronize on the completion of the
   * `helloGPU` kernel completion before continuing the CPU
   * thread.
   */
}

Modified code

#include <stdio.h>

void helloCPU()
{
  printf("Hello from the CPU.\n");
}

/*
 * Refactor the `helloGPU` definition to be a kernel
 * that can be launched on the GPU. Update its message
 * to read "Hello from the GPU!"
 */

__global__ void helloGPU()
{
  printf("Hello from the GPU.\n");
}

int main()
{

  helloCPU();

  /*
   * Refactor this call to `helloGPU` so that it launches
   * as a kernel on the GPU.
   */
 //launch two blocks
  helloGPU<<<2, 1>>>();

  /*
   * Add code below to synchronize on the completion of the
   * `helloGPU` kernel completion before continuing the CPU
   * thread.
   */
   cudaDeviceSynchronize();
}
  • First refactor the application, then compile and run it using the nvcc command below (remember that you can start the application to execute the contents of the code execution unit by using CTRL + ENTER). 01-hello-gpu.cu The comments in will help you complete the operation. If you have problems or want to check your actions, see Solution . Do not forget to save changes to the file before compiling and running with the following command.
!nvcc -arch=sm_70 -o hello-gpu 01-hello/01-hello-gpu.cu -run

Results of operation

After successfully refactoring 01-hello-gpu.cu, make the following changes and try to compile and run the application after each change (by clicking on the code execution unit above using CTRL + ENTER). If there are errors, take the time to read them carefully: familiarity with them will help you a lot when you start writing your own acceleration code.

  • Remove the keyword global from the kernel definition. Note the line number in the error: what do you think "configured" in the error means? When you are finished, replace global.
  • Remove execution configuration: is your understanding of \ "configured \" still reasonable? When finished, replace the execution configuration.
  • Remove the call to cudaDeviceSynchronize. Before compiling and running the code, guess what will happen. You can recall that the kernel function starts asynchronously, and cudaDeviceSynchronize will make the host temporarily wait until the kernel function finishes executing. When finished, replace the call to cudaDeviceSynchronize.
  • Refactor 01-hello-gpu.cu so that Hello from the GPU prints before Hello from the CPU.
  • Refactor 01-hello-gpu.cu so that Hello from the GPU prints twice, one before Hello from the CPU and the other after Hello from the CPU.

Compiling and Running Accelerated CUDA Code

This section contains details of the above nvcc commands called to compile and run the. cu program.

CUDA platform NVIDIA CUDA compiler Nvcc, which compiles CUDA acceleration applications, including host and device code. As far as this experiment is concerned, the discussion scope of nvcc will be determined according to our urgent needs. After the completion of this experiment, all users who intend to go deep into nvcc can visit File Start.

Users who have used gcc will be familiar with nvcc. For example, compiling the file some-CUDA.cu is simple:

nvcc -arch=sm_70 -o out some-CUDA.cu -run

  • Nvcc is a command line command that uses the nvcc compiler.

  • Pass some-CUDA.cu as a file for compilation.

  • The o flag specifies the output file of the compiler.

  • The arch flag indicates which schema type the file must be compiled to. In this example, SM F will be used to compile the Volta GPU run specifically for this experiment, but users who want to go deep can refer to the arch logo , [virtual architecture features] (http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html × GPU feature list) and GPU characteristics Documents.

  • For convenience, providing the run flag will execute the successfully compiled binary.

CUDA Thread Hierarchy

The following slides will give a visual overview of the material to be released. Click to go through the slides and then move on to the topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_CUDA_C-zh/AC_CUDA_C_2-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Thread hierarchy: GPU can execute work in parallel, GPU performs work in threads, multiple threads run in parallel, the collection of threads is called blocks, and the number of blocks is large. The set of blocks associated with the start of a given kernel function is called grid, and GPU function is called kernel function. Kernel function starts through execution configuration, which defines the number of blocks in grid and the number of threads in each block, Each block in the grid contains the same number of threads.

Launching Parallel Kernels

The programmer can specify details on how to start the kernel function to run in parallel in multiple GPU threads by executing the configuration. More precisely, a programmer can specify the number of thread groups (called thread blocks or blocks for short) and the number of threads that he wants each thread block to contain through the execution configuration. The syntax to perform the configuration is as follows:

<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>

When you start a kernel, the kernel code is executed by each thread in each configured thread block.

Therefore, if you assume that a kernel function named someKernel has been defined, the following is true:

  • Somekernel < < 1, 1 > > () is configured to run only once in a single thread block with a single thread.
  • Somekernel < < 1, 10 > > () is configured to run 10 times in a single thread block with 10 threads.
  • Somekernel < < 10, 1 > > () will run 10 times after it is configured to run in 10 thread blocks (each with a single thread).
  • Somekernel < < 10, 10 > > () will run 100 times after it is configured to run in 10 thread blocks (each with 10 threads).

Exercise: Launch Parallel Kernels

01-first-parallel.cu At present, a very basic function call has been made, that is, printing the message This should be running in parallel.

#include <stdio.h>

/*
 * Refactor firstParallel so that it can run on the GPU.
 */

void firstParallel()
{
  printf("This should be running in parallel.\n");
}

int main()
{
  /*
   * Refactor this call to firstParallel to execute in parallel
   * on the GPU.
   */

  firstParallel();

  /*
   * Some code is needed below so that the CPU will wait
   * for the GPU kernels to complete before proceeding.
   */

}

Please follow the steps below to refactor to run on GPU, run in parallel in a single thread block, and finally run in multiple thread blocks. If you have problems, see Solution.

  • Refactor the firstParallel function to start as a CUDA kernel function on the GPU. After you compile and run 01 basic parallel.cu using the nvcc command below, you should still see the output of the function.
  • Refactor the firstParallel kernel function to execute in parallel in five threads, and all of them are executed in the same thread block. After compiling and running the code, you should see that the output message has been printed five times.
  • Reconstruct the first parallel kernel function again, and make it execute in parallel within 5 thread blocks (each thread block contains 5 threads). After compiling and running, you should be able to see that the output message has now been printed 25 times.
#include <stdio.h>

/*
 * Refactor firstParallel so that it can run on the GPU.
 */

__global__ void firstParallel()
{
  printf("This should be running in parallel.\n");
}

int main()
{
  /*
   * Refactor this call to firstParallel to execute in parallel
   * on the GPU.
   */

  firstParallel<<<5, 5>>>();

  /*
   * Some code is needed below so that the CPU will wait
   * for the GPU kernels to complete before proceeding.
   */
   
   cudaDeviceSynchronize();

}

!nvcc -arch=sm_70 -o basic-parallel 02-first-parallel/01-basic-parallel.cu -run

Screenshot of operation result:

CUDA-Provided Thread Hierarchy Variables

The following slides will give a visual overview of the material to be released. Click to go through the slides and then move on to the topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_CUDA_C-zh/AC_CUDA_C_3-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Thread hierarchy variable provided by CUDA: in the definition of kernel function, the variable provided by CUDA describes the thread, block and network it executes. gridDim.x is the number of blocks in the grid, and blockIdx.x is the index of the current block in the grid. In the kernel function, blockDim.x describes the number of threads in the block, and all blocks in the grid contain the same number of threads. In the kernel function, threadIdx.x describes the index of threads in the block.

Thread and Block Indices

Each thread is assigned an index within its thread block, starting at 0. In addition, each thread block is assigned an index, starting at 0. Just as threads form thread blocks, thread blocks form a grid, and grid is the highest level entity in CUDA thread hierarchy. In short, the CUDA kernel function executes in a grid of one or more thread blocks, each of which contains the same number of threads.

CUDA kernel function can access the special variables that can identify the following two indexes: the index of the thread executing the kernel function (in the thread block) and the index of the thread block (in the grid). The two variables are threadIdx.x and blockIdx.x.

Exercise: Use Specific Thread and Block Indices

Currently, the 01-thread-and-block-idx.cu file contains a kernel function that is printing the execution of the failed message.

#include <stdio.h>

__global__ void printSuccessForCorrectExecutionConfiguration()
{

  if(threadIdx.x == 1023 && blockIdx.x == 255)
  {
    printf("Success!\n");
  } else {
    printf("Failure. Update the execution configuration as necessary.\n");
  }
}

int main()
{
  /*
   * Update the execution configuration so that the kernel
   * will print `"Success!"`.
   */

  printSuccessForCorrectExecutionConfiguration<<<1, 1>>>();
}

Open the file to learn how to update the execution configuration to print a success message. After refactoring, use the following code execution unit to compile and run the code to confirm your work. If you have problems, see Solution.

#include <stdio.h>

__global__ void printSuccessForCorrectExecutionConfiguration()
{

  if(threadIdx.x == 1023 && blockIdx.x == 255)
  {
    printf("Success!\n");
  } 
  //else {
    //printf("Failure. Update the execution configuration as necessary.\n");
  //}
}

int main()
{
  /*
   * Update the execution configuration so that the kernel
   * will print `"Success!"`.
   */

  printSuccessForCorrectExecutionConfiguration<<<256, 1024>>>();
  cudaDeviceSynchronize();
}

!nvcc -arch=sm_70 -o thread-and-block-idx 03-indices/01-thread-and-block-idx.cu -run

Accelerating For Loops

The time has come to speed up the loops in CPU applications: instead of running each iteration of the loop in sequence, let each iteration run in parallel in its own thread. Consider the following "for loop," which, although obvious, controls the number of times the loop will execute and defines what happens to each iteration of the loop:

int N = 2<<20;
for (int i = 0; i < N; ++i)
{
  printf("%d\n", i);
}

To parallel this loop, you must perform the following 2 steps:

  • You must write a kernel function that completes the single iteration of the loop.
  • Because the kernel function is independent of other running kernel functions, the execution configuration must make the kernel function execute the correct number of times, such as the number of loop iterations.

Exercise: Accelerating a For Loop with a Single Block of Threads

Currently, the loop function in 01-single-block-loop.cu runs a "for loop" and prints all numbers between 0 and 9 consecutively.

#include <stdio.h>

/*
 * Refactor `loop` to be a CUDA Kernel. The new kernel should
 * only do the work of 1 iteration of the original loop.
 */

void loop(int N)
{
  for (int i = 0; i < N; ++i)
  {
    printf("This is iteration number %d\n", i);
  }
}

int main()
{
  /*
   * When refactoring `loop` to launch as a kernel, be sure
   * to use the execution configuration to control how many
   * "iterations" to perform.
   *
   * For this exercise, only use 1 block of threads.
   */

  int N = 10;
  loop(N);
}

The loop function is reconstructed into CUDA kernel function, which can execute N iterations in parallel after startup. After the reconstruction is successful, you should still be able to print all numbers between 0 and 9.

#include <stdio.h>

/*
 * Refactor `loop` to be a CUDA Kernel. The new kernel should
 * only do the work of 1 iteration of the original loop.
 */

void loop(int N)
{
  for (int i = 0; i < N; ++i)
  {
    printf("This is iteration number %d\n", i);
  }
}

__global__ void loop_gpu(){ 

    int time_loop = threadIdx.x;

    printf("This is gpu iteration number %d\n", time_loop);
 }
int main()
{
  /*
   * When refactoring `loop` to launch as a kernel, be sure
   * to use the execution configuration to control how many
   * "iterations" to perform.
   *
   * For this exercise, only use 1 block of threads.
   */

  int N = 10;
  loop(N);
  loop_gpu<<<1, N>>>();
  cudaDeviceSynchronize();
  
}

!nvcc -arch=sm_70 -o single-block-loop 04-loops/01-single-block-loop.cu -run

Operation result:

Coordinating Parallel Threads

The following slides will give a visual overview of the material to be released. Click to go through the slides and then move on to the topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_CUDA_C-zh/AC_CUDA_C_4-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Coordinate parallel threads: suppose the data is in the vector with index 0. For some reason, the elements in the processed vector of each thread must be mapped. The formula threadIdx.x + blockIdx.x*blockDim.x can map each thread to the elements of the vector.

Using Block Dimensions for More Parallelization

A thread block contains a limit on the number of threads: 1024, to be exact. To increase the amount of parallelism in an accelerated application, we must be able to coordinate multiple thread blocks.

CUDA kernel function can access the special variable given the number of threads in the block: blockDim.x. By using this variable in combination with blockIdx.x and threadIdx.x variables, and using the idiomatic expression threadIdx.x + blockIdx.x * blockDim.x to organize parallel execution among multiple thread blocks containing multiple threads, the parallelism will be improved. Here is a detailed example.

The execution configuration < < 10, 10 > > > will start the grid with a total of 100 threads, which are contained in 10 thread blocks composed of 10 threads. Therefore, we want each thread (between 0 and 99) to be able to calculate a unique index for that thread.

  • If the thread block blockIdx.x is equal to 0, blockIdx.x * blockDim.x is 0. Add the possible threadIdx.x values (0 to 9) to 0, and then you can generate indexes 0 to 9 in a grid of 100 threads.
  • If the thread block blockIdx.x is equal to 1, blockIdx.x * blockDim.x is 10. Add the possible threadIdx.x values (0 to 9) to 10, and you can then generate indexes 10 to 19 in a grid of 100 threads.
  • If the thread block blockIdx.x equals 5, blockIdx.x * blockDim.x is 50. Add the possible threadIdx.x values (0 to 9) to 50, and you can then generate indexes 50 to 59 in a grid of 100 threads.
  • If the thread block blockIdx.x equals 9, blockIdx.x * blockDim.x is 90. Add the possible threadIdx.x value (0 to 9) to 90, after which indexes 90 to 99 can be generated in a grid of 100 threads.

Exercise: Accelerating a For Loop with Multiple Blocks of Threads

Currently, the loop function in 02-multi-block-loop.cu runs a "for loop" and prints all numbers between 0 and 9 consecutively. The loop function is reconstructed into CUDA kernel function, which can execute N iterations in parallel after startup. After the reconstruction is successful, you should still be able to print all numbers between 0 and 9. For this exercise, as an additional limitation, use the execution configuration to start at least 2 thread blocks.

#include <stdio.h>

/*
 * Refactor `loop` to be a CUDA Kernel. The new kernel should
 * only do the work of 1 iteration of the original loop.
 */

void loop(int N)
{
  for (int i = 0; i < N; ++i)
  {
    printf("This is iteration number %d\n", i);
  }
}

__global__ void loop_gpu(){ 

    int time_loop = blockIdx.x;

    printf("This is gpu iteration number %d\n", time_loop);
 }

int main()
{
  /*
   * When refactoring `loop` to launch as a kernel, be sure
   * to use the execution configuration to control how many
   * "iterations" to perform.
   *
   * For this exercise, be sure to use more than 1 block in
   * the execution configuration.
   */

  int N = 10;
  loop(N);
    loop_gpu<<<N, 1>>>();
  cudaDeviceSynchronize();
  
}

If you have problems, see Solution.

!nvcc -arch=sm_70 -o multi-block-loop 04-loops/02-multi-block-loop.cu -run

Operation result:

Allocating Memory to be accessed on the GPU and the CPU

The latest version of CUDA (version 6 and later) has made it easy to allocate memory for CPU hosts and any number of GPU devices. Although today there are many applications that are suitable for memory management and can support the acceleration of optimal performance in applications Advanced technology , but the basic CUDA memory management technology we are going to introduce now can not only support the excellent performance far beyond the CPU application, but also hardly produce any developer cost.

To allocate and free memory and get pointers that can be referenced in host and device code, use cudaMallocManaged and cudaFree instead of calling malloc and free, as shown in the following example:

// CPU-only

int N = 2<<20;
size_t size = N * sizeof(int);

int *a;
a = (int *)malloc(size);

// Use `a` in CPU-only program.

free(a);
// Accelerated

int N = 2<<20;
size_t size = N * sizeof(int);

int *a;
// Note the address of `a` is passed as first argument.
cudaMallocManaged(&a, size);

// Use `a` on the CPU and/or on any GPU in the accelerated system.

cudaFree(a);

Exercise: Array Manipulation on both the Host and Device

The 01-double-elements.cu program allocates an array, initializes it with integer values on the host, and attempts to double each of these values in parallel on the GPU, and then confirms the success of the doubling operation on the host. Currently, the program will not be able to execute because it is trying to interact with the array pointed to by pointer a on the host and device, but only allocates arrays accessible on the host (using malloc).

#include <stdio.h>

/*
 * Initialize array values on the host.
 */

void init(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    a[i] = i;
  }
}

/*
 * Double elements in parallel on the GPU.
 */

__global__
void doubleElements(int *a, int N)
{
  int i;
  i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < N)
  {
    a[i] *= 2;
  }
}

/*
 * Check all elements have been doubled on the host.
 */

bool checkElementsAreDoubled(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    if (a[i] != i*2) return false;
  }
  return true;
}

int main()
{
  int N = 100;
  int *a;

  size_t size = N * sizeof(int);

  /*
   * Refactor this memory allocation to provide a pointer
   * `a` that can be used on both the host and the device.
   */

  a = (int *)malloc(size);

  init(a, N);

  size_t threads_per_block = 10;
  size_t number_of_blocks = 10;

  /*
   * This launch will not work until the pointer `a` is also
   * available to the device.
   */

  doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
  cudaDeviceSynchronize();

  bool areDoubled = checkElementsAreDoubled(a, N);
  printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

  /*
   * Refactor to free memory that has been allocated to be
   * accessed by both the host and the device.
   */

  free(a);
}

Refactor the application to meet the following conditions:

  • Pointer a shall be available for host and device codes.
  • The memory of pointer a should be properly freed.
#include <stdio.h>

void init(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    a[i] = i;
  }
}

__global__
void doubleElements(int *a, int N)
{
  int i;
  i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < N)
  {
    a[i] *= 2;
  }
}

bool checkElementsAreDoubled(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    if (a[i] != i*2) return false;
  }
  return true;
}

int main()
{
  int N = 1000;
  int *a;

  size_t size = N * sizeof(int);

  /*
   * Use `cudaMallocManaged` to allocate pointer `a` available
   * on both the host and the device.
   */

  cudaMallocManaged(&a, size);

  init(a, N);

  size_t threads_per_block = 256;
  size_t number_of_blocks = (N + threads_per_block - 1) / threads_per_block;

  doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
  cudaDeviceSynchronize();

  bool areDoubled = checkElementsAreDoubled(a, N);
  printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

  /*
   * Use `cudaFree` to free memory allocated
   * with `cudaMallocManaged`.
   */

  cudaFree(a);
}

!nvcc -arch=sm_70 -o double-elements 05-allocate/01-double-elements.cu -run

Grid Size Work Amount Mismatch

The following slides will give a visual overview of the material to be released. Click to go through the slides and then move on to the topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_CUDA_C-zh/AC_CUDA_C_5-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Grid size workload mismatch: in the previous scenario, the number of threads in the grid exactly matches the number of elements. What if the number of threads exceeds the workload to be completed? Attempting to access a non-existent element can cause a runtime error, you must use code checking and make sure that the dataIndex calculated by the formula threadIdx.x+blockIdx.x*blockDim.x is less than N (number of data elements)

Handling Block Configuration Mismatches to Number of Needed Threads

There may be situations where it is not possible to represent an execution configuration that will create the exact number of threads required for a parallel loop.

Common examples relate to the best thread block size you want to choose. For example, in view of the hardware characteristics of GPU, it is an ideal choice to include 32 multiple thread blocks because of its performance advantages. If we want to start some thread blocks and each thread block contains 256 threads (multiples of 32), and need to run 1000 parallel tasks (use a small number here for explanation), then any number of thread blocks cannot generate 1000 bus programs precisely in the grid, because no integer value can be exactly equal to 1000 after multiplying 32.

This can be easily addressed by:

  • Write the execution configuration so that it creates more threads than are required to perform the assigned work.
  • Pass the value as a parameter to the kernel function (N) to represent the total size of the dataset to be processed or the number of bus passes required to complete the work.
  • After calculating the thread index in the grid (using tid+bid*bdim), check whether the index exceeds N, and only perform the work related to the kernel function if it does not exceed.

The following is an example of a common method for writing execution configurations, which is applicable to the case where N and the number of threads in the thread block are known, but the exact match between the number of threads in the grid and N cannot be guaranteed. In this way, you can ensure that there are at least the number of threads required by N in the grid at all times, and the number of threads exceeded can only be equal to the number of threads in one thread block at most:

// Assume `N` is known
int N = 100000;

// Assume we have a desire to set `threads_per_block` exactly to `256`
size_t threads_per_block = 256;

// Ensure there are at least `N` threads in the grid, but only 1 block's worth extra
size_t number_of_blocks = (N + threads_per_block - 1) / threads_per_block;

some_kernel<<<number_of_blocks, threads_per_block>>>(N);

Because the above execution configuration causes the number of threads in the grid to exceed N, it is necessary to pay attention to the content in the definition of some_kernel to ensure that some_kernel will not attempt to access the data elements out of range when executed by one of the "extra" threads:

__global__ some_kernel(int N)
{
  int idx = threadIdx.x + blockIdx.x * blockDim.x;

  if (idx < N) // Check to make sure `idx` maps to some value within `N`
  {
    // Only do work if it does
  }
}

Exercise: Accelerating a For Loop with a Mismatched Execution Configuration

The program in 02-mismatched-config-loop.cu uses cudaMallocManaged to allocate memory for an integer array of 1000 elements, and then attempts to initialize all the values in the array in parallel using CUDA kernel function.

#include <stdio.h>

/*
 * Currently, `initializeElementsTo`, if executed in a thread whose
 * `i` is calculated to be greater than `N`, will try to access a value
 * outside the range of `a`.
 *
 * Refactor the kernel defintition to prevent our of range accesses.
 */

__global__ void initializeElementsTo(int initialValue, int *a, int N)
{
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  a[i] = initialValue;
}

int main()
{
  /*
   * Do not modify `N`.
   */

  int N = 1000;

  int *a;
  size_t size = N * sizeof(int);

  cudaMallocManaged(&a, size);

  /*
   * Assume we have reason to want the number of threads
   * fixed at `256`: do not modify `threads_per_block`.
   */

  size_t threads_per_block = 256;

  /*
   * Assign a value to `number_of_blocks` that will
   * allow for a working execution configuration given
   * the fixed values for `N` and `threads_per_block`.
   */

  size_t number_of_blocks = 0;

  int initialValue = 6;

  initializeElementsTo<<<number_of_blocks, threads_per_block>>>(initialValue, a, N);
  cudaDeviceSynchronize();

  /*
   * Check to make sure all values in `a`, were initialized.
   */

  for (int i = 0; i < N; ++i)
  {
    if(a[i] != initialValue)
    {
      printf("FAILURE: target value: %d\t a[%d]: %d\n", initialValue, i, a[i]);
      exit(1);
    }
  }
  printf("SUCCESS!\n");

  cudaFree(a);
}

This procedure assumes that the number of N and threads per block are known. Your task is to achieve the following two goals:

  • Assign a value to number Ou of Ou blocks to ensure that the number of threads is at least as many as the number of accessible elements in pointer a.
  • Update the initializeElementsTo kernel function to ensure that you do not attempt to access out of range data elements.
#include <stdio.h>

/*
 * Currently, `initializeElementsTo`, if executed in a thread whose
 * `i` is calculated to be greater than `N`, will try to access a value
 * outside the range of `a`.
 *
 * Refactor the kernel defintition to prevent our of range accesses.
 */

__global__ void initializeElementsTo(int initialValue, int *a, int N)
{
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if(i<N){
      a[i] = initialValue;
  }
  
}

int main()
{
  /*
   * Do not modify `N`.
   */

  int N = 1000;

  int *a;
  size_t size = N * sizeof(int);

  cudaMallocManaged(&a, size);

  /*
   * Assume we have reason to want the number of threads
   * fixed at `256`: do not modify `threads_per_block`.
   */

  size_t threads_per_block = 256;

  /*
   * Assign a value to `number_of_blocks` that will
   * allow for a working execution configuration given
   * the fixed values for `N` and `threads_per_block`.
   */

  size_t number_of_blocks = (N+threads_per_block-1)/256;

  int initialValue = 6;

  initializeElementsTo<<<number_of_blocks, threads_per_block>>>(initialValue, a, N);
  cudaDeviceSynchronize();

  /*
   * Check to make sure all values in `a`, were initialized.
   */

  for (int i = 0; i < N; ++i)
  {
    if(a[i] != initialValue)
    {
      printf("FAILURE: target value: %d\t a[%d]: %d\n", initialValue, i, a[i]);
      exit(1);
    }
  }
  printf("SUCCESS!\n");

  cudaFree(a);
}

!nvcc -arch=sm_70 -o mismatched-config-loop 05-allocate/02-mismatched-config-loop.cu -run

The result is: SUCCESS!

Grid-Stride Loops

The following slides will give a visual overview of the material to be released. Click to go through the slides and then move on to the topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_CUDA_C-zh/AC_CUDA_C_6-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Grid span cycle: the number of data elements is often greater than the number of threads in the grid. In this case, the thread cannot process only one element, or the work will not be completed. One of the ways to solve this problem programmatically is to use the grid span cycle. In the grid span cycle, the first element of the thread is still calculated by threadIdx.x+blockIdx.x*blockDim.x, and then the thread will move forward according to the number of threads in the grid (blockDim.x * gridDim.x),

Data Sets Larger then the Grid

Either for choice, to create an execution configuration with ultra-high performance, or for need, the number of threads in a grid may be smaller than the size of the dataset. Consider an array of 1000 elements and a grid of 250 threads (use a small size here for illustration). Each thread in this grid will need to be used four times. To achieve this, a common method is to use grid span loops in kernel functions.

In the grid span cycle, each thread will use tid+bid*bdim to calculate its own unique index in the grid, and perform the corresponding operation on the elements of the index in the array, then add the number of threads in the grid to the index and repeat the operation until it is beyond the range of the array. For example, for an array of 500 elements and a grid of 250 threads, the thread with index 20 in the grid will do the following:

  • Perform an operation on element 20 of an array of 500 elements
  • Increase its index by 250 to make the grid 270
  • Perform an operation on element 270 of an array containing 500 elements
  • Increase its index by 250 to make the grid size 520
  • Since 520 is now outside the array range, the thread will stop working

CUDA provides a special variable that can give the number of thread blocks in the grid: gridDim.x. Then calculate the bus number in the grid, that is, the number of thread blocks in the grid multiplied by the number of threads in each thread block: gridDim.x * blockDim.x. With this in mind, take a look at a detailed example of grid span loops in the following kernel functions:

__global void kernel(int *a, int N)
{
  int indexWithinTheGrid = threadIdx.x + blockIdx.x * blockDim.x;
  int gridStride = gridDim.x * blockDim.x;

  for (int i = indexWithinTheGrid; i < N; i += gridStride)
  {
    // do work on a[i];
  }
}

Exercise: Use a Grid-Stride Loop to Manipulate an Array Larger than the Grid

Refactoring 03-grid-strip-double.cu

#include <stdio.h>

void init(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    a[i] = i;
  }
}

/*
 * In the current application, `N` is larger than the grid.
 * Refactor this kernel to use a grid-stride loop in order that
 * each parallel thread work on more than one element of the array.
 */

__global__
void doubleElements(int *a, int N)
{
  int i;
  i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < N)
  {
    a[i] *= 2;
  }
}

bool checkElementsAreDoubled(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    if (a[i] != i*2) return false;
  }
  return true;
}

int main()
{
  /*
   * `N` is greater than the size of the grid (see below).
   */

  int N = 10000;
  int *a;

  size_t size = N * sizeof(int);
  cudaMallocManaged(&a, size);

  init(a, N);

  /*
   * The size of this grid is 256*32 = 8192.
   */

  size_t threads_per_block = 256;
  size_t number_of_blocks = 32;

  doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
  cudaDeviceSynchronize();

  bool areDoubled = checkElementsAreDoubled(a, N);
  printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

  cudaFree(a);
}

To use the grid span loop in the doubleElements kernel function, so that the grid smaller than N can reuse threads to cover every element in the array. The program will print whether each element in the array has been doubled, and currently the program will print FALSE accurately.

#include <stdio.h>

void init(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    a[i] = i;
  }
}

/*
 * In the current application, `N` is larger than the grid.
 * Refactor this kernel to use a grid-stride loop in order that
 * each parallel thread work on more than one element of the array.
 */

__global__
void doubleElements(int *a, int N)
{
  int i;
  i = blockIdx.x * blockDim.x + threadIdx.x;
  int gridStride = gridDim.x * blockDim.x;
  
  while(i<N){
       a[i] *= 2;
       i = i + gridStride;
  }
}

bool checkElementsAreDoubled(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    if (a[i] != i*2) return false;
  }
  return true;
}

int main()
{
  /*
   * `N` is greater than the size of the grid (see below).
   */

  int N = 10000;
  int *a;

  size_t size = N * sizeof(int);
  cudaMallocManaged(&a, size);

  init(a, N);

  /*
   * The size of this grid is 256*32 = 8192.
   */

  size_t threads_per_block = 256;
  size_t number_of_blocks = 32;

  doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
  cudaDeviceSynchronize();

  bool areDoubled = checkElementsAreDoubled(a, N);
  printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

  cudaFree(a);
}
!nvcc -arch=sm_70 -o grid-stride-double 05-allocate/03-grid-stride-double.cu -run

Run result: all elements were double? True

Error Handling

As in any application, accelerating error handling in CUDA code is also critical. Even if not most, there are many CUDA functions (for example, Memory management functions )A value of type cudaerror? T is returned, which can be used to check whether an error occurred when the function was called. The following is an example of error handling on a call to the cudaMallocManaged function:

cudaError_t err;
err = cudaMallocManaged(&a, N)                    // Assume the existence of `a` and `N`.

if (err != cudaSuccess)                           // `cudaSuccess` is provided by CUDA.
{
  printf("Error: %s\n", cudaGetErrorString(err)); // `cudaGetErrorString` is provided by CUDA.
}

When a kernel function defined to return void is started, a value of type cudaerror? T will not be returned. To check whether an error occurs when starting a kernel function (for example, if the startup configuration is wrong), CUDA provides the cudaGetLastError function, which returns a value of type cudaerror.

/*
 * This launch should cause an error, but the kernel itself
 * cannot return it.
 */

someKernel<<<1, -1>>>();  // -1 is not a valid number of threads.

cudaError_t err;
err = cudaGetLastError(); // `cudaGetLastError` will return the error from above.
if (err != cudaSuccess)
{
  printf("Error: %s\n", cudaGetErrorString(err));
}

Finally, in order to catch asynchronous errors (for example, during the execution of asynchronous kernel functions), it is important to check the status returned by subsequent synchronous CUDA runtime API calls (for example, cudaDeviceSynchronize); if one of the previously started kernel functions fails, an error will be returned.

Exercise: Add Error Handling

Currently, 01-add-error-handling.cu compiles, runs, and prints array elements that have doubled their failure. However, the program does not indicate whether there are any errors in it.

#include <stdio.h>

void init(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    a[i] = i;
  }
}

__global__
void doubleElements(int *a, int N)
{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = gridDim.x * blockDim.x;

  for (int i = idx; i < N + stride; i += stride)
  {
    a[i] *= 2;
  }
}

bool checkElementsAreDoubled(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    if (a[i] != i*2) return false;
  }
  return true;
}

int main()
{
  /*
   * Add error handling to this source code to learn what errors
   * exist, and then correct them. Googling error messages may be
   * of service if actions for resolving them are not clear to you.
   */

  int N = 10000;
  int *a;

  size_t size = N * sizeof(int);
  cudaMallocManaged(&a, size);

  init(a, N);

  size_t threads_per_block = 2048;
  size_t number_of_blocks = 32;

  doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
  cudaDeviceSynchronize();

  bool areDoubled = checkElementsAreDoubled(a, N);
  printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

  cudaFree(a);
}

Refactor the application to handle CUDA errors so that you can understand the problems and debug effectively. You will need to investigate possible synchronization errors when calling CUDA functions and asynchronous errors when executing CUDA core functions.

#include <stdio.h>

void init(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    a[i] = i;
  }
}

__global__
void doubleElements(int *a, int N)
{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = gridDim.x * blockDim.x;

  /*
   * The previous code (now commented out) attempted
   * to access an element outside the range of `a`.
   */

  // for (int i = idx; i < N + stride; i += stride)
  for (int i = idx; i < N; i += stride)
  {
    a[i] *= 2;
  }
}

bool checkElementsAreDoubled(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    if (a[i] != i*2) return false;
  }
  return true;
}

int main()
{
  int N = 10000;
  int *a;

  size_t size = N * sizeof(int);
  cudaMallocManaged(&a, size);

  init(a, N);

  /*
   * The previous code (now commented out) attempted to launch
   * the kernel with more than the maximum number of threads per
   * block, which is 1024.
   */

  size_t threads_per_block = 1024;
  /* size_t threads_per_block = 2048; */
  size_t number_of_blocks = 32;

  cudaError_t syncErr, asyncErr;

  doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);

  /*
   * Catch errors for both the kernel launch above and any
   * errors that occur during the asynchronous `doubleElements`
   * kernel execution.
   */

  syncErr = cudaGetLastError();
  asyncErr = cudaDeviceSynchronize();

  /*
   * Print errors should they exist.
   */

  if (syncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(syncErr));
  if (asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  bool areDoubled = checkElementsAreDoubled(a, N);
  printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

  cudaFree(a);
}

!nvcc -arch=sm_70 -o add-error-handling 06-errors/01-add-error-handling.cu -run

CUDA Error Handling Function

Creating a macro that wraps the CUDA function call is useful for checking for errors. Here is an example of a macro that you can use at any time for the rest of the exercise:

#include <stdio.h>
#include <assert.h>

inline cudaError_t checkCuda(cudaError_t result)
{
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
  return result;
}

int main()
{

/*
 * The macro can be wrapped around any function returning
 * a value of type `cudaError_t`.
 */

  checkCuda( cudaDeviceSynchronize() )
}

Summary

At this point, you have completed all of the experimental learning objectives listed below:

  • Write, compile and run C/C + + programs that can call CPU functions or start GPU kernel functions.
  • Use the execution configuration to control the parallel thread hierarchy.
  • Refactor the serial loop to perform its iterations in parallel on the GPU.
  • Allocate and free memory available for CPU and GPU.
  • Handle errors generated by CUDA code.

Now you will complete the final goal of the experiment:

  • Speed up CPU applications.

Final Exercise: Accelerate Vector Addition Application

The following challenges will give you the opportunity to apply what you have learned in the experiment. It involves the accelerated CPU vector addition program. Although it is not very complex, it still gives you the opportunity to focus on the use of what you have learned to accelerate GPU applications with CUDA. After completing this exercise, if you have the time and interest to dig deeper, you can continue to the advanced content section to learn about some of the challenges that involve more complex code bases.

01-vector-add.cu contains a working CPU vector addition application.

#include <stdio.h>

void initWith(float num, float *a, int N)
{
  for(int i = 0; i < N; ++i)
  {
    a[i] = num;
  }
}

void addVectorsInto(float *result, float *a, float *b, int N)
{
  for(int i = 0; i < N; ++i)
  {
    result[i] = a[i] + b[i];
  }
}

void checkElementsAre(float target, float *array, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(array[i] != target)
    {
      printf("FAIL: array[%d] - %0.0f does not equal %0.0f\n", i, array[i], target);
      exit(1);
    }
  }
  printf("SUCCESS! All values added correctly.\n");
}

int main()
{
  const int N = 2<<20;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  a = (float *)malloc(size);
  b = (float *)malloc(size);
  c = (float *)malloc(size);

  initWith(3, a, N);
  initWith(4, b, N);
  initWith(0, c, N);

  addVectorsInto(c, a, b, N);

  checkElementsAre(7, c, N);

  free(a);
  free(b);
  free(c);
}

Accelerate its addVectorsInto function to run on GPU as CUDA kernel function and make it perform work in parallel. The following actions are required:

  • Expand the definition of addvectors into CUDA kernel function.
  • Select and use a valid execution configuration to enable addvectors into to start as a CUDA kernel function.
  • Update the memory allocation and release the memory to reflect that the host and device code need to access three vectors: a, b and result.
  • Refactor the body of addvectors into: it will start inside a single thread and only need to perform a single thread operation on the input vector. Make sure that the thread never attempts to access elements that are outside the scope of the input vector, and note whether the thread needs to perform operations on multiple elements of the input vector.
  • Add error handling where CUDA code might otherwise fail silently.
#include <stdio.h>
#include <assert.h>

inline cudaError_t checkCuda(cudaError_t result)
{
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
  return result;
}

void initWith(float num, float *a, int N)
{
  for(int i = 0; i < N; ++i)
  {
    a[i] = num;
  }
}

__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    result[i] = a[i] + b[i];
  }
}

void checkElementsAre(float target, float *array, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(array[i] != target)
    {
      printf("FAIL: array[%d] - %0.0f does not equal %0.0f\n", i, array[i], target);
      exit(1);
    }
  }
  printf("SUCCESS! All values added correctly.\n");
}

int main()
{
  const int N = 2<<20;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  checkCuda( cudaMallocManaged(&a, size) );
  checkCuda( cudaMallocManaged(&b, size) );
  checkCuda( cudaMallocManaged(&c, size) );

  initWith(3, a, N);
  initWith(4, b, N);
  initWith(0, c, N);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = (N + threadsPerBlock - 1) / threadsPerBlock;

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  checkCuda( cudaGetLastError() );
  checkCuda( cudaDeviceSynchronize() );

  checkElementsAre(7, c, N);

  checkCuda( cudaFree(a) );
  checkCuda( cudaFree(b) );
  checkCuda( cudaFree(c) );
}

!nvcc -arch=sm_70 -o vector-add 07-vector-add/01-vector-add.cu -run

Run result: SUCCESS! All values added correctly

Advanced Content

The following exercises provide additional challenges for learners who have plenty of time and are interested in learning. These challenges need to be addressed with more advanced technologies, and they provide little useful knowledge. As a result, it's not easy to complete these challenges, but you'll also make significant progress in the process.

Grids and Blocks of 2 and 3 Dimensions

You can define grids and thread blocks to have up to three dimensions. Defining grids and thread blocks with multiple dimensions will never have any impact on their performance, but this can be useful when dealing with data with multiple dimensions, such as a 2D matrix. To define a 2D or 3D mesh or thread block, you can use CUDA's dim3 type, as follows:

dim3 threads_per_block(16, 16, 1);
dim3 number_of_blocks(16, 16, 1);
someKernel<<<number_of_blocks, threads_per_block>>>();

Given the above example, the variables gridDim.x, gridDim.y, blockDim.x, and blockDim.y within someKernel will all equal 16.

Exercise: Accelerate 2D Matrix Multiply Application

The file 01-matrix-multiply-2d.cu contains a full-featured host function, matrixMulCPU.

#include <stdio.h>

#define N  64

__global__ void matrixMulGPU( int * a, int * b, int * c )
{
  /*
   * Build out this kernel.
   */
}

/*
 * This CPU function already works, and will run to create a solution matrix
 * against which to verify your work building out the matrixMulGPU kernel.
 */

void matrixMulCPU( int * a, int * b, int * c )
{
  int val = 0;

  for( int row = 0; row < N; ++row )
    for( int col = 0; col < N; ++col )
    {
      val = 0;
      for ( int k = 0; k < N; ++k )
        val += a[row * N + k] * b[k * N + col];
      c[row * N + col] = val;
    }
}

int main()
{
  int *a, *b, *c_cpu, *c_gpu; // Allocate a solution matrix for both the CPU and the GPU operations

  int size = N * N * sizeof (int); // Number of bytes of an N x N matrix

  // Allocate memory
  cudaMallocManaged (&a, size);
  cudaMallocManaged (&b, size);
  cudaMallocManaged (&c_cpu, size);
  cudaMallocManaged (&c_gpu, size);

  // Initialize memory; create 2D matrices
  for( int row = 0; row < N; ++row )
    for( int col = 0; col < N; ++col )
    {
      a[row*N + col] = row;
      b[row*N + col] = col+2;
      c_cpu[row*N + col] = 0;
      c_gpu[row*N + col] = 0;
    }

  /*
   * Assign `threads_per_block` and `number_of_blocks` 2D values
   * that can be used in matrixMulGPU above.
   */

  dim3 threads_per_block;
  dim3 number_of_blocks;

  matrixMulGPU <<< number_of_blocks, threads_per_block >>> ( a, b, c_gpu );

  cudaDeviceSynchronize();

  // Call the CPU version to check our work
  matrixMulCPU( a, b, c_cpu );

  // Compare the two answers to make sure they are equal
  bool error = false;
  for( int row = 0; row < N && !error; ++row )
    for( int col = 0; col < N && !error; ++col )
      if (c_cpu[row * N + col] != c_gpu[row * N + col])
      {
        printf("FOUND ERROR at c[%d][%d]\n", row, col);
        error = true;
        break;
      }
  if (!error)
    printf("Success!\n");

  // Free all our allocated memory
  cudaFree(a); cudaFree(b);
  cudaFree( c_cpu ); cudaFree( c_gpu );
}

Your task is to expand the core function matrixMulGPU of CUDA. The source code will use these two functions to perform matrix multiplication and compare their answers to verify that you have written the correct CUDA kernel function.

  • You will need to create an execution configuration with parameters of dim3 and x and y dimensions greater than 1.
  • Inside the body of the kernel function, you will need to establish a unique index of the running thread in the grid according to the Convention, but you should establish two indexes for the thread: one for the x axis of the grid, and the other for the y axis of the grid.
#include <stdio.h>

#define N  64

__global__ void matrixMulGPU( int * a, int * b, int * c )
{
  int val = 0;

  int row = blockIdx.x * blockDim.x + threadIdx.x;
  int col = blockIdx.y * blockDim.y + threadIdx.y;

  if (row < N && col < N)
  {
    for ( int k = 0; k < N; ++k )
      val += a[row * N + k] * b[k * N + col];
    c[row * N + col] = val;
  }
}

void matrixMulCPU( int * a, int * b, int * c )
{
  int val = 0;

  for( int row = 0; row < N; ++row )
    for( int col = 0; col < N; ++col )
    {
      val = 0;
      for ( int k = 0; k < N; ++k )
        val += a[row * N + k] * b[k * N + col];
      c[row * N + col] = val;
    }
}

int main()
{
  int *a, *b, *c_cpu, *c_gpu;

  int size = N * N * sizeof (int); // Number of bytes of an N x N matrix

  // Allocate memory
  cudaMallocManaged (&a, size);
  cudaMallocManaged (&b, size);
  cudaMallocManaged (&c_cpu, size);
  cudaMallocManaged (&c_gpu, size);

  // Initialize memory
  for( int row = 0; row < N; ++row )
    for( int col = 0; col < N; ++col )
    {
      a[row*N + col] = row;
      b[row*N + col] = col+2;
      c_cpu[row*N + col] = 0;
      c_gpu[row*N + col] = 0;
    }

  dim3 threads_per_block (16, 16, 1); // A 16 x 16 block threads
  dim3 number_of_blocks ((N / threads_per_block.x) + 1, (N / threads_per_block.y) + 1, 1);

  matrixMulGPU <<< number_of_blocks, threads_per_block >>> ( a, b, c_gpu );

  cudaDeviceSynchronize(); // Wait for the GPU to finish before proceeding

  // Call the CPU version to check our work
  matrixMulCPU( a, b, c_cpu );

  // Compare the two answers to make sure they are equal
  bool error = false;
  for( int row = 0; row < N && !error; ++row )
    for( int col = 0; col < N && !error; ++col )
      if (c_cpu[row * N + col] != c_gpu[row * N + col])
      {
        printf("FOUND ERROR at c[%d][%d]\n", row, col);
        error = true;
        break;
      }
  if (!error)
    printf("Success!\n");

  // Free all our allocated memory
  cudaFree(a); cudaFree(b);
  cudaFree( c_cpu ); cudaFree( c_gpu );
}

!nvcc -arch=sm_70 -o matrix-multiply-2d 08-matrix-multiply/01-matrix-multiply-2d.cu -run

Exercise: Accelerate A Thermal Conductivity Application

In the following exercise, you will perform an acceleration operation for an application that simulates two-dimensional heat conduction of metallic silver.

In 01-heat-production.cu

#include <stdio.h>
#include <math.h>

// Simple define to index into a 1D array from 2D space
#define I2D(num, c, r) ((r)*(num)+(c))

/*
 * `step_kernel_mod` is currently a direct copy of the CPU reference solution
 * `step_kernel_ref` below. Accelerate it to run as a CUDA kernel.
 */

void step_kernel_mod(int ni, int nj, float fact, float* temp_in, float* temp_out)
{
  int i00, im10, ip10, i0m1, i0p1;
  float d2tdx2, d2tdy2;


  // loop over all points in domain (except boundary)
  for ( int j=1; j < nj-1; j++ ) {
    for ( int i=1; i < ni-1; i++ ) {
      // find indices into linear memory
      // for central point and neighbours
      i00 = I2D(ni, i, j);
      im10 = I2D(ni, i-1, j);
      ip10 = I2D(ni, i+1, j);
      i0m1 = I2D(ni, i, j-1);
      i0p1 = I2D(ni, i, j+1);

      // evaluate derivatives
      d2tdx2 = temp_in[im10]-2*temp_in[i00]+temp_in[ip10];
      d2tdy2 = temp_in[i0m1]-2*temp_in[i00]+temp_in[i0p1];

      // update temperatures
      temp_out[i00] = temp_in[i00]+fact*(d2tdx2 + d2tdy2);
    }
  }
}

void step_kernel_ref(int ni, int nj, float fact, float* temp_in, float* temp_out)
{
  int i00, im10, ip10, i0m1, i0p1;
  float d2tdx2, d2tdy2;


  // loop over all points in domain (except boundary)
  for ( int j=1; j < nj-1; j++ ) {
    for ( int i=1; i < ni-1; i++ ) {
      // find indices into linear memory
      // for central point and neighbours
      i00 = I2D(ni, i, j);
      im10 = I2D(ni, i-1, j);
      ip10 = I2D(ni, i+1, j);
      i0m1 = I2D(ni, i, j-1);
      i0p1 = I2D(ni, i, j+1);

      // evaluate derivatives
      d2tdx2 = temp_in[im10]-2*temp_in[i00]+temp_in[ip10];
      d2tdy2 = temp_in[i0m1]-2*temp_in[i00]+temp_in[i0p1];

      // update temperatures
      temp_out[i00] = temp_in[i00]+fact*(d2tdx2 + d2tdy2);
    }
  }
}

int main()
{
  int istep;
  int nstep = 200; // number of time steps

  // Specify our 2D dimensions
  const int ni = 200;
  const int nj = 100;
  float tfac = 8.418e-5; // thermal diffusivity of silver

  float *temp1_ref, *temp2_ref, *temp1, *temp2, *temp_tmp;

  const int size = ni * nj * sizeof(float);

  temp1_ref = (float*)malloc(size);
  temp2_ref = (float*)malloc(size);
  temp1 = (float*)malloc(size);
  temp2 = (float*)malloc(size);

  // Initialize with random data
  for( int i = 0; i < ni*nj; ++i) {
    temp1_ref[i] = temp2_ref[i] = temp1[i] = temp2[i] = (float)rand()/(float)(RAND_MAX/100.0f);
  }

  // Execute the CPU-only reference version
  for (istep=0; istep < nstep; istep++) {
    step_kernel_ref(ni, nj, tfac, temp1_ref, temp2_ref);

    // swap the temperature pointers
    temp_tmp = temp1_ref;
    temp1_ref = temp2_ref;
    temp2_ref= temp_tmp;
  }

  // Execute the modified version using same data
  for (istep=0; istep < nstep; istep++) {
    step_kernel_mod(ni, nj, tfac, temp1, temp2);

    // swap the temperature pointers
    temp_tmp = temp1;
    temp1 = temp2;
    temp2= temp_tmp;
  }

  float maxError = 0;
  // Output should always be stored in the temp1 and temp1_ref at this point
  for( int i = 0; i < ni*nj; ++i ) {
    if (abs(temp1[i]-temp1_ref[i]) > maxError) { maxError = abs(temp1[i]-temp1_ref[i]); }
  }

  // Check and see if our maxError is greater than an error bound
  if (maxError > 0.0005f)
    printf("Problem! The Max Error of %.5f is NOT within acceptable bounds.\n", maxError);
  else
    printf("The Max Error of %.5f is within acceptable bounds.\n", maxError);

  free( temp1_ref );
  free( temp2_ref );
  free( temp1 );
  free( temp2 );

  return 0;
}

The step kernel mod function of is converted to execute on the GPU, and the main function is modified to properly allocate the data used on the CPU and GPU. The step kernel ref function is executed on the CPU and is used to check for errors. Because this code involves floating-point computation, a simple rearrangement operation on different processors or even the same processor may result in slightly different results. To do this, error checking codes use error thresholds instead of finding exact matches.

#include <stdio.h>
#include <math.h>

// Simple define to index into a 1D array from 2D space
#define I2D(num, c, r) ((r)*(num)+(c))

__global__
void step_kernel_mod(int ni, int nj, float fact, float* temp_in, float* temp_out)
{
  int i00, im10, ip10, i0m1, i0p1;
  float d2tdx2, d2tdy2;

  int j = blockIdx.x * blockDim.x + threadIdx.x;
  int i = blockIdx.y * blockDim.y + threadIdx.y;

  // loop over all points in domain (except boundary)
  if (j > 0 && i > 0 && j < nj-1 && i < ni-1) {
    // find indices into linear memory
    // for central point and neighbours
    i00 = I2D(ni, i, j);
    im10 = I2D(ni, i-1, j);
    ip10 = I2D(ni, i+1, j);
    i0m1 = I2D(ni, i, j-1);
    i0p1 = I2D(ni, i, j+1);

    // evaluate derivatives
    d2tdx2 = temp_in[im10]-2*temp_in[i00]+temp_in[ip10];
    d2tdy2 = temp_in[i0m1]-2*temp_in[i00]+temp_in[i0p1];

    // update temperatures
    temp_out[i00] = temp_in[i00]+fact*(d2tdx2 + d2tdy2);
  }
}

void step_kernel_ref(int ni, int nj, float fact, float* temp_in, float* temp_out)
{
  int i00, im10, ip10, i0m1, i0p1;
  float d2tdx2, d2tdy2;


  // loop over all points in domain (except boundary)
  for ( int j=1; j < nj-1; j++ ) {
    for ( int i=1; i < ni-1; i++ ) {
      // find indices into linear memory
      // for central point and neighbours
      i00 = I2D(ni, i, j);
      im10 = I2D(ni, i-1, j);
      ip10 = I2D(ni, i+1, j);
      i0m1 = I2D(ni, i, j-1);
      i0p1 = I2D(ni, i, j+1);

      // evaluate derivatives
      d2tdx2 = temp_in[im10]-2*temp_in[i00]+temp_in[ip10];
      d2tdy2 = temp_in[i0m1]-2*temp_in[i00]+temp_in[i0p1];

      // update temperatures
      temp_out[i00] = temp_in[i00]+fact*(d2tdx2 + d2tdy2);
    }
  }
}

int main()
{
  int istep;
  int nstep = 200; // number of time steps

  // Specify our 2D dimensions
  const int ni = 200;
  const int nj = 100;
  float tfac = 8.418e-5; // thermal diffusivity of silver

  float *temp1_ref, *temp2_ref, *temp1, *temp2, *temp_tmp;

  const int size = ni * nj * sizeof(float);

  temp1_ref = (float*)malloc(size);
  temp2_ref = (float*)malloc(size);
  cudaMallocManaged(&temp1, size);
  cudaMallocManaged(&temp2, size);

  // Initialize with random data
  for( int i = 0; i < ni*nj; ++i) {
    temp1_ref[i] = temp2_ref[i] = temp1[i] = temp2[i] = (float)rand()/(float)(RAND_MAX/100.0f);
  }

  // Execute the CPU-only reference version
  for (istep=0; istep < nstep; istep++) {
    step_kernel_ref(ni, nj, tfac, temp1_ref, temp2_ref);

    // swap the temperature pointers
    temp_tmp = temp1_ref;
    temp1_ref = temp2_ref;
    temp2_ref= temp_tmp;
  }

  dim3 tblocks(32, 16, 1);
  dim3 grid((nj/tblocks.x)+1, (ni/tblocks.y)+1, 1);
  cudaError_t ierrSync, ierrAsync;

  // Execute the modified version using same data
  for (istep=0; istep < nstep; istep++) {
    step_kernel_mod<<< grid, tblocks >>>(ni, nj, tfac, temp1, temp2);

    ierrSync = cudaGetLastError();
    ierrAsync = cudaDeviceSynchronize(); // Wait for the GPU to finish
    if (ierrSync != cudaSuccess) { printf("Sync error: %s\n", cudaGetErrorString(ierrSync)); }
    if (ierrAsync != cudaSuccess) { printf("Async error: %s\n", cudaGetErrorString(ierrAsync)); }

    // swap the temperature pointers
    temp_tmp = temp1;
    temp1 = temp2;
    temp2= temp_tmp;
  }

  float maxError = 0;
  // Output should always be stored in the temp1 and temp1_ref at this point
  for( int i = 0; i < ni*nj; ++i ) {
    if (abs(temp1[i]-temp1_ref[i]) > maxError) { maxError = abs(temp1[i]-temp1_ref[i]); }
  }

  // Check and see if our maxError is greater than an error bound
  if (maxError > 0.0005f)
    printf("Problem! The Max Error of %.5f is NOT within acceptable bounds.\n", maxError);
  else
    printf("The Max Error of %.5f is within acceptable bounds.\n", maxError);

  free( temp1_ref );
  free( temp2_ref );
  cudaFree( temp1 );
  cudaFree( temp2 );

  return 0;
}

!nvcc -arch=sm_70 -o heat-conduction 09-heat/01-heat-conduction.cu -run

The original heat conduction CPU source code in this task is from the University of Houston article An OpenACC Example Code for a C-based heat conduction code (OpenACC sample code based on C's heat conduction code).

Using CUDA C/C + + unified memory and nvprof management to accelerate application memory

For this experiment and other CUDA basic experiments, we strongly recommend that you follow the CUDA best practice guide Which recommends a design cycle called APOD: evaluation, parallelization, optimization, and deployment. In short, APOD defines an iterative design process in which developers can incrementally improve their accelerated application performance and release code. With the development of CUDA Programming ability, developers have been able to apply more advanced optimization technology in accelerated code base.

This experiment will support this iterative development style. You will use NVIDIA command line analyzer to qualitatively measure application performance and identify optimization opportunities, then you will apply incremental improvements, and finally you will learn new technologies and repeat the cycle. It is important to note that many technologies you will learn and apply in this experiment will involve the details of CUDA's unified memory working principle. Understanding unified memory behavior is a basic skill for CUDA developers, and it can also be a prerequisite for many more advanced memory management technologies.

Prerequisites

To make full use of this experiment, you should be competent for the following tasks:

  • Write, compile and run C/C + + programs that can call CPU functions or start GPU kernel functions.
  • Use the execution configuration to control the parallel thread hierarchy.
  • Refactor the serial loop to perform its iterations in parallel on the GPU.
  • Allocate and free unified memory.

Objectives

When you have completed this experiment, you will be able to:

  • Use NVIDIA command line analyzer (nvprof) to analyze the performance of an accelerated application.
  • Optimize the execution configuration with the understanding of streaming multiprocessors.
  • Understand the behavior of page errors and data migration within the unification.
  • Use asynchronous memory prefetching to reduce page errors and data migration to improve performance.
  • Adopt iterative development cycle to accelerate and deploy application quickly.

Iterative Optimizations with the NVIDIA Command Line Profiler

The only way to ensure that attempts to optimize the accelerated code base are truly successful is to analyze the application for quantitative information about its performance. nvprof is the NVIDIA command line analyzer. The analyzer is included in CUDA toolkit and can provide powerful functions for accelerating application analysis.

nvprof is very simple to use, the most basic use is to pass it the path of the executable compiled with nvcc. nvprof then continues to execute the application and prints the summary output of the application GPU activity, CUDA API calls, and information about the unified memory activity. We will cover this topic in detail later in this experiment.

When accelerating an application or optimizing an already accelerated application, a scientific iterative approach should be used. After making changes, you need to analyze the application, document and document what impact any refactoring might have on performance. Making such observations early and often will often make it easy for you to get enough performance improvements to help you publish accelerated applications. In addition, frequent analysis of applications will give you an idea of how specific changes to the CUDA code base can affect their actual performance: it's hard to know when you analyze applications only after making multiple changes in the code base.

Exercise: Profile an Application with nvprof

01-vector-add.cu (< ------ you can click to open this file link and any source file link in this experiment and edit it) is a simple and easy-to-use accelerated vector addition program. Use the following two code execution units (hold down CTRL and click). The first code execution unit compiles (and runs) the vector addition program. The second code execution unit will use nvprof to analyze the newly compiled executable file.

After the application is analyzed, use the information displayed in the analysis output to answer the following questions:

  • What is the name of the CUDA kernel function that is the only call in this application?
  • What is the name of the CUDA kernel function that is the only call in this application?
  • The running time of this kernel function is? Record this time somewhere: you'll optimize the application and want to know the maximum optimization speed you can get.
!nvcc -arch=sm_70 -o single-thread-vector-add 01-vector-add/01-vector-add.cu -run
!nvprof ./single-thread-vector-add

Exercise: Optimize and Profile

Please take a minute or two to update 01-vector-add.cu , so that it can run on multiple threads in a single thread block. Use the code execution unit below to recompile and analyze with nvprof. Use the analysis output to check the runtime of the kernel function. How much speed improvement does this optimization bring? Be sure to record your results somewhere.

!nvcc -arch=sm_70 -o multi-thread-vector-add 01-vector-add/01-vector-add.cu -run
!nvprof ./multi-thread-vector-add

Exercise: Optimize Iteratively

In this exercise, you will go through several cycles, including editing 01-vector-add.cu To view the impact. Please follow the following guidelines when carrying out the operation:

  • Start by listing 3 to 5 different methods that you will use to update the execution configuration, making sure to cover a range of different combinations of grid and thread block sizes.
  • Edit using one of the listed methods 01-vector-add.cu Program.
  • Use the following two codes to perform unit compilation and analysis of the updated code.
  • When recording the execution of a kernel function, it should be the same as that given in the analysis output.
  • Repeat the edit / analyze / record cycle for each of the possible optimizations listed above

Which of the execution configurations you are trying to run has proven to be the fastest?

!nvcc -arch=sm_70 -o iteratively-optimized-vector-add 01-vector-add/01-vector-add.cu -run
!nvprof ./iteratively-optimized-vector-add

Streaming Multiprocessors and Querying the Device

This section explores how understanding the specific features of GPU hardware can facilitate optimization. After learning stream multiprocessor, you will try to further optimize the accelerated vector addition program you have been executing.

The following slides will give a visual overview of the material to be released. Click to go through the slides and then move on to the topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_UM_NVPROF-zh/NVPROF_UM_1-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Streaming Multiprocessors and Warps

GPUs running CUDA applications have processing units called streaming multiprocessors (or SM). During kernel function execution, thread blocks are provided to SM for execution. In order to support GPU to perform as many parallel operations as possible, you can usually choose to increase performance by several times the number of thread blocks as the number of SM on the specified GPU.

In addition, SM creates, manages, schedules, and executes 32 thread groups within a block of threads called a thread bundle. This course will not change Explore SM and thread bundle in depth , but it's worth noting that you can also choose to increase performance by several times the thread block size of 32 threads.

Programmatically Querying GPU Device Properties

Because the number of SM on a GPU will vary depending on the specific GPU used, you must not hard code the number of SM into the code base to support portability. Instead, get this information programmatically.

The following is a method to obtain the C structure in CUDA C/C + +, which contains several properties of GPU devices currently active, including the number of SM devices:

int deviceId;
cudaGetDevice(&deviceId);                  // `deviceId` now points to the id of the currently active GPU.

cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId); // `props` now has many useful properties about
                                           // the active GPU device.

Exercise: Query the Device

At present, 01-get-device-properties.cu Contains many unassigned variables, and will print some useless information, which is used to describe the details of the GPU device that is currently active.

Extension 01-get-device-properties.cu To print the actual values of the required device properties indicated in the source code. For operational support and an introduction, see CUDA runtime documentation To help identify related attributes in the device attribute structure. If you have problems, see Solution.

!nvcc -arch=sm_70 -o get-device-properties 04-device-properties/01-get-device-properties.cu -run

Exercise: Optimize Vector Add with Grids Sized to Number of SMs

Refactor by querying the number of SM devices you have been 01-vector-add.cu The addvectors into kernel function is executed in so that the grid at startup contains several times as many thread blocks as SM on the device.

Depending on other specific details in your code, this refactoring may or may not improve or significantly improve the performance of the kernel functions. Therefore, it is important to always use nvprof in order to evaluate performance changes quantitatively. According to the analysis output, record the current results and other findings.

!nvcc -arch=sm_70 -o sm-optimized-vector-add 01-vector-add/01-vector-add.cu -run
!nvprof ./sm-optimized-vector-add

Unified Memory Details

You have always used cudaMallocManaged to allocate memory intended for use by host or device code, and now you still enjoy the convenience of this method, that is, while realizing automatic memory migration and simplifying programming, you do not need to have a deep understanding of the actual working principle of the unified memory (UM) allocated by cudaMallocManaged. nvprof provides detailed information about um management in accelerated applications, and combines this information with a deeper understanding of how um works to create more opportunities for optimizing accelerated applications.

The following slides will give a visual overview of the material to be released. Click to go through the slides and then move on to the topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_UM_NVPROF-zh/NVPROF_UM_2-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Unified Memory Migration

When UM is allocated, memory does not reside on the host or device. Occurs when a host or device attempts to access memory Page fault , the host or device will migrate the required data in bulk. Similarly, when the CPU or any GPU in the acceleration system attempts to access memory that does not reside on it, a page error occurs and triggers the migration.

The ability to execute page errors and migrate memory on demand is a great help in simplifying development processes in accelerated applications. In addition, when processing data showing sparse access patterns (for example, when the data to be processed cannot be known before the application is actually running), and in an acceleration system with multiple GPUs, when the data may be accessed by multiple GPU devices, migrating memory on demand will bring significant advantages.

In some cases (for example, when we need to know the data before running, and need a large number of continuous memory blocks), we can effectively avoid page errors and the overhead of on-demand data migration.

The following content of this experiment will focus on the understanding of on-demand migration and how to identify on-demand migration in analyzer output. This knowledge allows you to enjoy the advantages of on-demand migration while reducing its overhead.

Exercise: Explore UM Page Faulting

Nvprof provides output that describes the UM behavior of the analyzed application. In this exercise, you will make some changes to a simple application, and use the unified memory output part of nvprof to explore the behavior of UM data migration after each change.

01-page-faults.cu Including the hostFunction and gpuKernel functions, we can use these two functions and use the number 1 to initialize the elements of the 2 < 24 unit vector. Host function and GPU kernel function are not used at present.

For each of the following 4 questions, based on your understanding of UM behavior, first assume what page errors should occur, and then edit using one or both of the 2 functions provided in the code base 01-page-faults.cu To create a scenario for you to test your assumptions.

To test your assumptions, use the code execution unit below to compile and analyze your code. Be sure to record your assumptions and results from nvprof output, especially CPU and GPU page errors, for the four experiments you are doing. If you encounter problems, you can click the following link to obtain the reference solutions for each of the four experiments.

  • What happens when the unified memory is accessed only by the CPU? ( Solution)
  • What happens when unified memory is accessed only by GPU? ( Solution)
  • What happens when the unified memory is accessed first by CPU and then by GPU? ( Solution)
  • What happens when the unified memory is accessed first by GPU and then by CPU? ( Solution)
!nvcc -arch=sm_70 -o page-faults 06-unified-memory-page-faults/01-page-faults.cu -run
!nvprof ./page-faults

Exercise: Revisit UM Behavior for Vector Add Program

Return to what you have been doing in this experiment 01-vector-add.cu Program, look at the code base of the program in its current state, and assume what page errors you expect to occur. View the analysis output of the last refactoring (by scrolling up to find the output or by executing the code execution unit below), and observe the unified memory part of the analyzer output. Can you explain the page error description according to the content of the code base?

!nvprof ./sm-optimized-vector-add

Exercise: Initialize Vector in Kernel

When nvprof gives the execution time of kernel function, the host to device page errors and data migration occurred during the execution of this function will be included in the displayed execution time.

With the idea of 01-vector-add.cu The initWith host function in the program is reconstructed into CUDA kernel function to initialize the allocated vectors on GPU in parallel. After successfully compiling and running the refactored application, but before analyzing it, assume the following:

  • How do you expect refactoring to affect UM page error behavior?
  • How do you expect refactoring to affect the reported addvectors into runtime?

Please record the results again. If you have problems, see Solution.

!nvcc -arch=sm_70 -o initialize-in-kernel 01-vector-add/01-vector-add.cu -run
!nvprof ./initialize-in-kernel

Asynchronous Memory Prefetching

In the process of host to device and device to host memory transfer, we use a technology to reduce page errors and on-demand memory migration costs. This powerful technology is called asynchronous memory prefetching. With this technology, programmers can asynchronously migrate application code to any CPU or GPU device in the system in the background before using unified memory (UM). This can reduce the cost of page errors and on-demand data migration, and further improve the performance of GPU core functions and CPU functions.

In addition, prefetch often migrates data with larger data blocks, so its migration times are lower than on-demand migration. This technique is well suited for situations where data access requirements are known prior to runtime and data access is not in sparse mode.

CUDA can use cudaMemPrefetchAsync function to asynchronously prefetch managed memory to GPU device or CPU. The following shows how to use this function to prefetch data to the currently active GPU device, and then to the CPU:

int deviceId;
cudaGetDevice(&deviceId);                                         // The ID of the currently active GPU device.

cudaMemPrefetchAsync(pointerToSomeUMData, size, deviceId);        // Prefetch to GPU device.
cudaMemPrefetchAsync(pointerToSomeUMData, size, cudaCpuDeviceId); // Prefetch to host. `cudaCpuDeviceId` is a
                                                                  // built-in CUDA variable.

Exercise: Prefetch Memory

At this time, the 01-vector-add.cu The program should not only start CUDA kernel function to add 2 vectors to the third solution vector (all vectors are allocated by cudaMallocManaged function), but also initialize each vector in CUDA kernel function in parallel. If for some reason the application does not perform any of the above actions, refer to the following Reference application , and update your own code base to reflect its current capabilities.

stay 01-vector-add.cu In the application, cudaMemPrefetchAsync function is used to carry out three experiments to explore its impact on page errors and memory migration.

  • What happens when you prefetch one of the initialization vectors to the host?
  • What happens when you prefetch two of the initialization vectors to the host?
  • What happens when you prefetch all three initialization vectors to the host?

Before conducting each experiment, please first assume the behavior of UM (especially in terms of page errors) and its impact on the reported initialization kernel function runtime, and then run nvprof for verification. If you have problems, see Solution.

!nvcc -arch=sm_70 -o prefetch-to-gpu 01-vector-add/01-vector-add.cu -run
!nvprof ./prefetch-to-gpu

Exercise: Prefetch Memory Back to the CPU

Add additional memory prefetch CPU s for this function to verify the correctness of the addVectorInto kernel function. Then we assume that UM is affected again and analyze it in nvprof. If you have problems, see Solution.

!nvcc -arch=sm_70 -o prefetch-to-cpu 01-vector-add/01-vector-add.cu -run
!nvprof ./prefetch-to-cpu

Summary

At this point, you can do the following in the experiment:

  • Use NVIDIA command line analyzer (nvprof) to analyze and accelerate application performance.
  • Optimize the execution configuration with the understanding of streaming multiprocessors.
  • Understand the behavior of page errors and data migration within the unification.
  • Use asynchronous memory prefetching to reduce page errors and data migration to improve performance.
  • Adopt iterative development cycle to accelerate and deploy application quickly.

To consolidate your learning and enhance your ability to accelerate, optimize, and deploy applications iteratively, continue with the last exercise of this experiment. After completion, learners who have enough time and are interested in further study can continue to learn the higher-level content.

Final Exercise: Iteratively Optimize an Accelerated SAXPY Application

here Provide you with a basic SAXPY Accelerate applications. The program currently contains some errors that you need to find and fix before you can successfully compile, run, and analyze them using nvprof.

After fixing the errors and analyzing the application, you need to record the running time of the saxpy kernel function, then optimize the application in an iterative way, and use nvprof to analyze and verify after each iteration, so as to understand the impact of code changes on the kernel function performance and UM behavior.

Use the techniques provided in this experiment. To get learning support, make the most of it Extraction effort Technology, and don't rush to the technical details at the beginning of this course.

Your ultimate goal is to analyze the exact saxpy kernel function without modifying N, so that it can run in 50us. If you have problems, see Solution , you can compile and analyze it at any time.

!nvcc -arch=sm_70 -o saxpy 09-saxpy/01-saxpy.cu -run
!nvprof ./saxpy
Published 301 original articles, won praise 203, visited 590000+
His message board follow

Tags: Programming Docker simulator github

Posted on Fri, 14 Feb 2020 05:04:13 -0500 by ljCharlie