[experience sharing] GPU CUDA uses memory padding to avoid bank conflict

Welcome to my official account, reply to 001 Google programming specification.

  O_o   >_<   o_O   O_o   ~_~   o_O

  this article talks about how to use memory padding in GPU CUDA Programming to avoid bank conflict.

1,Shared memory

  shared memory is a small, low latency on-chip memory, which is hundreds of times faster than global memory. Shared memory can be used as a programmable cache. Its main functions are:

  • An intra block thread communication channel: communication channel between threads;
  • A program managed cache for global memory data: programmable cache;
  • Scratch pad memory for transforming data to improve global memory access patterns: reduce the number of glabal memory accesses by caching data.

    shared memory can be allocated dynamically or statically. Its declaration can be either inside the kernel or as a global variable. It can be declared through the following keywords:

__shared__     ///Identifier

__shared__ float tile[_y][_x];     ///Statically declares a 2D floating-point array
    
///kernel declaration
extern __shared__ int tile[];

kernel<<<grid, block, isize * sizeof(int)>>>(...);

  in order to obtain high bandwidth, shared memory is divided into 32 memory cards of equal size, which can be accessed at the same time corresponding to the thread in warp.

2. Use memory padding to avoid bank conflict

  if there is no bank conflict, shared memory is as fast as registers.

  quick case:

  • All threads in warp access different banks without conflict;
  • All threads in warp read the same address and trigger the broadcast mechanism. There is no conflict.

  slow speed:

  • bank conflict: multiple threads in warp access the same bank;
  • Memory access must be serialized;
  • The maximum number of threads that multiple threads access the same bank at the same time.

  as an example of bank conflict, the following is a shared memory:

  without bank conflict:

int x_id = blockDim.x * blockIdx.x + threadIdx.x;     // Column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y;     // Row coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE];

if(x_id < col && y_id < row)
{
    sData[threadIdx.y][threadIdx.x] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.y][threadIdx.x];
}

    In case of bank conflict:

int x_id = blockDim.x * blockIdx.x + threadIdx.x;     // Column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y;     // Row coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE];

if(x_id < col && y_id < row)
{
    sData[threadIdx.x][threadIdx.y] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.x][threadIdx.y];
}

     In the above example, only minor changes have been made from no bank conflict to bank conflict. Let's see how to solve the above bank conflict.

  take the above example as an example, you can simply avoid bank conflict through memory padding, as shown in the following figure:

  from the perspective of code, how to improve the performance of the code with bank conflict through memory padding:

int x_id = blockDim.x * blockIdx.x + threadIdx.x;     // Column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y;     // Row coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE + 1];     // memory padding

if(x_id < col && y_id < row)
{
    sData[threadIdx.x][threadIdx.y] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.x][threadIdx.y];
}

  the above shared the method of using memory padding in GPU CUDA Programming to avoid bank conflict. I hope my sharing is a little helpful to your learning.

[official account transmission]
<[experience sharing] GPU CUDA uses memory padding to avoid bank conflict>

Tags: Cache gpu nvidia Memory

Posted on Mon, 22 Nov 2021 15:12:15 -0500 by abhic