# Parallel programming and performance optimization of N-Body problem based on CUDA

catalogue

N-body problem

principle

Serial code

CUDA parallel programming

Basic idea of parallel

Parallel detailed design

Step 1: apply for CPU and GPU memory space and initialize and copy data.

Step 2: design bodyForce function

Optimization ideas

Optimization 1 - BLOCK_STEP introduction and shared_memory

Optimization 2 - Calculation consolidation

Optimization 3 - Compilation Optimization

Optimization 4 - other optimization directions

Effect comparison

Other ideas

# N-body problem

## principle

N-body problem (or N-body problem) is a common physical simulation problem. In the N-body system, each particle body will interact with other particles (the interaction varies with specific problems), resulting in corresponding physical phenomena.

## Serial code

Firstly, the position and velocity of each point are initialized randomly.

```void randomizeBodies(float *data, int n) { //Initialization function
for (int i = 0; i < n; i++) {
data[i] = 2.0f * (rand() / (float)RAND_MAX) - 1.0f;
}
}```

Then, the mutual force is calculated according to the position, and the speed of the point is updated based on this. The code is integrated into a bodyForce function.

```void bodyForce(Body *p, float dt, int n) { //Calculate the interaction force and record the velocity
for (int i = 0; i < n; ++i) {
float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f;

for (int j = 0; j < n; j++) {
float dx = p[j].x - p[i].x;
float dy = p[j].y - p[i].y;
float dz = p[j].z - p[i].z;
float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING;
float invDist = rsqrtf(distSqr);
float invDist3 = invDist * invDist * invDist;
Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3;
}

p[i].vx += dt*Fx; p[i].vy += dt*Fy; p[i].vz += dt*Fz;
}
}```

Update the position of each point according to the new speed, and then start the calculation of the next cycle.

``` for (int i = 0 ; i < nBodies; i++) { //Calculate the new position based on the speed
p[i].x += p[i].vx*dt;
p[i].y += p[i].vy*dt;
p[i].z += p[i].vz*dt;
}
```

# CUDA parallel programming

## Basic idea of parallel

Since the stress conditions at each point are independent of each other and do not affect each other, we can use the high concurrency of GPU to calculate the speed change concurrently with n threads, and then update the position concurrently with n threads. This can greatly improve the efficiency of the code.

## Parallel detailed design

### Step 1: apply for CPU and GPU memory space and initialize and copy data.

```//Request space and initialize
int bytes = nBodies * sizeof(Body);
float *buf;
cudaMallocHost((void **)&buf,bytes);
randomizeBodies(buf, 6 * nBodies);
float *d_buf;
cudaMalloc((void **)&d_buf,bytes);
Body *d_p=(Body *)d_buf;
cudaMemcpy(d_buf,buf,bytes,cudaMemcpyHostToDevice);```

### Step 2: design bodyForce function

For each thread, just calculate the corresponding position in p according to threadIdx.x and blockIdx.x, and then update it globally.

```//Parallel bodyForce function
__global__ void bodyForce(Body *p, float dt, int n) {
if(i<n){
float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f;
for (int j = 0; j < n; j++) {
float dx = p[j].x - p[i].x;
float dy = p[j].y - p[i].y;
float dz = p[j].z - p[i].z;
float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING;
float invDist = rsqrtf(distSqr);
float invDist3 = invDist * invDist * invDist;
Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3;
}
p[i].vx += dt*Fx; p[i].vy += dt*Fy; p[i].vz += dt*Fz;
}
}```

For each thread, just calculate the corresponding position in p according to threadIdx.x and blockIdx.x, and then update the position.

```__global__ void integrate_position(Body *p,float dt,int n){
if(i<n){
// integrate position
p[i].x += p[i].vx*dt;
p[i].y += p[i].vy*dt;
p[i].z += p[i].vz*dt;
}
}```

# Optimization ideas

## Optimization 1 - BLOCK_STEP introduction and shared_memory

By analyzing the basic parallel program, we can see that the main cost of the program is in bodyForce. For each thread, the amount to be calculated is n, and frequent access to global memory will lead to a lot of time overhead. Therefore, we can exchange space for time and use BLOCK_STEP times of threads can increase the amount of computation of each thread, and use shared in the same thread block_ Memory stores the most recently accessed data.

Spatial optimization

•  For each point, each calculation will access the global data, which will cause a lot of overhead
•  For a thread block, there is shared memory in the block. You can use this memory to obtain part of it for calculation each time to reduce memory access       time Space for time

• introduce BLOCK_STEP Variable, each thread only calculates part of the data of one point
• Each thread executes only a loop (n=4096) cover BLOCK_STEP Part after division The following figure shows the principle of division and can well understand the optimization idea. Each block where n is located in the figure below has a BLOCK_SIZE data, blocked_ Step division. Pay attention to understanding the relationship between data partitioning and threads. ## Optimization 2 - Calculation consolidation

The calculation time of each thread is inconsistent. In the above program logic, we need to wait for all threads to calculate the speed change before the global update of the position. We can put the operation of updating the position in the calculation speed change, and use the counter to judge whether the data of a point has been calculated. If it is completed, it will be updated immediately without waiting. The Flag array is used to record the calculation data. After each thread completes the calculation, the corresponding flag[i] atom will be subtracted by one. If it is 0, the position will be updated immediately and the flag[i] will be reset. Pay attention to the atomic independence of the operation.

```atomicSub(&flag[i], 1);
if(!atomicMax(&flag[i], 0)){
// integrate position
atomicExch(&flag[i],BLOCK_STEP);
}

```

## Optimization 4 - other optimization directions

Considering the hardware characteristics of GPU, the last group must have finished the calculation, so the calculation results in the last group.

# Effect comparison

Serial efficiency Unoptimized parallelism Optimized parallelism The efficiency has been improved by 1771 times, and the optimization is very successful.

# Other ideas & source code resources

Part of the code can also be improved by changing the assembly.

You can also consider using multiple cores for calculation, which will increase by almost an order of magnitude, which will be more obvious when the amount of data increases.
https://github.com/nightmare2423/cuda https://github.com/nightmare2423/cuda

Posted on Mon, 08 Nov 2021 11:48:14 -0500 by evan18h