$24
Overview
The objective of this assignment is to implement an optimized reduction and an optimized scan kernel and analyze basic architectural performance properties.
CUDA
Please join the Lab2 by clicking the following link: https://classroom.github.com/a/uOjWtYbF .
In the first step, you will implement a naive reduction kernel with unoptimized thread indexing (See Reduction slides page 11 for code snippets). Recall, that the naive reduction kernel implementation suffers from significant warp divergence due to naive thread indexing.
In this second step, modify the kernel so that it implements an optimized reduction kernel. The goal is to have the thread indexing behave as shown in slide 20 of the Reduction slides. Implement and verify the code works.
Finally, you need to complete the scan (Prefix Sum) kernel (See Prefix Sum slides).
In this lab, you only need to modify “kernel.cu”. There are 6 function in this file, 2 for each algorithm.
You should complete the GPU function and change the Grid and Block dimension in CPU function.
The size of the input array is a command line argument. If none is given, 1 million is the default size. Note that you only must accumulate the partial sums into an array, which is copied back to the host and verified to check the final answer. To ensure consistency when grading, do not change the srand seed value.
Bonus Part: You learned to implement an efficient algorithm but in only one Block. However, we calculate the final answer in CPU. There are 10 bonus point (5 for Reduction and 5 for Scan) if you also implement the last part on GPU.
GPGPU-Sim analysis
We will now analyze the behavior of the naive and optimized Reduction kernel using GPGPU-Sim. To aid in analyzing the microarchitectural properties of these programs, it may help to save the output of the GPGPU-Sim run into a file.
Since the focus of this lab is on performance and warp divergence, we will focus mainly on these statistics.
At the end of the simulation run, a lot of various statistics are printed out. Specifically, the following statistics are general performance statistics:
UC Riverside
gpu_sim_cycle
gpu_sim_insn
gpu_ipc
gpu_tot_sim_cycle
gpu_tot_sim_insn gpu_tot_ipc
CS/EE 217
Number of cycles the last kernel took
Number of instructions in the last kernel
The IPC of the last kernel
Number of cycles for the entire application (An application can consist of multiple kernels)
Number of instructions for the entire application The IPC of the entire application run
Another aspect we're concern about is the warp divergence. In the output, there is a section of output that looks similar to this:
Warp Occupancy Distribution:
Stall:564
W0_Idle:3037
W0_Scoreboard:2257
W1:0
W2:0
W3:0
W4:0
W5:0
W6:0
W7:0
W8:0
W9:0
W10:0
W11:0
W12:0
W13:0
W14:0
W15:0
W16:310
W17:0
W18:0
W19:0
W20:0
W21:0
W22:0
W23:0
W24:0
W25:0
W26:0
W27:0
W28:1
W29:0
W30:0
W31:0
W32:512
This gives a distribution of the number of warps with a given number of active threads. For example, W32 means that all 32 threads are active (i.e. no warp divergence). In this example, we also see a good number of warps that only have half of the threads active: W16:310. The simulator defines WX (where X = 1 to 32) - The number of cycles when a warp with X active threads is scheduled into the pipeline.
Report
Answer the following questions (Assume we run reduction with an input size of 1,000,000):
For the naive reduction kernel, how many steps execute without divergence? How many steps execute with divergence?
For the optimized reduction kernel, how many steps execute without divergence? How many steps execute with divergence?
Which kernel performed better? (for both real GPUs and GPGPU-Sim)
How does the warp occupancy distribution compare between the two Reduction implementations?
Why do GPGPUs suffer from warp divergence?
Submission
Push your final code to GitHub and check if your codes work correctly.
Answer the previous questions in a pdf file. Please include your GitHub username in your report. And,
Do not forget to upload your report on iLearn.
Do not forget to push your report to GitHub.
3