Starting from:
$30

$24

Lab 1 Matrix Multiply Solution

Overview



The objective of this assignment is to implement a tiled matrix multiplication kernel that can support arbitrary sized matrices.




CUDA



Please join the Lab1 by clicking the following link: https://classroom.github.com/a/_UVh2NcI .




Edit the source files kernel.cu and main.cu to complete the functionality of matrix multiplication on the device. The two matrices could be any size, but we will not test your code with an output matrix size exceeding 65,536 elements (for example, 256 x 256 input matrices) on GPGPU-Sim.




There are three modes of operation for the application. Check “main.cu” for a description of the modes (repeated below). You will support each of these modes using a Tiled matrix multiplication implementation.




No arguments: The application will create two randomly initialized matrices to multiply size (1000x1000). After the device multiplication is invoked, it will compute the correct solution matrix using the CPU and compare that solution with the device-computed solution. If it matches (within a certain tolerance), if will print out "Test PASSED" to the screen before exiting.



One argument: The application will use the random initialization to create the input matrices (size mxm, where m is the argument. Start your testing with small matrices.



Three arguments m, k, and n: The application will initialize the two input matrices with random values. A matrix will be of size m x k while the B matrix will be of size k x n, producing a C matrix of size m x n



Note that if you wish, you may add a mode to accept input matrices from files, or to dump input and output matrices to files to facilitate testing. The first three modes must remain untouched.



At the end commit and push your completed tiled matrix multiplication code to the private repository. (Don’t forget to check the correctness of tour code in GitHub)




GPGPU-Sim analysis



In this lab, we will analyze memory behavior of tiled matrix multiplication using GPGPU-Sim.




To aid in analyzing the microarchitectural properties of these programs, it may help to save the output of the GPGPU-Sim run into a file. Do NOT commit any output files to the git repository.




Since the focus of this lab is on memory performance, we will focus mainly on these memory statistics:




gpgpu_n_load_insn gpgpu_n_store_insn gpgpu_n_shmem_insn




Number of global/local load instructions executed.




Number of global/local store instructions executed.




Number of shared memory instructions executed.






1
UC Riverside CS/EE 217







Report



Please put the output of your code and GPGPU-Sim on the report.




Also, answer the following questions:




On Bender, compare the execution time of a 256 x 256 square matrix multiplication compared to a 1024 x 64 and 64 x 1024 rectangular matrix multiply. All input matricies have 65k entries. What do you observe? Which is faster? Can you explain the observed behavior? Tip: You may want to comment out the “verify()” function in “main.cu” when timing this question.



Conceptual Question: For a 64 square tiled matrix multiplication, how many times is each element of the input matrices loaded from global memory? Assume 16x16 tiles.



Conceptual Question: For a 64 square non-tiled matrix multiplication, how many times is each element of the input matrices loaded from global memory?



GPGPU-Sim related question: In this part, we will compare the execution of a 128x128 square tiled matrix multiplication across different tile sizes. Run ./sgemm-tiled 128 in GPGPU-Sim with TILE_SIZE of 8, 16 (default), and 32. Fill the following table:



Tile size
8
16
32
Note
gpu_tot_sim_cycle






Total cycles
gpu_tot_ipc






Instruction per cycle
gpgpu_n_load_insn






Total loads to global memory
gpgpu_n_store_insn






Total stores to global memory
gpgpu_n_shmem_insn






Total accesses to shared memory






Which tile size resulted in the least number of accesses to global memory? Which tile size resulted in the most number of accesses to global memory? What is the reasoning behind this observation?



Which tile size performed the fastest, which tile size performed the slowest? Why do you think that is?






Submission



Push your final code to GitHub and check if your codes work correctly.




Answer the previous questions in a pdf file. Please include your GitHub username in your report. And,




Do not forget to upload your report on iLearn.




Do not forget to push your report to GitHub.
















2

More products