$29
Submit responses to all tasks which don’t specify a le name to Canvas in a le called assign-ment4.txt, docx, pdf, rtf, odt (choose one of the formats). Also all plots should be submitted on Canvas. All source les should be submitted in the HW04 subdirectory on the master branch of your homework git repo with no subdirectories.
All commands or code must work on Euler with only the cuda module loaded unless speci ed otherwise. Commands and/or code may behave di erently on your computer, so be sure to test on Euler before you submit.
Please submit clean code. Consider using a formatter like clang-format.
• Before you begin, copy the provided les from HW04 of the ME759-2020 repo. Do not change any of the provided les because we will write clean copies over them when grading.
1. (a) Implement in a le called matmul.cu the matmul and matmul kernel functions as de-clared and described in matmul.cuh.
(b) Write a program task1.cu which does the following:
Creates matrices (as 1D row major arrays) A and B of size n*n in managed (aka uni ed) memory.
Fills those matrices however you like. Calls your matmul function.
Prints the last element of the resulting matrix.
Prints the time taken to perform the multiplication in milliseconds using CUDA events.
Compile: nvcc task1.cu matmul.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task1
Run (where n and threads per block are positive integers): ./task1 n threads per block
Example expected output: 11.36 1.23
(c) On an Euler compute node, run task1 for each value n = 25; 26; ; 215 and generate a plot task1.pdf which plots the time taken by your algorithm as a function of n when threads per block = 1024. Overlay another plot which plots the same relationship with a di erent choice of threads per block.
1
2. (a) Implement in a le called stencil.cu stencil and stencil kernel functions as de-clared and described in stencil.cuh. These functions should produce the 1D convolu-tion of image and mask:
R
X
output[i] = image[i + j] mask[j + R] i = 0; ; n 1
j= R
Assume that image[i] = 0 when i < 0 or i > n 1. Pay close attention to what data you are asked to store and compute in shared memory.
(b) Write a program task2.cu which does the following:
Creates arrays image (length n), output (length n), and mask (length 2 * R + 1) all in managed memory.
Fills those arrays however you like. Calls your stencil function.
Prints the last element of the resulting array.
Prints the time taken to perform the convolution in milliseconds using CUDA events.
Compile: nvcc task2.cu stencil.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task2
Run (where n, R, and threads per block are positive integers):
./task2 n R threads per block
Example expected output: 11.36 1.23
(c) On an Euler compute node, run task2 for each value n = 210; 211; ; 231 and generate a plot task2.pdf which plots the time taken by your algorithm as a function of n when threads per block = 1024 and R = 128. Overlay another plot which plots the same relationship with a di erent choice of threads per block.
2