$29
1. In HW05, you have implemented a reduction using a sequential implementation (no parallel execution). In this task, you will compare the performance of Thrust and CUB with the previous sequential implementation by performing a scaling analysis for a reduction problem.
a) Implement in a le called task1 thrust.cu the Thrust version of reduction. It’s ex-pected to do the following (some details about copying between host and device are not included here but should be implemented in your code when necessary):
Create and ll (with int type numbers) however you like a thrust::host vector of length n where n is the rst command line argument as below.
Use the built-in function in Thrust to copy the thrust::host vector into a thrust::device vector.
Call the thrust::reduce function to do a reduction. Print the result of reduction.
Print the time taken to run the thrust::reduce function in milliseconds using CUDA events.
Compile: nvcc task1 thrust.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task1 thrust
Run (where n is a positive integer): ./task1 thrust n Example expected output:
3141
0.012
b) Implement in a le called task1 cub.cu the CUB version of the exclusive scan based on the code example sent by Dan in an email (deviceReduce.cu).
Stick with the same device memory allocation pattern as the code example (DeviceAllocate() and cudaMemcpy()). Do not use uni ed memory.
Modify the example program so that the host array h in has length n where n is the rst command line argument as below, then ll in h in with int type numbers however you like.
Call the DeviceReduce::Sum function that outputs the reduction result to the output array.
Print the reduction result
Print the time taken to run the DeviceReduce::Sum function (the actual one, not the one that’s used to nd the size of temporary storage needed) in milliseconds using CUDA events.
Compile: nvcc task1 cub.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task1 cub
Run (where n is a positive integer): ./task1 cub n Example expected output:
3141
0.012
c) On an Euler compute node:
Run task1 thrust for value n = 210; 211; ; 230 and generate a pattern of time
vs. n in log2 log2 scale.
1
Run task1 cub for value n = 210; 211; ; 230 and generate a pattern of time vs. n
in log2 log2 scale.
Overlay the above two patterns on top of the plot you generated for HW05 task1 in task1.pdf.
2
2. In HW06, you’ve implemented the scan function using Hillis-Steele algorithm. In this task, you will implement the exclusive scan using the Thrust library and the CUB library and compare their e ciency with the previous implementation through a scaling analysis.
a) Implement in a le called task2 thrust.cu the Thrust version of exclusive scan. It’s expected to do the following (some details about copying between host and device are not included here but should be implemented in your code when necessary):
Create and lls (with float type numbers) however you like a thrust::host vector of length n where n is the rst command line argument as below.
Use the built-in function in Thrust to copy the thrust::host vector into a thrust::device vector. Call the thrust::exclusive scan function to ll another thrust::device vector
with the result of the exclusive scan.
Print the last element of the output array
Print the time taken to run the thrust::exclusive scan function in milliseconds using CUDA events.
Compile: nvcc task2 thrust.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task2 thrust
Run (where n is a positive integer): ./task2 thrust n Example expected output:
1560.3
0.012
b) Implement in a le called task2 cub.cu the CUB version of the exclusive scan based on the code example sent by Dan in an email.
Stick with the same device memory allocation pattern as the code example (DeviceAllocate() and cudaMemcpy()). Do not use uni ed memory.
Modify the example program so that the host array h in has length n where n is the rst command line argument as below, then ll in h in with float type numbers however you like.
Call the DeviceScan::ExclusiveSum function that lls the output array with the result of the exclusive scan.
Print the last element of the output array
Print the time taken to run the DeviceScan::ExclusiveSum function (the actual one, not the one that’s used to nd the size of temporary storage needed) in mil-liseconds using CUDA events.
Compile: nvcc task2 cub.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task2 cub
Run (where n is a positive integer): ./task2 cub n Example expected output:
1560.3
0.012
c) On an Euler compute node:
Run task2 thrust for value n = 210; 211; ; 229 and generate a pattern of time
vs. n in log2 log2 scale.
Run task2 cub for value n = 210; 211; ; 229 and generate a pattern of time vs. n
in log2 log2 scale.
Overlay the above two patterns on top of the plot you generated for HW06 task2 in task2.pdf.
3
3. (a) Implement in a le called count.cu the function count as declared and described in count.cuh. Your count function should be able to take a thrust::device vector, for instance, named d in ( lled by integers), and ll the output values array with the unique integers that appear in d in in ascending order, as well as the output counts array with the corresponding occurrences of these integers. A brief example is shown below:
Example input: d in = [3, 5, 1, 2, 3, 1]
Expected output: values = [1, 2, 3, 5]
Expected output: counts = [2, 1, 2, 1]
Hints:
Since the length of values and counts may not be equal to the length of d in, you may want to use thrust::inner product to nd the number of \jumps" (when a[i-1] != a[i]) as you step through the sorted array (the input array is not sorted, so you would have to do a sort using Thrust built-in function). - See Lecture 18 slide 56 for an example. There are other valid options as well, for instance, thrust::unique.
thrust::reduce by key could be helpful.
(b) Write a test program task3.cu which does the following:
Create and ll (with int type of numbers) however you like a thrust::host vector of length n where n is the rst command line argument as below.
Use the built-in function in Thrust to copy the thrust::host vector into a thrust::device vector as the input of your count function.
Use your count function to ll another two arrays with the result of count. Print the last element of values array.
Print the last element of counts array.
Print the time taken to run the count function in milliseconds using CUDA events.
Compile: nvcc task3.cu count.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task3
Run: (where n is a positive integer): ./task3 n Example expected output:
370
23
0.13
(c) On an Euler compute node, run task3 for value n = 25; 26; ; 224 and generate a plot
of time vs. n in log2 log2 scale in a le called task3.pdf.
4