Lab 4: CUDA 101 Solution

Starting from:

~~$30~~

$24

Home

CUDA "Hello World Vector"

Goal

The purpose of this assignment:

To make sure that you can properly access school resources for GPU and CUDA.

To get you started with the most basic CUDA: Learn how to create a basic CUDA program, to compile it and run it.

We will be using C/C++.

Description

In this assignment we will implement a very basic linear algebra operation: vector addition (sum two vectors element-to-element).

Visit http://mathworld.wolfram.com/VectorAddition.html for the definition of vector addition.

If you are able to your own equipment, download and install CUDA (see the Resources section).

Getting Started

READ: Follow this really nice CUDA Basics slides. You just need to read up to the "Blocks" or "Threads" slides.

Visit Nvidia's CUDA Toolkit documentation for more detail.

Remember, the purpose of this assignment is just to get you familiarized with creating a CUDA program with code that runs on the GPU (device) -- don't need to do anything fancy yet, just make sure you can structure and compile your code.

Equipment

You can use your own equipment if you have access to an NVIDIA GPU.

Use the linux lab machines UW1-320-**p (make sure that you are using a machine with a name ending in p, note that those are not accessible remotely). NOTE: try this right away, if for some reason CUDA is not available, let us know ASAP.

For remote access, you can use cssgpu01.uwb.edu. Everybody in the class will get remote access to this machine.

○ Your home directory is not mounted on these machines as they are research systems. You will have to move your files manually to these machines.

https://uwnetid-my.sharepoint.com/:o:/r/personal/efuente_uw_edu/_layouts/15/WopiFrame.aspx?sourcedoc=%7B7b59849d-651c-4ad4-9bb5-4de9005f18ca%7D&acti
OneNote Online

Deliverables

Implementation

Create two vectors of random integers and implement basic vector addition using CUDA. Store the result in a new vector.

Follow the example in the CUDA Basics slides mentioned above. You can use either the Blocks or the Threads implementation

Time how long it takes for the computer to perform the operation. You can use time.h components like

clock_t and clock().

Hint: you can use cudaThreadSynchronize() after your vector addition, e.g., have the following structure:

clock_t()

Your vector Addition cudaThreadSynchronize(); clock_t()

Otherwise you may not get accurate time measurement. Also, you don't have to use clock_t, there are other ways of timing.

Note: cudaThreadSynchronize() is deprecated and cudaDeviceSynchronize() can be used instead. For this assignment you can use either as they are very similar and the former one will still work. We will discuss more about synchronization when we start on CUDA topics.

Try your program with vectors of a few different sizes, say, 2^7, 2^9, 2^12 and 2^15

elements. Feel free to try other sizes. Hint: consider changing the number of blocks or threads (from 512 to something else -- powers of two are better) for different vector sizes. We'll talk about blocks in detail in future lectures, for now, you can just try a couple of your favourite sizes.

Submit your source code files in a ZIP file. Do not include any executable but make sure that your code compiles.

In your source files, use comments to include information on how to compile your code as well as the type of machine / OS you used.

Output

Submit a screenshot of your output, copy-paste it on a docx or pdf document.