Program 4 Matrix Multiplication on GPU Solution

Starting from:

~~$35~~

$29

Home

UNDER CONSTRUCTION

Keep an eye for updates on this description

For this assignment, you can opt to work in teams of at most 2 students.

Goals

Parallelize a problem using CUDA threads

Practice CUDA software and execution model in a more complex problem

Use Matrix-Matrix Multiply operation as means to understand memory layout

Measure performance and understand factors affecting it

Software and Hardware

Before we start: we will use CUDA for this assignment.

Hardware

You can use your own equipment for this assignment if you have access to an NVIDIA GPU and CUDA. Just make sure you have all the required software installed.

You can use the machines in the linux lab UW1-320 (machine pool UW1-320-00p through UW1-320-15p, cannot be accessed remotely!) for development and benchmarking

You can also use machine cssgpu01 for development, however, for benchmarking you should make sure that you are the only user in the machine you are using so it is recommended that: you either use the lab machines or we come up with a schedule to ensure everybody gets alone time on this server for benchmarking.

Note: CUDA is working in the lab machines. You can use either your own equipment or the machine pool in the lab for benchmarking. If you are going to use cssgpu01, make sure there are no other users at the time you are collecting time measurements. Don't forget that code may behave very different on different hardware. Stick to one architecture for testing!

https://uwnetid-my.sharepoint.com/:o:/r/personal/efuente_uw_edu/_layouts/15/WopiFrame.aspx?sourcedoc=%7B7b59849d-651c-4ad4-9bb5-4de9005f18ca%7D&acti
OneNote Online

Software

The required implementations will be using C or C++ and CUDA.

CUDA

C/C++ compiler (may need it depending on your implementation)

Visual Studio (if you are using Windows, e.g., on your own machine)

Like in Program 1, you will need to generate floating point random numbers to fill out your vectors and matrices. Don’t use zeros or ones in your data as that may give an incorrect timing behavior.

Description

The main task is to implement several parallel versions of Matrix Multiplication using CUDA. The first version would be using a "naı̈ve" implementation. For the other versions, you will vary some parameters that could affect performance. There are many factors that will impact the performance of this operation.

Part 1: Understand the underlying hardware architecture

Investigate and understand the technical specifications for the GPU hardware you are going to be using. This will be part of your report. NVIDIA has plenty of technical documents, however, for starters, you can find other "simpler" specification summaries around the web (e.g., the TechPowerUp GPU DB)

Part 2: Implement Naïve parallel version of MM multiplication using CUDA

You can do this implementation using a basic matrix multiply algorithm or refer to the generic algorithm discussed for CPU . You can also use as reference the version provided in the CUDA samples.

Time the execution of your program using different sizes of matrices

You can use square matrices only, rectangular matrices is optional

Use 1D execution configurations so that a thread loads a whole (at this point, you don't need to focus on fitting for the underlying hardware, the idea is that this is naı̈ve.)

4 different sizes of matrices are enough but try to experiment with more to make it easier to see how different parameters affect performance

Try a couple of block sizes

Part 3: Implement optimized parallel versions of MM multiplication using CUDA, varying parameters

Implement a naı̈ve version of the algorithm using 2D grid or 2D Blocks, choose and vary parameters

Vary the size of your computational grid: change number of CUDA threads and blocks

Different sizes of matrices (same guidance as in part 2)

Try to optimize your naı̈ve 2D implementation,

Tiling using 2D grid and 2D blocks: tuning for locality, e.g., block size to fit in the cache using the tiled algorithm for matrix multiplication

Part 4: Compare performance of your different implementations

Time the execution of each version of your program using different sizes of matrices

Like in Program 2, collect 2 times:

Including data transfers

Time for computations only

Report and plot your results in FLOPS (not time units)

Deliverables

Submit code and report separately:

Submit source code in a zip file.

Submit report separately as a pdf or docx file.

Report (50%)

Report and analyze your results; report and discuss some plots of your performance data using various matrix sizes.

Verify that you are computing the correct results in each case, you may want to read this reference.

Report what hardware you used and how you are getting the performance results as well. Describe in detail GPU specifications (how much memory, how many CUDA cores, how many SMs, etc.)

Analyze your timing results by plotting execution rate (operations per second) as the size of the problem changes.

Include a few screenshots of the different runs from your implementations (make these user-friendly, not just a couple of numbers showing on the screen).

If you are using *any* external reference, you must cite it in your report in a "References" section.

Submit your report separately from the code, i.e., if you created a zip file containing your source code don't include the report in it, submit separately as a second file in canvas. Your report should be submitted as a MS Word or PDF document.

Code (50%)

(Please note the requirements for code submission)

Put the different implementations in a single file. Name your file:

Program4.cu

If you have any helper files archive all your source code files into a single zip file. Name your file Program4.zip

Verify that the tar file you submitted is correct and contains all necessary files for your code to compile and run. (You can download what you submitted, un-tar it and make sure it is correct). If the graders cannot unzip or figure out what you submitted, you will receive no points.

If you had already started it in multiple files (as the original instructions), don't spend time changing your code, submit your files with instructions for compiling and running.

All the different implementations and kernels should be commented and there should be a clear separation both in the code (e.g., using comments) as well as in the output, indicating where each version and kernel is currently running. For instance, use separators like "-------

------------" or "**************"

Do not have any input arguments to your program or request any input from user at any point. You may do this for your own testing and experimentation, but when you submit:

Hardcode a couple of different execution configuration values for each kernel to run with a couple of matrix sizes.

The purpose of this is to help automation and the graders: minimize user interaction, make your code user friendly.

Include clear instructions on how to compile your files both in code (as comments) and in the report,

Comment your code properly: comment at the top of each file or function what modification you are applying

Remember, the goal of this assignment is not just to write software, but to look at the performance for each of your implementations and try to explain why you are getting the performance you see and how different strategies can affect the performance. It is the analysis of what your program is doing what matters the most.