$29
R
$mpicc -o pingpong pingpong.c
unning MPI jobs on the HPCcluster:
https://hpc.oit.uci.edu/running-jobs#_mpi_using_two_or_more_nodes
OpenMPI documentation:http://www.open-mpi.org/doc/v1.8
Snailspeed Ltd. is actively trying to enter new markets. Your boss is starting a new project with the goal of developing Mandelbrot sets.
With the revolution in parallel computing, and since Snailspeed owns stock in the graphics company, your manager, Peter Sloanie, wants you to develop fractals that are parallelized and run super fast, so he can impress his competitors. Therefore, the code should run as fast as possible while making the most use of the hardware’s capabilities.
As added incentive to write good code, whenyouare done, the project is going to be passed to your neighbor, Bob Halfbit. Bob is the sort of coworker wholeavesthe coffee potempty,occasionally eats someone else’s lunch, and listens to Michael Bolton just loud enough thatyoucan’t not be distractedbyit, but not loud enough foryoutosaysomething. If he doesn’tgrokyour code, you’re going to be paired with him for the next year and a half toworkon it. For theloveofPete(yourmanager),makeyourcodereadableandwell-documentedsoyoucancontinue toworkon interestingprojects.
The lab hastwoparts. In the first part,youwill run a “ping-pong” micro-benchmark and use it to come up with a cost model for communication on the classcluster.Youwill not need to do any programming in this part. In part 2,youwill implement a distributed algorithm for Mandelbrot computation. This part is anofflinepart thatyouwill do athome.
You may work inteams of two. To simply grading of your assignments, each team should submit one copy of the assignment. Be sure to indicate your assignment partner by creating aREADMEfile as part of your submission.
Part 1 (in class): Ping-pong
The demo implements a ping-pong microbenchmark. The benchmark usestwonodes, whichwewill call Node 0 and Node 1. First, Node 0 sends a message to Node 1. Then, as soon as Node 1 receives the message, it immediately returns the message to Node 0.Toget accurate timing data, the benchmark actually measures the time to complete 1000 ping-pong volleys, and thendividesby1000togettheaveragetimeofasinglevolley.
Nowletsdownload,compile,andrunademo.Wewillusetheresultsofthisdemotoanalyze the performance of the network on the HPCcluster.
Getting the scaffolding code
$ssh -X <UCInetID@hpc.oit.uci.edu
Use a ssh client and your UCInetID login/password to log into thecluster.Sincewe wantto display the ping-pong benchmark output and final Mandelbrot fractal,wewill enable X11 forwarding when logging into thecluster.
ThebaselinecodeforthisassignmentisavailableatthisURL:https://github.com/EECS-UCI/ hw2.git
$git clone https://github.com/EECS-UCI/hw2.git
To get a local copy of the repository for your work, you need to use git to clone it. So, let’s make a copy on the HPC cluster and modify it there. To do that, run the following command on the cluster.
There will be a new directory calledhw2.
Compiling and running your code
Tocompile the benchmark,wewill usempicc, which is a MPI-aware wrapper script around thedefaultGNUcompiler,gcc.Thefollowingcommandswillturnpingpong.cintoabinary
executable namedpingpong:
module load mpich-3.0.4/gcc-4.8.3
$mpicc -o pingpong pingpong.c
Asmpiccis just a wrapper aroundgcc, it accepts the usual compiler command-line flagsforoptimization, preprocessing, andlinking.
Running the benchmark: Batchjobs
The HPC cluster is a shared computer. Whenyoulogin tohpc.oit.uci.edu,youare using the login node.Youshould limit your use of the login node to light tasks, such as file editing, compiling, and small test runs of,say,just a few seconds. Whenyouare ready to do a timing orperformancerun, thenyousubmit a job request to a grid engine (GE) scheduler. There aretwosteps:
Create a batch job script, which tells GE what machine resourcesyouwant and how torunyourprogram.
SubmitthisjobscripttoGE,usingacommandcalledqsub.
A batch job script is a shell script file containingtwoparts: (i) the commands needed to run yourprogram;and(ii)metadatadescribingyourjobsresourceneeds,whichappearinthescript as comments at the top of the script.Wehaveprovided an example for thepingpongbenchmark; this script ispingpong.shin thepart1subdirectory.
This example asks the scheduler for (a)twonodes, (b) includes the MPI command needed to launch thepingpongprogram, and (c) includes some additional post-processing commands(e.g.,the call tognuplot).
$qsub pingpong.sh
To submit this job request, use theqsubcommand:
$qstat -u <UCInetID
If successful, this command will register your job request and return a job identification number (job ID). Your job is first entered into a central queue, and depending on the current cluster load, might not run right away. As such, you can check on the status of your job requests by running:
When your job eventually runs, any output to standard output or standard error (e.g., as produced print statements) will be automatically redirected into output files (with .o### and
.e### files, labeled by the job ID ###).
Analyzing the results
If all went well above, you should see two new output files:results.datandnetplot.png.
Theresults.datfilecontainstherawbenchmarkresults;inspectthisnow.Thefirstcolumnof this tab-delimited file is the message length in bytes; the second column is theaveragemeasuredone-waytime (in seconds) required to send themessage.
Thenetplot.pngfile is a plot of theabovedata, with message size on the x-axis and time on the y-axis.Toview this file, run the following command. Note that the axes use a log10scale.
$display netplot.png
Assumingyou wereable to complete the stepsabove,here is whatyouneed to turn in. (Each memberoftheteamshoulddotheirownsubmissionbutmaysubmitidenticalwork.Justbesure tosaywith whomyouworked.)Submit yourresults.datandnetplot.pngfiles.In addition,answer the followingquestions:
Takea look atnetplot.png. What is the minimum time to send a message of any size?
Whenthemessagesizeislarge,estimatetheslopeoftheline.Whatdoesthesloperepresent in thiscase?
Usingyouranswerstotheabovequestions,comeupwithamodel(i.e.,formula)thatwould allowsomeonetoestimateT(m),thetimetosendamessageofsizembytes.
Takea look at the functionpingpong()inpingpong.c; for your convenience, the code for this function appearsbelow.Go line-by-line and do your best to explain what the code is doing. Make a note of anythingyoudon’t understand aswell.
Part 2 (after class): Mandelbrot set
The Mandelbrot set is a famous example of a fractal in mathematics. It is named after math- ematician Benoit Mandelbrot. It is important for chaos theory.
The Mandelbrot set can be explained with the equation
n
zn+1=z2+c
wherecandzare complex numbers andnis zero or a positive integer (natural number).
Sincewearecalculatinganinfiniteseries,weneedtoapproximatetheresultbycuttingoffthe calculation at some point. If a point is inside the mandelbrot set,wecan keep iteratinginfinitely.Toavoidthis,weset a limit on the maximum iterationswewilleverdo.In the code given to you,the maximum number of iterations is set to511.
TheMandelbrotsetaboveisembarrassinglyparallelsincecomputationofapixeloftheimagedoes not depend on any other pixel. Hence,wecan distribute theworkacross MPI processes in many differentways.In this assignment,youwill conceptually think about load balancingthiscomputationandimplementageneralstrategythatworksforawiderageofproblems.
Compiling and running your code
We have provided a small program, broken up into modules (separate C++ files and headers), that performs Mandelbrot set sequentially and renders the image. We have also provided aMakefilefor compiling your program. To use it, just runmake.
$./mandelbrot_serial1000 1000
Runmandelbrot_serialon an input image size of1000 x 1000as follows:
$display mandelbrot.png
To visualize the fractal, enter the following command:
If your network connection is slow, this may take a couple of minutes. Be patient.
Running batch jobs on the cluster
Wehaveprovided a sample job script,mandelbrot.sh, for running the serial Mandelbrotprogram youjust compiled on a1000 x 1000image on the compute node of thecluster.
$qsub mandelbrot.sh
$qstat -u <UCInetID
Go ahead and try this by entering the following commands:
Note that since the scaffolding code given to you is serial and runs on one core. When you finish parallelizing your code, to run with multiple MPI processes, follow the instructions athttp://hpc.oit.uci.edu/running-jobs#_mpi_using_two_or_more_nodesvery carefully.
Load balancing strategies
Sinceyourbossissoeagertoseeresults,youhiretwointernstohelpyouwiththeparallelization. Intern Susie Cyclic implements theabovecomputation withPMPI processes. Her strategy is to make processpcompute all of the (valid) rowsp+nPforn=0, 1, 2, ... and use MPI gather operationtocollectallthevaluesattherootprocesstorenderthefractal.
Intern Joe Block implements the above computation withPMPI processes as well. His strat- egy is to make processpcompute all of the (valid) rowspN,pN+1,pN+2, ...pN+ (N1)whereN=height/Pand then use MPI gather to collect all of the values at the root process for rendering the fractal.
Which do you think is better? Why? Which intern do you offer a full-time job?
HINT:The Mandelbrot function can require anywhere from 0 to 511 iterations perrow
.
Parallelization
Implement Susie and Joe’s approaches in the respective files calledmandelbrot_susie.ccandmandelbrot_joe.cc. Analyze the speedup and efficiency of thetwoapproaches against the pro- vided serial version. Use atleast upto the maximum 64 processes on the class queue and more on the public queues ifyouwish and an image size of your choice.Submit the plots and adiscussion of yourresults.
In general,you maynot know beforehand how best to distribute the tasks. In the worst case,youcouldgetalistofjobsthatmakesanyspecificdistributiontheworstpossible.Anotheroptionwhen the number of jobs is much greater than the number of processes is to let each process request a job whenever it has finished the previous one itwasgiven.This is the master/slavemodel.The master is responsible for giving each process a unit of work, receiving the result from anyslavethat completes its job, and sendingslavesnew units of work. Aslaveprocess is responsible for receiving a unit of work, completing the unit of work, sending the result back to themaster,and repeating until there is noworkleft. Implement the Mandelbrot image computation using a master/slave MPI strategy inmandelbrot_ms.ccwhere a job is defined as computing arowof the image. Communicate as little as possible.Compare the master/slavestrategy with Susie/Joe’s implementation. Which doyouthink will scale to very largeimagesizes?Why?
Submission
When you’ve written up answers to all of the above questions, turn in your write-up and zip/- tarball of your code by uploading it to Canvas.
Good luck, and have fun!