$24
Part I. Computational Complexity and Flop/s Performance
In this part, you will perform small hands-on exercises to get a feel for computational complexity and flop/s (floating-point operations per second) performance of a computer. Submit your answers to the following two questions.
I-1. Measuring Computational Complexity
Download the data file, MDtime.out, from the course home page for assignment 1. In the two-column file, the left column is the number of atoms, N, simulated by the md.c program, whereas the right column is the corresponding running time, T, of the program in seconds. Make a log-log plot of T vs. N. Perform linear fit of logT vs. logN, i.e., logT = alogN +b, where a and
are fitting parameters. Note that the coefficient a signifies the power with which the runtime scales as a function of problem size N: T µ Na. Submit the plot and your fit of a.
I-2. Theoretical Flop/s Performance
Suppose you have a quadcore processor (a processor equipped with 4 processing cores) operating at a clock speed of 3.0 GHz (i.e., clock ticks 3´109 times per second), in which each core can operate 1 multiplication and 1 addition operations per clock cycle (e.g., using a special fused multiply-add (FMA) circuit). Assume that each multiply or add operation is performed on vector registers, each holding 4 double-precision operands (Fig. 1).* What is the theoretical peak
performance of your computer in Gflop/s (gigaflop/s or 109 flops)?
Fig. 1: Vector multiplier (VMUL) loads data from two vector registers, R1 and R3, each holding 4 double-precision numbers, concurrently performs 4 multiplications, and stores the results on vector register R3.
Part II. Implementing Your Own Global Summation with Message Passing Interface
In this part, you will write your own global summation program (equivalent to MPI_Allreduce) using MPI_Send and MPI_Recv . Your program should run on P = 2l processors (l = 0, 1,...). Each process contributes a partial value, and at the end, all the processes will have the globally summed value of these partial contributions.
Your program will use a communication structure called butterfly, which is structured as a series of pairwise exchanges (see the figure below where messages are denoted by arrows). This structure allows a global reduction among P processes to be performed in log2P steps.
See FLOPS in Wikipedia [https://en.wikipedia.org/wiki/FLOPS].
1
a000 + a001 + a010 + a011 + a100 + a101 + a110 + a111
((a000 + a001) + (a010 + a011)) + ((a100 + a101) + (a110 + a111))
At each level l, a process exchanges messages with a partner whose rank differs only at the l-th bit position in the binary representation (Fig. 2).
Fig. 2: Butterfly network used in hypercube algorithms.
HYPERCUBE TEMPLATE
We can use the following template to perform a global reduction using any associative operator OP (such as multiplication or maximum), (a OP b) OP c = a OP (b OP c).
procedure hypercube(myid, input, logP, output)
begin
mydone := input;
for l := 0 to logP-1 do
begin
partner := myid XOR 2l;
send mydone to partner;
receive hisdone from partner;
mydone = mydone OP hisdone
end
output := mydone
end
USE OF BITWISE LOGICAL XOR
Note that
0XOR0=1XOR1=0;
0XOR1=1XOR0=1.
so that a XOR 1 flips the bit a, i.e.,
XOR 1 = `a a XOR 0 = a
where `a is the complement of a (`a = 1|0 for a = 0|1). In particular, myid XOR 2l reverses the l-th bit of the rank of this process, myid:
abcdefg XOR 0000100 = abcd`efg
Note that the XOR operator is ^ (caret symbol) in the C programming language.
ASSIGNMENT
Complete the following program by implementing the function, global_sum, using MPI_Send and MPI_Recv functions and the hypercube template given above.
2
Submit the source code as well as the printout from a test run on 4 processors and that on 8 processors.
#include "mpi.h"
#include <stdio.h
int nprocs; /* Number of processors */
int myid; /* My rank */
double global_sum(double partial) {
/* Implement your own global summation here */
}
int main(int argc, char *argv[]) {
double partial, sum, avg;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
partial = (double) myid;
printf("Node %d has %le\n", myid, partial); sum = global_sum(partial); if (myid == 0) {
avg = sum/nprocs;
printf("Global average = %le\n", avg);
}
MPI_Finalize();
return 0;
}
3