Program 3: Parallel Fractal Generation Using ISPC Solution

Starting from:

~~$35~~

$29

Home

Now that you're comfortable with SIMD execution, we'll return to parallel Mandelbrot fractal generation (like in program 1). Like Program 1, Program 3 computes a mandelbrot fractal image, but it achieves even greater speedups by utilizing both the CPU's four cores and the SIMD execution units within each core.

In Program 1, you parallelized image generation by creating one thread

for each processing core in the system. Then, you assigned parts of

the computation to each of these concurrently executing

threads. (Since threads were one-to-one with processing cores in

Program 1, you effectively assigned work explicitly to cores.) Instead

of specifying a specific mapping of computations to concurrently

executing threads, Program 3 uses ISPC language constructs to describe

*independent computations*. These computations may be executed in

parallel without violating program correctness (and indeed they

will!). In the case of the Mandelbrot image, computing the value of

each pixel is an independent computation. With this information, the

ISPC compiler and runtime system take on the responsibility of

generating a program that utilizes the CPU's collection of parallel

execution resources as efficiently as possible.

You will make a simple fix to Program 3 which is written in a combination of

C++ and ISPC (the error causes a performance problem, not a correctness one).

With the correct fix, you should observe performance that is over forty times

greater than that of the original sequential Mandelbrot implementation from

`mandelbrotSerial()`.

# Program 3, Part 1. A Few ISPC Basics (10 of 20 points) ###

When reading ISPC code, you must keep in mind that although the code appears

much like C/C++ code, the ISPC execution model differs from that of standard

C/C++. In contrast to C, multiple program instances of an ISPC program are

always executed in parallel on the CPU's SIMD execution units. The number of

program instances executed simultaneously is determined by the compiler (and

chosen specifically for the underlying machine). This number of concurrent

instances is available to the ISPC programmer via the built-in variable

`programCount`. ISPC code can reference its own program instance identifier via

the built-in `programIndex`. Thus, a call from C code to an ISPC function can

be thought of as spawning a group of concurrent ISPC program instances

(referred to in the ISPC documentation as a gang). The gang of instances

runs to completion, then control returns back to the calling C code.

__Stop. This is your friendly instructor. Please read the preceding paragraph again. Trust me.__

As an example, the following program uses a combination of regular C code and ISPC

code to add two 1024-element vectors. As we discussed in class, since each

instance in a gang is independent and performing the exact

same program logic, execution can be accelerated via

implementation using SIMD instructions.

A simple ISPC program is given below. The following C code will call the

following ISPC code:

------------------------------------------------------------------------

C program code: myprogram.cpp

------------------------------------------------------------------------

const int TOTAL_VALUES = 1024;

float a[TOTAL_VALUES];

float b[TOTAL_VALUES];

float c[TOTAL_VALUES]

// Initialize arrays a and b here.

sum(TOTAL_VALUES, a, b, c);

// Upon return from sumArrays, result of a + b is stored in c.

The corresponding ISPC code:

------------------------------------------------------------------------

ISPC code: myprogram.ispc

------------------------------------------------------------------------

export sum(uniform int N, uniform float* a, uniform float* b, uniform float* c)

{

// Assumption programCount divides N evenly.

for (int i=0; i<N; i+=programCount)

{

c[programIndex + i] = a[programIndex + i] + b[programIndex + i];

}

}

The ISPC program code above interleaves the processing of array elements among

program instances. Note the similarity to Program 1, where you statically

assigned parts of the image to threads.

However, rather than thinking about how to divide work among program instances

(that is, how work is mapped to execution units), it is often more convenient,

and more powerful, to instead focus only on the partitioning of a problem into

independent parts. ISPCs `foreach` construct provides a mechanism to express

problem decomposition. Below, the `foreach` loop in the ISPC function `sum2`

defines an iteration space where all iterations are independent and therefore

can be carried out in any order. ISPC handles the assignment of loop iterations

to concurrent program instances. The difference between `sum` and `sum2` below

is subtle, but very important. `sum` is imperative: it describes how to

map work to concurrent instances. The example below is declarative: it

specifies only the set of work to be performed.

-------------------------------------------------------------------------

ISPC code:

-------------------------------------------------------------------------

export sum2(uniform int N, uniform float* a, uniform float* b, uniform float* c)

{

foreach (i = 0 ... N)

{

c[i] = a[i] + b[i];

}

}

Before proceeding, you are encouraged to familiarize yourself with ISPC

language constructs by reading through the ISPC walkthrough available at

<http://ispc.github.com/example.html. The example program in the walkthrough

is almost exactly the same as Program 3's implementation of `mandelbrot_ispc()`

in `mandelbrot.ispc`. In the assignment code, we have changed the bounds of

the foreach loop to yield a more straightforward implementation.

**What you need to do:**

1. Compile and run the program mandelbrot ispc. __The ISPC compiler is currently configured to emit 8-wide AVX2 vector instructions.__ What is the maximum

speedup you expect given what you know about these CPUs?

Why might the number you observe be less than this ideal? (Hint:

Consider the characteristics of the computation you are performing?

Describe the parts of the image that present challenges for SIMD

execution? Comparing the performance of rendering the different views

of the Mandelbrot set may help confirm your hypothesis.)

We remind you that for the code described in this subsection, the ISPC

compiler maps gangs of program instances to SIMD instructions executed

on a single core. This parallelization scheme differs from that of

Program 1, where speedup was achieved by running threads on multiple

cores.

# Program 3, Part 2: ISPC Tasks (10 of 20 points) ###

ISPCs SPMD execution model and mechanisms like `foreach` facilitate the creation

of programs that utilize SIMD processing. The language also provides an additional

mechanism utilizing multiple cores in an ISPC computation. This mechanism is

launching _ISPC tasks_.

See the `launch[2]` command in the function `mandelbrot_ispc_withtasks`. This

command launches two tasks. Each task defines a computation that will be

executed by a gang of ISPC program instances. As given by the function

`mandelbrot_ispc_task`, each task computes a region of the final image. Similar

to how the `foreach` construct defines loop iterations that can be carried out

in any order (and in parallel by ISPC program instances, the tasks created by

this launch operation can be processed in any order (and in parallel on

different CPU cores).

**What you need to do:**

1. Run `mandelbrot_ispc` with the parameter `--tasks`. What speedup do you

observe on view 1? What is the speedup over the version of `mandelbrot_ispc` that

does not partition that computation into tasks?

2. There is a simple way to improve the performance of

`mandelbrot_ispc --tasks` by changing the number of tasks the code

creates. By only changing code in the function

`mandelbrot_ispc_withtasks()`, you should be able to achieve

performance that exceeds the sequential version of the code by over 40 times!

How did you determine how many tasks to create? Why does the

number you chose work best?

3. _Extra Credit: (2 points)_ What are differences between the thread

abstraction (used in Program 1) and the ISPC task abstraction? There

are some obvious differences in semantics between the (create/join

and (launch/sync) mechanisms, but the implications of these differences

are more subtle. Here's a thought experiment to guide your answer: what

happens when you launch 10,000 ISPC tasks? What happens when you launch

10,000 threads?

_The smart-thinking student's question_: Hey wait! Why are there two different

mechanisms (`foreach` and `launch`) for expressing independent, parallelizable

work to the ISPC system? Couldn't the system just partition the many iterations

of `foreach` across all cores and also emit the appropriate SIMD code for the

cores?

_Answer_: Great question! And there are a lot of possible answers. Come to

office hours.