Program 2: Vectorizing Code Using SIMD Intrinsics So

Starting from:

~~$35~~

$29

Take a look at the function `clampedExpSerial` in `prog2_vecintrin/main.cpp` of the

Assignment 1 code base. The `clampedExp()` function raises `values[i]` to the power given by `exponents[i]` for all elements of the input array and clamps the resulting values at 9.999999. In program 2, your job is to vectorize this piece of code so it can be run on a machine with SIMD vector instructions.

However, rather than craft an implementation using SSE or AVX2 vector intrinsics that map to real SIMD vector instructions on modern CPUs, to make things a little easier, we're asking you to implement your version using CS149's "fake vector intrinsics" defined in `CS149intrin.h`. The `CS149intrin.h` library provides you with a set of vector instructions that operate

on vector values and/or vector masks. (These functions don't translate to real CPU vector instructions, instead we simulate these operations for you in our library, and provide feedback that makes for easier debugging.) As an example of using the CS149 intrinsics, a vectorized version of the `abs()` function is given in `main.cpp`. This example contains some basic vector loads and stores and manipulates mask registers. Note that the `abs()` example is only a simple example, and in fact the code does not correctly handle all inputs! (We will let you figure out why!) You may wish to read through all the comments and function definitions in `CS149intrin.h` to know what operations are available to you.

Here are few hints to help you in your implementation:

- Every vector instruction is subject to an optional mask parameter. The mask parameter defines which lanes whose output is "masked" for this operation. A 0 in the mask indicates a lane is masked, and so its value will not be overwritten by the results of the vector operation. If no mask is specified in the operation, no lanes are masked. (Note this equivalent to providing a mask of all ones.)

*Hint:* Your solution will need to use multiple mask registers and various mask operations provided in the library.

- *Hint:* Use `_cs149_cntbits` function helpful in this problem.

- Consider what might happen if the total number of loop iterations is not a multiple of SIMD vector width. We suggest you test

your code with `./myexp -s 3`. *Hint:* You might find `_cs149_init_ones` helpful.

- *Hint:* Use `./myexp -l` to print a log of executed vector instruction at the end.

Use function `addUserLog()` to add customized debug information in log. Feel free to add additional

`CS149Logger.printLog()` to help you debug.

The output of the program will tell you if your implementation generates correct output. If there

are incorrect results, the program will print the first one it finds and print out a table of

function inputs and outputs. Your function's output is after "output = ", which should match with

the results after "gold = ". The program also prints out a list of statistics describing utilization of the CS149 fake vector

units. You should consider the performance of your implementation to be the value "Total Vector

Instructions". (You can assume every CS149 fake vector instruction takes one cycle on the CS149 fake SIMD CPU.) "Vector Utilization"

shows the percentage of vector lanes that are enabled.

**What you need to do:**

1. Implement a vectorized version of `clampedExpSerial` in `clampedExpVector` . Your implementation

should work with any combination of input array size (`N`) and vector width (`VECTOR_WIDTH`).

2. Run `./myexp -s 10000` and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector

utilization. You can do this by changing the `#define VECTOR_WIDTH` value in `CS149intrin.h`.

Does the vector utilization increase, decrease or stay the same as `VECTOR_WIDTH` changes? Why?

3. _Extra credit: (1 point)_ Implement a vectorized version of `arraySumSerial` in `arraySumVector`. Your implementation may assume that `VECTOR_WIDTH` is a factor of the input array size `N`. Whereas the serial implementation has `O(N)` span, your implementation should have at most `O(N / VECTOR_WIDTH + log2(VECTOR_WIDTH))` span. You may find the `hadd` and `interleave` operations useful.