Starting from:
$35

$29

Program 5: BLAS `saxpy` Solution

Program 5 is an implementation of the saxpy routine in the BLAS (Basic Linear

Algebra Subproblems) library that is widely used (and heavily optimized) on

many systems. `saxpy` computes the simple operation `result = scale*X+Y`, where `X`, `Y`,

and `result` are vectors of `N` elements (in Program 5, `N` = 20 million) and `scale` is a scalar. Note that

`saxpy` performs two math operations (one multiply, one add) for every three

elements used. `saxpy` is a *trivially parallelizable computation* and features predictable, regular data access and predictable execution cost.




**What you need to do:**




1. Compile and run `saxpy`. The program will report the performance of

ISPC (without tasks) and ISPC (with tasks) implementations of saxpy. What

speedup from using ISPC with tasks do you observe? Explain the performance of this program.

Do you think it can be substantially improved? (For example, could you rewrite the code to achieve near linear speedup? Yes or No? Please justify your answer.)

2. __Extra Credit:__ (1 point) Note that the total memory bandwidth consumed computation in `main.cpp` is `TOTAL_BYTES = 4 * N * sizeof(float);`. Even though `saxpy` loads one element from X, one element from Y, and writes one element to `result` the multiplier by 4 is correct. Why is this the case? (Hint, think about how CPU caches work.)

3. __Extra Credit:__ (points handled on a case-by-case basis) Improve the performance of `saxpy`.

We're looking for a significant speedup here, not just a few percentage

points. If successful, describe how you did it and what a best-possible implementation on these systems might achieve.




Notes: Some students have gotten hung up on this question (thinking too hard) in the past. We expect a simple answer, but the results from running this problem might trigger more questions in your head. Feel free to come talk to the staff.

More products