Starting from:
$30

$24

Machine problem 5: Cache simulation & optimization Solution

 Overview

This lab is to help understand the impact that cache memories can
have on the performance of your C programs.

The lab consists of two parts. The first part is writing a small C
program that simulates the behavior of a cache memory. The second part will
optimize a small matrix transpose function, with the goal of minimizing the
number of cache misses.


 Reference Trace Files

The `traces` subdirectory contains a collection of *reference trace files* that
we will use to evaluate the correctness of the cache simulator you write in Part
A. The trace files are generated by a Linux program called `valgrind`. For
example, typing

    linux> valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ls -l

on the command line runs the executable program `ls -l`, captures a trace of
each of its memory accesses in the order they occur, and prints them on
`stdout`.

<span>Valgrind</span> memory traces have the following form:

    I 0400d7d4,8
     M 0421c7f0,4
     L 04f6b868,8
     S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is

    [space]operation address,size

The <span>*operation*</span> field denotes the type of memory access:
`I` denotes an instruction load, `L` a data load, `S` a data store, and
`M` a data modify (i.e., a data load followed by a data store). There is
never a space before each `I`. There is always a space before each `M`,
`L`, and `S`. The <span>*address*</span> field specifies a 64-bit
hexadecimal memory address. The <span>*size*</span> field specifies the
number of bytes accessed by the operation.

 Part A: Writing a Cache Simulator

In Part A, a cache simulator in "csim.c" takes a <span>valgrind</span> 
memory trace as input, simulates the hit/miss behavior of a cache memory 
on this trace, and outputs the total number of hits, misses, and evictions.

We have provided the binary executable of a <span>*reference
cache simulator*</span>, called `csim-ref`, that simulates the behavior
of a cache with arbitrary size and associativity on a <span>
valgrind</span> trace file. It uses the LRU (least-recently used)
replacement policy when choosing which cache line to evict.

The reference simulator takes the following command-line arguments:

    Usage: ./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>

-   `-h`: Optional help flag that prints usage info

-   `-v`: Optional verbose flag that displays trace info

-   `-s <s>`: Number of set index bits (*S = 2<sup>s</sup>* is the number of sets)

-   `-E <E>`: Associativity (number of lines per set)

-   `-b <b>`: Number of block bits (*B = 2<sup>b</sup>* is the block size)

-   `-t <tracefile>`: Name of the `valgrind`
    trace to replay

The command-line arguments are based on the notation (*s*, *E*, and *b*) from
CS:APP. For example:

        linux> ./csim-ref -s 4 -E 1 -b 4 -t traces/yi.trace
        hits:4 misses:5 evictions:3

The same example in verbose mode:

        linux> ./csim-ref -v -s 4 -E 1 -b 4 -t traces/yi.trace
        L 10,1 miss 
        M 20,1 miss hit 
        L 22,1 hit 
        S 18,1 hit 
        L 110,1 miss eviction 
        L 210,1 miss eviction 
        M 12,1 miss eviction hit 
        hits:4 misses:5 evictions:3

Task for Part A is to fill in the "csim.c" file so that
it takes the same command line arguments and produces the identical
output as the reference simulator. Notice that this file is almost
completely empty, will be written from scratch.

 Programming Rules for Part A

-   <span>csim.c</span> file must compile without warnings in order
    to receive credit.

-   Simulator must work correctly for arbitrary *s*, *E*, and *b*.  This
    means that it will need to dynamically allocate storage for the
    simulator’s data structures using the `malloc` function.

-   For this lab, we are interested only in data cache performance, so
    the simulator should ignore all instruction cache accesses (lines
    starting with `I`). Recall that <span>valgrind</span> always puts
    `I` in the first column (with no preceding space), and `M`, `L`, and
    `S` in the second column (with a preceding space).

-   To receive credit for Part A, must call the function
    <span>printSummary</span>, with the total number of hits, misses,
    and evictions, at the end of the <span>main</span> function:

            printSummary(hit_count, miss_count, eviction_count);

-   For this this lab, assume that memory accesses are
    aligned properly, such that a single memory access never crosses
    block boundaries. By making this assumption, ignore the
    request sizes in the <span>valgrind</span> traces.

 Part B: Optimizing Matrix Transpose

In Part B, write a transpose function in <span>trans.c</span>
that causes as few cache misses as possible.

Let *A* denote a matrix, and *A<sub>ij</sub>* denote the component on the ith
row and jth column. The <span>*transpose*</span> of *A*, denoted *A<sup>T</sup>*,
is a matrix such that *A<sub>ij</sub>=A<sup>T</sup><sub>ji</sub>*.

An example transpose function in "trans.c" that computes the transpose of
*N &times; M* matrix *A* and stores the results in *M &times; N* matrix *B*:

        char trans_desc[] = "Simple row-wise scan transpose";
        void trans(int M, int N, int A[N][M], int B[M][N])

The example transpose function is correct, but it is inefficient because
the access pattern results in relatively many cache misses.

Task in Part B is to write a similar function, called
`transpose_submit`, that minimizes the number of cache misses across
different sized matrices:

        char transpose_submit_desc[] = "Transpose submission";
        void transpose_submit(int M, int N, int A[N][M], int B[M][N]);

Do <span>*not*</span> change the description string
(“`Transpose submission`”) for the `transpose_submit` function. The
autograder searches for this string to determine which transpose
function to evaluate for credit.

 Programming Rules for Part B

-   Code in <span>trans.c</span> must compile without warnings to
    receive credit.

-   Allowed to define at most 12 local variables of type `int`
    per transpose function.

-   Not allowed to side-step the previous rule by using any
    variables of type `long` or by using any bit tricks to store more
    than one value to a single variable.

-   The transpose function may not use recursion.

-   If choosing to use helper functions, may not have more than 12
    local variables on the stack at a time between the helper functions
    and the top level transpose function. For example, if the
    transpose declares 8 variables, and then call a function which
    uses 4 variables, which calls another function which uses 2, it
    will have 14 variables on the stack, and it will be in violation of
    the rule.

-   The transpose function may not modify array A. It may, however, do
    whatever with the contents of array B.

-   NOT allowed to define any arrays in code or to use any
    variant of <span>malloc</span>.

Evaluation
----------

This section describes how the work will be evaluated. The full score
for this lab is 53 points:

-   Part A: 27 Points

-   Part B: 26 Points

 Evaluation for Part A

For Part A, we will run cache simulator using different cache
parameters and traces. There are eight test cases, each worth 3 points,
except for the last case, which is worth 6 points:

      linux> ./csim -s 1 -E 1 -b 1 -t traces/yi2.trace
      linux> ./csim -s 4 -E 2 -b 4 -t traces/yi.trace
      linux> ./csim -s 2 -E 1 -b 4 -t traces/dave.trace
      linux> ./csim -s 2 -E 1 -b 3 -t traces/trans.trace
      linux> ./csim -s 2 -E 2 -b 3 -t traces/trans.trace
      linux> ./csim -s 2 -E 4 -b 3 -t traces/trans.trace
      linux> ./csim -s 5 -E 1 -b 5 -t traces/trans.trace
      linux> ./csim -s 5 -E 1 -b 5 -t traces/long.trace

Can use the reference simulator `csim-ref` to obtain the correct
answer for each of these test cases. During debugging, use the
<span>-v</span> option for a detailed record of each hit and miss.

For each test case, outputting the correct number of cache hits, misses
and evictions will give the full credit for that test case. Each of the
reported number of hits, misses and evictions is worth 1/3 of the credit
for that test case. That is, if a particular test case is worth 3
points, and the simulator outputs the correct number of hits and
misses, but reports the wrong number of evictions, then it will earn 2
points.

 Evaluation for Part B

For Part B, we will evaluate the correctness and performance of the
`transpose_submit` function on three different-sized output matrices:

-   *32 &times; 32* (*M=32*, *N=32*)

-   *64 &times; 64* (*M=64*, *N=64*)

-   *61 &times; 67* (*M=61*, *N=67*)

# Performance (26 pts)

For each matrix size, the performance of the `transpose_submit`
function is evaluated by using <span>valgrind</span> to extract the
address trace for the function, and then using the reference simulator
to replay this trace on a cache with parameters (*s=5*, *E=1*, *b=5*).

The performance score for each matrix size scales linearly with the
number of misses, *m*, up to some threshold:

-   *32 &times; 32*: 8 points if *m < 300*, 0 points if *m > 600*

-   *64 &times; 64*: 8 points if *m < 1,300*, 0 points if *m > 2,000*

-   *61 &times; 67*: 10 points if *m < 2,000*, 0 points if *m > 3,000*

The code must be correct and adhere to the programming rules to receive any
performance points for a particular size. The code only needs to be correct for
these three cases and it can optimize it specifically for these three cases. In
particular, it is perfectly OK for the function to explicitly check for the
input sizes and implement separate code optimized for each case.

Working on the Lab
------------------

 Working on Part A

We have provided an autograding program, called `test-csim`,
that tests the correctness of the cache simulator on the reference
traces. Be sure to compile the simulator before running the test:

    linux> make
    linux> ./test-csim
                            Your simulator     Reference simulator
    Points (s,E,b)    Hits  Misses  Evicts    Hits  Misses  Evicts
         3 (1,1,1)       9       8       6       9       8       6  traces/yi2.trace
         3 (4,2,4)       4       5       2       4       5       2  traces/yi.trace
         3 (2,1,4)       2       3       1       2       3       1  traces/dave.trace
         3 (2,1,3)     167      71      67     167      71      67  traces/trans.trace
         3 (2,2,3)     201      37      29     201      37      29  traces/trans.trace
         3 (2,4,3)     212      26      10     212      26      10  traces/trans.trace
         3 (5,1,5)     231       7       0     231       7       0  traces/trans.trace
         6 (5,1,5)  265189   21775   21743  265189   21775   21743  traces/long.trace
        27

For each test, it shows the number of points earned, the cache
parameters, the input trace file, and a comparison of the results from
the simulator and the reference simulator.

Here are some hints and suggestions for working on Part A:

-   Do the initial debugging on the small traces, such as
    `traces/dave.trace`.

-   The reference simulator takes an optional `-v` argument that enables
    verbose output, displaying the hits, misses, and evictions that
    occur as a result of each memory access. Is is not required to
    implement this feature in the <span>csim.c</span> code, but we
    strongly recommend that it is done so. It will help to debug by
    allowing to directly compare the behavior of your simulator with
    the reference simulator on the reference trace files.

-   We recommend using the <span>getopt</span> function to parse
    the command line arguments. You’ll need the following header files:

            #include <getopt.h> 
            #include <stdlib.h> 
            #include <unistd.h> 

    See “<span>man 3 getopt</span>” for details.

-   Each data load (L) or store (S) operation can cause at most one cache miss
    (assume that the latter uses a write-allocate policy). The data modify
    operation (M) is treated as a load followed by a store to the same
    address. Thus, an M operation can result in two cache hits, or a miss and a
    hit plus a possible eviction.

 Working on Part B

We have provided an autograding program, called <span>
test-trans.c</span>, that tests the correctness and performance of each
of the transpose functions that is registered with the autograder.

It can register up to 100 versions of the transpose function in the
<span>trans.c</span> file. Each transpose version has the following
form:

        /* Header comment */
        char trans_simple_desc[] = "A simple transpose";
        void trans_simple(int M, int N, int A[N][M], int B[M][N])
        {
            /* your transpose code here */
        }

Register a particular transpose function with the autograder by making a
call of the form:

        registerTransFunction(trans_simple, trans_simple_desc);

in the `registerFunctions` routine in <span>"trans.c"</span>.  At runtime, the
autograder will evaluate each registered transpose function and print the
results. Of course, one of the registered functions must be the
`transpose_submit` function that is being submitting for credit:

        registerTransFunction(transpose_submit, transpose_submit_desc);

See the default <span>trans.c</span> function for an example of how this
works.

The autograder takes the matrix size as input. It uses
<span>valgrind</span> to generate a trace of each registered transpose
function. It then evaluates each trace by running the reference
simulator on a cache with parameters (*s=5*, *E=1*, *b=5*).

For example, to test the registered transpose functions on a *32
&times; 32* matrix, rebuild `test-trans`, and then run it with the
appropriate values for *M* and *N*:

    linux> make
    linux> ./test-trans -M 32 -N 32
    Step 1: Evaluating registered transpose funcs for correctness:
    func 0 (Transpose submission): correctness: 1
    func 1 (Simple row-wise scan transpose): correctness: 1
    func 2 (column-wise scan transpose): correctness: 1
    func 3 (using a zig-zag access pattern): correctness: 1

    Step 2: Generating memory traces for registered transpose funcs.

    Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)
    func 0 (Transpose submission): hits:1766, misses:287, evictions:255
    func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151
    func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151
    func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945

    Summary for official submission (func 0): correctness=1 misses=287

In this example, we have registered four different transpose functions
in <span>trans.c</span>. The `test-trans` program tests each of the
registered functions, displays the results for each, and extracts the
results for the official submission.

Here are some hints and suggestions for working on Part B.

-   The <span>test-trans</span> program saves the trace for function *i*
    in file "trace.f*i*". These trace files are
    invaluable debugging tools that can help to understand exactly
    where the hits and misses for each transpose function are
    coming from. To debug a particular function, simply run its trace
    through the reference simulator with the verbose option:

        linux> ./csim-ref -v -s 5 -E 1 -b 5 -t trace.f0
        S 68312c,1 miss 
        L 683140,8 miss 
        L 683124,4 hit 
        L 683120,4 hit 
        L 603124,4 miss eviction 
        S 6431a0,4 miss 
        ...

-   Since the transpose function is being evaluated on a direct-mapped
    cache, conflict misses are a potential problem. Think about the
    potential for conflict misses in the code, especially along
    the diagonal. Try to think of access patterns that will decrease the
    number of these conflict misses.

-   Blocking is a useful technique for reducing cache misses. See
    <http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf> for more
    information.

 Putting it all Together

We have provided a <span>*driver program*</span>, called
`./driver.py`, that performs a complete evaluation of the simulator and
transpose code. This is the same program the instructor uses to
evaluate handins. The driver uses <span>test-csim</span> to
evaluate the simulator, and it uses <span>test-trans</span> to evaluate
the submitted transpose function on the three matrix sizes. Then it
prints a summary of the results and the points earned.

To run the driver, type:

    linux> ./driver.py

More products