$24
Overview
In this assignment, we're exploring the design space for basic spinlocks, and
in so doing developing some implementations that we can use as building blocks
later on. We'll measure how these implementations perform at various levels of
concurrency and bus contention, and think along the way about why each idea we
introduce results in the behavior we observe.
The series of steps in this assignment build on one-another, so you'll need to
do them in order to get things working. There are numbered questions
throughout the assignment; we recommend thinking through these questions
carefully and writing down short concrete answers before moving on to the next
section. If you're confused by a question or don't know where to start, you're
free to discuss or puzzle through them with other students, and Ted and Jon are
also available to help. The point of these assignments is to develop your
understanding of the material and give you some hands-on intuition for
concurrent programming, and the questions are there to help you understand
what's going on. If they don't, we'd really like to know, so we can write
some better ones :)
Finally, this assignment is brand new, so there may be gaps in the
documentation and bugs in the code. Let us know about any issues you find and
we'll resolve them as quickly as possible.
If you have questions and are comfortable sharing them, send them to the
[mailing list](mailto:cs510walpole@cecs.pdx.edu). If you would like Ted to
privately answer questions or take a look at your code, you can get in touch
with him via [email](mailto:theod@pdx.edu). He will also make an effort to
lurk on IRC in `#cs510walpole` on irc.cat.pdx.edu (see below for details).
Submission instructions
```bash
cd CS510-Advanced-Topics-in-Concurrency
pushd Assignment_1
pushd data
# remove all data and graphs BESIDES the final ones you'd like me to look at
rm <out-of-date data and graphs
popd
make clean
popd
# where NAME is your name separated by underscores
tar czf Assignment_1-NAME.tar.gz Assignment_1/
```
Email the archive to [theod@pdx.edu](mailto:theod@pdx.edu) with subject
`Assignment_1-NAME` and any questions and comments you'd like to share.
What is a spinlock?
We can use a lock to make sure that one thread has mutually exclusive access to a
resource. To do so, we need to ensure that when thread A acquires a lock, no other
thread also acquires that lock until after thread A has finished accessing the resource
and releases the lock.
We can use a word `lock` in shared memory to represent a lock, which has the
value `UNLOCKED` when it is free, and the value `LOCKED` when it is held by
some thread. Then, acquiring the lock is as simple as observing `lock ==
UNLOCKED`, and then writing `LOCKED` to `lock`, right? WRONG! Let's say that threads
A and B both observe `lock == UNLOCKED`, and then concurrently write `LOCKED`
to `lock`. Both would observe that `lock == LOCKED`, and each might conclude
that it could safely access the resource `lock` protects. This is just what a
lock is supposed to prevent, right?
We could try identifying the lock holder in `lock`, with thread A writing
`LOCKED_A` and thread B writing `LOCKED_B`. Then if thread A observed `lock ==
LOCKED_A`, it would know that B hadn't come along and written `LOCKED_B` to
`lock`, right? WRONG! What if B just hadn't gotten around to it yet? A observes
`lock == LOCKED_A`, then B writes `LOCKED_B` to `lock` and observes `lock ==
LOCKED_B`, and once again the threads concurrently conclude that each has
exclusive access to the resource. That doesn't work either! Let's abandon the
"lock tells you who has it" train of thought for now; we'll come back to it in
a later assignment.
Let's again consider the case where threads A and B are racing to write
`LOCKED` to `lock`. If each thread could tell the computer to only write
`LOCKED` to `lock` if nobody else has snuck in and done it yet, and the
computer lets the thread know whether the attempt succeeded or failed, then the
thread whose attempt succeeded would know that it had exclusive access to the
resource, while the thread whose attempt failed would know that it didn't.
Primitives
Modern multicore computers offer atomic compare and exchange instructions to
enable this way of using words in memory to implement locks and other
concurrent objects. In particular, the `x86_64` instruction set offers a
"`lock; cmpxchgq`" instruction that we've exposed as
a macro you can use in C code `lockcmpxchgq(ptr, old, new)` that has the
following behavior:
* If `*ptr == old`, atomically store new to `*ptr` and return `old`.
* Otherwise, return the value of `*ptr` without changing it. This is the
primitive we'll be using to implement our spinlocks!
The `q` stands for "quad word", which is Intel's backward-looking way of saying
"as long as 4 2-byte words, or 8 bytes (64 bits) long". Going forward, this
what we mean when we say "word". We'll use the C type `uint64_t` to represent
these words.
NOTE: There are a variety of interesting solutions to the critical section
problem that don't use atomic instructions, such as
[Peterson's Algorithm](https://en.wikipedia.org/wiki/Peterson%27s_algorithm)
and
[Simpson's 4-slot Algorithm](http://deploy-eprints.ecs.soton.ac.uk/116/1/sld.ch7.conc.pdf)
(slides 7 and on), we just aren't exploring them in this assignment.
To keep the compiler from rearranging our code, e.g. during optimizations, we
need to use some other macros: If another thread might be loading from or
storing to `x`, write `_CMM_STORE_SHARED(x,v);` instead of `x = v;`, and
`CMM_LOAD_SHARED(x)` instead of `x` whenever we load `x`. For example, if we
were looping on the value of `x`, we would write `while(CMM_LOAD_SHARED(x)) {
... }` instead of `while(x) { ... }`, and if we were loading `x` into a local
variable `y`, we would write `y = CMM_LOAD_SHARED(x)`.
Note that storing an aligned `uint64_t` with `_CMM_STORE_SHARED()` is atomic on
`x86_64`. It doesn't prevent accesses that would normally be reordered across a
store (only subsequent loads on `x86_64`) from being reordered across it at
runtime, but if you don't care about that, you can safely use this macro to
unconditionally and atomically update a word that other threads may access.
## Building Spinlocks
`spin_try_lock()`
First, we'll write some code that tries to acquire a spinlock, returning `1` if
it succeeds and `0` if it fails. Head over to `worker.c` and look for an empty
function definition for `int spin_try_lock(volatile uint64_t *lock)`. Fill in
the body of the function, using `lockcmpxchgq()` to attempt to install the
value `LOCKED` in the word that `lock` points to, which we expect to contain
`UNLOCKED`, returning `1` if we succeeded in replacing `UNLOCKED`, and `0` if
somebody else beat us to it and stored `LOCKED` at `*lock`.
Next, we need a way for a thread, having completed its critical section, to
release a lock it holds. `void spin_unlock(volatile uint64_t *lock)` will
serve this purpose, and all it needs to do on `x86_64` is store `UNLOCKED` to
`*lock` and ensure that the compiler doesn't reorder memory accesses across
this store. Fill in the body of the function to accomplish this.
Testing your `spin_try_lock()` and `spin_unlock()` implementations
To test your implementations, build and run the test harness from a shell:
```bash
make
./run 24 10000 1
```
`run` takes 3 parameters, `MAX_THREADS` (how many threads the harness tests up
to), `N_OPS` (the total number of operations to run), and `N_RUNS` (how many
times to repeat the experiment). It prints diagnostic messages from the parent
thread, as well as comma-separated data for each test attempt (more on this later).
With our code in its current state, the invocation above will run the test
called `spin_try_lock_correctness_test_nograph`, which tests `spin_try_lock()`
and `spin_unlock()` on a spinlock `my_spinlock`, with 1 worker thread running
them 10000 times, then 2 worker threads running them 5000 times apiece, then 3
worker threads running them about 3333 times apiece (one does 3334), and so on
until 24 worker threads run them about 416 times apiece (one does 417). This
way we're doing the same work every time, just changing how many workers are
splitting the job up. Every time a worker calls `spin_try_lock()`, it checks
the return value to determine whether the lock has been successfully acquired.
If it has, the worker increments a counter `my_spinlock_shared_counter` that it
shares with other workers, and records the fact that it has done so in a
thread-local counter `d-my_spinlock_counter`. Then, if the lock was
successfully acquired, the worker releases it with `spin_unlock()`.
If `spin_try_lock()` is working properly (that is, ensuring that no other
thread thinks it has acquired `my_spinlock` when `spin_try_lock(&my_spinlock)`
returns `1`), then the sum of all thread-local counters should equal the final
value of `my_spinlock_shared_counter`. On the other hand, if for example two
threads think that they both hold `my_spinlock` at once, then they may both
observe `my_spinlock_shared_counter` to have some value `5`, and both write `6`
to `my_spinlock_shared_counter`, but both also increment their local counters.
Now `my_spinlock_shared_counter` has increased by `1`, while the sum of local
counters has increased by `2`, and the sum of local counters no longer equals
the value of the shared counter.
Every time the worker threads complete a set of 10000 operations, the parent
thread checks that `my_spinlock_shared_counter` equals the sum of the workers'
local counters, and the program prints a message something like this before
exiting with status `-1`:
```
-1: spin_try_lock_correctness_nograph failed for nt 2,
my_spinlock_shared_counter: 5208, sum: 10000, n_ops: 10000. If the lock were
working correctly, counter and sum would be nonzero and the same.
```
All of the correctness tests we'll use in this assignment work more or less the
same way. Feel free to experiment with adjusting the values of `MAX_THREADS`,
`N_OPS`, and `N_RUNS`.
`spin_lock()`
Once you've convinced yourself that `spin_try_lock()` and `spin_unlock()` are
correct by running some experiments, you'll have seen some messages like
```
-1: spin_try_lock_correctness_nograph succeeded for nt 24,
my_spinlock_shared_counter: 535, sum: 535, n_ops: 10000. Values are nonzero
and the same, so the lock is working correctly.
```
If only 535 increments actually occurred, then the other 9465 attempts must
have resulted in `spin_try_lock()` returning `0`, that is, a thread observing
that some other thread currently holds `my_spinlock` and therefore failing to
acquire it and perform an increment. This isn't too surprising, since we have
24 threads competing for `my_spinlock` in this case (note `for nt 24` above).
What we really need is a lock acquisition function that *always works*. Well,
we could just keep spamming `spin_try_lock()` until it returns `1`, right?
Let's give that a try. Head over to `void spin_lock(volatile uint64_t *lock)`
in `worker.c`, and replace the `TODO` with some code that calls
`spin_try_lock()` in a tight loop until it returns `1`. If you'd prefer to use
`lockcmpxchgq()` directly instead of `spin_try_lock()`, that's fine too.
This code doesn't return until the lock is acquired; it just keeps trying to get it,
or in other words, it *busy-waits*.
We don't need to change `spin_unlock()` to work with this implementation, or any
subsequent implementation in this assignment.
Testing your `spin_lock()` implementation
Open `tests.c` and take a look at the array `test_names`, which enumerates the
tests that are available, and the array `test_on`, which determines whether the
test at the same index in `test_names` will be run. Right now, the test
harness is set up to only run the first test,
`"spin_try_lock_correctness_nograph"`. The `"correctness"` part indicates that
it's a test that checks whether the `spin_try_lock()` implementation is
correct, not a benchmark that measures the performance of `spin_try_lock()`.
To test the correctness of `spin_lock()`, we'll need to change the `0` in `test_on`
corresponding to `spin_lock_correctness_nograph` to a `1`, rebuild the test
harness with `make`, and run it as before with `run`. When your implementation
is working correctly, you should see a message like
```
-1: spin_lock_correctness_nograph succeeded for nt 24,
my_spinlock_shared_counter: 10000, sum: 10000, n_ops: 10000. Values are
nonzero and the same, so the lock is working correctly.
```
Now the number of increments is equal to the number of operations attempted,
since each call to `spin_lock()` busy-waits until it has successfully acquired
`lock`, so workers successfully acquire the lock for each operation, and
are able to always increment the local and shared counters.
Benchmarking your `spin_lock()` implementation
Now we know that `spin_lock()` is correct, but we don't know how efficient it is
compared to other forms of synchronization. To get an idea of how we're doing,
we'll use the spinlock implementation from pthreads `pthread_spin_lock` as a baseline.
We need to turn on two benchmarking tests in `test_on` in `tests.c`:
* `pthread_spin_lock`
* `spin_lock`
We aren't including a benchmark for `spin_try_lock()` here since it's an
apples-to-oranges comparison, as `spin_lock()` and `pthread_spin_lock()` both
busy-wait until a successful acquisition, and `spin_try_lock()` does not.
Now, rebuild the test harness and use `bench` to run benchmarks.
```bash
make
./bench 24 300000 1
```
You should see output that ends with some messages about generated graphs:
```
[1] "generating data/bench_t24_2018-01-09_12-27-07.csv_ticks_0accesses.pdf"
[1] "generating data/bench_t24_2018-01-09_12-27-07.csv_ticks_8accesses.pdf"
[1] "generating data/bench_t24_2018-01-09_12-27-07.csv_ticks_80accesses.pdf"
```
Check out the graphs; they should have one line for `spin_lock` and another for
`pthread_spin_lock`. How do the implementations compare?
Bus contention
Note that 3 different graphs are generated, one labeled to represent test runs
with "no critical section accesses" `_0accesses.pdf`, another with "1 cache
line updated during critical section" `_8accesses.pdf`, and another with "10
cache lines updated during critical section" `_80accesses.pdf`. The latter 2
tests simulate what threads usually do when they hold a lock, which is update
some other memory protected by the lock. We simulate this by doing racy
increments to words in a shared circular buffer. Because a given word in the
buffer is updated by multiple threads, increments to that word by different
threads require communication between those threads over the system bus. This
is the case whenever multiple threads try to update the same word (among other
situations that trigger bus traffic).
Questions
1. Does calling `spin_try_lock(&my_spinlock)` over and over again (even after
we've just observed that `my_spinlock` is held) every time we need to get
`my_spinlock` affect how long it takes to finish a set of operations? Does
each call require threads to communicate, and if so, what shared resources
does that communication use?
2. How does this implementation scale as we increase the number of threads?
Does passing the lock around and busy-waiting for it affect the time it
takes for all threads to crunch through the set of operations? If so, is the
effect different for different numbers of threads?
3. As we add bus traffic, first by updating a single cache line and then by
updating 10 cache lines while holding the lock, how do `spin_lock` and
`pthread_spin_lock` respond to this change, and why?
A note on resource sharing, measurement accuracy, and interference
Since we'll all be using `babbage` to run these benchmarks, it's possible that another
student could be running benchmarks at the same time. This will at the very least
make benchmarking take longer, and could make your graphs look a little strange.
To check whether this is going on, watch `htop` in another terminal to see if another
user is running a bunch of `./test` threads too. If so, get in touch with that user
in one of the following ways to negotiate a way to share benchmarking time:
1. Via the `#cs510walpole` IRC channel on `irc.cat.pdx.edu`.
See the [CAT documentation](https://cat.pdx.edu/services/irc/)
for more information about getting connected and using IRC.
2. Using the `write` command to make a message appear in their terminal:
```bash
$ write theod
hello friend, will you be finished benchmarking soon?
(CTRL+d to end)
```
3. Getting in touch via email. CAT and ODIN usernames are usually the same, so
`username@pdx.edu` will probably work, but you could also use the mailing list
to broadcast to everybody in the course (as a last resort) at
[`cs510walpole@cecs.pdx.edu`](mailto:cs510walpole@cecs.pdx.edu).
We've tuned the `./run` and `./bench` invocations in this assignment to take
a few minutes or less, so the chance of collisions between users is relatively low.
Also, note that the correctness tests work no matter what, although they make take
longer if another user is running tests.
Finally, it's important to note that `babbage` is heavily used, and is a virtual
machine that shares a host with other heavily used virtual machines. It's an
environment subject to frequent interference and weird behavior resulting from
virtualization and hyperthreading on the host, and benchmark results often
fluctuate. Try running a given benchmark multiple times and looking at the
graphs that are produced to get an idea of the range of results, and if you
produce a particular graph with results that just don't make sense, try running
`./bench` again before you get too invested in being confused :)
`spin_wait_lock()`
Based on our observations in the last section, it seems that spamming
`spin_try_lock()` as quickly as possible does not scale particularly well as we
increase the number of threads, especially in the presence of other bus
traffic. Let's see what happens when we build a delay into our lock
acquisition function; we can write this modified version of `spin_lock()` in
`void spin_wait_lock(volatile uint64_t *lock)` in `worker.c`. To be clear, the
only goal here is to poll the spinlock less frequently. For example, it is
sufficient to just add a loop that, say, decrements an `int` 500 times, and
crank through the loop in between calls to `spin_try_lock()`.
Once you've implemented `spin_wait_lock()`, enable
`spin_wait_lock_correctness_nograph` and `spin_wait_lock` in `test_on` in
`tests.c`, `make` again, `run` as above until you've passed the correctness
tests, and then `bench` again as above and take a look at the generated graphs.
There should be a `spin_wait_lock` line now in addition to the lines from
previous graphs.
Questions
4. What result did you expect? Are these graphs consistent with your expectation?
5. Does `spin_wait_lock()` result in more or less communication than
`spin_lock()`? Do the results shed any light on the relative merits of more
and less communication?
6. Communication always comes at a cost, so we need to think carefully about how
much bus traffic our implementations generate. How can we minimize communication in
our spinlock implementations? Is there more than one way?
7. If we have just observed `*lock == LOCKED`, how likely is it that
`spin_try_lock()` will succeed (returning `1`)? Do you think that
`spin_try_lock(lock)`, which can trigger an update, requires the same
amount of communication as just loading the value of `my_spinlock`? Why? Do
you think that these operations require unidirectional or bidirectional
communication, and why?
`spin_read_lock()`
Let's explore another way to minimize communication. Instead of introducing a
delay while polling with `spin_try_lock()`, let's poll by *loading* `*lock` in
a tight loop until we observe `*lock == UNLOCKED` before each attempt at
`spin_try_lock()`. Implement this approach in `void spin_read_lock(volatile
uint64_t *lock)` in `worker.c`. We'd like to compare the effect of delaying
with the effect of polling with loads instead of `lockcmpxchgq()`, so don't
include the delay loop from `spin_wait_lock()` in this implementation.
Once you've implemented `spin_read_lock()`, enable
`spin_read_lock_correctness_nograph` and `spin_read_lock` in `test_on` in
`tests.c`, `make` again, `run` as above until you've passed the correctness
tests, and then `bench` again as above and take a look at the generated graphs.
There should be a `spin_read_lock` line now in addition to the lines from
previous graphs.
Questions
8. What result did you expect? Are these graphs consistent with your expectation?
9. Does `spin_try_lock(lock)` always return `1` after we observe `*lock ==
UNLOCKED`? Why or why not? If not, what does your implementation do in
this situation?
10. Do the results suggest that one way of minimizing communication is better
than the other?
(Optional) `spin_experimental_lock()`
Finally, we offer a chance for intrepid souls to explore combining the
techniques we've tried above, and anything else you can imagine or read about
in the realm of spinlock implementation. How close can you get to `pthread_spin_lock`?
How is `pthread_spin_lock` [implemented](https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_spin_lock.c;h=58c794b5da9aaeac9f0082ea6fdd6d10cabc2622;hb=HEAD) anyway?
You may find the `xchgq()` primitive implemented in `util.h` helpful in this exploration.
Play and trial and error are the substance of experiential learning, so go crazy.
In keeping with previous steps, implement `spin_experimental_lock()` in `worker.c`,
and turn on `spin_experimental_lock_correctness_nograph` and `spin_experimental_lock` in
`test_on` in `tests.c`, then `make`, `run`, `bench` and repeat as usual.