Starting from:
$30

$24

Assignment 2: FIFO Spinlocks Solution

Overview




In this assignment, we're exploring the design space for spinlocks that provide

a fairness property (in addition to solving the critical section problem):

If thread A's acquisition attempt takes effect before thread B's acquisition attempt,

thread A will get the lock before thread B. In all the implementations we'll explore,

this is achieved by tying the order in which threads acquire the lock to the coherence

order for a word in memory.




The series of steps in this assignment build on one-another only conceptually,

so you can complete the implementations in any order to get things working (just

compare all of them once you're done). There are numbered questions

throughout the assignment; we recommend thinking through these questions

carefully and writing down short concrete answers before moving on to the next

section. If you're confused by a question or don't know where to start, you're

free to discuss or puzzle through them with other students, and Ted and Jon are

also available to help. The point of these assignments is to develop your

understanding of the material and give you some hands-on intuition for

concurrent programming, and the questions are there to help you understand

what's going on. If they don't, we'd really like to know, so we can write

some better ones :)




Finally, this assignment is brand new, so there may be gaps in the

documentation and bugs in the code. Let us know about any issues you find and

we'll resolve them as quickly as possible.




If you have questions and are comfortable sharing them, send them to the

[mailing list](mailto:cs510walpole@cecs.pdx.edu). If you would like Ted to

privately answer questions or take a look at your code, you can get in touch

with him via [email](mailto:theod@pdx.edu). He will also make an effort to

lurk on IRC in `#cs510walpole` on irc.cat.pdx.edu (see below for details).




Submission instructions




```bash

cd CS510-Advanced-Topics-in-Concurrency

pushd Assignment_2

pushd data

# remove all data and graphs BESIDES the final ones you'd like me to look at

rm <out-of-date data and graphs

popd

make clean

popd

# where NAME is your name separated by underscores

tar czf Assignment_2-NAME.tar.gz Assignment_2/

```

 

Email the archive to [theod@pdx.edu](mailto:theod@pdx.edu) with subject

`Assignment_2-NAME` and any questions and comments you'd like to share.

Include your answers to the numbered questions in the assignment in the text of

the email.




Why FIFO?




If it's possible for thread A to acquire a lock many times while thread B

continues to wait to acquire the lock, then thread A may end up getting its

work done much sooner, at the expense of thread B's progress. This lack of

fairness can in extreme cases lead to complete *starvation*, e.g. thread B

continually failing to make any progress. People who write concurrent programs

observe problems resulting from unfairness in practice. For example,

Nick Piggin (a Linux contributor) cites some anecdotal evidence of this

[here](https://lwn.net/Articles/267968/):




On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,

with a userspace test having a difference of up to 2x runtime per thread, and

some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)

times.




All of the spinlocks we've implemented so far make no guarantees about

fairness, since they're all based on some mechanism where threads race on some

atomic instruction, and the "winner" acquires the lock while other threads

retry. If we use a "talking stick" as an analogy for a lock, this is like the

holder throwing the talking stick on the ground once they've finished talking,

and letting everybody who wants to talk dive for it and fight over it. Once

the dust settles, one winner (and nothing else) is determined, so there's

nothing preventing some people from winning over and over again while others

lose over and over again. In this assignment, we'll explore a series of

approaches roughly analogous to folks forming a line behind the talking stick

holder, who then (once finished talking) passes the talking stick to the next

person in line.




As you work through the steps, think about what fairness buys us, and when

it is and isn't worthwhile.




Primitives




To complete some of the following implementations, you'll need a new primitive

that atomically adds to a word in memory and returns its previous value. We've

provided `x86_64`'s `lock; xaddq` instruction to serve this purpose; the `x` is

for "exchange" (as in `xchgq`). This primitive is exposed as a macro

`lockxaddq(ptr,val)`, which atomically adds `val` to `*ptr` and returns the

pre-add value of `*ptr`. Note that unlike the actual `lock; xaddq` instruction, this

macro does not store the old value of `*ptr` in `val`, so you need to catch the

return value.




All previously introduced primitives remain available (and will come in handy),

and are implemented and documented in `util.h`.




We'll need to use the `old = xchg(ptr, new)` primitive, defined in `util.h`.

This primitive is an atomic unconditional swap, that is, it atomically swaps

the current value in `*ptr` (returned as `old`) for `new`.




To keep the compiler from rearranging our code, e.g. during optimizations, we

need to use some other macros: If another thread might be loading from or

storing to `x`, write `_CMM_STORE_SHARED(x,v);` instead of `x = v;`, and

`CMM_LOAD_SHARED(x)` instead of `x` whenever we load `x`. For example, if we

were looping on the value of `x`, we would write `while(CMM_LOAD_SHARED(x)) {

... }` instead of `while(x) { ... }`, and if we were loading `x` into a local

variable `y`, we would write `y = CMM_LOAD_SHARED(x)`.




Note that storing an aligned `uint64_t` with `_CMM_STORE_SHARED()` is atomic on

x86_64. It doesn't prevent accesses that would normally be reordered across a

store (only subsequent loads on x86_64) from being reordered across it at

runtime, but if you don't care about that, you can safely use this macro to

unconditionally and atomically update a word that other threads may access.




Building FIFO Spinlocks




`ticket_lock()` and `ticket_unlock()`




You can find an overview of ticket locks in

[this Linux Weekly News article](https://lwn.net/Articles/267968/)

and a description and example implementation in Section 2.2 and Figure 2, respectively, of

[Mellor-Crummey and Scott 1991](https://cis.temple.edu/~qzeng/cis5512-fall2015/papers/p21-mellor-crummey.pdf).




Basically, a ticket lock is like when you take a number at the DMV, except

there's only one desk, and it's your job to update the number on display when

you've met with the employee at the desk and finished your business. It

consists of two numbers that can be separately updated atomically:

* `next`, which is the next number that will be handed out to somebody in line

* `owner`, which is the number of the ticket that belongs to the current lock

holder.




Head over to `// define a type for ticket` in `tests.h`, and familiarize

yourself with the type `ticket_state` and the global declaration of

`my_ticket_lock`. As you implement the operations below, consider whether

`next` and `owner` should be on the same cache line or not.




Next, head over to `ticket_lock()` in `worker.c`. To acquire the lock, you

atomically increment `next`, grabbing its previous value as your number, and

then spin until another thread updates `owner` to match your number (if there's

nobody ahead of you in line, this will already be the case). Now it's your

turn to execute your critical section!




Finally, head over to `ticket_unlock()` in `worker.c`. To release the lock, you

simply increment `owner`, to let your successor know that it's their turn.

Note that only the lock holder updates `owner`, so no other thread will race

you here.




Testing your `ticket_lock()` and `ticket_unlock()` implementations




As usual, turn on `ticket_correctness_test` to test this implementation

and `ticket_lock()` to benchmark it.




Build and run the test harness from a shell:

```bash

make

./run 24 10000 1

```

Note that the correctness test for these implementations (and the others in

this assignment) is only partial: it doesn't test fairness in any way, just

whether or not the implementation solves the critical section problem. The

benchmarks may shed some light on fairness, since the times indicate how long

it takes _all threads_ to finish their work, so if some threads have been

starved and take longer than others to finish, the overall time goes up.

However, there isn't any way to distinguish between time wasted because of

unfairness and time wasted by e.g. atomic instruction overhead.




Once you're passing this correctness test, it may be helpful to compare the

performance of this implementation to some of your unfair spinlocks from the

last assignment. In particular, `spin_lock()` and `spin_read_lock()` (or

`spin_experimental_lock()`) (just paste them into `worker.c` from your previous

submission) establish helpful reference points, and `pthread_spin_lock()` is

still a good baseline.




We've observed slightly smoother data with a smaller set of operations and

multiple runs, like so




```bash

./bench 24 3000 80

```

Questions




1. How does `ticket_lock()` compare to its unfair counterparts?

2. In particular, why is there a gap between `spin_lock()` and `ticket_lock()`?

3. What memory location(s) are threads contending to update in this implementation?

4. What memory location(s) are threads spinning on in this implementation?

5. What communication (to whom) is triggered when `ticket_unlock()` writes to `owner`?




`abql_(no)sharing_lock()` and `abql_(no)sharing_unlock()`




Let's explore a queueing lock that works roughly the same way as `ticket_lock()`

on the acquisition side, but spreads out spinning and successor notifications

to per-thread words in memory on the release side, rather than the single word

`owner` used by all waiting threads in `ticket_lock()`. You've read about this

Array-Based Queueing Lock (AQBL) in

[Anderson 1990](http://www.cs.pdx.edu/~walpole/class/cs510/papers/02.pdf).

See Table V for Anderson's ABQL implementation. NOTE: Anderson's

`ReadAndIncrement` corresponds to our `lockxaddq`.




Preparing data structures




Head over to `// define a type for abql_sharing` in `tests.h` and familiarize

yourself with the `flag_sharing` type and the global declaration of `flags_sharing`.




Next, head over to `// TODO declare and initialize data for abql_sharing` in

`tests.c` and declare and initialize an array of `n_threads` `flag_sharing`s,

where the first element has `val` set to `HAS_LOCK` and each subsequent element

has `val` set to `MUST_WAIT`. Point `flags_sharing` to this array. Now

`flags_sharing` will be supplied as `flags` to each call to

`abql_sharing_(un)lock()`. Additionally, note that `queue_last_sharing` is

declared in `tests.h`, initialized to `0` in `tests.c`, and will be supplied as

`queue_last` to each call.




`abql_sharing_lock()`




Head over to `abql_sharing_lock()` in `worker.c`. Instead of atomically

incrementing `next` and then spinning on `owner` (along with all other threads

waiting for the lock) as in `ticket_lock()`, this lock acquisition operation will

atomically increment `queue_last`, and use the value handed back from the

atomic increment (the previous value of `queue_last`, uniquely returned to this

thread) as an index into `flags`, indicating which element in the array this

thread should spin on. This way, each thread in the queue ends up spinning on

a different `flags[i].val`, so updates to e.g. your predecessor's

`flags[j].val` won't require your cached copy of `flags[i].val` to be

invalidated (which takes time). Right? When you index into `flags`, though,

note that `queue_last` just keeps growing as this algorithm proceeds,

while there are only `n_threads` elements in `flags`, so you'll need to wrap

around.




Store this index in `*my_place`, because we'll need it to persist from this

call to `abql_sharing_lock()` to the corresponding call to

`abql_sharing_unlock()`. Next, wait for your flag to no longer be `MUST_WAIT`.

When this happens, it's because your predecessor in the queue is letting you

know that it's your turn, so you can just set the flag back to `MUST_WAIT` for

the next thread who ends up here, and proceed on your merry way.




`abql_sharing_unlock()`




Head over to `abql_sharing_unlock()`. We still have `*my_place` from the call

to `abql_sharing_lock()`, so we know where our flag is. The thread who entered

the queue after us got back the next value of `queue_last` as its... place, and

that is just `*my_place + 1` (we're atomically _incrementing_), so we know

where our successor's flag is, too. All that remains, then, is to tap our

successor on the shoulder and let them know it's their turn by writing

`HAS_LOCK` to their flag.




Checking for (partial) correctness




Before implementing the `nosharing` verions, it will simplify things to check

`abql_sharing_(un)lock()` by turning on `abql_sharing_correctness_nograph` in

`tests.c` and verifying that your implementations correctly solve the critical

section problem.




`abql_nosharing_(un)lock()`




Head over to `// TODO define a type for abql_nosharing` in `tests.h` and

define a type `flag_nosharing` that still contains a single `uint64_t` `val`,

but also contains enough padding to occupy an entire cache line (8 words or 64

bytes in a cache line).




Next, head over to `// TODO declare and initialize data for abql_nosharing` in

`tests.c` and declare and initialize an array of `n_threads` `flag_nosharing`s

(which is itself aligned on a cache line boundary), where the elements are

initialized as before. Point `flags_nosharing` to this array.




`nosharing` versions of all our familiar variables will be passed to

`abql_nosharing_(un)lock()` in the same way, which is convenient, because it

means we can just copy the text of the `sharing` version of each operation and

replace `sharing` with `nosharing`.




Testing your implementations




Turn on `abql_nosharing_correctness_nograph`, `abql_sharing_lock()`,

`abql_nosharing_lock()` (and leave `ticket_lock()` on) and `bench` as before.




Questions

6. How does `abql_sharing_lock()` compare to `abql_nosharing_lock()`? Why?

7. What memory location(s) are threads contending to update in this implementation?

8. What memory location(s) are threads spinning on in this implementation?

9. What communication (to whom) is triggered when `abql_nosharing_unlock()`

writes to `flags[successor_index].val`? How does this compare

to the communication triggered by `ticket_unlock()`?

10. How does `abql_nosharing_lock()` compare to `ticket_lock()` and why?




(Optional) `mcs_(no)sharing_lock()` and `mcs_(no)sharing_unlock()`




Another queueing lock is the MCS lock, described accessibly [on LWN](https://lwn.net/Articles/590243/) and originally introduced in

[Mellor-Crummey and Scott 1991](https://cis.temple.edu/~qzeng/cis5512-fall2015/papers/p21-mellor-crummey.pdf)

(see Figures 5 and 7). NOTE: This paper uses alternate names for our

primitives. `fetch_and_store` corresponds to `xchgq`, while `compare_and_swap`

corresponds to `lockcmpxchgq`.




The MCS lock, named after its inventors, is basically the linked list to ABQL's

array. To hand off the lock, the lock holder still notifies its successor that

the successor can proceed, much like how the lock holder updates its

successor's `flags_(no)sharing` value in `abql_(no)sharing_release()`. Instead

of an array of flags, we have a local pair of values `next` and `locked` for

each thread (`local_lock`), plus another global pair that new threads use to

join the queue (`global_lock`).




If `global_lock-next` is null, there is no queue (and so no tail of the

queue), and the lock is not held. This is initially the case (otherwise,

nobody could initially acquire it), see the initializations of

`mcs_global_(no)sharing` in tests.c. Otherwise, `global_lock-next` points to

the tail of the queue of local locks, each corresponding to a thread waiting in

line to acquire the lock (besides the head, which corresponds to the lock

holder).




The holder's local lock is at the head of the queue, with its

`local_lock-next` pointing to the holder's successor's lock, starting a chain

that ends at the queue tail. If `local_lock-next == local_lock` or

`local_lock-next` is null, then the holder has no successor, and is alone in

the queue (at least until one comes along, which could happen at any time).




If `CMM_LOAD_SHARED(local_lock[i]-locked) == LOCKED`, then thread `i` is

waiting for its predecessor to update `local_lock[i]-locked` to `UNLOCKED`.

This is how the predecessor hands off the lock to thread `i`. Otherwise,

`local_lock[i]- LOCKED` is always `UNLOCKED`.




Preparing data structures




To implement this, it will help to define a type for these locks. For now, we

won't worry about false sharing, and just use the definition for the type

`mcs_sharing` in `tests.h` where there it says `// define a type for

mcs_sharing`. Giving its members type `uint64_t` will necessitate some casting

back and forth between `mcs_sharing*` and `uint64_t` in the implementation, but

makes it easier to work with our primitive operations, which expect

`uint64_t`s. Also note the global declarations of `mcs_global_sharing` and

`mcss_sharing` below.




Next, head over to `// TODO declare and initialize data for mcs_sharing`

in `tests.c` and declare and initialize an array of `n_threads` `mcs_sharing`s,

each with `next` set to `0` and `locked` set to `UNLOCKED`. Point

`mcss_sharing` to this array. Now `&mcss_sharing[i]` will be supplied as

`local_lock` to each call to `mcs_sharing_(un)lock()`, where `i` is the thread

number (that is, the `i` field in `ti_data_in`) of the caller. `global_lock`

will contain another instance of `mcs_sharing`, initialized in the same way,

indicating that the lock is initially not held.




`mcs_sharing_lock()`




Now we're ready to implement the operations for this lock. Head over to the

definition of `mcs_sharing_lock()` in `worker.c`. To acquire the lock, clear

your `local_lock-next`, then `xchgq` `global_lock-next` for your `local_lock`

to install your lock as the new tail of the queue.




If it comes back that there was no old tail, you're the only thread in the

queue right now, and you're done. Otherwise, you need to put your lock after

the old tail in the queue. Since you'll be waiting for the old tail to finish,

mark `local_lock-locked` as `LOCKED`, and then let the thread corresponding to

the old tail know that you're its successor by updating the old tail's `next`

field to point to your `local_lock`. Now that your lock is linked into the

queue, all you need to do is wait for your predecessor to give you the go-ahead

by updating `local_lock-locked` to `UNLOCKED`. Once that happens, you've

successfully acquired the lock.




`mcs_sharing_unlock()`




Head over to `mcs_sharing_unlock()` in `worker.c`. To release the lock, first

check whether you have a successor (that is, whether `local_lock-next` is

non-null).




If you have a successor, all you need to do is update your successor's lock's

`locked` field to `UNLOCKED`. Now you've handed the lock off to your

successor, and are done.




If you don't have a successor, things are little more complicated. You've

observed no successor, so `global_lock-next` _should_ already be pointing to

your `local_lock` since there is nobody else in the queue, but you need to

allow for the possibility that other threads snuck in and added themselves to

the queue after you observed `local_lock-next` to be null. You can use

`lockcmpxchgq()` here to update `global_lock-next` to `0` if it still holds

`local_lock` (nobody snuck in). In this case, you've removed yourself as the

last remaining queue member, the lock is no longer held, and you're done. If

`global_lock-next` now points to a different lock (that is, `lockcmpxchgq()`

gave back a value `!= local_lock`), then some other thread has concurrently

added itself to the queue, and thinks you're its predecessor. This other

thread will eventually update your `local_lock-next` to point to its own lock

(i.e. get behind you in line), so you need to wait for it to do that (to find

out who the successor is), then update the successor's lock's `locked` field to

`UNLOCKED`. Now you've handed the lock off to your successor, and are done.




`mcs_nosharing_(un)lock()`




Let's see what the impact of false sharing is on MCS performance. Fill in

`// TODO define a type for mcs_nosharing` in `tests.h` with a version that is

padded out fill a cache line (64 bytes, or 8 `uint64_t`s).




Fill in `// TODO declare and initialize data for mcs_nosharing` in

`tests.c` with code that initializes and assigns to `mcss_nosharing` as before,

but guarantees that the array `mcss_nosharing` is aligned on cache line

boundaries (you can always declare a variable with

`__attribute__((aligned (64)))`, for instance).




Head over to `worker.c`, copy your implementations of `mcs_sharing_lock()` and

`mcs_sharing_unlock()` and replace `sharing` with `nosharing` within the copies

to obtain your `mcs_nosharing_(un)lock()` implementations. Wouldn't

language-level support for specializing polymorphic code to multiple concrete

types come in handy about now?




Alternative `mcs_(no)sharing_unlock()`




BONUS: Mellor-Crummey and Scott propose an alternative `unlock` implementation that uses

an unconditional exchange operation (such as `xchgq`) instead of

`lockcmpxchgq`, in Figure 7 of

[Mellor-Crummey and Scott 1991](https://cis.temple.edu/~qzeng/cis5512-fall2015/papers/p21-mellor-crummey.pdf)

and provide a detailed explanation (including an example) in the latter half of

Section 2.4. Implement this variant and compare your graphs.




Testing your implementations




Turn on `mcs_(no)sharing_correctness_nograph` and `mcs_(no)sharing_lock` and

rebuild to check and benchmark your implementations as usual.




Questions

11. How do the `sharing` and `nosharing` variants of MCS compare? Do they

differ as much as `abql_sharing_lock()` and `abql_nosharing_lock()`? Why?

12. How does MCS compare to ABQL? Does one consistently beat the other? If not,

when does MCS win? When does ABQL win? Why?

13. Compare Figures 15 through 17 with Figure 18 in

[Mellor-Crummey and Scott 1991](https://cis.temple.edu/~qzeng/cis5512-fall2015/papers/p21-mellor-crummey.pdf).

Do `mcs` and `anderson` (ABQL), compare in these figures the same way that

MCS and ABQL compare in your graphs?

14. Why do you think Mellor-Crummey and Scott included 3 graphs for tests with

"empty" critical sections and only 1 with a nonempty ("small") critical

section?

15. BONUS: How do the `lockcmpxchgq`-based and alternative `unlock`

implementations' performance compare? Why?

16. BONUS: Explain how the alternative `unlock` implementation breaks fairness.



More products