Saturday, June 15, 2013

Memory Ordering and Atomics in C++11

I recommend watching these two videos: atomic<> Weapons and Threads & Shared Variables. Below, you can read my summary notes from these sources and others.


Compiler optimization, processor out of order execution and cache coherency could cause the sequence of operations that a processor executes to be changed from the order specified by the source code.

For example, below you can see an optimization a compiler might apply,
From this:                        To this:
for (p=q; q!=0; p=p->next)        r1 = count;
  if (p->data > 0) ++count;       for (p=q; q!=0; p=p->next)
                                      if (p->data > 0) ++r1;
                                  count = r1;

In the above example, the compiler loads the global variable 'count' into a register before the loop and then writes the register to the global variable after the loop. This transformation is not safe with more than one thread because a data-race is introduced. Assuming one thread loops through data that is always '0' and the other thread has some positive data values, the final count value could be overwritten by the thread with '0' in r1.

Race condition: A memory location (variable) can be simultaneously accessed by two threads, and at least one thread is a writer.

In the example below, assume x and y are global variables and initially set to '0'. One possible inter- leaved execution could be x=1, y=1, r1=y, r2=x, with r1 and r2 set to '1'. A different inter-leaved execution could result in either one of the r values set to '1' and the other set to '0'. That would be a sequentially consistent execution. However, in reality both r1 and r2 could read '0' -- the initial values. This is because the result of the store instruction in hardware is not made immediately visible to other processors. This is not the behaviour a programmer would expect i.e. it is not sequentially consistent. The result is that we can not predetermine what operations are actually executed and the order in which they are executed by any particular thread.

Thread 1                Thread 2
x = 1;                  y = 1;
r1 = y;                 r2 = x;

Sequential consistency (SC): Executing the program you wrote. Defined as “the result of any execution is the same as if the reads and writes occurred in some order, and the operations of each individual processor appear in this sequence in the order specified by its program ”

The goal is a correctly synchronized program (no race conditions) with the help of atomics and mutexes and that behaves as if:
  • memory operations are actually executed in an order that appears equivalent to some sequentially consistent inter-leaved execution of the memory operations of each thread in your source code;
  • including that each write appears to be atomic and globally visible simultaneously to all processors.
By correctly synchronizing your program (no race conditions), “the system” will provide the
illusion of executing the program you wrote. 

Mutex & Atomics:

Locks (mut_x is a mutex protecting x):
{ lock_guard<mutex> hold(mut_x); // enter critical region (lock “acquire”)
   ... read/write x ...
   x = 42;
} // exit critical region (lock “release”)

Ordered atomics (whose_turn is a std::atomic<> variable protecting x):
while( whose_turn != me ) { } // enter critical region (atomic read “acquires” value)
  ... read/write x ...
  x = 42;
whose_turn = someone_else; // exit critical region (atomic write “release”)

Note:
read atomic variable <=> enter critical section (acquire )
write atomic variable <=> exit critical section (release)

Declaring a variable as atomic, in addition to ensuring that the variable is accessed indivisibly, prevents both the compiler and the hardware from reordering memory accesses in ways that are visible to the program. As far as optimizing is concerned, the compiler can not move code out of a Critical section but can move code into a Critical section. Atomics provide one-way barriers: An “acquire barrier” and a “release barrier.” A release store makes its prior accesses visible to a thread performing an acquire load that sees (pairs with) that store.

This explains why the following example is not a data-race. Although x and done are unrelated variables, the memory model specifies that the assert cannot fail. The store to 'x' happens-before the store to done in thread 1. Also, the load of 'done' in thread 2 synchronizes with the store of 'done' in thread 1. Therefore, load of 'x' in thread 2 gets the results of the store that happened in thread 1. It must all see all operations that happened before the store in thread 1, even unrelated ones. That means the optimizer is not free to reorder the two stores in thread 1 since thread 2 must see the store to done as well. There cannot be an interleaving of the steps in which the actions x = 42 and r = x are adjacent -- a sequentially consistent execution.

int x; // initially zero
atomic<bool> done;// initially false

Thread 1                            Thread 2
x = 42;                               while (!done) {}
done = true;                        r = x; // assert(x==42)

Transitivity/causality: x and y are std::atomic, all variables initially zero. It is impossible for the assertion to fail in a sequentially consistent program.

Thread 1                            Thread 2                      Thread 3
g = 1;                                 if( x == 1 )                   if( y == 1 )
x = 1;                                    y = 1;                         assert( g == 1 );

Below, is a single-producer/single-consumer queue. The use of a dummy node for an empty queue is a technique that is commonly used. The idea is to only use the tail pointer in the producer thread except to check if the queue is empty in the consumer thread, which also synchronizes the threads. And only use the head pointer in the consumer thread.

Globally, the tail pointer is always pointing to the dummy node. The store to tail in the push thread, makes the stores to memory locations 3 and 4 (data & next) globally visible to the loads of the same memory locations 1 and 2. Since the tail store in push and tail load in pop are synchronized, 3 and 4 happens-before 1 and 2.
class spsc_queue
{
private:
  struct node
  {
      node* next_;
      int value_;
    node() : next_(0), value_(0) {}
    node(node* next, T value) : next_(next), value_(value) {}
  };

  std::atomic<node*> tail_; // tail of the queue
  std::atomic<node*> head_; // head of the queue

public:
  spsc_queue() : head_(new node), tail_(head_.load())  {}  <-- create dummy node

  bool pop(int& v) 
  { 
    if(head_.load() == tail.load()) {     <---------------- acquire load of tail
      return false;  // queue is empty
    } 
    node* const old_head = head_;   
    head_ = old_head->next_;   // 1
    v = old_head->data_;       // 2
    delete old_head; 
    return true; 
  } 

  void push(int new_value) 
  { 
    node* p = new node; 
    tail_->data_ = new_data;   // 3
    tail_->next_= new node;    // 4
    tail_.store(p);   <---------------- release store to tail
  } 

Atomics support various memory_order specifications. In the above examples, the default specification is used i.e. memory_order_seq_cst. Since it is essentially a thread-to-thread communication using a shared flag (tail_), memory_order_acquire (used for accesses) and memory_order_release (used for updates) could have been used to provide the required visibility guarantees. This allows for a relaxing of the synchronization required between independent reads of independent writes.

Assuming 'x' and 'y' are initially 0:
Thread 1 y.store (20, memory_order_release);
Thread 2 x.store (10, memory_order_release);
Thread 3 assert(y.load(memory_order_acquire)==20 && x.load(memory_order_acquire)==0)
Thread 4 assert(y.load(memory_order_acquire)==0 && x.load(memory_order_acquire)==10)

Both of these asserts can pass since there is no ordering imposed between the stores in thead 1 and thread 2. If this example were written using the sequentially consistent model i.e. the default (memory_order_seq_cst), then one of the stores must happen-before the other (although the order isn't determined until run-time), the values are synchronized between threads, and if one assert passes, the other assert must therefore fail.

Some other atomic operations:

T atomic<T>::exchange( T desired ) { 
    T oldval = this->value; this->value = desired; return oldval; }

bool atomic<T>::compare_exchange_strong( T& expected, T desired ) {
    if( this->value == expected ) { this->value = desired; return true; }
    expected = this->value; return false;
}
Pronounced: “Am I the one who gets to change val from expected to desired?”

_weak vs. _strong compare exchange:
  • _weak allows spurious failures.
  • Prefer _weak when you’re going to write a CAS loop anyway.
  • Almost always want _strong when doing a single test.   
The following boost code is a multi-producer, single consumer queue:
template<typename T>
class waitfree_queue {
public:
  struct node {
    T data;
    node* next;
  };
  void push(const T &data)
  {
    node* n = new node;
    n->data = data;
    node* stale_head = head_.load(std::memory_order_relaxed);
    do {
      n->next = stale_head;
    } while (
        !head_.compare_exchange_weak(
                stale_head, n, std::memory_order_release));
  }

  // pop in reverse order
  node* pop_all(void)
  {
    return head_.exchange(0, std::memory_order_aquire);
  }

  waitfree_queue() : head_(0) {}

private:
  std::atomic<node *> head_;
};
Assuming that for a particular thread, the 'stale_head' is no longer equal to 'head' by the time compare_exchange_weak is called, then its 'stale_head' will be updated to the current 'head' and it will continue looping in the while loop until 'n' can be made the new head in the compare_exchange_weak atomic call. Note the use of memory_order_aquire and memory_order_release.