Friday, December 27, 2013

Writing OpenCL Kernels in C++

Download: https://github.com/dimitrs/cpp-opencl/tree/first_blog_post

In this post I would like to present my C++ to OpenCL C source transformation tool. Source code written in the OpenCL C language can now be written in C++ instead (even C++11). It is even possible to execute some of the C++ standard libraries on the GPU.

The tool compiles the C++ source into LLVM byte-code, and uses a modified version of the LLVM 'C' back-end to disassemble the byte-code into OpenCL 'C'. You can see how it is used by looking at the tests provided in the project.

This code using std::find_if and a lambda function is executed on the GPU.

#include <algorithm>

extern "C" void _Kernel_find_if(int* arg, int len, int* out) {
    int* it = std::find_if (arg, arg+len, [] (int i) { return ((i%2)==1); } );
    out[0] = *it;
}

This code using C++11's std::enable_if is executed on the GPU.

#include <type_traits>

template<class T>
T foo(T t, typename std::enable_if<std::is_integral<T>::value >::type* = 0)
{
    return 1;
}

template<class T>
T foo(T t, typename std::enable_if<std::is_floating_point<T>::value >::type* = 0)
{
    return 0;
}

extern "C" void _Kernel_enable_if_int_argument(int* arg0, int* out)
{
    out[0] = foo(arg0[0]);
}

Please not that this project is experimental. It would require a lot more work to make it a production quality tool. One limitation of the tool is that containers such as std::vector are not supported (without a stack-based allocator) because OpenCL does not support dynamically or statically allocated memory. The tool does not support atomics which disallows the use of std::string. It does not have support for memcpy and memset which means that some standard algorithms can not be used; std::copy's underlying implementation sometimes uses memcpy.

Thursday, December 26, 2013

C++ Generic Programming: Implementing Runga-kutta

Download: https://github.com/dimitrs/odes

Below, listing 1. is an implementation of a generic algorithm – the Runga-kutta method which is well known for its use in the approximation of solutions of ordinary differential equations. Although the Runga-kutta method is well known and relatively simple, the generic algorithm implementation below is difficult to read. I won't attempt to explain how it works; I wrote it several months ago and I don't remember all the details. The reason why I list it here is to be able to compare it with a non-generic implementation of the algorithm I wrote in listing 2. 

Listing 2. is far easier to read and understand. It is almost a carbon copy of the mathematical equations for the Runga-kutta4 method (see listing 3). The non-generic implementation is easy to read (assuming you are familiar with the equations) and it was very easy to write. I was done in about 30 minutes, whereas the generic version took me a couple of days to complete. So why make the effort ? The generic version allows you to use any container that supports forward iterators e.g. std::array, std::vector, C array, boost::ublas vectors, Blitz++. With the non-generic version you are tied into using only boost::ublas vectors. 

As far as performance is concerned, the generic version using boost::ublas vectors is about 20% slower than Ublass non-generic version. However, if I substitute the container for a std::array, the generic-algorithm is 40% faster. 

Runga-kutta4 Performance of the generic-algorithm using different containers compared to the non-generic version: 

std::vector: 25% faster
Blitz: about the same
ublas:       20% slower
std::array: 40% faster

template <class InputIterator, class ForwardIterator, class T, class Time=double>
std::pair<std::vector<Time>, ForwardIterator> __runge_kutta4(
    InputIterator inBegin, InputIterator inEnd, 
    Time x0, Time x1, int nr_steps, T f, ForwardIterator yout)
{
  typedef typename std::iterator_traits<InputIterator>::value_type 
      value_type;
  auto delta_t = static_cast<Time>((x1-x0)/nr_steps);
  auto nr_equations = std::distance(inBegin, inEnd);
  ForwardIterator youtP = yout;
  yout = std::copy(inBegin, inEnd, yout);
  std::vector<Time> ytime(nr_steps);

  for (auto ii=0; ii< nr_steps; ++ii) {
    typedef typename std::vector<value_type> vec;

    // K1 = f(x, u(x))
    vec v1(nr_equations);
    typename vec::iterator k1 = f(yout-nr_equations, x0, begin(v1));

    // K2 = f(x+deltaX/2, u(x)+K1*deltaX/2)
    vec v2(nr_equations);
    vec y2(nr_equations);
    std::transform(yout-nr_equations, yout, k1, begin(y2), 
      [delta_t](value_type& y,value_type& k) {
          return y+0.5*delta_t*k;
      });
    typename vec::iterator k2 = 
        f(begin(y2), 0.5*delta_t+x0, begin(v2));

    // K3 = f(x+deltaX/2, u(x)+K2*deltaX/2)
    vec v3(nr_equations);
    vec y3(nr_equations);
    std::transform(yout-nr_equations, yout, k2, begin(y3), 
      [delta_t](value_type& y,value_type& k){
          return y+0.5*delta_t*k;
      });
    typename vec::iterator k3 = 
        f(begin(y3), 0.5*delta_t+x0, begin(v3));

    // K4 = f(x+deltaX, u(x)+K2*deltaX)
    vec v4(nr_equations);
    vec y4(nr_equations);
    std::transform(yout-nr_equations, yout, k3, begin(y4), 
      [delta_t](value_type& y, value_type& k){ return y+delta_t*k; });
    typename vec::const_iterator k4 
        = f(begin(y4), delta_t+x0, begin(v4));

    // u(x+deltaX) = u(x) + (1/6) (K1 + 2 K2 + 2 K3 + K4) * deltaX
    for (auto i=0; i<nr_equations; ++i,++yout) {
      *yout = *(yout-nr_equations) + 
        (1.0/6.0) * (k1[i] + 2.0*k2[i] + 2.0*k3[i] + k4[i]) * delta_t;
    }
    ytime[ii] = x0;
    x0 += delta_t;
  }

  return std::make_pair(ytime, yout);
}

Listing 1: Generic Implementation of Runga-kutta4

template <class T, class Vector=boost::numeric::ublas::vector<double> >
Vector rk4(const Vector& init, double x0, double x1, int nr_steps, T f)
{
 Vector y = init;
 auto h = static_cast<double>((x1-x0)/nr_steps);
 double xn = x0;

 for (auto ii=0; ii< nr_steps; ++ii) {
   Vector k1 = f(y, xn);
   Vector k2 = f(y+0.5*h*k1, 0.5*h+xn);
   Vector k3 = f(y+0.5*h*k2, 0.5*h+xn);
   Vector k4 = f(y+h*k3, h+xn);

   xn += h;
   y = u + (1.0/6.0) * (k1 + 2.0*k2 + 2.0*k3 + k4) * h;
 }
 return y;
}

Listing 2: Non-Generic Implementation of Runga-kutta4 (using ublas)

for n = 0,1,2,3,..., using
 k1 = h f(xn, yn)
 k2 = h f(xn + h/2, yn + k1/2)
 k3 = h f(xn + h/2, yn + k2/2)
 k4 = h f(xn + h, yn + k3)

 xn+1 = xn + h
 yn+1 = yn + h(1/6)(k1 + 2k2 + 2k3 + k4)

Listing 3: Runga-kutta4 equations. 

Saturday, June 15, 2013

Memory Ordering and Atomics in C++11

I recommend watching these two videos: atomic<> Weapons and Threads & Shared Variables. Below, you can read my summary notes from these sources and others.


Compiler optimization, processor out of order execution and cache coherency could cause the sequence of operations that a processor executes to be changed from the order specified by the source code.

For example, below you can see an optimization a compiler might apply,
From this:                        To this:
for (p=q; q!=0; p=p->next)        r1 = count;
  if (p->data > 0) ++count;       for (p=q; q!=0; p=p->next)
                                      if (p->data > 0) ++r1;
                                  count = r1;

In the above example, the compiler loads the global variable 'count' into a register before the loop and then writes the register to the global variable after the loop. This transformation is not safe with more than one thread because a data-race is introduced. Assuming one thread loops through data that is always '0' and the other thread has some positive data values, the final count value could be overwritten by the thread with '0' in r1.

Race condition: A memory location (variable) can be simultaneously accessed by two threads, and at least one thread is a writer.

In the example below, assume x and y are global variables and initially set to '0'. One possible inter- leaved execution could be x=1, y=1, r1=y, r2=x, with r1 and r2 set to '1'. A different inter-leaved execution could result in either one of the r values set to '1' and the other set to '0'. That would be a sequentially consistent execution. However, in reality both r1 and r2 could read '0' -- the initial values. This is because the result of the store instruction in hardware is not made immediately visible to other processors. This is not the behaviour a programmer would expect i.e. it is not sequentially consistent. The result is that we can not predetermine what operations are actually executed and the order in which they are executed by any particular thread.

Thread 1                Thread 2
x = 1;                  y = 1;
r1 = y;                 r2 = x;

Sequential consistency (SC): Executing the program you wrote. Defined as “the result of any execution is the same as if the reads and writes occurred in some order, and the operations of each individual processor appear in this sequence in the order specified by its program ”

The goal is a correctly synchronized program (no race conditions) with the help of atomics and mutexes and that behaves as if:
  • memory operations are actually executed in an order that appears equivalent to some sequentially consistent inter-leaved execution of the memory operations of each thread in your source code;
  • including that each write appears to be atomic and globally visible simultaneously to all processors.
By correctly synchronizing your program (no race conditions), “the system” will provide the
illusion of executing the program you wrote. 

Mutex & Atomics:

Locks (mut_x is a mutex protecting x):
{ lock_guard<mutex> hold(mut_x); // enter critical region (lock “acquire”)
   ... read/write x ...
   x = 42;
} // exit critical region (lock “release”)

Ordered atomics (whose_turn is a std::atomic<> variable protecting x):
while( whose_turn != me ) { } // enter critical region (atomic read “acquires” value)
  ... read/write x ...
  x = 42;
whose_turn = someone_else; // exit critical region (atomic write “release”)

Note:
read atomic variable <=> enter critical section (acquire )
write atomic variable <=> exit critical section (release)

Declaring a variable as atomic, in addition to ensuring that the variable is accessed indivisibly, prevents both the compiler and the hardware from reordering memory accesses in ways that are visible to the program. As far as optimizing is concerned, the compiler can not move code out of a Critical section but can move code into a Critical section. Atomics provide one-way barriers: An “acquire barrier” and a “release barrier.” A release store makes its prior accesses visible to a thread performing an acquire load that sees (pairs with) that store.

This explains why the following example is not a data-race. Although x and done are unrelated variables, the memory model specifies that the assert cannot fail. The store to 'x' happens-before the store to done in thread 1. Also, the load of 'done' in thread 2 synchronizes with the store of 'done' in thread 1. Therefore, load of 'x' in thread 2 gets the results of the store that happened in thread 1. It must all see all operations that happened before the store in thread 1, even unrelated ones. That means the optimizer is not free to reorder the two stores in thread 1 since thread 2 must see the store to done as well. There cannot be an interleaving of the steps in which the actions x = 42 and r = x are adjacent -- a sequentially consistent execution.

int x; // initially zero
atomic<bool> done;// initially false

Thread 1                            Thread 2
x = 42;                               while (!done) {}
done = true;                        r = x; // assert(x==42)

Transitivity/causality: x and y are std::atomic, all variables initially zero. It is impossible for the assertion to fail in a sequentially consistent program.

Thread 1                            Thread 2                      Thread 3
g = 1;                                 if( x == 1 )                   if( y == 1 )
x = 1;                                    y = 1;                         assert( g == 1 );

Below, is a single-producer/single-consumer queue. The use of a dummy node for an empty queue is a technique that is commonly used. The idea is to only use the tail pointer in the producer thread except to check if the queue is empty in the consumer thread, which also synchronizes the threads. And only use the head pointer in the consumer thread.

Globally, the tail pointer is always pointing to the dummy node. The store to tail in the push thread, makes the stores to memory locations 3 and 4 (data & next) globally visible to the loads of the same memory locations 1 and 2. Since the tail store in push and tail load in pop are synchronized, 3 and 4 happens-before 1 and 2.
class spsc_queue
{
private:
  struct node
  {
      node* next_;
      int value_;
    node() : next_(0), value_(0) {}
    node(node* next, T value) : next_(next), value_(value) {}
  };

  std::atomic<node*> tail_; // tail of the queue
  std::atomic<node*> head_; // head of the queue

public:
  spsc_queue() : head_(new node), tail_(head_.load())  {}  <-- create dummy node

  bool pop(int& v) 
  { 
    if(head_.load() == tail.load()) {     <---------------- acquire load of tail
      return false;  // queue is empty
    } 
    node* const old_head = head_;   
    head_ = old_head->next_;   // 1
    v = old_head->data_;       // 2
    delete old_head; 
    return true; 
  } 

  void push(int new_value) 
  { 
    node* p = new node; 
    tail_->data_ = new_data;   // 3
    tail_->next_= new node;    // 4
    tail_.store(p);   <---------------- release store to tail
  } 

Atomics support various memory_order specifications. In the above examples, the default specification is used i.e. memory_order_seq_cst. Since it is essentially a thread-to-thread communication using a shared flag (tail_), memory_order_acquire (used for accesses) and memory_order_release (used for updates) could have been used to provide the required visibility guarantees. This allows for a relaxing of the synchronization required between independent reads of independent writes.

Assuming 'x' and 'y' are initially 0:
Thread 1 y.store (20, memory_order_release);
Thread 2 x.store (10, memory_order_release);
Thread 3 assert(y.load(memory_order_acquire)==20 && x.load(memory_order_acquire)==0)
Thread 4 assert(y.load(memory_order_acquire)==0 && x.load(memory_order_acquire)==10)

Both of these asserts can pass since there is no ordering imposed between the stores in thead 1 and thread 2. If this example were written using the sequentially consistent model i.e. the default (memory_order_seq_cst), then one of the stores must happen-before the other (although the order isn't determined until run-time), the values are synchronized between threads, and if one assert passes, the other assert must therefore fail.

Some other atomic operations:

T atomic<T>::exchange( T desired ) { 
    T oldval = this->value; this->value = desired; return oldval; }

bool atomic<T>::compare_exchange_strong( T& expected, T desired ) {
    if( this->value == expected ) { this->value = desired; return true; }
    expected = this->value; return false;
}
Pronounced: “Am I the one who gets to change val from expected to desired?”

_weak vs. _strong compare exchange:
  • _weak allows spurious failures.
  • Prefer _weak when you’re going to write a CAS loop anyway.
  • Almost always want _strong when doing a single test.   
The following boost code is a multi-producer, single consumer queue:
template<typename T>
class waitfree_queue {
public:
  struct node {
    T data;
    node* next;
  };
  void push(const T &data)
  {
    node* n = new node;
    n->data = data;
    node* stale_head = head_.load(std::memory_order_relaxed);
    do {
      n->next = stale_head;
    } while (
        !head_.compare_exchange_weak(
                stale_head, n, std::memory_order_release));
  }

  // pop in reverse order
  node* pop_all(void)
  {
    return head_.exchange(0, std::memory_order_aquire);
  }

  waitfree_queue() : head_(0) {}

private:
  std::atomic<node *> head_;
};
Assuming that for a particular thread, the 'stale_head' is no longer equal to 'head' by the time compare_exchange_weak is called, then its 'stale_head' will be updated to the current 'head' and it will continue looping in the while loop until 'n' can be made the new head in the compare_exchange_weak atomic call. Note the use of memory_order_aquire and memory_order_release.

Tuesday, May 14, 2013

Sorting Large Objects


As far as I know, the standard sort, is partially implemented with a merge-sort algorithm which has a O(n log(n)) worst case running time. Quick-sort is also inexpensive with an average running time of O(n log(n)). There is an important detail to remember when sorting large objects: 'std::swap' is called allot for both merge-sort and quick-sort. Below, is a possible implementation of quick-sort. Note the use of "std::swap" in the partition function.

template <class ForwardIterator, typename _Compare>
void __quick_sort(ForwardIterator first, ForwardIterator last, _Compare __comp)
{
    if (std::distance(first, last) <= 1) return;

    ForwardIterator mid = __partition(first, last, __comp);
    ForwardIterator first1 = first;
    ForwardIterator last1 = mid;
    ForwardIterator first2 = mid;
    ForwardIterator last2 = last;
    std::advance(first2, 1);    

    __quick_sort(first1, last1, __comp);
    __quick_sort(first2, last2, __comp);
}

template <class ForwardIterator, typename _Compare>
ForwardIterator __partition(
       ForwardIterator first, ForwardIterator last, _Compare __comp)
{
    ForwardIterator pivot = first;  // randomly choose pivot
    auto idx = rand() % std::distance(first, last);
    std::advance(pivot, idx);
    std::swap(*pivot, *first);
    pivot = first;
    
    ForwardIterator i = first;
    ForwardIterator j = first;
    std::advance(i, 1);
    std::advance(j, 1);
    ForwardIterator i_prev=i;
    for ( ;j!=last; ++j) {
        if (__comp(*j, *pivot)) {
            i_prev = i;
            std::swap(*j, *i);
            std::advance(i, 1);
        }
    }
    std::swap(*pivot, *i_prev);
    return i_prev;
}

Each call to 'std::swap' can potentially mean three copies. This can make sorting large objects stored by value an expensive task. See the std::swap function below.

template <typename T>
void swap( T& x, T& y )
{
    T tmp( std::move(x) ); // a hint to use move constructor
    x = std::move(y);
    y = std::move(tmp);
}

The compiler will use the move constructor (or move assignment) if possible and default to the copy constructor otherwise. For example, if class T does not have a move constructor the copy constructor is used. According to my measurements, copying large objects log n times adds at least 2 orders of overhead to the sorting algorithm. I compared sorting large objects by value against sorting large objects by pointer. Swapping pointers is done in constant time. No move or copy constructors are called. When you do not want to deal with raw pointers, you can use std::unique_ptr which provides similar results to ordinary pointers (std::swap is overloaded for unique_ptr).

        by ptr       :         2155  us  
        by unique_ptr:         2732  us  
        by value     :       791678  us     

Below, you can see the "large object" and the test. The large object is implemented in terms of an array which does not have a move constructor and therefore is copied when std::swap is called.

struct LargeArrayObject {
    LargeArrayObject(int id) : id(id) {}
    int id;
    std::array<int,10000> array;
};

std::vector<LargeArrayObject> byval;
std::vector<LargeArrayObject*> byptr;
std::vector<unique_ptr<LargeArrayObject>> byuniqptr;

for (int i = 0; i < 10000; ++i) {
   int id = rand() % 100000;
   byval.push_back(LargeArrayObject(id));
   byptr.push_back(new LargeArrayObject(id));
   byuniqptr.push_back(unique_ptr<T>(new LargeArrayObject(id)));
}

auto start = std::chrono::high_resolution_clock::now();
quick_sort(byval.begin(), byval.end(), [](const T& a, const T& b) {
   return a.id<b.id; });
//quick_sort(byptr.begin(), byptr.end(), [](const T* a, const T* b) {
//   return a->id<b->id;});
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = end - start;

An alternative to using pointers for speeding-up swap operations is to implement the large object class in terms of dynamic memory allocation, effectively making it a small object. One could use a std::vector instead of the std::array:

struct LargeVectorObject {
    LargeVectorObject(int id) : id(id), array(10000, 0) {}
    int id;

    // Swap exchanges the elements between two vectors in constant time.
    std::vector<int> array;
};

Below, you can see the measurements I made using LargeVectorObject instead of LargeArrayObject:


        by ptr       :                2155  us  
        by unique_ptr:                2732  us  
        by value     :                1500  us