Friday, December 27, 2013

Writing OpenCL Kernels in C++

Download: https://github.com/dimitrs/cpp-opencl/tree/first_blog_post

In this post I would like to present my C++ to OpenCL C source transformation tool. Source code written in the OpenCL C language can now be written in C++ instead (even C++11). It is even possible to execute some of the C++ standard libraries on the GPU.

The tool compiles the C++ source into LLVM byte-code, and uses a modified version of the LLVM 'C' back-end to disassemble the byte-code into OpenCL 'C'. You can see how it is used by looking at the tests provided in the project.

This code using std::find_if and a lambda function is executed on the GPU.

#include <algorithm>

extern "C" void _Kernel_find_if(int* arg, int len, int* out) {
    int* it = std::find_if (arg, arg+len, [] (int i) { return ((i%2)==1); } );
    out[0] = *it;
}

This code using C++11's std::enable_if is executed on the GPU.

#include <type_traits>

template<class T>
T foo(T t, typename std::enable_if<std::is_integral<T>::value >::type* = 0)
{
    return 1;
}

template<class T>
T foo(T t, typename std::enable_if<std::is_floating_point<T>::value >::type* = 0)
{
    return 0;
}

extern "C" void _Kernel_enable_if_int_argument(int* arg0, int* out)
{
    out[0] = foo(arg0[0]);
}

Please not that this project is experimental. It would require a lot more work to make it a production quality tool. One limitation of the tool is that containers such as std::vector are not supported (without a stack-based allocator) because OpenCL does not support dynamically or statically allocated memory. The tool does not support atomics which disallows the use of std::string. It does not have support for memcpy and memset which means that some standard algorithms can not be used; std::copy's underlying implementation sometimes uses memcpy.

Thursday, December 26, 2013

C++ Generic Programming: Implementing Runga-kutta

Download: https://github.com/dimitrs/odes

Below, listing 1. is an implementation of a generic algorithm – the Runga-kutta method which is well known for its use in the approximation of solutions of ordinary differential equations. Although the Runga-kutta method is well known and relatively simple, the generic algorithm implementation below is difficult to read. I won't attempt to explain how it works; I wrote it several months ago and I don't remember all the details. The reason why I list it here is to be able to compare it with a non-generic implementation of the algorithm I wrote in listing 2. 

Listing 2. is far easier to read and understand. It is almost a carbon copy of the mathematical equations for the Runga-kutta4 method (see listing 3). The non-generic implementation is easy to read (assuming you are familiar with the equations) and it was very easy to write. I was done in about 30 minutes, whereas the generic version took me a couple of days to complete. So why make the effort ? The generic version allows you to use any container that supports forward iterators e.g. std::array, std::vector, C array, boost::ublas vectors, Blitz++. With the non-generic version you are tied into using only boost::ublas vectors. 

As far as performance is concerned, the generic version using boost::ublas vectors is about 20% slower than Ublass non-generic version. However, if I substitute the container for a std::array, the generic-algorithm is 40% faster. 

Runga-kutta4 Performance of the generic-algorithm using different containers compared to the non-generic version: 

std::vector: 25% faster
Blitz: about the same
ublas:       20% slower
std::array: 40% faster

template <class InputIterator, class ForwardIterator, class T, class Time=double>
std::pair<std::vector<Time>, ForwardIterator> __runge_kutta4(
    InputIterator inBegin, InputIterator inEnd, 
    Time x0, Time x1, int nr_steps, T f, ForwardIterator yout)
{
  typedef typename std::iterator_traits<InputIterator>::value_type 
      value_type;
  auto delta_t = static_cast<Time>((x1-x0)/nr_steps);
  auto nr_equations = std::distance(inBegin, inEnd);
  ForwardIterator youtP = yout;
  yout = std::copy(inBegin, inEnd, yout);
  std::vector<Time> ytime(nr_steps);

  for (auto ii=0; ii< nr_steps; ++ii) {
    typedef typename std::vector<value_type> vec;

    // K1 = f(x, u(x))
    vec v1(nr_equations);
    typename vec::iterator k1 = f(yout-nr_equations, x0, begin(v1));

    // K2 = f(x+deltaX/2, u(x)+K1*deltaX/2)
    vec v2(nr_equations);
    vec y2(nr_equations);
    std::transform(yout-nr_equations, yout, k1, begin(y2), 
      [delta_t](value_type& y,value_type& k) {
          return y+0.5*delta_t*k;
      });
    typename vec::iterator k2 = 
        f(begin(y2), 0.5*delta_t+x0, begin(v2));

    // K3 = f(x+deltaX/2, u(x)+K2*deltaX/2)
    vec v3(nr_equations);
    vec y3(nr_equations);
    std::transform(yout-nr_equations, yout, k2, begin(y3), 
      [delta_t](value_type& y,value_type& k){
          return y+0.5*delta_t*k;
      });
    typename vec::iterator k3 = 
        f(begin(y3), 0.5*delta_t+x0, begin(v3));

    // K4 = f(x+deltaX, u(x)+K2*deltaX)
    vec v4(nr_equations);
    vec y4(nr_equations);
    std::transform(yout-nr_equations, yout, k3, begin(y4), 
      [delta_t](value_type& y, value_type& k){ return y+delta_t*k; });
    typename vec::const_iterator k4 
        = f(begin(y4), delta_t+x0, begin(v4));

    // u(x+deltaX) = u(x) + (1/6) (K1 + 2 K2 + 2 K3 + K4) * deltaX
    for (auto i=0; i<nr_equations; ++i,++yout) {
      *yout = *(yout-nr_equations) + 
        (1.0/6.0) * (k1[i] + 2.0*k2[i] + 2.0*k3[i] + k4[i]) * delta_t;
    }
    ytime[ii] = x0;
    x0 += delta_t;
  }

  return std::make_pair(ytime, yout);
}

Listing 1: Generic Implementation of Runga-kutta4

template <class T, class Vector=boost::numeric::ublas::vector<double> >
Vector rk4(const Vector& init, double x0, double x1, int nr_steps, T f)
{
 Vector y = init;
 auto h = static_cast<double>((x1-x0)/nr_steps);
 double xn = x0;

 for (auto ii=0; ii< nr_steps; ++ii) {
   Vector k1 = f(y, xn);
   Vector k2 = f(y+0.5*h*k1, 0.5*h+xn);
   Vector k3 = f(y+0.5*h*k2, 0.5*h+xn);
   Vector k4 = f(y+h*k3, h+xn);

   xn += h;
   y = u + (1.0/6.0) * (k1 + 2.0*k2 + 2.0*k3 + k4) * h;
 }
 return y;
}

Listing 2: Non-Generic Implementation of Runga-kutta4 (using ublas)

for n = 0,1,2,3,..., using
 k1 = h f(xn, yn)
 k2 = h f(xn + h/2, yn + k1/2)
 k3 = h f(xn + h/2, yn + k2/2)
 k4 = h f(xn + h, yn + k3)

 xn+1 = xn + h
 yn+1 = yn + h(1/6)(k1 + 2k2 + 2k3 + k4)

Listing 3: Runga-kutta4 equations.