Wednesday, July 27, 2011

Python Bindings for Sally (a machine learning tool)

Download:   sally-0.6.1-with-bindings.tar.gz

One of the tools I have used recently in my machine-learning projects is Sally. As Sally’s web page describes it: “There are many applications for Sally, for example, in the areas of natural language processing, bioinformatics, information retrieval and computer security”. You can look at the example page to see more details. It is written in C which makes it fast, but as is usually the case, using a tool like this directly from Python, would make life easier. It would make for faster prototyping and system development and since it is a tool that I think I will be using repeatedly in the future, I gave the library Python bindings. In this post I would like to outline the technique I use to create a python module from a C library. I use Swig for the bindings and you will have to be, to some extent, familiar with Swig to follow the rest of this post.

As input, SWIG takes a file containing ANSI C/C++ declarations, a special "interface file" (usually given an .i suffix). At its simplest, an interface file looks something like this (see below), where a module called "example" will be created with all C/C++ functions and variables in example.h available from Python.
%module example

%{
#include "example.h"
%}

%include "example.h"
 
Unfortunately, interface files are not usually that simple. There are limitations to what Swig will parse correctly. For example, complex declarations such as function pointers and arrays are problematic.

In the case of Sally, libconfig is used for its configuration management and one would need to include libconfig in the interface file. Take a look at the interface file below. Libconfig's config_lookup_string function is problematic. Swig can not deal with the char** without extra work from us. I created a function called config_lookup_string_2 that wraps config_lookup_string and with the help of the cstring.i library, this becomes useable from Python. Unfortunately, this is quite typical -- it often becomes a time consuming process to check every function and structure you want to provide bindings for, and look for ways of making problematic functions and structures work correctly from Python.
%module pysally

%{
#include <libconfig.h>
%}

%include <cstring.i>
%cstring_output_allocate(char **out1, free(*$1));

%{

void config_lookup_string_2(
    const config_t *config, const char *path, char **out1)
{
    *out1 = (char *) malloc(1024);
    (*out1)[0] = 0;   
    config_lookup_string(config, path, (const char *)out1);
}

%}
%include <libconfig.h>

The above interface file can quickly grow into a bit of a night-mare, in terms of development time and complexity as you add additional functions that Swig can’t deal with transparently. I take an alternative route. The approach I use is to create a facade over the api I want to use from Python. The facade consists of one or more C++ classes and it is the facade for which I provide bindings. The facade is made as complex as Swig's parser allows it to be without having to add complex Swig directives in the interface file.

Getting back to Sally. Essentially the library does three things:
1) Read a config file.
2) Read text from a file or files and process them.
3) Write features to an output file.

As far as Sally's configuration processing is concerned, one could provide some getter and setter member functions. It’s not strictly necessary to access the configuration from Python because the facade takes care of the configuration details in the load_config and init methods. All I have to do from Python is pass the name of the configuration file Sally is to use. Below, is my initial attempt at creating a facade and its interface file. You can pass the input and output paths (in the constructor), and configuration path (in load_config). As you can see, I make use of std::string because Swig deals with it semi-transparently by the addition of %include "std_string.i" in the interface file.

swig.i
%module pysally

%{
#include "pysally.h"
%}

%include "std_string.i"
%include "pysally.h"

pysally.h
class Sally
{
public:

    Sally(int verbose, std::string in, std::string out) :
        entries_(0), input_(in), output_(out) {}

    ~Sally();

    /// Load the configuration of Sally
    void load_config(const std::string& config_file);

    /// Init the Sally tool
    void init();

    /// Main processing routine of Sally.
    /// This function processes chunks of strings.
    void process();

    /// Get/Set configuration
    std::string getConfigAttribute(std::string name);

    void setConfigAttribute(std::string name, std::string value);

    // etc
    // ...
    // ...

private:
    config_t cfg_;
    int verbose_;
    long entries_;
    std::string input_;
    std::string output_;
};

From Python you would use it like this:
verbose = 0
in = "/tmp/input"
out = "/tmp/output"
config = "/tmp/sally.cfg"

s = Sally(verbose, in, out)
s.load_config(config)
s.init()
s.process()

As a result, I can now use Sally from Python, which is nice but it doesn’t really provide anything that I can’t already do with the C executable Sally provides. The Sally library allows you to configure its outputs for a specified format, such as plain text, in LibSVM or Matlab formats. Even though it’s not too difficult to add C code for other formats, it is even easier to do from Python. I provide two additional C++ classes; Reader and Writer (see the code below). The reader and writer facades use the underlying Sally library to read and write to files using the format specified in the configuration file, just as the original Sally binary does. But by extending these classes in Python, one could override the default behaviour -- read and write in other formats,  read and write to a database instead, or even write Sally's output directly to another machine-learning module or read its input directly from a web-scrapping python module instead of a file.

Below, you can see the final interface file, the C++ Reader/Writer classes that provide the default implementation and Python extension Reader/Writer classes. The interface file is still very simple. The only new additions are the directors directive. Directors allow C++ classes to be extended in Python, and from C++ these extensions look exactly like native C++ classes. Neither C++ code nor Python code needs to know where a particular method is implemented.

swig.i
%module(directors="1") pysally
%{
#include "pysally.h"
%}

%feature("director") Reader;        
%feature("director") Writer;       

%include "std_string.i"
%include "pysally.h"

pysally.h
class Writer
{
public:

    Writer(std::string out);

    virtual ~Writer();
   
    virtual void init();   
   
    virtual const std::string getName();

    virtual int write(const output_list& output, int len);
   
private:   
    config_t& cfg_;   
    std::string output_;
    bool hasout_;
};

class Reader
{
public:

    Reader(std::string in);

    virtual ~Reader();
   
    virtual void init();   
   
    virtual const std::string getName();
   
    virtual long getNrEntries();

    virtual int read(string_list& strs, int len);

private:   
    config_t& cfg_;   
    std::string input_; 
    long entries_;   
};

run.py
class MyReader(Reader):
    def __init__(self, input):
        super(MyReader, self).__init__(input)

    def read(self, strings, len):       
        return super(MyReader, self).read(strings, len)

    def init(self):
        super(MyReader, self).init()

    def getNrEntries(self):
        return super(MyReader, self).getNrEntries()


class MyWriter(Writer):
    def __init__(self, output):
        super(MyWriter, self).__init__(output)

    def init(self):        
        pass   

    def write(self, fvec, len):       
        for j in range(len):           
            print "l:", fvec.getFeaturesLabel(j),           
            for i in range(fvec.getListLength(j)):
                print fvec.getDimension(j, i), fvec.getValue(j, i),
                print  fvec.getValue(j, i)           
            print fvec.getFeaturesSource(j)           
            print       
        return 1

input = "/home/edimchr/reuters.zip"
output = "/home/edimchr/tmp/pyreuters.libsvm"
verbose = 0
r = MyReader(input)
w = MyWriter(output)
#r = Reader(input)
#w = Writer(output)

s = Sally(verbose, r, w)
s.load_config("./example.cfg")
s.init()
s.process()

From Python then, you can extend the Reader and/or Writer classes defined in C++. MyReader and MyWriter are passed to the Sally facade via its constructor, and from then-on the underlying C++ code uses the derived python implementations. MyReader simply defers to its base class i.e. Reader, and MyWriter prints the various output information Sally generated.

You may have noticed that the Reader class defines the member function:

    virtual int read(string_list& strs, int len);

And Writer defines the member function:

    virtual int write(const output_list& output, int len);

What are string_list and output_list ? Sally defines a couple of structures that it uses to store the text read (string_t) and output features calculated (fvec_t). These two structures are especially problematic for Swig. As a result, I create a facade over each one called string_list and output_list.
class string_list
{   
private:
    string_t* str_;
   
public:   
    string_list(string_t* str) :
        str_(str) {}
  
    /// Length for element i
    void setStringLength(int i, int len) 
      { str_[i].len = len ; }

    /// String data for element i
    void setStringData(int i, char* data) 
      { str_[i].str = strdup(data); } 
   
    /// Optional label of string
    void setLabel(int i, float label) 
      { str_[i].label = label; }
       
    /// Optional description of source
    void setSource(int i, char* src) 
      { str_[i].src = strdup(src); } 
   
    string_t* getString() const { return str_; }
};

class output_list
{   
private:
    fvec_t** vec_;
   
public:   
    output_list(fvec_t** vec) :
        vec_(vec) {}
  
    /// Length for element i
    unsigned long getListLength(int i) const 
      { return vec_[i]->len; }

    /// Nr of features for element i
    unsigned long getTotalFeatures(int i) const 
      { return vec_[i]->total; }
   
    /// Label for element i
    float getFeaturesLabel(int i) const 
      { return vec_[i]->label; }
   
    /// List of dimensions j
    unsigned long getDimension(int i, int j) 
      { return vec_[i]->dim[j]; }
   
    /// List of values for element i
    float getValue(int i, int j) 
      { return vec_[i]->val[j]; }   
   
    char* getFeaturesSource(int i) const 
      { return vec_[i]->src; } 
   
    fvec_t** getFvec() const { return vec_; }
};
By creating a simple C++ facade over the API, Swig can parse the interface file without difficulties. In general, one could use std::string, std::vector, and std::map.


Building the Python module

Sally is a C library and is built with Autotools.

1) You need additional Autoconf macros to enable SWIG and Python support. I added ac_pkg_swig.m4, ax_pkg_swig.m4 and ax_python_devel.m4 to the m4 subdirectory.
 
    sally-0.6.1/
        m4/
            ac_pkg_swig.m4
            ax_pkg_swig.m4
            ax_python_devel.m4
        pysally/
            Makefile.am
            swig.i
        src/
            Makefile.am
        Makefile.am
        configure.in
 
2) Add pysally to sally-0.6.1/Makefile.am
    ……
    SUBDIRS = src doc tests contrib pysally
    ……
    ……
 
3) Add the following to sally-0.6.1/configure.in
    AC_PROG_CXX
    AC_DISABLE_STATIC
    AC_PROG_LIBTOOL
    AX_PYTHON_DEVEL(>= '2.3')
    AM_PATH_PYTHON
    AC_PROG_SWIG(1.3.21)
    SWIG_ENABLE_CXX
    SWIG_PYTHON
 
4) Add pysally/Makefile to AC_CONFIG_FILES in sally-0.6.1/configure.in

    AC_CONFIG_FILES([
       Makefile \
       src/Makefile \
       src/input/Makefile \
       src/output/Makefile \
       src/fvec/Makefile \
       doc/Makefile \
       tests/Makefile \
       contrib/Makefile \
       pysally/Makefile \
    ])

5) Create the pysally subdirectory and add the files
    Makefile.am
    swig.i       <-- interface file
    globals.h 
    pysally.h    <-- wrapper facades
    pysally.cpp
    run.py       <-- example code to use the module

6) To build:
    cd sally-0.6.1
    ./autogen.sh
    ./configure --prefix=/home/yourhome/sally_install/ --enable-libarchive
    make
    make install