Saturday, December 25, 2010

Efficcient MCMC in Python

I have recently come across a very interesting blog. That is, interesting for people like me, who are enthusiasts of Statistics, programming and their combination expressed preferably in the Python language. One recent post of that blog, caught my attention, as it offered a comparison between MCMC implementations of a simple Gibbs sampler in multiple languages, namely: R, Python, Java and C.

After presenting the Python version the author (Darren Wilkinson) said that in an ideal world, he would always program in Python, but in reality, most of the time he had to use Java or C to obtain acceptable performance. That statement got me thinking that the very success of Python as a programming language is its ability of taking us to this ideal world, and keeping us there. Even in the performance domain, the Python community has developed a number of elegant solutions to minimize the need to resort to lower level languages: we have the ability to call directly into C code with Ctypes, the amazing progress of JIT (just in time compiling) capabilities in Pypy, which (we hope) will one day be incorporated into the reference implementation of Python, just to cite a few of the tools. But to solve the particular problem of optimizing numerical code to near C performance, my favorite tool these days is Cython.

So in this post, I set out to demonstrate how far we can go in terms of performance improvement with Cython, for the example code provided by Darren. Here is the original code in pure Python with a few modification of my own (basically the return of the samples):


import random,math

def gibbs(N=20000,thin=500):
x=0
y=0
samples = []
for i in range(N):
for j in range(thin):
x=random.gammavariate(3,1.0/(y*y+4))
y=random.gauss(1.0/(x+1),1.0/math.sqrt(x+1))
samples.append((x,y))
return samples

smp = gibbs()

Well, we can see by the look of it, that it's going to be slow in Python, two nested loops plus two calls to two non trivial Python functions in their core. In this kind of code (MCMC) there is no real way around heavily iterative structures. That's why it's such a good example to illustrate the powers of Cython. This code took 1480 seconds to run in my computer. This is going to be the baseline against which we will compare the performance of the Cython implementations. Here is the first one:

import random,math

def gibbs(int N=20000,int thin=500):
cdef double x=0
cdef double y=0
cdef int i, j
samples = []
for i in range(N):
for j in range(thin):
x=random.gammavariate(3,1.0/(y*y+4))
y=random.gauss(1.0/(x+1),1.0/math.sqrt(x+1))
samples.append((x,y))
return samples

smp = gibbs()
For those who are unfamiliar with Cython, Cython is just Python plus (C style) type declarations. You then save the file with a .pyx extension (gibbs.pyx) and compile it to straight C with these commands in a python shell:
>>> import pyximport; pyximport.install()
>>> import gibbs

This will compile your code automatically, and import it executing the function. This naturally assumes you are on a linux machine with Cython installed. I actually ran these experiments from a Sage Worksheet, a marvelous tool for those doing scientific computing. In Sage all you have to do to have your Cython code automatically compiled is to add a %cython line at the top of your cell containing cython code. This implementation took 127 seconds to run, an 11.6 fold improvement in performance. My next implementation included replacing the random number generators in the core of the loops for the equivalent ones from numpy.random, and that shaved a few extra seconds off the time (see the sage worksheet). The real bottleneck here is that the two function calls are still Python function calls which have the overhead of checking the types of the variables at runtime. So for my definitive implementation I decided to use GSL's random number generators. That involved declaring the parts of Gnu Scientific Library that I'd be using. But Cython makes it very easy:

#declaring external GSL functions to be used
cdef extern from "math.h":
double sqrt(double)

cdef double Sqrt(double n):
return sqrt(n)

cdef extern from "gsl/gsl_rng.h":
ctypedef struct gsl_rng_type:
pass
ctypedef struct gsl_rng:
pass
gsl_rng_type *gsl_rng_mt19937
gsl_rng *gsl_rng_alloc(gsl_rng_type * T)

cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)

cdef extern from "gsl/gsl_randist.h":
double gamma "gsl_ran_gamma"(gsl_rng * r,double,double)
double gaussian "gsl_ran_gaussian"(gsl_rng * r,double)

# original Cython code
def gibbs(int N=20000,int thin=500):
cdef double x=0
cdef double y=0
cdef int i, j
samples = []
#print "Iter x y"
for i in range(N):
for j in range(thin):
x = gamma(r,3,1.0/(y*y+4))
y = gaussian(r,1.0/Sqrt(x+1))
samples.append([x,y])
return samples

smp = gibbs()
This implementation took 5 seconds to run a 296 fold improvement in performance. At this point it felt good to be in my Ideal (Real) Python world.



7 comments:

victorg said...

Have you tried running this outside of Sage? I get such weird numbers on standalone Python that I wanted to check...

First, I find it weird that my ancient hardware runs the pure Python version 100x faster than yours. Further, if I create a main() function on the pure Python and simple Cython versions, I get no speedup with Cython:

def main():
. from time import time
. start = time()
. for x in xrange(10):
. . smp = gibbs()
. print time() - start
. return smp


python -c "import mcmc; mcmc.main()"

python -c "import cmcmc; cmcmc.main()"

Where cmcmc is a Cython module built using the instructions from http://docs.cython.org/src/quickstart/build.html#building-a-cython-module-using-distutils and the Cython without GSL code.

victorg said...

Oops, make that "my ancient hardware runs the pure Python version 10x faster than yours. Which, obviously, negates the 10x speedup via simple Cython.

FWIW, PyPy gives a 10x speedup over pure Python for me.

NIC1138 said...

Very nice, but isn't it possible to do that with scipy/numpy?

NIC1138 said...

Here it is, a scipy/numpy implementation. Just by using it I got a 12x speedup. I do believe you can attain even more using cython, but that shows how simply using numpy instead of the "normal" python math stuff, and before using things like Cython, you can already get great improvements.


def gibbs2(N=20000,thin=500):
y=scipy.zeros(N)
for j in range(thin):
x=scipy.random.gamma(3,1.0/(y**2+4))
y=scipy.random.normal(1.0/(x+1),1.0/scipy.sqrt(x+1))
return pylab.c_[x,y]

Flavio Coelho said...

I have addressed your comments in a follow-up post and posted the code on gist where you can fork at will.

thanks everyone for the comments!

Alex said...

Running this on PyPy seems to be within 20% of the Cython code, not bad:


alex@alex-laptop:/tmp$ cat test.py
import math
import random


def gibbs(N=20000, thin=500):
x = 0
y = 0
samples = []
for i in xrange(N):
for j in xrange(thin):
x = random.gammavariate(3, 1.0 / (y * y + 4))
y = random.gauss(1.0 / (x + 1), 1.0 / math.sqrt(x + 1))
samples.append((x, y))
return samples

smp = gibbs()

alex@alex-laptop:/tmp$ time ~/Desktop/pypy-1.4.1-linux/bin/pypy test.py

real 0m7.779s
user 0m7.648s
sys 0m0.036s

Flavio Coelho said...

@Alex: Very interesting result Alex, Thanks!

Anyone volunteers to improve the pure python version to use GSL via ctypes?

ccp

Amazon