Python in Science: 2008

Thursday, October 30, 2008

Fast(?) Gillespies Direct Algorithm in Python

Today I took the day off to implement the Gillespie SSA algorithm. For those of you who have never heard of it is a solver for stochastic equations. There is no implementation of it in Python (to my knowledge) and the only other easily accessible implementation I found was the one on the GillespieSSA package for R.

To help the unfamiliar to understand the application and relevance of this algorithm, consider the following dynamical system derived from an epidemiological problem:

Suppose an infected individual arrives in his home town after his/her vacation carrying an infectious disease entirely new to his fellow citizens. This means no one in town has antibodies for that disease. Let this set of people be called the "susceptibles" and let S be the number of people in this group. Our Infected individual will initially be the only member of a group called the "infectious" which size is denoted by I. The rate of encounters between susceptibles and infectious is proportional to their numbers (S*I). Everytime an infectious meet a susceptible there is a chance b that the disease is transmitted, so the probability of a transmission event is b*S*I. This is called an individual state transition, since the individual went from a susceptible state to an infectious one.
The dynamical system used as an example here contains this and other state transitions described in the code below:


"""
Example of an SEIR model with two Infectious classes: subclinical(Is) and clinical(Ic)
        Is
       /  \
S -> E     R
       \  /
        Ic
        
States:
S: Susceptible
E: Exposed
Is: Infectious subclinical
Ic: Infectious clinical
R: Recovered

Transition rates:
b,ks,kc,rs,rc = (0.001, 0.1, 0.1, 0.01, .01)
Transitions:
S -> E : b*S*(Is+Ic)
E -> Is : ks*E
E -> Ic : kc*E
Is -> R : rs*Is
Ic -> R : rc*Ic

"""
from gillespie import Model
import time
from numpy import array
vars = ['S','E','Is','Ic','R']
#rates: b,ks,kc,rs,rc

r = (0.001, 0.1, 0.1, 0.01, .01)
ini = (490,0,10,0,0)
prop = (lambda r,ini:r[0]*ini[0]*(ini[2]+ini[3]),
        lambda r,ini:r[1]*ini[1],
        lambda r,ini:r[2]*ini[1],
        lambda r,ini:r[3]*ini[2],
        lambda r,ini:r[4]*ini[3]
        )

tmat = array([[-1,0,0,0,0],
            [1,-1,-1,0,0],
            [0,1,0,-1,0],
            [0,0,1,0,-1],
            [0,0,0,1,1]
            ])
#for e in prop:
#    print e()
M=Model(vnames=vars,rates = r,inits=ini,tmat=tmat,propensity=prop)
t0 = time.time()
M.run(tmax=80,reps=100)
print 'total time: ',time.time()-t0
t,series,steps = M.getStats()
print steps,'steps'
from pylab import plot , show, legend, errorbar
plot(t,series.mean(axis=2),'-o')
legend(vars,loc=0)
show()

As can be seen there are five different states in this model with the last one R, being an absorbing state, since all individuals converge to that state with time.

the code that simulates this stochastic dynamic model is in the gillespie module:


from numpy.random import uniform, multinomial, exponential,random
from numpy import arange, array, empty,zeros,log
#from math import log
import time
import multiprocessing



import psyco
psyco.full()

#global ini
#ini=[500,1,0]

class Model:
    def __init__(self,vnames,rates,inits, tmat,propensity):
        '''
         * vnames: list of strings
         * rates: list of fixed rate parameters
         * inits: list of initial values of variables
         * propensity: list of lambda functions of the form: 
            lambda r,ini: some function of rates ans inits.
        '''
        self.vn = vnames
        self.rates = rates
        self.inits = inits
        self.tm = tmat
        self.pv = propensity#[compile(eq,'errmsg','eval') for eq in propensity]
        self.pvl = len(self.pv) #length of propensity vector
        self.nvars = len(self.inits) #number of variables
        self.time=None
        self.series=None
        self.steps=0
    
    def getStats(self):
        return self.time,self.series,self.steps
    
    def run(self,method='SSA', tmax=10, reps=1):
        self.res = zeros((tmax,self.nvars,reps),dtype=float)
        tvec = arange(tmax, dtype=int)
        if method =='SSA':
            for i in xrange(reps):
                steps = self.GSSA(tmax,i)
            print steps,' steps'
        elif method == 'SSAct':
            pass
        self.time=tvec
        self.series=self.res
        self.steps=steps
    def GSSA(self, tmax=50,round=0):
        '''
        Gillespie Direct algorithm
        '''
        ini = self.inits
        r = self.rates
        pvi = self.pv
        l=self.pvl
        pv = zeros(l,dtype=float)
        tm = self.tm
        
        tc = 0
        steps = 0
        self.res[0,:,round]= ini
        a0=1
        for tim in xrange(1,tmax):
            while tc < tim:
                for i in xrange(l):
                    pv[i] = pvi[i](r,ini)
                #pv = abs(array([eq() for eq in pvi]))# #propensity vector
                a0 = pv.sum() #sum of all transition probabilities
#                print tim, pv, a0
                tau = (-1/a0)*log(random())
                event = multinomial(1,pv/a0) # event which will happen on this iteration
                ini += tm[:,event.nonzero()[0][0]]
                #print tc, ini
                tc += tau
                steps +=1
                if a0 == 0: break
            self.res[tim,:,round] = ini
            if a0 == 0: break
#        tvec = tvec[:tim]
#        self.res = res[:tim,:,round]
        return steps

    def CR(self,pv):
        """
        Composition reaction algorithm
        """
        pass
        

def main():
    vars = ['s','i','r']
    ini= [500,1,0]
    rates = [.001,.1]
    tm = array([[-1,0],[1,-1],[0,1]])
    
    prop=[lambda r, ini:r[0]*ini[0]*ini[1],lambda r,ini:r[0]*ini[1]]
    M = Model(vnames = vars,rates = rates,inits=ini, tmat=tm,propensity=prop)
    t0=time.time()
    M.run(tmax=80,reps=1000)
    print 'total time: ',time.time()-t0
    #print res

#    from pylab import plot , show, legend
#    plot(t,res,'-.')
#    legend(M.vn,loc=0)
#    show()
    

if __name__=="__main__":
    import cProfile
    cProfile.run('main()',sort=1)
    main()

The Gillespie SSA is a very simple algorithm, so I thought it would be nice to blog about it and have other Pythonistas take a look at my code and possibly find a way to optimize it further. Having the core code run as fast as possible is very important for this application, because you normally have to run thousands of replicates of a every run due to its stochastic nature. This "need for speed", made me try Cython, but due to my inexperience with it, I couldn't squeeze much performance out of it:


from numpy.random import uniform, multinomial, exponential
#from numpy import arange, array, empty,zeros
import numpy as np
cimport numpy as np
import time

DTYPE = np.double
ctypedef np.double_t DTYPE_t
ctypedef np.int_t INT_t

cdef class Model:
    cdef object vn,rates,inits,pv
    cdef np.ndarray tm,res,time,series
    cdef int pvl,nvars,steps
    cdef object ts
    def __init__(self,vnames,rates,inits, tmat,propensity):
        '''
         * vnames: list of strings
         * rates: list of fixed rate parameters
         * inits: list of initial values of variables
         * propensity: list of lambda functions of the form: 
            lambda r,ini: some function of rates ans inits.
        '''
        self.vn = vnames
        self.rates = rates
        self.inits = inits
        self.tm = tmat
        self.pv = propensity#[compile(eq,'errmsg','eval') for eq in propensity]
        self.pvl = len(self.pv) #length of propensity vector
        self.nvars = len(self.inits) #number of variables
        self.time = np.zeros(1)
        self.series = np.zeros(1)
        self.steps = 0

    def run(self, method='SSA', int tmax=10, int reps=1):
        cdef np.ndarray[DTYPE_t,ndim=3] res = np.zeros((tmax,self.nvars,reps),dtype=float)
        tvec = np.arange(tmax)
        self.res = res
        cdef int i, steps
        if method =='SSA':
            for i from 0 <= i<reps:
                steps = self.GSSA(tmax,i)
            print steps,' steps'
        elif method == 'SSAct':
            pass
        self.time=tvec
        self.series=self.res
        self.steps=steps
    
    def getStats(self):
        return self.time,self.series,self.steps
    
    cpdef int GSSA(self, int tmax=50,int round=0):
        '''
        Gillespie Direct algorithm
        '''
        ini = self.inits
        r = self.rates
        pvi = self.pv
        cdef int l,steps,i,tim
        cdef double a0,tc, tau
        #cdef np.ndarray[INT_t] tvec
        cdef np.ndarray[DTYPE_t] pv
        l=self.pvl
        pv = np.zeros(l, dtype=float)
        tm = self.tm
        #tvec = np.arange(tmax,dtype=int)
        tc = 0
        steps = 0
        self.res[0,:,round]= ini
        a0=1.
        for tim from 1<= tim <tmax:
            while tc < tim:
                for i from 0 <= i <l:
                    pv[i] = pvi[i](r,ini)
                #pv = abs(array([eq() for eq in pvi]))# #propensity vector
                a0 = a_sum(pv,l) #sum of all transition probabilities
                #print ini#,tim, pv, a0
                tau = (-1/a0)*np.log(np.random.random())
                event = multinomial(1,(pv/a0)) # event which will happen on this iteration
                ini += tm[:,event.nonzero()[0][0]]
                #print tc, ini
                tc += tau
                steps +=1
                if a0 == 0: break
            self.res[tim,:,round] = ini
            if a0 == 0: break
#        tvec = tvec[:tim]
#        self.res = res[:tim,:,round]
        return steps

    def CR(self,pv):
        """
        Composition reaction algorithm
        """
        pass
        

def main():
    vars = ['s','i','r']
    cdef np.ndarray ini= np.array([500,1,0],dtype = int)
    cdef np.ndarray rates = np.array([.001,.1],dtype=float)
    cdef np.ndarray tm = np.array([[-1,0],[1,-1],[0,1]])
    
    prop = [l1,l2]
    M = Model(vnames = vars,rates = rates,inits=ini, tmat=tm,propensity=prop)
    t0=time.time()
    M.run(tmax=80,reps=1000)
    print 'total time: ',time.time()-t0

    
cdef double a_sum(np.ndarray a, int len):
    cdef double s
    cdef int i
    s=0
    for i from 0 <=i <len:
        s+=a[i]
    return s
        
def l1(np.ndarray r,np.ndarray ini):
    return r[0]*ini[0]*ini[1]
def l2(np.ndarray r,np.ndarray ini):
    return r[1]*ini[1]

That's it. I hope this code will serve to inspire other scientists which need to use this type of simulations. In the next few days I will organize it in a package and make it available somewhere.

I would appreciate any suggestions to speed up the gillespie module, both the Python and the Cython versions.

Sunday, October 5, 2008

Further Bayesian adventures

I am quite busy at the moment. But I recently got some inspiration for pushing forward with the development of my object-oriented Bayesian-inference package. The inspiration came in the form of the recently released R-packages rv By Kerman and Gelman. A technical report by the same authors provided me with the initial idea for starting this package. The rv package implements part of what intend to implement in my Bayesian package, and I am inclined to borrow some Ideas from it as I improve my Python package. I want to extend what is already available in rv with support for posterior derivation using natural conjugate priors, support for approximate bayes computation methods, etc. Let's just see If I can find the time...

Tuesday, September 30, 2008

Python Multi-processing

In this era of multiple cores everywhere, it kind of makes me nervous to see one of my computer cores siting idle as the other crunches away at some numerical simulation. Running heavy numerical simulations is the bread and butter of my work, so I am always on the lookout for ways to extract as much computational juice as I can from my CPUs.

Over last couple of years I have played different approaches available to python programmers, from tools available in the standard library such as forking processes and the threading module, to external packages such as Parallel Python and Ipython1. All of them have their pros and cons, and in many occasions, I found myself wasting valuable computable time trying to get my simulations to run under the parallelization models inherent to each of the solutions listed above.

I will not go into details about what I lked and disliked about each of them but rather I will focus on the future of parallel processing in the Pythonsphere: the already available and soon to be part of the standard library, Processing module (renamed multiprocessing for the standard library).

It can be installed with a simple "easy_install processing". For those who don't know yet, the processing/multiprocessing module is a multi-processing (duh!) module using the same API as the standard library's threading module.

The processing module takes a lot of the pain out of the process of writing parallel code when compared to other methods. By using multiple processes, it saves you from having to deal with problems associated with having a shared memory between tasks. This means you can elegantly bypass the GIL, with the same code you would write for multithreaded application minus the boilerplate code you'd have to write to handle racing conditions and whatnot. This is the meaning of sharing the same API with the threading module. Moreover, with processing, your code runs on Windows just as well as on Linux, which is something you couldn't do with fork.

Before processing, the (IMHO) best tool for "simple" multi-processing was Parallel-Python, but I found extremely painful having to manually declare global variables and modules which each process would have to have access to.

I must say that so far, my experience with processing is quite limited. However, I benefit from the point of view of having implemented the same exact (simple) code on all of the said platforms except for Ipython1, and I can attest that for simple parallelizable problems, processing makes the task it is as simple as it can get.

In conclusion, if you can benefit from parallel processing in your application, I strongly suggest trying out the processing module.

Thursday, August 7, 2008

AppEngine and Google Base

My experiments with AppEngine keep expanding. One thing they got right
was the incentive to use other google services through google data
APIs also known as GData.
The first step towards that was forcing us to use Authsub by default
in appEngine. Noticing that, I am thinking about taking advantage of
that to aleviate the usage of the tiny storage quota offered by Google
at AppEngine.
I'll post more about this as my experiments mature.

--
FLávio Codeço Coelho, PhD

Tuesday, August 5, 2008

Editra: a Great New Python Editor

Editra is a relatively new (at least for me) text editor written in Python/GTK.

I have been working extensively with it recently, and although I am not ready to leave behind my currently favorite Python IDE, Eric, Editra doesn't make me miss it too much. One thing that it does that Eric still does not, even though it could, is to provide a code browser for a source file which is not part of a project. I know, many editors do that, but Eric only enables the browser for files in a project, silly thing isn't it?.

Editra has an interesting plugin system, through which it provides most of its goodies, such as the code browser, comment browser, file browser, support for projects, a python shell, themes etc. There is also a plugin for an Ipython Shell that is apparently broken.

Its project support includes source control integration for Git, Bazaar, CVS and Subversion, Unfortunately notmy favorite: Mercurial.

One thing that annoys me is the lack of support for the middle-mouse-button paste, typical of Linux, which I use a lot (on other editors, naturally). I couldn't find the convenient run button as well... too bad.

Overall it is an excellent editor, and I wish it continues to develop well. One last wish: Good code completion, if it can do that it make an excellent alternative to commercial editors such as Komodo. Kudos to its developers!

Monday, August 4, 2008

Google AppEngine

I have spent quite a lot of time lately developing for google appengine. I must admit that it is really a great package, especially considering that is still in beta.

The development cycle is a breeze. One thing I would add though, would be a decent source code management system. This could take the form of an integration with google code. The appEngine dashboard does a good job keeping track of versions of your application, but if by accident you loose the code for your application (like having your laptop stolen for instance) you' re basically screwed.
Another ridiculous aspect of it is the 500MB of storage! why would a company that gives you 6+GB of email space not give you at least the same for your web app? Too bad they don't really have competition. Yes you can write your app in Django and all that, but Django, although more powerful as a development platform than the current version on GAE, does not offer such an efficient development and deployment cycle. Someone should work at creating an open source version of the GAE dashboard to help the deployment of Django apps. It would be a huge success.

My last wish is that I hope they will upgrade their internal version of Django to 1.0 as soon as the real Django 1.0 becomes production ready.

Wednesday, July 23, 2008

Metamodellers

I am rather busy at the moment pushing my new company forward. Yes, for those of you who don't know yet, I have just started up my own company.

It has been very rewarding so far, even considering all the hassle of tasks such as fund-seeking, registering the company, creating a website, develop the products, and making sure that everything work. Wow!, it's been a great challenge.

The focus of this company will be to build socially-enabled platforms to try to tackle two major modern-day challenges: Aggregating value to dispersed information and data (knowledge-building), and complex systems simulations and analysis. This two goals are a bit two complex to explain in a blog, but I hope that as soon as Metamodellers start to deliver its first products, the roadmap will become clear.

Wish me luck, and stay tuned for more news on this front.

Monday, May 12, 2008

Object-Oriented Bayesian Inference II

A while ago, I blogged about a little piece of code I had written as an example for my book.

Back then, it was a fun exercise for me, but it was clearly unfinished (and it still is) but worth as an example of how to model a mathematical object, a Bayesian random variate, as a python class.

Despite its simplicity and incompleteness, that example, still attracts attention, and people contact me asking for the code. In order to better serve those interested in this problem, I decided to create a project on google code to host the code.

I am still interested in taking this code further, extending its applicability to real problems. So if anyone is willing to help, the project site will hopefully be a good channel for that.

Sunday, April 13, 2008

New Job

It's been a while since i last posted about python in science. I apologize to my readers, but it has been a very busy couple of months. As a matter of fact it still is (i am writing this on my cell phone while I comute), but let's get to the point.

The reason for this crazyer than usual months is that I switched jobs. I have been offered a job in a private research company called Intelekto. It is a very interesting company, it's business model is based on selling scientific expertise to companies willing to outsource their research and development needs, to an agile, highly focused company that can deliver solutions in a fraction of the time usually taken by regular companies just to mobilize a research team.

I was hired to manage the reserch team: a small core team of highly trained scientists, ready to crunch through the clients' scientific problems. Whenever the needs of a client falls outside the joint expertise of our core team, we hire other scientists, on a per project basis.

Most of the projects we get, involve the development of software. As a good pythonista, I am starting to build the Python skills of my team and am actively hiring good Python developers. I should say that our company is based in Brazil, but we already have global clients, and are interested in expanding our collaboration network with other global players.

Even though I've been at the head of the research team for less than a month, we are already working with Python tools in a couple of our projetcs: Pyro Robotics, to prototype a robot navigation system, and with NetworkX and Igraph for a graph visualization module.

Well, that pretty much sums up the most recent news. I hope to post about our experiences with Pyro robotics and Igraph soon.

Monday, March 3, 2008

Incredible Video of intracellular molecular dynamics

I just found this incredible video (animation) depicting what goes on continually within our own cells. Reminds me of why I became a biologist.
We should never cease to let ourselves be amazed by the great machine that is the living cell and the organisms they are part of. Enjoy!

Python in Science