Friday, February 27, 2009

Real-time Plotting from Numerical Simulations

Complex numerical simulations usually take a long time to run while using full CPU(s). If something goes wrong during such a run, we generally only find out in the end of the run when the traces of what went wrong are no longer available, debugging such codes is also not an easy task, because the code gets even slower running through the debugger, and when we don't know where the code is going to break down, it can become a painstaking process.

So any way to monitor such a running code without slowing it down is always welcome. As the title of this article points out, if your code is a numerical simulation, real-time plots of its progress is an extremely useful thing to have. However, traditional Python plotting tools such as Matplotlib, which I have been using for many years, is not a viable solution since it is not very fast and don't support very well real-time plotting.

Recently, while going crazy debugging one such simulation I decide to come up with a solution for real-time plotting in Python. I examine many candidates which I won't mention here, in order to keep this story simple. I finally settled down on a old but still very good solution: Gnuplot!

Since Gnuplot is a stand-alone program, implemented in C, it is very fast at drawing plots and with the help of python-gnuplot, I was able write a class with methods which would send data to Gnuplot and returned immediately without slowing down my running Python code! Gnuplot, on it's side plotted whatever data I throwed at it very fastly, giving me my much needed real-time scope into my simulations. Gnuplot is a real example of the Unix philosophy: Do one thing, and do it well!!

Usage pattern for multiprocessing

I have found myself using multiprocessing more and more these days, especially since it has been backported to Python 2.5.x.
I can no longer remember what was the last computer I owned, which had a single core/processor. Multiprocessing make me much happier as I know I can milk my hardware for all it has to offer. I numpy, scipy and related packages find ways to build parallelization into their libraries as soon as possible.

This goal of this article is to share the simple usage pattern I have adopted to parallelize my code which is both simple and works in both single and multi-core systems. As a disclaimer, much of my day-to-day code, involves repeated calculations of some sort, which are prone to be asynchronously distributed. This explains the pattern I am going to present here.

I start setting up a process pool, as close as possible to the code which will be distributed, preferably in the same local namespace. This way I don't keep processes floating around after my parallel computation is done:

# -*- coding: utf-8 -*-
from numpy import arange,sqrt, random, linalg
from multiprocessing import Pool

global counter
counter = 0
def cb(r):
global counter
print counter, r
counter +=1

def det(M):
return linalg.det(M)

po = Pool()
for i in xrange(1,300):
j = random.normal(1,1,(100,100))
print counter

The call Pool() returns a pool of as many processes as there are cores/cpus available. This makes the code perform optimally on any number of cores, even on single core machines.

The use of a callback function allows for true asynchronicity, since my loop does not have to wait for apply_async to return.

finally, the po.close() and po.join() calls are essential to make sure that all 300 processes which have been fired up finish execution and are terminated. This also eliminates any footprints from your parallel execution, such as zombie processes left behind.

So this is my main pattern! what is yours? please share it and any comments you may have on mine.

Thursday, February 19, 2009

Computer Supported Collaborative Science

The title of this post is intentionally the same as this one by Greg Wilson. His post brings up the important issue of the lack of efficient collaborative science tools, and asks for opinions on the subject. Here is my personal view:

Though science has invented collaborative open-source development centuries ago, it is currently not the best example of such practices, being surpassed on many fronts by the OSS (Open Source Software) community. The blame for this situation can be partly attributed to commercial scientific publishers, the current method of evaluation of scientific productivity etc. But the goal of this article is not to discuss this.

The OSS community have matured a series of tools and practices to maximize the rate of collaborative production of good quality software. By good quality we mean not only bug-free working software, but software which meets criteria such as: efficiency, desirability, readability (you can't form a developer community around unreadable code), modularity,etc.

Science currently fails to even meet the most basic criterion it sets for itself: reproducibility. Most papers do not include sufficient information for its results to be replicated independently. You can compare a scientific paper, to the binary compiled version of a software, it shows its purpose but does not help those which would like to re-create it independently. However in OSS, Binary files always carry information about where its complete source code can be found and downloaded freely. This closes the circle of reproducibility.

When it comes to collaborating with potentially hundreds of peers in developing code, The OSS community have perfected tools such as distributed version control systems(DVCS), bug trackers, wikis and what not, which have been proven indispensable to the production and maintenance of serious OSS projects. Last but not least, OSS projects are never done, which is also a fundamental rule for science, but does not applay to scentific papers. Unfinished papers in science are almost worthless(with the notable exceptions of workng papers and pre-prints).

So, heading back to the focus of this article, what would be the desirable fatures of a productive Computer Supported Collaborative Science (CSCS) tool?
  • Free-software
  • Web based interface
  • DVCS for code and manuscripts
  • Wikis for outlining of ideas and hypotheses
  • Bug tracking for reporting of errors in the analysis
  • Database browsing capabilities for uploading experimental data and interactively exploring it
  • Simple visualizations tools to explore data. Could be based on Google graph/visualization APIs.
  • For my research area at least: Integrated Sage system to foster interactive/collaborative development of computational analytical methods.
  • Your wish here....
This is my take on the issue Greg, I even have some grant money to help realize this, the hard part has been to find like-minded collaborators which believe in the idea.

If you read this and think this has already been accomplished by some OSS project, PLEASE let me know.