Monday, December 14, 2009

LaTeX Tikz examples

I am currently working on a presentation where I have to include a lot of mathematical graphs, so instead of plotting using matplotlib and inserting the result as a figure, I am considering using specialized LaTeX packages to create the graphs. While searching for documentation, I found this great repository of examples for the Tikz package.

referente a: http://www.texample.net/tikz/examples/ (ver no Google Sidewiki)

Thursday, November 19, 2009

Developing a Wave Robot in Python

I have been so busy lately that I almost forgot to write about my lightning venture into writing a Google Wave robot: Trendy.
For those of you living under a rock in the last several months, Google Wave offers an API for the development of robot, which are... well, robots, which when added to waves, do automated tasks with its contents. There are robots for translating, do syntax highlighting on codde snippets and so on.

My idea for a robot was to create one which would allow for easy insertion of Google trends results into a conversation. Why? because everyone's favorite kind of online conversation is centered around what's hot and what's not... thus the need to quickly check the latests trends in mindshare (as proxied by search volume).

My initial approach was to somehow fetch raw data about trends from google and generate a small plot with the results (using google charts). But I stumbled into the fact that Google has yet to release an API for access to trend data. After a couple of days frustrated, which made me almost give up, I stumbled into a solution, which was both easier to code and faster on the server -- Robots use AppEngine for hosting, but for a robot, you must strive to keep latency as low as possible.

The solution involved embedding Google's own trend gadget, into the wave, replacing of the markup. Did I mention the markup? after fiddling with various possibilities, and decided on this markup: ~~topic1,topic2,topic3,topic4.

Whenever the robot finds such strings, it tries inserts a trends graph for those trends in the blip. make sure you dont forget the period at the end of the topic list, and add some text after the end of the markup, otherwise the robot gets confused... (I'll fix this bug at some point).

Enough talk, If you want to know how I did it, go read the source. If you're curious of how it works and have a wave account, add it to a wave and start playing: trendy-robot@appspot.com

Thursday, October 22, 2009

ASCII Histograms

I have recently come across an interestng problem while working on my random variable implementation for BIP. Any type in Python is expected to have a __str__(self) method which returns an adequate and expressive string representation of the object. Well, as far as I could think, the most straightforward representation of a random variable is its probability distribution. Probability distributions most often depicted graphically by a continuous density function, or a histogram. So my challenge was how to bring the information conveyed by a histogram to a concise ascii string, suitable to be the output of a print statement?

I immediately rejected the boring solution of representing the distribution by its moments (mean, variance, skewness, etc.). I wanted a full histogram in as few ascii characters as possible. So I set out to implement my own ASCII histogram generator. I can anticipate that it was a very simple task given the handy histogram function in Numpy and how easy it is to do string formatting in Python. It was nevertheless a fun couple of hours of programming. I ended up implementing a horizontal and a vertical histogram. The ascii histogram proved to be very useful since it helped enormously in debugging code involving probability calculations with simple print statements. Probabilistic simulations are extremely hard to test because the results of a given operation are never strictly the same. However, they should have the same probability distribution, so by looking at the rough shape of the histogram, you tell you if your calculations are going in the right direction.

Curiously, such a simple and expressive representation for probability distributions is not available in any package I knew, so I decided to share the code with the scientific Python community so that people that may put it to good use. The code below is part of BIP and consequently under GPL license. Any suggestions of improvements are welcome.

# -*- coding: utf-8 -*-
class Histogram(object):
    """
    Ascii histogram
    """
    def __init__(self, data, bins=10):
        """
        Class constructor
        
        :Parameters:
            - `data`: array like object
        """
        self.data = data
        self.bins = bins
        self.h = histogram(self.data, bins=self.bins)
    def horizontal(self, height=4, character ='|'):
        """Returns a multiline string containing a
        a horizontal histogram representation of self.data
        :Parameters:
            - `height`: Height of the histogram in characters
            - `character`: Character to use
        >>> d = normal(size=1000)
        >>> h = Histogram(d,bins=25)
        >>> print h.horizontal(5,'|')
        106            |||
                      |||||
                      |||||||
                    ||||||||||
                   |||||||||||||
        -3.42                         3.09
        """
        his = """"""
        bars = self.h[0]/max(self.h[0])*height
        for l in reversed(range(1,height+1)):
            line = ""
            if l == height:
                line = '%s '%max(self.h[0]) #histogram top count
            else:
                line = ' '*(len(str(max(self.h[0])))+1) #add leading spaces
            for c in bars:
                if c >= ceil(l):
                    line += character
                else:
                    line += ' '
            line +='\n'
            his += line
        his += '%.2f'%self.h[1][0] + ' '*(self.bins) +'%.2f'%self.h[1][-1] + '\n'
        return his
    def vertical(self,height=20, character ='|'):
        """
        Returns a Multi-line string containing a
        a vertical histogram representation of self.data
        :Parameters:
            - `height`: Height of the histogram in characters
            - `character`: Character to use
        >>> d = normal(size=1000)
        >>> Histogram(d,bins=10)
        >>> print h.vertical(15,'*')
                              236
        -3.42:
        -2.78:
        -2.14: ***
        -1.51: *********
        -0.87: *************
        -0.23: ***************
        0.41 : ***********
        1.04 : ********
        1.68 : *
        2.32 :
        """
        his = """"""
        xl = ['%.2f'%n for n in self.h[1]]
        lxl = [len(l) for l in xl]
        bars = self.h[0]/max(self.h[0])*height
        his += ' '*(max(bars)+2+max(lxl))+'%s\n'%max(self.h[0])
        for i,c in enumerate(bars):
            line = xl[i] +' '*(max(lxl)-lxl[i])+': '+ character*c+'\n'
            his += line
        return his
            
if __name__ == "__main__":
    from numpy.random import normal
    d = normal(size=1000)
    h = Histogram(d,bins=10)
    print h.vertical(15)
    print h.horizontal(5)

Tuesday, September 29, 2009

The Internet Manifesto

This document is a must read (see link at the end).

I want to add my own items to it:

1. Net Neutrality is not only protecting Internet content provider corporations' profits

We need to defend net neutrality in a way which goes beyond what is currently done: We need to establish fixed ip addresss for every private individual so that publishing rights don't have to be gatekeeped by large corporations such as Google and the like.

2. Copyright should not be sellable item.

The source of most of the confusion about whether copyright is a good thing or not in the information age, is the fact that most of the commercial exploration of copyrighted materials is not done by the original authors, but by large publishing houses which either coerce authors to give them the commercial right to their work in exchange for pennies, or exploit materials which should be in the public domain for as much as 70 years after the authors death.

3. Network Infrastructure ownership should not be a priviledge of large corporations.

The right to form open "ad hoc" wireless networks should be guaranteed globally, like we have with amateur radio for decades. This is the only way to assure the basic freedom of association and expression.

referente a: Internet-Manifesto (ver no Google Sidewiki)

Monday, September 28, 2009

Python(x,y) A Scientific Python Distribution

I recently came across, this interesting, opensource Python Scientific Distribution for Windows. I normally don't pay too much attention to windows tools, but it's good to have something to recommend to windows users when you want them to try out some Python code.

For Linux users, it's not really relevant, because we all have powerfull package managers to help us get most Python packages installed very easily.

The dowside is that it is a really big download, 300+ MB, and the only mirror available was giving me only about 15 Kbps today (I am on a 18MBps connection). It is so big that it includes Enthought Tool Suite in it and much more.

For users of more civilized Operating systems, like Linux, It's worth checking out one of the editors bundled, spyder, which is available through pypi ("easy_install spyder"). It's a reincarnation of pydee, and despite its beta status, it is very good already.

referente a: python(x,y) - Python for Scientists (ver no Google Sidewiki)

Thursday, September 24, 2009

Scientific Python Group at LinkedIn

I have just created a Scientific Python Group at LinkedIn. I was actually surprised that there wasn't one already.

Python is quickly becoming a major tool in scientific computing and we should all do our best to advertise its capabilities.

referente a: Scientific Python Group | LinkedIn (ver no Google Sidewiki)

Thursday, September 17, 2009

Violin Plot with Matplotlib


One of the things I sorely missed from matplotlib for a very long time, was a violin plot implementation. Many a time, I thought about implementing one myself, but never found the time.

Today, browsing through Matplotlib's documentation, I found the recently added fill_betweenx function. Finally it seemed to have become a piece of cake to implement a violin plot. I Googled for violin plot and Python, to no avail. So I decided to write it myself.

Violin Plots are very similar to Box and whiskers plots, however they offer a more detailed view of a dataset's variability. It's frequently a good idea to combine them on the same plot. So here is what I came up with:


# -*- coding: utf-8 -*-
from matplotlib.pyplot import figure, show
from scipy.stats import gaussian_kde
from numpy.random import normal
from numpy import arange


def violin_plot(ax,data,pos, bp=False):
'''
create violin plots on an axis
'''
dist = max(pos)-min(pos)
w = min(0.15*max(dist,1.0),0.5)
for d,p in zip(data,pos):
k = gaussian_kde(d) #calculates the kernel density
m = k.dataset.min() #lower bound of violin
M = k.dataset.max() #upper bound of violin
x = arange(m,M,(M-m)/100.) # support for violin
v = k.evaluate(x) #violin profile (density curve)
v = v/v.max()*w #scaling the violin to the available space
ax.fill_betweenx(x,p,v+p,facecolor='y',alpha=0.3)
ax.fill_betweenx(x,p,-v+p,facecolor='y',alpha=0.3)
if bp:
ax.boxplot(data,notch=1,positions=pos,vert=1)

if __name__=="__main__":
pos = range(5)
data = [normal(size=100) for i in pos]
fig=figure()
ax = fig.add_subplot(111)
violin_plot(ax,data,pos,bp=1)
show()


The next step now is to contribute this plot to Matplotlib, but before I do that, I'd like to get some comments on this particular implementation. Moreover, I don't know if it'd be acceptable for Matplotlib to add Scipy as a dependency. But since re-implementing kernel density estimation for a simple plot would be overkill, maybe the destiny of this implementation will be to live on as an example for others to adapt and use.

WARNING: This code requires maplotlib 0.99 (maybe 0.99.1rc1) to work because of the fill_betweenx function.

Friday, September 11, 2009

Why Python

I recently came across a repository of jewels in the form of the unpublished manuscripts of E.W. Dijkstra. I just finished reading one entitled "Some Meditations on Advanced Programming". It is amazingly well written and still so relevant to present-day computer science, that I recommend anyone with at least a passing interest on the subject to read it.

I was pleasantly surprised to find in the last remark of the manuscript, something which can be considered, IMHO, the most fundamental design principle of Python and the my main reason to love it. I quote:

"As my very last remark I should like to stress that the tool as a whole should have still another quality. It is a much more subtle one; whether we appreciate it or not depends much more on our personal taste and education and I shall not even try to defined it. The tool should be charming, it should be elegant, it should be worthy of our love. This is no joke, I am terribly serious about this. In this respect the programmer does not differ from any other craftsman: unless he loves his tools it is highly improbable that he will create something of superior quality.

At the same time these considerations tell us the greatest virtues a program can show: Elegance and Beauty."

Wednesday, September 2, 2009

New-style String Formatting and LaTeX

The recommended new way (since 2.6) of doing string formating in Python is to use the format method of string objects instead of the % operator and %s (and variants) placemarks.

I decided to check it out to generate some LaTeX tables programatically. Bad Idea. the method expects placemarks like this: "some string {key}".format(key=123) and that is a big problem when formatting LaTeX strings. Curly braces are a very significant character in LaTeX, and therefore you will get a Key Errors all the time. The only solution I see is that format would silently ignore Key Errors. But that is not currently an option.

Has anyone else devised a solution to this? The scary part of this is that the official documentation says that the % operator for string formatting will eventually be removed (?!?) from the language.

Friday, July 10, 2009

Chrome OS and Google World Domination

The web platform is starting to get annoying in its attempt to re-invent and re-create everything that previously existed in the Desktop space.

Finally, Desktop developers are beginning to wake up to this problem.

For years, browser-developing companies have tried to turn the Browser into an operating-system-like autonomous system. While the rest of us were trying to figure out how to get our favorite tools (Python) into the browser space.

No one seems to have fully realized yet, that it is much easier for us, Desktop denizens to conquer the world than for the browser people. They miss almost everything from the kernel up, while all we need to do is integrate better the web into the desktop.

We don't even need browsers. Minimize your browser (and all other windows you may have open) an look at that beautiful picture in you desktop. When was the last time you had a productive relationship with your desktop? Mine is used mainly to hang a pretty picture and place a bunch of useless files stashed to one side so they don't mess up the view. But even with my beautiful wallpaper unblocked by the icons on the desktop, I rarely look at it, because my browser is always blocking the view.

IMHO, the web should take the place of the walpaper and desktop apps should be gradually adapted to be more and more web aware, so that we don't need thousands of browser extensions duplicating the functionalities of our desktop apps which can't deal with the web.

My extension loaded Firefox, currently takes almost as long to start as Ubuntu to boot. Why do I need this mamooth when I could have the web on my desktop and integrated with all of my operating system? Why on earth do we need yet another OS?

Chrome OS according to recently released information, is a linux kernel with a light-weight graphics layer (not X) dedicate to a browser (chrome in this case). They are doing exactly what I proposed above, but I am not prepared to give away my desktop for that. What we need to do now, is to prove that the browser is the problem, not the operating system.

Friday, April 17, 2009

Parallel version of the Gillespie solver

A few months back I blogged about my implementation of the Gillespie SSA algorithm for solving stochastic differential equations. I have now improved the code allowing for parallel execution on SMP systems. If you don't know what this is all about, please refer the the first post for a quick intro to what a stochastic differential equation is and what it means to numerically solve a system of them.

The code of the solver is in this module. There is an example in the package which is also worth running. On a two core machine the simulation runs about twice as fast, a expected. The parallelization was implemented using the multiprocessing module, and ammounts to roughly 4 lines of code. I love multiprocessing. and it's also compatible with psyco which makes the simulations run twice as fast (on top of the parallelization gains). Probably my next shot at improving its performance will be using scipy.weave and migrate the core of the algorithm to C. Simpler optimizations schemes such as cython (my quick port), runs twice as slow as my python code. Cython can be quite tricky to optimize beyond Python performance, and generates horribly verbose C code (if you don't believe me, try running cython on a print "hello world", you will get a 1000+ lines hello.c !!!).

It is worth noting that the code for the solver and examples are being maintained as a subpackage of BIP, my Bayesian inference package. Even though it is a bit unrelated, Since I frequently have to do Bayesian inference on mathematical models (both deterministic and stochastic), I thought it would be a good idea to keep the solver to those models in the same package as the analysis tools.

Thursday, March 5, 2009

Fixing Rkward startup problems

Rkward is a Nice IDE project for R based on standard KDE components such as Kate, etc.

However, most people wanting to try it (version 0.5b) have come across a startup bug, where Rkward says it can't find some plugins, resulting in a useless interface since some plugins are required for proper operation.

After spending quite some time looking for the solution to this problem, and finding it, I decided to document it here in three simple steps:

  1. The first task is to find the file 'all.pluginmap': just type "locate all.pluginmap" at the console. Mine was located on "/usr/share/apps/rkward/all.pluginmap"
  2. Open "~/.kde/share/config/rkwardrc" with your favorite editor and fix the path to "all.pluginmap" you have just found out in the variable "Plugin Maps" under "[Plugin Settings]".
  3. The if you start Rkward again, you will stumble on another bug (derived again from a hardcoded path) which tells you that some php file cannot be found. To fix this one you have to make a link on your home directory to an Rkward instalation folder. Open a terminal and type "ln -s /usr/share/apps/rkward ~/rkward".
That should get you a working Rkward installation. Enjoy it!

Friday, February 27, 2009

Real-time Plotting from Numerical Simulations

Complex numerical simulations usually take a long time to run while using full CPU(s). If something goes wrong during such a run, we generally only find out in the end of the run when the traces of what went wrong are no longer available, debugging such codes is also not an easy task, because the code gets even slower running through the debugger, and when we don't know where the code is going to break down, it can become a painstaking process.

So any way to monitor such a running code without slowing it down is always welcome. As the title of this article points out, if your code is a numerical simulation, real-time plots of its progress is an extremely useful thing to have. However, traditional Python plotting tools such as Matplotlib, which I have been using for many years, is not a viable solution since it is not very fast and don't support very well real-time plotting.

Recently, while going crazy debugging one such simulation I decide to come up with a solution for real-time plotting in Python. I examine many candidates which I won't mention here, in order to keep this story simple. I finally settled down on a old but still very good solution: Gnuplot!

Since Gnuplot is a stand-alone program, implemented in C, it is very fast at drawing plots and with the help of python-gnuplot, I was able write a class with methods which would send data to Gnuplot and returned immediately without slowing down my running Python code! Gnuplot, on it's side plotted whatever data I throwed at it very fastly, giving me my much needed real-time scope into my simulations. Gnuplot is a real example of the Unix philosophy: Do one thing, and do it well!!

Usage pattern for multiprocessing

I have found myself using multiprocessing more and more these days, especially since it has been backported to Python 2.5.x.
I can no longer remember what was the last computer I owned, which had a single core/processor. Multiprocessing make me much happier as I know I can milk my hardware for all it has to offer. I numpy, scipy and related packages find ways to build parallelization into their libraries as soon as possible.

This goal of this article is to share the simple usage pattern I have adopted to parallelize my code which is both simple and works in both single and multi-core systems. As a disclaimer, much of my day-to-day code, involves repeated calculations of some sort, which are prone to be asynchronously distributed. This explains the pattern I am going to present here.

I start setting up a process pool, as close as possible to the code which will be distributed, preferably in the same local namespace. This way I don't keep processes floating around after my parallel computation is done:


# -*- coding: utf-8 -*-
from numpy import arange,sqrt, random, linalg
from multiprocessing import Pool

global counter
counter = 0
def cb(r):
global counter
print counter, r
counter +=1

def det(M):
return linalg.det(M)

po = Pool()
for i in xrange(1,300):
j = random.normal(1,1,(100,100))
po.apply_async(det,(j,),callback=cb)
po.close()
po.join()
print counter


The call Pool() returns a pool of as many processes as there are cores/cpus available. This makes the code perform optimally on any number of cores, even on single core machines.

The use of a callback function allows for true asynchronicity, since my loop does not have to wait for apply_async to return.

finally, the po.close() and po.join() calls are essential to make sure that all 300 processes which have been fired up finish execution and are terminated. This also eliminates any footprints from your parallel execution, such as zombie processes left behind.

So this is my main pattern! what is yours? please share it and any comments you may have on mine.

Thursday, February 19, 2009

Computer Supported Collaborative Science

The title of this post is intentionally the same as this one by Greg Wilson. His post brings up the important issue of the lack of efficient collaborative science tools, and asks for opinions on the subject. Here is my personal view:

Though science has invented collaborative open-source development centuries ago, it is currently not the best example of such practices, being surpassed on many fronts by the OSS (Open Source Software) community. The blame for this situation can be partly attributed to commercial scientific publishers, the current method of evaluation of scientific productivity etc. But the goal of this article is not to discuss this.

The OSS community have matured a series of tools and practices to maximize the rate of collaborative production of good quality software. By good quality we mean not only bug-free working software, but software which meets criteria such as: efficiency, desirability, readability (you can't form a developer community around unreadable code), modularity,etc.

Science currently fails to even meet the most basic criterion it sets for itself: reproducibility. Most papers do not include sufficient information for its results to be replicated independently. You can compare a scientific paper, to the binary compiled version of a software, it shows its purpose but does not help those which would like to re-create it independently. However in OSS, Binary files always carry information about where its complete source code can be found and downloaded freely. This closes the circle of reproducibility.

When it comes to collaborating with potentially hundreds of peers in developing code, The OSS community have perfected tools such as distributed version control systems(DVCS), bug trackers, wikis and what not, which have been proven indispensable to the production and maintenance of serious OSS projects. Last but not least, OSS projects are never done, which is also a fundamental rule for science, but does not applay to scentific papers. Unfinished papers in science are almost worthless(with the notable exceptions of workng papers and pre-prints).

So, heading back to the focus of this article, what would be the desirable fatures of a productive Computer Supported Collaborative Science (CSCS) tool?
  • Free-software
  • Web based interface
  • DVCS for code and manuscripts
  • Wikis for outlining of ideas and hypotheses
  • Bug tracking for reporting of errors in the analysis
  • Database browsing capabilities for uploading experimental data and interactively exploring it
  • Simple visualizations tools to explore data. Could be based on Google graph/visualization APIs.
  • For my research area at least: Integrated Sage system to foster interactive/collaborative development of computational analytical methods.
  • Your wish here....
This is my take on the issue Greg, I even have some grant money to help realize this, the hard part has been to find like-minded collaborators which believe in the idea.

If you read this and think this has already been accomplished by some OSS project, PLEASE let me know.

ccp

Amazon