Friday, August 24, 2007

First Course of Scientific Python was a success!

I just gave the first course based on my book. It was a success. Five days, 4 hours per day.

The students were colleagues of mine and some graduate students. Since the audience had different levels of knowledge about Python and programming in general, the first day was spent getting everybody at the same level about Python. Then we proceeded to explore the potential of Python to improve the productivity of scientists, through a series of examples. Given the limited amount of time, we explored topics which were of most interest to everybody:
  • Manipulating data stored in text files;
  • Interacting with databases;
  • Contructing a simple epidemiological model and implementing it using multiple threads;
  • A bit of graph theory using NetworkX;
  • A bit of bioinformatics using Bio-Python;
  • Integration of Python programs with C and Fortran (we didn't have time to explore Jython);
  • Plus many other bit and pieces such as basic numpy, Pylab, Gui design using Wxglade, etc.

One thing that surprised me was the excitement that Crunchy caused on everybody. I used crunchy mostly to facilitate my explanation of code snippets found on the web, but the students demanded to know how to install Crunchy on their computers so they could use it on their own.

I enjoyed very much giving this course. If anyone wants to sponsor a similar course on their institutions, just contact me, I'll be glad to give it again, in Portuguese, English or Spanish.

Monday, August 6, 2007

Set implementation performance

I recently, blogged about my concerns for CPython being replaced by IronPython in the browser platform. My main concerns in the other post were mainly of political nature. But now, as I was investigating the performance of set operations in Python for a project, I decided to compare CPython and IronPython on their set implementations.

So here is my simple code:

#Set implementations benchmark
import random,time

seta = set([random.randint(0,100000) for n in xrange(10000)])
setb =set([random.randint(0,100000) for n in xrange(10000)])

t0 = time.clock()
for i in xrange(1000):
seta & setb
seta | setb
seta ^ setb
print "Time: %s seconds"%(time.clock()-t0)

and here are the timings:

$ python
Time: 9.45 seconds
$ ipy
Time: 141.460593000 seconds

CPython is simply 15 times faster than Iron Python!

I always like to have external tool for comparison. So I converted my little Python script to C++ with ShedSkin, compiled and ran it:

$ ./set_bench
Time: 30.66 seconds

CPython was still more than 3 times faster than the C++ generated by ShedSkin (0.0.21)!!

For the reference: I used IronPython 1.0.2467 on .NET 2.0.50727.42 on an Ubuntu machine. It would be nice if someone could re-run this on a Windows box.

If anyone knows of a faster solution for determining the intersection between two sets in Python (perhaps using dictionaries?), I would be very interested to know.

Weird set behavior

I have been playing with set operations lately, and came across a kind of surprising result, given that it is not mentioned in the standard Python tutorial:

with python sets, intersections and unions are supposed to be done like this:

In [7]:set('casa') & set('porca')
Out[7]:set(['a', 'c'])

In [8]:set('casa') | set('porca')
Out[8]:set(['a', 'c', 'o', 'p', 's', 'r'])

and they work correctly. Now, what is confusing, is that if you do:

In [5]:set('casa') and set('porca')
Out[5]:set(['a', 'p', 'c', 'r', 'o'])

In [6]:set('casa') or set('porca')
Out[6]:set(['a', 'c', 's'])

The results are not what you would expect from an AND or OR operation, from the mathematical point of view! apparently the "and" operation is returning the the second set, and the "or" operation is returning the first.

If python developers wanted these operations to reflect the traditional (Python) truth value for data structures: False for empty data structures and True otherwise, why not return simply True or False?

So My question is: Why has this been implemented in this way? My answer, as many readers have also pointed out , is that sets are implemented like this, so that they can be used in the "and/or trick" or ternary operator. But this is very confusing for users that are thinking about sets in a mathematical way, where "AND" means intersection and "OR" means union. I can see this confusing many newbies...