Python in Science: July 2007

Monday, July 30, 2007

Updating all your Eggs

Lately, I have used setuptools (easy_install) more and more to manage third-party Python modules and packages on my computer. Using easy_install generally gives you a more up-to-date package than distribution packages generally do (I am talking about Linux here, though my solution can help windows users as well). The problem is that package management systems of most distributions have one advantage over setuptools: they will warn you when a new version of a given package is available.

As I like to live on the cutting-edge of things, I tend to stick with easy_install most of the time, and recently, I began to suspect that I had a more easy_installed packages in my system than I could recall by name. This thought led to another: "How in the world can I keep track of all those packages, let alone update them when needed?"

So I decided to post my concerns to the distutils list and afters a quick and enlightening exchange of emails I had a solution to my problem:

With this script I could not only find out how many eggs I had installed under easy_install (210), but also I could update them all!

This is obviously not the best solution to the problem, but it helped me. If you have suggestions or a better solution, I would like to hear about.

You can get the script here.

Thursday, July 26, 2007

My book is out!

If you use or teach Python and happen to speak portuguese, you may want to take a look at my recently published book: "Computação Científica com Python".

Thursday, July 12, 2007

On the "multiplatformness" of Python

I will allow my self to stray a little bit away from my normal focus of posting about science-related pythonic stuff, to raise a question that has been on my mind lately: the future of Python. No, it's not about Py3K!

One of the strong selling points of Python, in my opinion, is its support for multiple platforms. If you write a pure Python program you can run it on just about any OS/hardware combination. Nowadays, however the concept of a "platform" is shifting away from its usual meaning of an OS+CPU combination. The web, i. e., the browser is becoming the "platform of choice" by projects that, maybe 5 years ago, would be done in Python as a desktop application.

What worries me, is that while Micro$oft(silverlight), Adobe(Air), and others are investing heavily in adding support for their development languages and tools to the "new platform", the Browser apparently does not figure in Python's plans for the future.

I think that is a pity, because Python is the best suited language, in my opinion, to unleash the power of the browser platform, with its "batteries included" and easy-to-learn nature. I shiver at the prospect of having to depend on Jython or IronPython to be able to have my python code run from within a browser. Python should be aiming for the place Javascript occupies right now.

So, thats my lament... I hope that Python developers, after reading this post tell me I am wrong. That projects like MozPython should have never been abandoned, and that we can hope that exciting projects, like Crunchy, will one day, be runnable from within the browser.

Monday, July 9, 2007

Fastest Python Database Interface

Writing data to databases is something that I have to do quite often. In the last few years I have used many different databases and their python interfaces: Pysqlite, Pysqlite2, MySQldb, PsycoPg and PsycoPG2, SQlObject and SQLAlchemy. In many cases, when the data is to be used only by a Python program, I have used Pickle and cPickle.

Up until now, my criteria to pick a database and a given inteface, where mainly based on convenience or the need to use a particular database backend. However, recently, after having chosen SQLObject for a project based only on how neatly organized the resulting code looked like, I stumbled upon a major performance issue with SQLObject: it was extremely slow for a large number of inserts.

After searching all over and asking on SQLObject mailing list, I found out that trying to improve its insert speed was a lost cause. Apparently, something in the core design of SQLObject, impossibilitates it to do the equivalent of an executemany() when doing inserts. The executemany() method, which we find in all(?) other database interfaces, gives us a tremendous boost in insertion speed.

So I finally decided to do what I must have done years ago: write a simple benchmark script for all the interfaces I cared about.
I simply created a table, inserted 100000 short records into the database, and selected them back to check if the insertion operation had been completed. Then I dropped the table. I repeated this operation for all interfaces.

Since I have never before come across a straightforward comparison like the one I did, I decide to blog about it and save other people from learning it the hard way, like me.

So, without further ado, lets see the results:

Starting benchmarking with 100000 records (inserting and selecting)

Inserting 100000 records into SQLite(Memory) with SQLObject
Number of records selected: 100000
Time for SQLObject with db in memory: 68.07 seconds

Inserting 100000 records into SQLite(Memory) with sqlite3 module
Number of records selected: 100000
Time for Sqlite3 with db in memory: 2.87 seconds

Inserting 100000 records into PostgreSQL with PsycoPG2
Number of records selected: 100000
Time for PsycoPG2: 19.49 seconds

Inserting 100000 records into SQLite(Disk) with sqlite3 module
Number of records selected: 100000
Time for Sqlite3 with db on Disk: 3.02 seconds

Inserting 100000 records into SQLite(Disk) with SQLAlchemy
Number of records selected: 100000
Time for SQLAlchemy with SQlite db in Disk: 18.37 seconds

Inserting 100000 records into SQLite(memory) with SQLAlchemy
Number of records selected: 100000

Time for SQLAlchemy with SQlite db in Memory: 18.02 seconds

Inserting a 100000 records into a Pickle file with cPickle
Number of records loaded: 100000
Time for cPickle: 0.39 seconds

Inserting 100000 records into MySQL with MySQLdb
Number of records selected: 100000
Time for MySQL: 1.28 seconds

I was surprised at a few things (ok, at just about everything!):

How slow SQLObject really was (not really, I guess I already expected that...)
SQLite was not substantially faster with the DB in memory when compared to an "on disk" DB.
SQLite is pretty darn fast!
MySQL is even Faster!!!
PsycoPg2 is not as fast as people say...
SQLAlchemy does an excellent job at doing everything it does and not sacrificing performance in the process!

Before Running the script, make sure you have MySQL and PostgreSQL running, and adjust user names and passwords in the code. You also need to create a database "People" in both these servers.

One Last observation: for sqlite, I am using the sqlite3 module from Python 2.5's standard library. I don't know if it is based on Pysqlite or not. I wonder if Pysqlite2 would give different performance results.

I would appreciate any comments, and suggestions to extract more performance from each interface. The source code for the benchmarking is here.

Python in Science