Sunday, December 9, 2007
Free ZODB!
I think ZODB is a great product, the best Object-Oriented database for Python. It's a shame it does not have an active and vocal user community. Maybe ZODB has been eclipsed by Zope for too long. Zope is a cool project, but ZODB could appeal to a much wider audience than just web application developers.
Everytime I start a new application based on ZODB, I think: "what if the ZODB developers stop developing it...", "What if I find I serious bug or limitation in it, and can't motivate (or contact) the developers to fix it...".
I think many people abandon the Idea of using ZODB in their projects, because they can't identify a strong community interested in issues outside serving Zope applications.
Yet, ZODB has been alive for such a long time, feeding on the leftovers from the Zope community...
Isn't it about time ZODB is emancipated into a project standing on its own merits (which are many) and attracting its own community?
Monday, November 5, 2007
New release of Epigrass coming soon!
So here are the the things I have done so far and some I expect to get done before the new release:
- Port to Qt4: Done. Not as hard as I suspected it would be. Now it will be possible to install it on Windows. I also did a lot improvements on the GUI, such as fixing bugs in the layouts, managed to figure out how to update the GUI (progress display) while the model was running, etc.
- Wrote a GUI based .epg editor. Models are defined in Epigrass through a configuration file which is parsed by the ConfigParser module. To minimize the chances of users messing up the syntax of the file, I cooked up a configuration editor from scratch, which can be run independently of Epigrass to create an edit the .epg files.
- Real-time visualization window: Partially written. I am trying to put together a map animation of the simulation which would run on a separate thread using Pyglet OpenGl interface. I am currently trying to figure out how to do OpenGL viewport auto-resize, so that I can load different maps and have they fit the window. If anyone knows how to do that, please tell me (I am learning OpenGL as I go).
- Update Epigrass's KML output to generate time-animations of the map: Planning stage. This will require to generate multiple layers of the map, one for each time-step. I will probably have to generate a KMZ (a zipped version of a KML) instead of a KML in order to keep the file size from becoming very large. Most of the code is done since I will just re-use the KML-generation code, and call the layer generator multiple times, changing the colors of the polygons at each step.
- Re-write the simulation scheduler to allow the parallelization of the simulation via parallel-python (PP): Planning stage. the coding part will be easy, but I expect a lot of debugging to deal with all the dependency code which needs to be declared when creating the separate jobs with PP. Moreover, I have no idea if I will stumble on unpicklable stuff. I just hope that if I do, I can find a way around the problem.
- Revamp the setup.py, to conform with the latest version of setuptools. It's been broken for a while...
- Maybe try Pyinstaller to release single file executables, and avoid dependency hell, at install time. Many dependencies are not easy_installable yet...
Friday, September 21, 2007
The Wonders of Pyglet
The most important feature of Pyglet is that it's being designed from the ground up to be OS independent (Linux, Win, OSX) without external dependencies. For that, it uses the standard OpenGL implementation of each of these platforms via ctypes. This makes it my last best hope for a multi-platform graphical interface kit.
It already has a growing widget library, support for layout of html documents, import 3d models created with wings3d, a scene 2d module with support to sprites and collision detection, and a lot more. Most of these functionalities are available only in the SVN version. The (stable) release is somewhat more limited.
Go check it out! it is (IMHO) one of the few truly exciting graphical libraries in the python scene.
Monday, September 17, 2007
Parallel Processing in CPython
Before I go on, the usual disclaimer: I know that PP does multi-processing, not multi-threading, which is what the GIL won't let you do. But PP offers a very simple and intuitive API, that can be used for both multi-core CPUs and clusters. If, after seeing what PP can do for you, you still believe you need threads, use Jython!!
All examples here were run on a Xeon quad core, with 4GB of RAM, Running Ubuntu Feisty. Python interpreters used were: CPython 2.5.1, Jython 2.1 on java 1.6.0 and IronPython 1.0.2467.
Let's start with Parallel Python: I am using an example taken straight from PP's web site. Here is the code:
#!/usr/bin/python
# File: dynamic_ncpus.py
# Author: Vitalii Vanovschi
# Desc: This program demonstrates parallel computations with pp module
# and dynamic cpu allocation feature.
# Program calculates the partial sum 1-1/2+1/3-1/4+1/5-1/6+... (in the limit it is ln(2))
# Parallel Python Software: http://www.parallelpython.com
import math, sys, md5, time
import pp
def part_sum(start, end):
"""Calculates partial sum"""
sum = 0
for x in xrange(start, end):
if x % 2 == 0:
sum -= 1.0 / x
else:
sum += 1.0 / x
return sum
print """Using Parallel Python"""
start = 1
end = 20000000
# Divide the task into 64 subtasks
parts = 64
step = (end - start) / parts + 1
# Create jobserver
job_server = pp.Server()
# Execute the same task with different amount of active workers and measure the time
for ncpus in (1, 2, 4, 8, 16, 1):
job_server.set_ncpus(ncpus)
jobs = []
start_time = time.time()
print "Starting ", job_server.get_ncpus(), " workers"
for index in xrange(parts):
starti = start+index*step
endi = min(start+(index+1)*step, end)
# Submit a job which will calculate partial sum
# part_sum - the function
# (starti, endi) - tuple with arguments for part_sum
# () - tuple with functions on which function part_sum depends
# () - tuple with module names which must be imported before part_sum execution
jobs.append(job_server.submit(part_sum, (starti, endi)))
# Retrieve all the results and calculate their sum
part_sum1 = sum([job() for job in jobs])
# Print the partial sum
print "Partial sum is", part_sum1, "| diff =", math.log(2) - part_sum1
print "Time elapsed: ", time.time() - start_time, "s"
job_server.print_stats()
and here are the results:
Using Parallel Python
Starting 1 workers Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 7.85552501678 s
Starting 2 workers Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 4.37666606903 s
Starting 4 workers Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 2.11173796654 s Starting 8 workers Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 2.06818294525 s
Starting 16 workers Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 2.06896090508 s
Starting 1 workers Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 8.11736106873 s Job execution statistics: job count | % of all jobs | job time sum | time per job | job server 384 | 100.00 | 67.1039 | 0.174750 | local Time elapsed since server creation 27.0066168308
In order to compare it to threading code, I had to adapt the example to use threads. Before I fed the new code to Jython, I ran it though CPython to illustrate the fact that, under the GIL, threads are not executed in parallel but one at a time. This first run would also serve as a baseline to compare Jython results against.
The code is below. Since Jython 2.1 does not have the sum function, I implemented it with reduce (there was not perceptible perfomance difference when compared with the built-in sum).
#jython threads
import math, sys, time
import threading
global psums
def part_sum(start, end):
"""Calculates partial sum"""
sum = 0
for x in xrange(start, end):
if x % 2 == 0:
sum -= 1.0 / x
else:
sum += 1.0 / x
psums.append(sum)
def sum(seq):
# no sum in Jython 2.1, we will use reduce
return reduce(lambda x,y:x+y,seq)
print """Using: jython with threading module"""
start = 1
end = 20000000
# Divide the task into 64 subtasks
parts = 64
step = (end - start) / parts + 1
for ncpus in (1, 2, 4, 8, 16,1):
# Divide the task into n subtasks
psums = []
parts = ncpus
step = (end - start) / parts + 1
jobs = []
start_time = time.time()
print "Starting ",ncpus, " workers"
for index in xrange(parts):
starti = start+index*step
endi = min(start+(index+1)*step, end)
# Submit a job which will calculate partial sum
# part_sum - the function
# (starti, endi) - tuple with arguments for part_sum
t=threading.Thread(target=part_sum,name="", args=(starti, endi))
t.start()
jobs.append(t)
# wait for threads to finish
[job.join() for job in jobs]
# Retrieve all the results and calculate their sum
part_sum1 = sum(psums)
# Print the partial sum
print "Partial sum is", part_sum1, "| diff =", math.log(2) - part_sum1
print "Time elapsed: ", time.time() - start_time, "s"
and here are the results for CPython:
Using: CPython with threading module
Starting 1 workers
Partial sum is 0.69314720556 | diff = -2.50001152002e-08
Time elapsed: 8.17702198029 s
Starting 2 workers
Partial sum is 0.69314720556 | diff = -2.50001570556e-08
Time elapsed: 10.2990288734 s
Starting 4 workers
Partial sum is 0.69314720556 | diff = -2.50001127577e-08
Time elapsed: 11.1099839211 s
Starting 8 workers
Partial sum is 0.69314720556 | diff = -2.50001097601e-08
Time elapsed: 11.6850161552 s
Starting 16 workers
Partial sum is 0.69314720556 | diff = -2.50000701252e-08
Time elapsed: 11.8062999249 s
Starting 1 workers
Partial sum is 0.69314720556 | diff = -2.50001152002e-08
Time elapsed: 11.0002980232 s
Here are the results for Jython:
Using: jython with threading module
Starting 1 workers
Partial sum is 0.6931472055600734 | diff = -2.500012807882257E-8
Time elapsed: 4.14300012588501 s
Starting 2 workers
Partial sum is 0.6931472055601045 | diff = -2.500015916506726E-8
Time elapsed: 2.0239999294281006 s
Starting 4 workers
Partial sum is 0.6931472055600582 | diff = -2.5000112868767133E-8
Time elapsed: 2.1430001258850098 s
Starting 8 workers
Partial sum is 0.6931472055600544 | diff = -2.500010909400885E-8
Time elapsed: 1.6349999904632568 s
Starting 16 workers
Partial sum is 0.6931472055600159 | diff = -2.5000070569269894E-8
Time elapsed: 1.2360000610351562 s
Starting 1 workers
Partial sum is 0.6931472055600734 | diff = -2.500012807882257E-8
Time elapsed: 2.4539999961853027 s
And lastly, the results for IronPython:
Using: IronPython with threading module
Starting 1 workers
Partial sum is 0.6931472055601 | diff = -2.50001280788e-008
Time elapsed: 13.6127243042 s
Starting 2 workers
Partial sum is 0.6931472055601 | diff = -2.50001591651e-008
Time elapsed: 7.60165405273 s
Starting 4 workers
Partial sum is 0.6931472055601 | diff = -2.50001128688e-008
Time elapsed: 8.14302062988 s
Starting 8 workers
Partial sum is 0.6931472055601 | diff = -2.5000109205e-008
Time elapsed: 8.32349395752 s
Starting 16 workers
Partial sum is 0.6931472055600 | diff = -2.50000707913e-008
Time elapsed: 8.37589263916 s
Starting 1 workers
Partial sum is 0.6931472055601 | diff = -2.50001280788e-008
Time elapsed: 10.3567276001 s
Now on to some final considerations. The quality of a parallelization tool should be measured not in how fast it is, but how well it scales. The attentive reader may have noticed that Jython threads, were twice as fast than PP. But is that performance related to the threads? No, since it was already faster than CPython (with threading or with PP) for a single thread. PP scaled better up to the available number of Cores, consistently halving the time when doubling the number of cores used. Jython, halved the time when it went from one to two threads, but failed to halve the time again, when going to 4 threads. I'll give it a break here since it recovered at 8 and 16 threads.
Threads alone are not the answer, if they are not well implemented. Look at the results from IronPython, It seem not to be able to take advantage of more than two threads, on a four core system. Can anyone explain this? I'd be curious to know why.
Wednesday, September 12, 2007
ZODB vs Durus
Despite the simplicity of my test code, There was one suprinsing result of my test. Both databases used files as storages, but the file size for Durus was 3.7MB for a million records, while ZODB file size was 23.7MB !!!
Both database systems offer the option of packing their stores, to reduce size, but this feature was not used. Besides, to pack a ZODB storage file, the same ammount of free disk space is required, wich only makes matters worse for ZODB. Please, also check Michael's Blog for a very interesting benchmark of Durus vs cPickle.
Here is the code:
import time, os, glob
import ZODB
from ZODB import FileStorage, DB
import pylab as P
from durus.file_storage import FileStorage as FS
from durus.connection import Connection
def zInserts(n):
print "Inserting %s records into ZODB"%n
for i in xrange(n):
dbroot[i] = {'name':'John Doe','sex':1,'age':35}
connection.transaction_manager.commit()
def DurusInserts(n):
print "Inserting %s records into Durus"%n
for i in xrange(n):
Droot[i] = {'name':'John Doe','sex':1,'age':35}
conndurus.commit()
recsize = [1000,5000,10000,50000,100000,200000,400000,600000,800000,1000000]
zperf = []
durusperf =[]
for n in recsize:
# remove old databases
if os.path.exists('testdb.fs'):
[os.remove(i) for i in glob.glob('testdb.fs*')]
if os.path.exists('test.durus'):
os.remove('test.durus')
# setup ZODB storage
dbpath = 'testdb.fs'
storage = FileStorage.FileStorage(dbpath)
db = DB(storage)
connection = db.open()
dbroot = connection.root()
#Setting up durus database
conndurus = Connection(FS("test.durus"))
Droot = conndurus.get_root()
#begin tests
t0 = time.clock()
zInserts(n)
t1 = time.clock()
# closing and reopening ZODB' database to make sure
# we are reading from file and not from some memory cache
connection.close()
db.close()
storage = FileStorage.FileStorage(dbpath)
db = DB(storage)
connection = db.open()
dbroot = connection.root()
t2 = time.clock()
print "Number of records read from ZODB: %s"%len(dbroot.items())
t3 = time.clock()
ztime = (t1-t0)+(t3-t2)
zperf.append(ztime)
print 'Time for ZODB: %s seconds\n'%ztime
t4 = time.clock()
DurusInserts(n)
t5 = time.clock()
conndurus = Connection(FS("test.durus"))
Droot = conndurus.get_root()
t6 = time.clock()
print "Number of records read from Durus: %s"%len(Droot.items())
t7 = time.clock()
Dtime = (t5-t4)+(t7-t6)
durusperf.append(Dtime)
print 'Time for Durus with db on Disk: %s seconds\n'%Dtime
P.plot(recsize,zperf,'-v',recsize,durusperf,'-^')
P.legend(['ZODB','Durus'])
P.xlabel('inserts')
P.ylabel('time(s)')
P.show()
Tuesday, September 11, 2007
ZODB vs Relational Database: a simple benchmark
Since this blog is about Python, I soon felt bad about not including ZODB in that comparison. At the time I justified that omission, by saying to myself that ZODB cannot be compared to standard DBs because it is an object database. Subconsciously, I thought ZOBD would loose so badly in a race against relational databases, that I feared for its reputation. Silly me.
The truth is: object databases such as ZODB, can be a perfect replacement for relational databases in a large portion (if not the majority) of database driven applications. Had I stopped to look more carefully at ZODB before, I would have saved countless hours of struggle with ORMS.
As you can see in the figure above, for up to a 100000 inserts per transaction, ZODB's performance is comparable to SQLite3 and since ZODB allows you to store arbitrarily complex objects, you don't have to cook up complex SQL queries to get at data you need, the relation between each datum is given by the design of the object you are storing. In some apps of mine, I have to write code to extract the the data from my Python objects, put them in table format (to store in a relational db), and then, when I read them back, I have to have more code to put them back where they belong. With ZODB, none of that is necessary.
ZODB stores your data in a file like SQLite, however it supports other storage types, see this table for a comparison of storage types.
ZODB is certainly one of the hidden jewels of Zope. Due to the lack of good documentation (an exception, though somewhat outdated), many Python programmers either don't known that ZODB can be used outside of Zope or don't know how to get started with it.
The goal of this post is not to serve as a tutorial of ZODB, since I am hardly an expert in the subject, but to spike the interest in adopting ZODB for mundane applications outside Zope.
Let get to the code:
import time, os, glob
import sqlite3
import ZODB
from ZODB import FileStorage, DB
import pylab as P
def zInserts(n):
print "Inserting %s records into ZODB"%n
for i in xrange(n):
dbroot[i] = {'name':'John Doe','sex':1,'age':35}
connection.transaction_manager.commit()
def zInserts2(n):
print "Inserting %s records into ZODB"%n
dbroot['employees'] = [{'name':'John Doe','sex':1,'age':35} for i in xrange(n)]
connection.transaction_manager.commit()
def testSqlite3Disk(n):
print "Inserting %s records into SQLite(Disk) with sqlite3 module"%n
conn = sqlite3.connect('dbsql')
c = conn.cursor()
# Create table
c.execute('''create table Person(name text, sex integer, age integer)''')
persons = [('john doe', 1, 35) for i in xrange(n)]
c.executemany("insert into Person(name, sex, age) values (?,?,?)", persons)
c.execute('select * from Person')
print "Number of records selected: %s"%len(c.fetchall())
c.execute('drop table Person')
recsize = [1000,5000,10000,50000,100000,200000,400000,600000,800000,1000000]
zperf = []
sqlperf =[]
for n in recsize:
# remove old databases
if os.path.exists('testdb.fs'):
[os.remove(i) for i in glob.glob('testdb.fs*')]
if os.path.exists('dbsql'):
os.remove('dbsql')
# setup ZODB storage
dbpath = 'testdb.fs'
storage = FileStorage.FileStorage(dbpath)
db = DB(storage)
connection = db.open()
dbroot = connection.root()
#begin tests
t0 = time.clock()
zInserts(n)
t1 = time.clock()
# closing and reopening ZODB' database to make sure
# we are reading from file and not from some memory cache
connection.close()
db.close()
storage = FileStorage.FileStorage(dbpath)
db = DB(storage)
connection = db.open()
dbroot = connection.root()
t2 = time.clock()
print "Number of records read from ZODB: %s"%len(dbroot.items())
t3 = time.clock()
ztime = (t1-t0)+(t3-t2)
zperf.append(ztime)
print 'Time for ZODB: %s seconds\n'%ztime
t4 = time.clock()
testSqlite3Disk(n)
t5 = time.clock()
stime = (t5-t4)
sqlperf.append(stime)
print 'Time for Sqlite3 with db on Disk: %s seconds\n'%stime
P.plot(recsize,zperf,'-v',recsize,sqlperf,'-^')
P.legend(['ZODB','SQLite3'])
P.xlabel('inserts')
P.ylabel('time(s)')
P.show()
As you can see in this very simple example, Using ZODB is no harder than using a dictionary, and it performs better than all ORMs I know! Below are the numeric results for the beginning of the plot above.
ZODB allows for a much more sophisticated usage than the one shown here. I chose to do it this way to make the insert operations on ZODB and SQLite as similar as possible. I hope the ZODB gurus out there will get together to write an up-to-date detailed tutorial on ZODB for Python programmers. ZODB deserves it. And so do we!
Monday, September 3, 2007
PyconBrasil[03]
Last week I had the pleasure to attend, for the first time, Brasil's largest meeting of Python users: PyconBrasil[03]. My impression of the community couldn't be better, Everyone was very nice and open, and talks were awesome. I will make specific posts about the talks that impressed me most, which is not to say that talks I don't mention were not great as well, but I really can't make any relevant comments on talks regarding business solutions, e-government, etc. If you are interested in those topics, I recommend watching the videos of the talks on google video (most of them are in portuguese).
The first thing that impressed me positively, was the number of science-related talks. They were very high level. My own talk was only mildly scientific, since I had planned the talk to preach about the importance of expanding the Python academic community. It turns out that the existing community is already highly sensitive to the scientific possibilities of Python. In the event, I met many full time scientists among the "Pythonistas". It was also nice to notice that a large number of members of the community were involved with science as well. A good example is Fabiano Weimar, one of the exponents of the Brazilian Python scene, who is working towards his doctoral degree, working with speech recognition using Hidden Markov Models, If I understood it correctly. It will be nice to see a good python implementation of HMM in Python, though I am not sure if that is in his plans. The funny thing is, that I believed that the last chapter of my book, about stochastic methods, would find almost no echo on the Python community, due to its dryer scientific language and focus. Apparently I was wrong, which is great!.
Even though PyconBrasil is on its third iteration, the Brasilian Python association, a non-profit, organized to promote Python in Brasil, was celebrating only three months of existence, I met their staff and found them very nice and open, I wish them all the success they deserve!
I want to close this post with big thanks to the Python community as a whole for receiving me and my book so well, and letting them know that I will keep doing everything in my reach to help the community grow and be known in the scientific community.
Friday, August 24, 2007
First Course of Scientific Python was a success!
The students were colleagues of mine and some graduate students. Since the audience had different levels of knowledge about Python and programming in general, the first day was spent getting everybody at the same level about Python. Then we proceeded to explore the potential of Python to improve the productivity of scientists, through a series of examples. Given the limited amount of time, we explored topics which were of most interest to everybody:
- Manipulating data stored in text files;
- Interacting with databases;
- Contructing a simple epidemiological model and implementing it using multiple threads;
- A bit of graph theory using NetworkX;
- A bit of bioinformatics using Bio-Python;
- Integration of Python programs with C and Fortran (we didn't have time to explore Jython);
- Plus many other bit and pieces such as basic numpy, Pylab, Gui design using Wxglade, etc.
One thing that surprised me was the excitement that Crunchy caused on everybody. I used crunchy mostly to facilitate my explanation of code snippets found on the web, but the students demanded to know how to install Crunchy on their computers so they could use it on their own.
I enjoyed very much giving this course. If anyone wants to sponsor a similar course on their institutions, just contact me, I'll be glad to give it again, in Portuguese, English or Spanish.
Monday, August 6, 2007
Set implementation performance
So here is my simple code:
#Set implementations benchmark
import random,time
seta = set([random.randint(0,100000) for n in xrange(10000)])
setb =set([random.randint(0,100000) for n in xrange(10000)])
t0 = time.clock()
for i in xrange(1000):
seta & setb
seta | setb
seta ^ setb
print "Time: %s seconds"%(time.clock()-t0)
and here are the timings:
$ python set_bench.py
Time: 9.45 seconds
$ ipy set_bench.py
Time: 141.460593000 seconds
CPython is simply 15 times faster than Iron Python!
I always like to have external tool for comparison. So I converted my little Python script to C++ with ShedSkin, compiled and ran it:
$ ./set_bench
Time: 30.66 seconds
CPython was still more than 3 times faster than the C++ generated by ShedSkin (0.0.21)!!
For the reference: I used IronPython 1.0.2467 on .NET 2.0.50727.42 on an Ubuntu machine. It would be nice if someone could re-run this on a Windows box.
If anyone knows of a faster solution for determining the intersection between two sets in Python (perhaps using dictionaries?), I would be very interested to know.