Tuesday, September 11, 2007

ZODB vs Relational Database: a simple benchmark

Recently, I posted about relational databases performances. In that experiment, I found SQLite3, a database that comes in the Python standard distribution, to be the second fastest database backend available for multiple inserts.

Since this blog is about Python, I soon felt bad about not including ZODB in that comparison. At the time I justified that omission, by saying to myself that ZODB cannot be compared to standard DBs because it is an object database. Subconsciously, I thought ZOBD would loose so badly in a race against relational databases, that I feared for its reputation. Silly me.

The truth is: object databases such as ZODB, can be a perfect replacement for relational databases in a large portion (if not the majority) of database driven applications. Had I stopped to look more carefully at ZODB before, I would have saved countless hours of struggle with ORMS.

As you can see in the figure above, for up to a 100000 inserts per transaction, ZODB's performance is comparable to SQLite3 and since ZODB allows you to store arbitrarily complex objects, you don't have to cook up complex SQL queries to get at data you need, the relation between each datum is given by the design of the object you are storing. In some apps of mine, I have to write code to extract the the data from my Python objects, put them in table format (to store in a relational db), and then, when I read them back, I have to have more code to put them back where they belong. With ZODB, none of that is necessary.

ZODB stores your data in a file like SQLite, however it supports other storage types, see this table for a comparison of storage types.

ZODB is certainly one of the hidden jewels of Zope. Due to the lack of good documentation (an exception, though somewhat outdated), many Python programmers either don't known that ZODB can be used outside of Zope or don't know how to get started with it.

The goal of this post is not to serve as a tutorial of ZODB, since I am hardly an expert in the subject, but to spike the interest in adopting ZODB for mundane applications outside Zope.

Let get to the code:


import time, os, glob
import sqlite3
import ZODB
from ZODB import FileStorage, DB
import pylab as P

def zInserts(n):
print "Inserting %s records into ZODB"%n
for i in xrange(n):
dbroot[i] = {'name':'John Doe','sex':1,'age':35}
connection.transaction_manager.commit()

def zInserts2(n):
print "Inserting %s records into ZODB"%n
dbroot['employees'] = [{'name':'John Doe','sex':1,'age':35} for i in xrange(n)]
connection.transaction_manager.commit()


def testSqlite3Disk(n):
print "Inserting %s records into SQLite(Disk) with sqlite3 module"%n
conn = sqlite3.connect('dbsql')
c = conn.cursor()
# Create table
c.execute('''create table Person(name text, sex integer, age integer)''')
persons = [('john doe', 1, 35) for i in xrange(n)]
c.executemany("insert into Person(name, sex, age) values (?,?,?)", persons)
c.execute('select * from Person')
print "Number of records selected: %s"%len(c.fetchall())
c.execute('drop table Person')


recsize = [1000,5000,10000,50000,100000,200000,400000,600000,800000,1000000]
zperf = []
sqlperf =[]
for n in recsize:
# remove old databases
if os.path.exists('testdb.fs'):
[os.remove(i) for i in glob.glob('testdb.fs*')]
if os.path.exists('dbsql'):
os.remove('dbsql')
# setup ZODB storage
dbpath = 'testdb.fs'
storage = FileStorage.FileStorage(dbpath)
db = DB(storage)
connection = db.open()
dbroot = connection.root()
#begin tests
t0 = time.clock()
zInserts(n)
t1 = time.clock()
# closing and reopening ZODB' database to make sure
# we are reading from file and not from some memory cache
connection.close()
db.close()
storage = FileStorage.FileStorage(dbpath)
db = DB(storage)
connection = db.open()
dbroot = connection.root()
t2 = time.clock()
print "Number of records read from ZODB: %s"%len(dbroot.items())
t3 = time.clock()
ztime = (t1-t0)+(t3-t2)
zperf.append(ztime)
print 'Time for ZODB: %s seconds\n'%ztime
t4 = time.clock()
testSqlite3Disk(n)
t5 = time.clock()
stime = (t5-t4)
sqlperf.append(stime)
print 'Time for Sqlite3 with db on Disk: %s seconds\n'%stime
P.plot(recsize,zperf,'-v',recsize,sqlperf,'-^')
P.legend(['ZODB','SQLite3'])
P.xlabel('inserts')
P.ylabel('time(s)')
P.show()

As you can see in this very simple example, Using ZODB is no harder than using a dictionary, and it performs better than all ORMs I know! Below are the numeric results for the beginning of the plot above.

ZODB allows for a much more sophisticated usage than the one shown here. I chose to do it this way to make the insert operations on ZODB and SQLite as similar as possible. I hope the ZODB gurus out there will get together to write an up-to-date detailed tutorial on ZODB for Python programmers. ZODB deserves it. And so do we!

7 comments:

Anonymous said...

Flávio - Its nice to see you do this comparison - object databases really deserve more attention than they currently get. ZODB and Durus (which I wrote to you about back in July) are both very capable engines. If I were to adopt Zope I'd have no problem using ZODB, but since I am not tied to Zope I prefer Durus because while it offers the same core features, it is simpler and even I can understand the code underneath it.

In addition to raw performance which you've looked at, one other issue needs to be considered -- practical maxmimum objects per database.

The base "storages" (File / Shelf) for Durus essentially chew up a certain amount of ram per object (to represent the "oid" object id in the lookup mechanism). Since this "index" is maintained in ram, there are practical implications for start up time as well as overall RAM usage, regardless of the complexity or lack thereof for objects.

I am not sure if this is true for ZODB -- likely is -- but for Durus it is the case.

This is quite different than most (not all) SQL databases, where row index optimization has been worked on by various product teams for many years. A Postgresql table with 5 million rows can be queried without loading all 5 million keys (unless a full table scan is required by the query); 5 million objects in a Durus DB means 5 million keys (oid's) in RAM.

However the simplicity of Durus allows for some fun, and in a few days I'm going to publish an alternate back end for Durus based on Postgresql which removes that limitation completely. Similar work has been done using sqlite as a backend, too. Another bonus with the Postgresql backend for Durus is that bulk inserts (such as an initial data load) can be done very efficiently using its COPY statement. Millions of inserts can be done very quickly indeed with this.

Having said that, unless your application really will have many millions of objects in it, overthinking this problem can lead one away from using a very practical, very easy to use, and as you've seen, very fast data persistence solution ideal for Python programmers - ZODB or Durus (and other related solutions that have appeared on the scene).

Unknown said...

Hi Michael,

my apologies for completely forgetting about Durus. I recall your first comment in july, but back then I was not ready to perceive the importance of Object databases.

I promisse to blog about Durus soon, preferably with a comparison to ZODB.

Thanks for the reminder.... once again!
;-)

Anonymous said...

A complementary approach, as long as you can express your data in dictionary-style key/value pairs, is shove:

http://pypi.python.org/pypi/shove

It exposes a variety of backends, including both the ZODB and Durus, by using the dictionary API similar to how shelve does it. It's not a good choice for applications which require access to specialized storage backend features such as ZODB transactions. However, for applications without those requirements, it might be an interesting choice.

Unknown said...

Another option is to try cPickle and bsddb. It will provide a much faster performance than ZODB.

I have a few comments on your code.

-When doing performance testing you should not include print statements in the code being timed. Just create variables and print them out after the timing. With the number of iterations you have in your tests these print statements are not likely to have that much of an impact with the code here but in other test cases you write in the future they may bias the results.
-In the sqlite test you are timing the connection, obtaining the cursor object, table creation, and table drop but in the ZODB test you make the connection and obtain the dbroot object outside the timing. It would just be better to just time the inserts in both cases.

Anonymous said...
This comment has been removed by a blog administrator.
code43 said...

hi Flavio, here's a 2009 update for your article: http://yserial.sourceforge.net

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL."

Extremely easy to implement (couple lines of code to replicate your results), and well documented.

The execution speed is also consistent with your findings on 100,000 inserts across various databases. We extended that to a million inserts without any problem by using generator functions and executemany -- that also includes compression of any Python object, especially nice for scientific data.

Your feedback would be greatly appreciated. Thanks very much for your blog.

cool-RR said...

Great post, I was just curious about using ZODB, both in scientific applications and non-scientific applications, and this was an interesting overview!

ccp

Amazon