Monday, July 9, 2007

Fastest Python Database Interface

Writing data to databases is something that I have to do quite often. In the last few years I have used many different databases and their python interfaces: Pysqlite, Pysqlite2, MySQldb, PsycoPg and PsycoPG2, SQlObject and SQLAlchemy. In many cases, when the data is to be used only by a Python program, I have used Pickle and cPickle.

Up until now, my criteria to pick a database and a given inteface, where mainly based on convenience or the need to use a particular database backend. However, recently, after having chosen SQLObject for a project based only on how neatly organized the resulting code looked like, I stumbled upon a major performance issue with SQLObject: it was extremely slow for a large number of inserts.

After searching all over and asking on SQLObject mailing list, I found out that trying to improve its insert speed was a lost cause. Apparently, something in the core design of SQLObject, impossibilitates it to do the equivalent of an executemany() when doing inserts. The executemany() method, which we find in all(?) other database interfaces, gives us a tremendous boost in insertion speed.

So I finally decided to do what I must have done years ago: write a simple benchmark script for all the interfaces I cared about.
I simply created a table, inserted 100000 short records into the database, and selected them back to check if the insertion operation had been completed. Then I dropped the table. I repeated this operation for all interfaces.

Since I have never before come across a straightforward comparison like the one I did, I decide to blog about it and save other people from learning it the hard way, like me.

So, without further ado, lets see the results:

Starting benchmarking with 100000 records (inserting and selecting)

Inserting 100000 records into SQLite(Memory) with SQLObject
Number of records selected: 100000
Time for SQLObject with db in memory: 68.07 seconds

Inserting 100000 records into SQLite(Memory) with sqlite3 module
Number of records selected: 100000
Time for Sqlite3 with db in memory: 2.87 seconds

Inserting 100000 records into PostgreSQL with PsycoPG2
Number of records selected: 100000
Time for PsycoPG2: 19.49 seconds

Inserting 100000 records into SQLite(Disk) with sqlite3 module
Number of records selected: 100000
Time for Sqlite3 with db on Disk: 3.02 seconds

Inserting 100000 records into SQLite(Disk) with SQLAlchemy
Number of records selected: 100000
Time for SQLAlchemy with SQlite db in Disk: 18.37 seconds

Inserting 100000 records into SQLite(memory) with SQLAlchemy
Number of records selected: 100000

Time for SQLAlchemy with SQlite db in Memory: 18.02 seconds

Inserting a 100000 records into a Pickle file with cPickle
Number of records loaded: 100000
Time for cPickle: 0.39 seconds

Inserting 100000 records into MySQL with MySQLdb
Number of records selected: 100000
Time for MySQL: 1.28 seconds


I was surprised at a few things (ok, at just about everything!):
  • How slow SQLObject really was (not really, I guess I already expected that...)
  • SQLite was not substantially faster with the DB in memory when compared to an "on disk" DB.
  • SQLite is pretty darn fast!
  • MySQL is even Faster!!!
  • PsycoPg2 is not as fast as people say...
  • SQLAlchemy does an excellent job at doing everything it does and not sacrificing performance in the process!
Before Running the script, make sure you have MySQL and PostgreSQL running, and adjust user names and passwords in the code. You also need to create a database "People" in both these servers.

One Last observation: for sqlite, I am using the sqlite3 module from Python 2.5's standard library. I don't know if it is based on Pysqlite or not. I wonder if Pysqlite2 would give different performance results.

I would appreciate any comments, and suggestions to extract more performance from each interface. The source code for the benchmarking is here.

27 comments:

Nick Gerner said...

Regarding your SQLite perf to disk vs in memory:

SQLite has poor disk sync behavior when you're inserting records without a transaction. So they do a sync to disk on every insert (for 10k inserts this sucks!).

Try a begin/commit around your inserts and re-run your SQLite test

for more info see:
http://www.sqlite.org/cvstrac/wiki?p=PerformanceConsiderations

Unknown said...

Hi Nick,

Thanks for the tip!

I actually think that SQLite (when not being slowed down by SQLObject) is very fast...

Regarding your tip, according to its documentation, the sqlite3 module automatically open transactions around operations like inserts and others.

thanks,

Flávio

Marius Gedminas said...

There should really be a graph to visualize these numbers.

I wonder how the newly-announced Storm object-relational mapper (https://storm.canonical.com/) would fare in this. Its API is very nice.

Justin said...

SQLite is pretty fast for simple inserts.

BUT once it gets to selects with joins, you will have to optimize your queries by hand, since the sql query optimizer is pretty poor.

Unknown said...

I thought aboutdoing a benchmark of Storm but its documentation is still sparse regarding multiple inserts...

so maybe later....

Anonymous said...

Hi!

You could try apsw to access sqlite. For me it was most of the time faster than pysqlite.

http://www.initd.org/tracker/pysqlite/wiki/APSW

mike bayer said...

the sqlalchemy benchmark would certainly be alot faster, almost as fast as the pure sqlite3 benchmark, if you used executemany() semantics:

engine.execute(table.insert(), {params}, {params}, {params}, ..etc )

the 18 seconds of overhead is due to recompiling the insert() object over and over as well as some overhead SA adds to statement executions.

if the records need to be streamed in and out of memory as they come through, you can try "chunking" the insert statements; such as one execute() for every 1000 records.

mp said...

A couple of things:

Is MySQL configured to use InnoDB or MyISAM by default? One is much faster than the other but doesn't support transactions.

When performing large inserts in to PostgreSQL, you need to wrap things in "BEGIN" "COMMIT" blocks or pgsql will flush after every write, slowing inserts down terribly. This behaviour is configurable, but I forget where.

I'd be interested to see how long the inserts/selects took if using SQLAlchemy's ORM rather than their query code - from working with it here we've found it to be excruciatingly slow.

mike bayer said...

SQLAlchemy's ORM is not designed for bulk insert operations. it issues every INSERT distinctly and also must look at cursor.lastrowid in some cases, and also in some cases will post-fetch information about the row, like defaults that fired off. this is all to deliver rich ORM behavior where your mapped objects remain consistent with what's in the database.

therefore for bulk insert, use insert() statements directly.

mike bayer said...

also, if youre having general ORM slowness issues, feel free to post on the list what youre observing. this is the first time ive heard anyone use the term "excruciatingly slow" before. if youre having select slowness in particular, theres things that can be looked at to speed things up.

mp said...

Yeah, after a while we found that we were square-peg-round-hole-ing with SQLAlchemy, and that we were doing things with it that it wasn't designed to do.

So while the slowdown was unexpected, the reason for it was fairly reasonable after investigation.

Unfortunately, our bulk selects /inserts that we were performing as part of our initial data imports are semi-representative of the kind of workloads that we're going to be putting SA under. When we actually do get to that point I'll be compiling a couple of demonstration unit tests and bringing them to the SQLAlchemy mlist.

Unknown said...

Just for the record: I have used executemany() for every DB interface, except for SQLObject which does not offer that capability. It is up each one to execute the multiple inserts as efficiently as they can (since they get all records on a single call).

Gerhard Häring said...

Ok, after reading your article I could really not explain why MySQLdb was faster than pysqlite. I implemented one performance patch to pysqlite that should improve performance by ca. 30 %: http://initd.org/tracker/pysqlite/changeset/414

mike bayer said...

usagi:

yes, my apologies.

the reason for the SA slowdown is because the various String/Integer types perform "type processing" on each bind parameter. This is the feature that allows SQLalchemy to encode unicode strings on the fly and to convert datetime() objects into strings for SQLite, for example.

so yes, in this case, SQLAlchemy still adds a lot of overhead to a bulk insert of that size. we can either add some kind of flag somewhere to "disable" the bind parameter processing, or you can use the raw connection provided by SQLAlchemy's engine to bypass this processing (i.e. engine.connect().connection.executemany()). a third option would be more complicated optimization, which would involve checking just the first set of bind params for any types that need converting; if none found, then the rest of the result goes in raw. feel free to add a ticket for this case to trac.

Anonymous said...

I think you'll find that cPickle is an order of magnitude slower (at least) if you were to save 100,000 new Object() instances vs 100,000 simple tuples (str, int, int) as you have, simply due to the overhead in object instantiation (creation and recreating them). That's something to consider when you look at ORMs vs straight SQL result manipulation, too.

I cooked up an example following along the lines of your BenchDB and did two runs with Pickle - one with Tuples, and one with a simple Person class with three attributes. Note the significant difference:

Inserting a 100000 Tuples into a Pickle file with cPickle
testPickle completed 100000 operations in: 0.291534900665 seconds.

Inserting a 100000 Objects into a Pickle file with cPickle
testPickleObjects completed 100000 operations in: 10.1858320236 seconds.

When you are looking at SQLObject and other such ORMs keep that in mind.

Carrying along the same theme, I also benchmarked a Python object database, Durus, which is similar to ZODB, with tuples and persistent objects, using memory, file, and client server storage:

% python performance.py
Inserting 100000 python Tuples into Durus FileStorage
testDurus completed 100000 operations in: 0.269452095032 seconds.

Inserting 100000 python PersistentObject into Durus FileStorage
testDurusObjects completed 100000 operations in: 26.3141629696 seconds.

Inserting 100000 python Tuples into Durus MemoryStorage
testDurusMemoryStorage completed 100000 operations in: 0.0707190036774 seconds.

Inserting 100000 python PersistentObject into Durus MemoryStorage
testDurusMemoryStorageObjects completed 100000 operations in: 1.26428484917 seconds.

Inserting 100000 python Tuples into Durus ClientStorage
testDurusClientServer completed 100000 operations in: 0.274677038193 seconds.

Inserting 100000 python PersistentObject into Durus ClientStorage
testDurusClientServerObjects completed 100000 operations in: 28.3248071671 seconds.

The code for the above can be found here:

http://64.21.147.49/durus/performance.py

Durus - won't pollute your system if you install it - can be found here:

http://www.mems-exchange.org/software/durus/

And you might be interested in the readme:

http://www.mems-exchange.org/software/durus/Durus-3.7.tar.gz/Durus-3.7/README.txt

Unknown said...

Hi Michael Watkins,

Thanks for the demonstration of Durus! I had heard of it but hadn't really played with it yet.

it sounds really cool, and fast! it is very acceptable that it be a little slower that cPickle for disk storage since it must be much more flexible.

I am certainly going to check it out and consider using it for my next projects.

Thanks,

Flávio

Anonymous said...

Wonderful blog.

Anonymous said...

Wonderful blog.

Anonymous said...

Hello all!

Anonymous said...

actually, that's brilliant. Thank you. I'm going to pass that on to a couple of people.

Anonymous said...

smg9t5 Good job!

Anonymous said...

actually, that's brilliant. Thank you. I'm going to pass that on to a couple of people.

Unknown said...

Hey Flávio, a couple of us electrical engineers in Minneapolis Minnesota have been working on a new Python Object Database Management System called pod:

http://code.google.com/p/pickled-object-database/

We're built on sqlite, so we're not going to be any faster. But, because we're an abstract object layer on top of sqlite, we wanted to see how fast our system was -- so we did our own speed trials and for 100,000 inserts were 6x slower than using sqlite directly:

http://code.google.com/p/pickled-object-database/wiki/discussion_the_pod_speed_trials#Insert_Test

We would be honored to get your feedback on how our system fares -- we've really tried to beat up performance while maintaing a very abstract Pythonic API to access the api.

Cheers and great post,

Andrew Carter

Anonymous said...

Hi All :)

Your article is great. Thanks for information. I'ts very useful. You have very nice blog.

------------------------------------------
Free Poker Bonus No Deposit Required - Sign up Promotions.
Free Poker bankrolls - Regards

pedro velasquez said...

Fri, Feb 6, 2009 at 5:19 AM, M.-A. Lemburg wrote:
I'm writing a small application for sportsbook detecting source code plagiarism that
currently relies on a database to store lines of code.
The application has two primary functions: adding a new file to the database
and comparing a file to those that are already stored in the database.
I started out using sqlite3, but was not satisfied with the performance
results. I then tried using psycopg2 with a local postgresql server, bet nfl and the
performance got even worse. My simple benchmarks show that sqlite3 is an
average of 3.5 times faster at inserting a file, and on average less than a
tenth of a second slower than psycopg2 at matching a file.
I expected postgresql to be a lot faster . is there some peculiarity in
psycopg2 that could be causing slowdown? Are these performance results
typical? Any suggestions on what to try from here? I don't think my
code http://www.enterbet.com queries are inherently slow, but I'm not a DBA or a very accomplished
Python developer, so I could be wrong.

Stivel Velasquez said...

I started out using sqlite3, but was not satisfied with the performance
results. I then tried using psycopg2 with a local postgresql server, and the
performance got even worse. My simple benchmarks show that sqlite3 is an costa rica vacation rentals average of 3.5 times faster at inserting a file, and on average less than a
tenth of a second slower than psycopg2 at matching a file.
http://www.meadbrown.com

Unknown said...

I would disagree on SQLAlchemy. I am doing some comparative genomics research and after profiling the API I have been using, it appears that SQLAlchemy is the issue with runtime.

I can do the same work using Perl DBI + DBD::mysql in 2 hours in comparison to 8 hours with SQLAlchemy :-(

ccp

Amazon