Python in Science

Saturday, June 17, 2017

Curso de introdução a criptomoedas - Aula 01

For the Portuguese speaking readers of this blog, I am starting an Introductory course on Cryptocurrencies and applications on the blockchain which is an online version of a standard classroom course I am starting now at FGV.

This is the first lecture which is basically an intro to the topic and the structure of the course.

The online version should have the main content of the lectures on a more compact package. Enjoy and please share this with portuguese-speaking friends.

If there sufficient demand I'll consider making an English version of it.

Wednesday, December 26, 2012

Compressing GeoJSON with Python

GeoJSON can get quite big when you need represent complex maps with lots of polygons. Even though it compresses well, being text,
it can still be quite a hassle to store and upload to a website in uncompressed form.

Recently while working on a web app which requires the upload of maps in GeoJSON format, I stumbled upon Google App Engine's
limitation of 32MB for POST requests. At that point I realized that I'd have to look for a way to compress it before uploading,
but rather that just asking users to gzipping before uploading, I decided to look into ways to make a GeoJSON map lighter by
eliminating redundancies and perhaps reducing the level of detail a little bit. You see, a single polygon (representing a state
or a county), may be composed of thousands of points, each one represented by an array of two floating point numbers. That's
**a lot** of bytes!

Soon my search took me to TopoJSON by Mike Bostok, which is great but is not compatible with GeoJSON, so it was not what I was
looking for. But reading about TopoJSON, led me to LilJSON and this paper:

After looking at those resources, it was time to get my fingers typing. So I forked LilJSON which already achieved some compression by
reducing the precision of the floating points in the GeoJSON. After a while I had a mildly improved version of it, which contributed
back via Pull Request. I then set out to implement Visvalingam's algorithm, which aimed a simplifying polygonal lines while trying
not to alter too much the original area (and shape) of the polygon.

I was very surprised to find out that both techniques combined, the reduction of coordinate precision and the simplification
of lines, yielded a very nice compression of my test GeoJSON: It went from 62MB to a mere 5.1MB!! And all of this could still be reduced by
a factor of ten by Gzipping it.

Just to check how well my compressed version stood up to the original, I opened both in Quantum GIS and was blown away, the differences were tiny! only magnifying A LOT, I was able to see the the differences. I am not including an image here to encourage you to try for yourself. ;-)

If you want to give it a try, the source code is here.

Sunday, April 1, 2012

Benchmarking NLTK under Pypy

Natural Language Processing (NLP). A marvelous world of possibilities! Fortunately it is also a great example of another domain of application for which Python is wonderfully well equipped.

I have been playing with Python and NLP, for a couple of years now, integrating its tools on a reasonably large project. I hope to demo this project really soon, but it is not the topic of this post.

The topic of this post is to demonstrate the extent of possible performance gains on typical NLP problems, by simply using Pypy. Pypy has come a long way recently, and can now be used a a drop in replacement for CPython in many applications with large performance gains.

I'll start by showing how you can start using pypy for your day-to-day development needs with very little effort. First you need to install some very powerfull python tools: virtualenv and virtualenvWrapper. These can be easily installed with easy_install or pip.

sudo easy_install -U virtualenv virtualenvwrapper

Follow the post-install configuration for virtualenvwrapper. Then download the most recent stable release tarball from pypy page, and extract it somewhere on your system:

wget https://bitbucket.org/pypy/pypy/downloads/pypy-1.8-linux64.tar.bz2
tar xvjf pypy-1.8-linux64.tar.bz2

Now you have to create you own virtualenv to work with Pypy instead of the standard CPython intallation on your system:

mkvirtualenv -p pypy-1.8/bin/pypy pypyEnv

From this point on, whenever you want to use Pypy all you need to do is type:

workon pypyEnv

anywhere on your system.

Now that we've got all this environment setup out of our way. We can focus on testing NLTK with Pypy, and compare it to CPython. By the way, NLTK can be installed in the same way as virtualenv.

Since Pypy has a very extensive benchmarking system, I decide to keep all my benchmarking code visible so that if the project devs want to take advantage of it to further improve Pypy, they can. The code is on GitHub.

The benchmarks (see Github page) indicate big gains on some operations, and not so big ones on others. On a couple of cases, pypy is slower though I didn't investigate why.

The main purpose of having such a benchmark is to provide some experimental grounds for the improvement of PyPy. Its core developers say that "If it is not faster CPython then it's a bug". But I also wanted to let fellow developers know that NLTK seems to be fully compatible with PyPy (as far as I tested it), and can benefit from performance improvements when run with PyPy.

Naturally, this benchmark can be vastly improved and extented. I count on YOU to help me with this, just add a quick function on the benchmark.py module and send me a pull request.