Wednesday, December 26, 2012

Compressing GeoJSON with Python

GeoJSON can get quite big when you need represent complex maps with lots of polygons. Even though it compresses well, being text,
it can still be quite a hassle to store and upload to a website in uncompressed form.

Recently while working on a web app which requires the upload of maps in GeoJSON format, I stumbled upon Google App Engine's
limitation of 32MB for POST requests. At that point I realized that I'd have to look for a way to compress it before uploading,
but rather that just asking users to gzipping before uploading, I decided to look into ways to make a GeoJSON map lighter by
eliminating redundancies and perhaps reducing the level of detail a little bit. You see, a single polygon (representing a state
or a county), may be composed of thousands of points, each one represented by an array of two floating point numbers. That's
**a lot** of bytes!

Soon my search took me to TopoJSON by Mike Bostok, which is great but is not compatible with GeoJSON, so it was not what I was
looking for. But reading about TopoJSON, led me to LilJSON and this paper:

After looking at those resources, it was time to get my fingers typing. So I forked LilJSON which already achieved some compression by
reducing the precision of the floating points in the GeoJSON. After a while I had a mildly improved version of it, which contributed
back via Pull Request. I then set out to implement Visvalingam's algorithm, which aimed a simplifying polygonal lines while trying
not to alter too much the original area (and shape) of the polygon.

I was very surprised to find out that both techniques combined, the reduction of coordinate precision and the simplification
of lines, yielded a very nice compression of my test GeoJSON: It went from 62MB to a mere 5.1MB!! And all of this could still be reduced by
a factor of ten by Gzipping it.

Just to check how well my compressed version stood up to the original, I opened both in Quantum GIS and was blown away, the differences were tiny! only magnifying A LOT, I was able to see the the differences. I am not including an image here to encourage you to try for yourself. ;-)

If you want to give it a try, the source code is here.

Sunday, April 1, 2012

Benchmarking NLTK under Pypy

Natural Language Processing (NLP). A marvelous world of possibilities! Fortunately it is also a great example of another domain of application for which Python is wonderfully well equipped.

I have been playing with Python and NLP, for a couple of years now, integrating its tools on a reasonably large project. I hope to demo this project really soon, but it is not the topic of this post.

The topic of this post is to demonstrate the extent of possible performance gains on typical NLP problems, by simply using Pypy. Pypy has come a long way recently, and can now be used a a drop in replacement for CPython in many applications with large performance gains.

I'll start by showing  how you can start using pypy for your day-to-day development needs with very little effort. First you need to install some very powerfull python tools: virtualenv and virtualenvWrapper. These can be easily installed with easy_install or pip.
sudo easy_install -U virtualenv virtualenvwrapper
Follow the post-install configuration for virtualenvwrapper. Then download the most recent stable release tarball from pypy page, and extract it somewhere on your system:
 wget https://bitbucket.org/pypy/pypy/downloads/pypy-1.8-linux64.tar.bz2
 tar xvjf pypy-1.8-linux64.tar.bz2
 Now you have  to create you own virtualenv to work with Pypy instead of the standard CPython intallation on your system:
mkvirtualenv -p pypy-1.8/bin/pypy pypyEnv
From this point on, whenever you want to use Pypy all you need to do is type:
workon pypyEnv
anywhere on your system.

Now that we've got all this environment setup out of our way. We can focus on testing NLTK with Pypy, and compare it to CPython. By the way, NLTK can be installed in the same way as virtualenv.

Since Pypy has a very extensive benchmarking system, I decide  to keep all my benchmarking code visible so that if the project devs want to take advantage of it to further improve Pypy, they can. The code is on GitHub.

The benchmarks (see Github page) indicate big gains on some operations, and not so big ones on others. On a couple of cases, pypy is slower  though I didn't investigate why.

The main purpose of having such a benchmark is to provide some experimental grounds for the improvement of PyPy. Its core developers say that "If it is not faster CPython then it's a bug". But I also wanted to let fellow developers know that NLTK seems to be fully compatible with PyPy (as far as I tested it), and can benefit from performance improvements when run with PyPy.

Naturally, this benchmark can be vastly improved and extented. I count on YOU to help me with this, just add a quick function on the benchmark.py module and send me a pull request.

Monday, October 10, 2011

Scrapy, Google Scholar and MongoDB

I have been deeply involved in text mining problems lately, If you watched my presentation a EuroScipy this year (http://www.slideshare.net/fccoelho/mining-legal-texts-with-python) you have an idea of some of the things I am up to.

Well part of problem of text mining is to get hold of the text you want to analyze in the first place. For many projects I am involved with, I already have mountains of text to analyze. For some however I have  to go get it. On the web...

This requires a technique (or should I say an art?) known as web scraping. Of course there are tons of sites which are happy to provide you with an API for all your data feeding needs. Some "evil" sites however, withhold unnecessarily, even public data. To be explicit, I am talking about Google here which despite a massive number of requests, has yet to provide an API for the consumption of their database of scientific literature. This database powers one of their search services called Google Scholar, which you are not likely to have heard about unless you are a researcher or a graduate student.

I am pretty sure that this database was built by means of their scraping other literature indexing sites and publishers websites. None of the information in it is subject of copyright since it is composed mainly of references to articles, books and similar resources hosted elsewhere in the web. Nevertheless, they insist in making the life of their fellow scrapers harder than it needs to be. But enough ranting about Google's evil policies.

My research involves analizing the evolution of the scientific literature on a given subject, and to achieve this goal, I need to download thousands of articles and analyze their content by means of natural language processing techniques (NLTK).

So I wrote a google Scholar scraping tool (https://bitbucket.org/fccoelho/scholarscrap - pun intended) which I is available to others facing the same challenge. It is a chalenge indeed, given the number of stealth techniques necessary to avoid early detection by Google's bot watchdogs. My crawler is still being detected, meaning I can only make a limited number of requests each day before being banned. I welcome new ideas to further evade detection.

I hope Google would give something back to us researchers after making so much money from our work.






Enhanced by Zemanta

ccp

Amazon