Monday, October 10, 2011

Scrapy, Google Scholar and MongoDB

I have been deeply involved in text mining problems lately, If you watched my presentation a EuroScipy this year ( you have an idea of some of the things I am up to.

Well part of problem of text mining is to get hold of the text you want to analyze in the first place. For many projects I am involved with, I already have mountains of text to analyze. For some however I have  to go get it. On the web...

This requires a technique (or should I say an art?) known as web scraping. Of course there are tons of sites which are happy to provide you with an API for all your data feeding needs. Some "evil" sites however, withhold unnecessarily, even public data. To be explicit, I am talking about Google here which despite a massive number of requests, has yet to provide an API for the consumption of their database of scientific literature. This database powers one of their search services called Google Scholar, which you are not likely to have heard about unless you are a researcher or a graduate student.

I am pretty sure that this database was built by means of their scraping other literature indexing sites and publishers websites. None of the information in it is subject of copyright since it is composed mainly of references to articles, books and similar resources hosted elsewhere in the web. Nevertheless, they insist in making the life of their fellow scrapers harder than it needs to be. But enough ranting about Google's evil policies.

My research involves analizing the evolution of the scientific literature on a given subject, and to achieve this goal, I need to download thousands of articles and analyze their content by means of natural language processing techniques (NLTK).

So I wrote a google Scholar scraping tool ( - pun intended) which I is available to others facing the same challenge. It is a chalenge indeed, given the number of stealth techniques necessary to avoid early detection by Google's bot watchdogs. My crawler is still being detected, meaning I can only make a limited number of requests each day before being banned. I welcome new ideas to further evade detection.

I hope Google would give something back to us researchers after making so much money from our work.

Enhanced by Zemanta

Monday, September 5, 2011

Generating Qt resource files from whole directory trees

Yesterday I had to scratch a little annoying itch. I had to embed a javascript library (MathJax) into a PyQt project of mine(Model-Builder). Embedding static files into Qt projects is done via resource files. I was surprised that Qt Designer resource editor has no way to recursively scan directories. You have to add files manually.

That was not an option for me since MathJax is somewhat large. Naturally this is a trivial thing to do in Python. But I could not find any tool to do that in a  google search, even though I found many messages to mailing lists asking for it. I decided to solve this problem not only for me but for the other suffering souls out there.

For those which are not familiar with resource files, they are a clever way to encode data as a source file (python module, for instance), this greatly facilitates the distribution of your project as you don't have to worry about static files as the become indistiguishable from source code. Single resource files, .qrc, may contain any number of files preserving their full path, so that they can accessed by you program via urls which start like this "qrc:///...". Resource files are defined as a simple XML file. Then a resource compiler, in PyQt4's case, pyrcc4, compiles the static files listed in the xml into a python .py file.

The script I created, qrcgen, takes a directory and a prefix, recursively scans the directory and generates a .qrc file with the same name as the directory scanned. It has solved my problem, and I hope it can help others. It is also available via PyPI, just "easy_install qrcgen".