Monday, October 10, 2011

Scrapy, Google Scholar and MongoDB

I have been deeply involved in text mining problems lately, If you watched my presentation a EuroScipy this year (http://www.slideshare.net/fccoelho/mining-legal-texts-with-python) you have an idea of some of the things I am up to.

Well part of problem of text mining is to get hold of the text you want to analyze in the first place. For many projects I am involved with, I already have mountains of text to analyze. For some however I have  to go get it. On the web...

This requires a technique (or should I say an art?) known as web scraping. Of course there are tons of sites which are happy to provide you with an API for all your data feeding needs. Some "evil" sites however, withhold unnecessarily, even public data. To be explicit, I am talking about Google here which despite a massive number of requests, has yet to provide an API for the consumption of their database of scientific literature. This database powers one of their search services called Google Scholar, which you are not likely to have heard about unless you are a researcher or a graduate student.

I am pretty sure that this database was built by means of their scraping other literature indexing sites and publishers websites. None of the information in it is subject of copyright since it is composed mainly of references to articles, books and similar resources hosted elsewhere in the web. Nevertheless, they insist in making the life of their fellow scrapers harder than it needs to be. But enough ranting about Google's evil policies.

My research involves analizing the evolution of the scientific literature on a given subject, and to achieve this goal, I need to download thousands of articles and analyze their content by means of natural language processing techniques (NLTK).

So I wrote a google Scholar scraping tool (https://bitbucket.org/fccoelho/scholarscrap - pun intended) which I is available to others facing the same challenge. It is a chalenge indeed, given the number of stealth techniques necessary to avoid early detection by Google's bot watchdogs. My crawler is still being detected, meaning I can only make a limited number of requests each day before being banned. I welcome new ideas to further evade detection.

I hope Google would give something back to us researchers after making so much money from our work.






Enhanced by Zemanta

7 comments:

Jonathan Street said...

I've recently been doing something similar. I've recently completed a PhD during which I've collected hundreds - thousands of PDF documents. I thought it would be useful to have an automated means to group related documents. I gave a talk on it at a recent python northwest group meeting (http://jonathanstreet.com/blog/full-text-visualisation/) and have been working on it since. I'm getting some meaningful results but this is very much an area in which I lack experience.

Going through the linked presentation was very interesting - quite a few tools I'm going to look into further.

Flavio Coelho said...

@Jonathan: Thanks for the comments, I took a look at your blog found it very informative. We are trying to adopt some method of document clustering as well but have not decided on which similarity measure to use. tf_idf is appealing and is already implemented in NLTK.TextCollection class.

Marcel Pinheiro Caraciolo said...

Great Work Flavio! You should take a look at gensim project, it's a great project for computing the similarity between documents. By the way, a good exercise for clustering documents is the tutorial provided by the toolkit scikit-learn. I met the core developers of this project and they are really engaged in this open-source project : http://scikit-learn.sourceforge.net/auto_examples/document_clustering.html

I am terribly sorry that I couldn't make to meet you during my stay at Rio. I will come back maybe this month and I really want to meet you and your team! Jayron is a great guy!

Regards,
Marcel

Dogwalker said...

I'd like to refer you to Mendeley, a reference manager software. They an OpenAPI:
http://dev.mendeley.com/
which means you can use their database of publications.

Pedro Feitosa said...

Professor,

Gostaria de saber se o senhor será o responsável pelo curso de Introdução à Análise matemática no verão da EMAP-FGV. Em caso afirmativo estou interessado na duração do curso e em seu conteúdo programático.

Desde já agradeço,

Pedro Feitosa de Lucena
Graduando em Economia, UnB

e-mail: pedroflucena@gmail.com

Pat said...

Hello,

I have done a similar tool (scholarScape) to crawl Google Scholar.
Python + Scrapy + MongoDB
It also checks for duplicates and generate a graph with cited-by relationships.
It think you'll be very interested.
If you have a Ubuntu machine, the install process is very straightforward.
scholarScape at the end is a server serving a Web interface to start your crawls and export the graphs.

scholarScape does not download the documents but it remembers the URL so you will be able to download the documents after the crawl (with curl for example).

If you have any questions, please send me a mail.

Pat said...

I forgot the URL of scholarScape :
http://www.github.com/medialab/scholarScape

ccp

Amazon