Creating an Elasticsearch index of Congress Bills using Pig

pig_python_elasticsearch

Recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Shortly thereafter, Elasticsearch announced integration with Hadoop. “Using Elasticsearch in Hadoop has never been easier. Thanks to the deep API integration, interacting with Elasticsearch is similar to that of HDFS resources. And since Hadoop is more then just vanilla Map/Reduce, in elasticsearch-hadoop one will find support for Apache Hive, Apache Pig and Cascading in addition to plain Map/Reduce.”

Elasticsearch published the first milestone (1.3.0.M1) based on the new code-base that has been in the works for the last few months.

The intial attempt at testing out Mortar and Elasticsearch didn’t work. Working with the great team at Mortar and costinl at Elasticsearch, Mortar was able to update their platform to allow Mortar to write out to Elasticsearch at scale.

Test Case
To test this out, I decided to process congressional bill data from the past several congresses. The process will be to read in the json files, process the file using Pig, use NTLK to find the top 5 bigrams and then write the data out to an Elasticsearch index.

The Data
GovTrack.us, a tool by Civic Impulse, LLC, is one of the world’s most visited government transparency websites. The site helps ordinary citizens find and track bills in the U.S. Congress and understand their representatives’ legislative record.

The bulk data is a deep directory structure of flat XML and JSON files. The directory layout is described below.

Our files are in three main directories:

Getting the Data

To fetch the data we support rsync, a common Unix/Mac tool for efficiently fetching files and keeping them updated as they change. The root of our rsync tree is govtrack.us::govtrackdata, and this corresponds exactly to what you see at http://www.govtrack.us/data/.

To download bill data for the 113th Congress into a local directory named bills, run:

rsync -avz --delete --delete-excluded --exclude **/text-versions/ \
		govtrack.us::govtrackdata/congress/113/bills .

(Note the double colons in the middle and the period at the end. This is a long command. I’ve indicated the line continuation with a backslash.)

Directories

The following code loops through a directory of bills and converts all of the .json files into single line .json files.

The following pig code reads all of the single line .json files, pulls out some of the fields, calls a Python UDF to find the top 5 bigrams and then writes the data into an elasticsearch index.

The important steps are to:
a) register the jar file
b) define the storage to the elasticsearch index
c) write out the data using the defined storage\

If you are using the mortar framework, nltk isn’t installed by default. Here’s how you can install it:

# From your project's root directory - Switch to the mortar local virtualenv
source .mortar-local/pythonenv/bin/activate

#Install nltk (http://nltk.org/install.html)
sudo pip install -U pyyaml nltk

For the bi-grams, I re-used some sample Mortar code from Doug Daniels shown below:

Results
The pig job loaded 58,624 files, processed them and created the elasticsearch index in 53 seconds. The NLTK python UDF finished in another 34 seconds resulting in a total time of 87 seconds.

Screen Shot 2013-10-24 at 7.32.59 PM

You can see the working elasticsearch in the following screen shot:

Screen Shot 2013-10-24 at 7.36.14 PM

One thing of note
The elasticsearch hadoop connector doesn’t handle geo-coordinates quite yet so you can’t create an index with latitude/longitude. That should be coming soon.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.