Creating an Elasticsearch index of Congress Bills using Pig

Recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Shortly thereafter, Elasticsearch announced integration with Hadoop. “Using Elasticsearch in Hadoop has never been easier. Thanks to the deep API integration, interacting with Elasticsearch is similar to that of HDFS resources. And since Hadoop is more then just vanilla Map/Reduce, in elasticsearch-hadoop one will find support for Apache Hive, Apache Pig and Cascading in addition to plain Map/Reduce.”

Elasticsearch published the first milestone (1.3.0.M1) based on the new code-base that has been in the works for the last few months.

The intial attempt at testing out Mortar and Elasticsearch didn’t work. Working with the great team at Mortar and costinl at Elasticsearch, Mortar was able to update their platform to allow Mortar to write out to Elasticsearch at scale.

Test Case
To test this out, I decided to process congressional bill data from the past several congresses. The process will be to read in the json files, process the file using Pig, use NTLK to find the top 5 bigrams and then write the data out to an Elasticsearch index.

The Data
GovTrack.us, a tool by Civic Impulse, LLC, is one of the world’s most visited government transparency websites. The site helps ordinary citizens find and track bills in the U.S. Congress and understand their representatives’ legislative record.

The bulk data is a deep directory structure of flat XML and JSON files. The directory layout is described below.

Our files are in three main directories:

/data/congress-legislators/
Information on Members of Congress from 1789-present, presidents and vice presidents, Congressional committees, and current committee assignments. This data is a mirror of the files in github:unitedstates/congress-legislators.
/data/congress/ (i.e. http://www.govtrack.us/data/congress/)
Bill status and other legislative data from 2013 (113th Congress) and forward. This data is the output of the scrapers developed by the github:unitedstates/congress project.

Getting the Data

To fetch the data we support rsync, a common Unix/Mac tool for efficiently fetching files and keeping them updated as they change. The root of our rsync tree is govtrack.us::govtrackdata, and this corresponds exactly to what you see at http://www.govtrack.us/data/.

To download bill data for the 113th Congress into a local directory named bills, run:

rsync -avz --delete --delete-excluded --exclude **/text-versions/ \
		govtrack.us::govtrackdata/congress/113/bills .

(Note the double colons in the middle and the period at the end. This is a long command. I’ve indicated the line continuation with a backslash.)

Directories

/data/congress/113/bills/[bill_type]/[bill_type][bill_number]/data.[json|xml]
Bill and resolution status for bills in the 113th Congress. See the github:unitedstates/congress project documentation for details of the JSON format. The XML format is backwards-compatible with our legacy bill XML files (documentation).

The following code loops through a directory of bills and converts all of the .json files into single line .json files.

package jsonFormatter; import java.io.*; import java.nio.file.FileVisitResult; import java.nio.file.FileVisitor; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.SimpleFileVisitor; import java.nio.file.attribute.BasicFileAttributes; import java.io.IOException; public final class importer { public static void main(String... aArgs) throws IOException{ String ROOT = "/Users/davidfauth/bills/109"; FileVisitor<Path> fileProcessor = new ProcessFile(); Files.walkFileTree(Paths.get(ROOT), fileProcessor); } private static final class ProcessFile extends SimpleFileVisitor<Path> { @Override public FileVisitResult visitFile( Path aFile, BasicFileAttributes aAttrs ) throws IOException { FileWriter fileWriter = null; String absolutePath = ""; String filePath=""; String outFilePath = ""; File f = new File(aFile.toString()); System.out.println(f.getName()); absolutePath = f.getAbsolutePath(); filePath = absolutePath.substring(0,absolutePath.lastIndexOf(File.separator)); // System.out.println("file path is: " + filePath); // System.out.println("new file path is: " + filePath.substring(17)); outFilePath = "/Users/davidfauth/MortarData/" + filePath.substring(18).replace("/", "_");; System.out.println("new file is " + outFilePath); if (f.getName().toString().equals("data.json")){ System.out.println("Processing file:" + aFile); BufferedReader br = new BufferedReader(new FileReader(aFile.toString())); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { sb.append(line); sb.append(" "); line = br.readLine(); } File newTextFile = new File(outFilePath+"_data1.json"); fileWriter = new FileWriter(newTextFile); fileWriter.write(sb.toString()); fileWriter.close(); } finally { br.close(); } } return FileVisitResult.CONTINUE; } @Override public FileVisitResult preVisitDirectory( Path aDir, BasicFileAttributes aAttrs ) throws IOException { // System.out.println("Processing directory:" + aDir); return FileVisitResult.CONTINUE; } } }

The following pig code reads all of the single line .json files, pulls out some of the fields, calls a Python UDF to find the top 5 bigrams and then writes the data into an elasticsearch index.

The important steps are to:
a) register the jar file
b) define the storage to the elasticsearch index
c) write out the data using the defined storage\

-- 'Document' is the delimiter -- 'event, gathering' is the tag list %default OUTPUT_PATH '/Users/davidfauth/MortarBillsData' REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/billsProject.py' USING streaming_python AS nltk_udfs; REGISTER '/Users/davidfauth/Downloads/elasticsearch-hadoop-1.3.0.M1.jar'; define ESStorage org.elasticsearch.hadoop.pig.ESStorage('es.resource=govtrack/bills'); bills = LOAD '/Users/davidfauth/MortarData/' USING org.apache.pig.piggybank.storage.JsonLoader( 'bill_id:chararray, congress:chararray, official_title:chararray, updated_at:chararray, subjects_top_term:chararray,summary:map[], sponsor:map[], subjects'); billDetails = FOREACH bills GENERATE bill_id, congress, official_title, updated_at, subjects_top_term, sponsor#'name' as sponsorName:chararray, sponsor#'state' as sponsorState:chararray, subjects AS subjectList: {t: (subjects: chararray)}, summary#'text' AS billText:chararray; billSearch = FOREACH bills GENERATE bill_id, congress, official_title, updated_at, subjects_top_term, sponsor#'name' as sponsorName:chararray, sponsor#'state' as sponsorState:chararray, summary#'text' AS billText:chararray; -- Group the tweets by place name and use a CPython UDF to find the top 5 bigrams -- for each of these places. bigrams_by_place = FOREACH (GROUP billDetails BY subjects_top_term) GENERATE group AS subjects_top_term:chararray, nltk_udfs.top_5_bigrams(billDetails.official_title), COUNT(billDetails) AS sample_size; top_100_places = LIMIT (ORDER bigrams_by_place BY sample_size DESC) 100; STORE billSearch INTO 'govtrack/bills' USING org.elasticsearch.hadoop.pig.ESStorage(); rmf $OUTPUT_PATH; STORE top_100_places INTO '/Users/davidfauth/MortarBillsData' USING PigStorage('\t');

If you are using the mortar framework, nltk isn’t installed by default. Here’s how you can install it:

# From your project's root directory - Switch to the mortar local virtualenv
source .mortar-local/pythonenv/bin/activate

#Install nltk (http://nltk.org/install.html)
sudo pip install -U pyyaml nltk

For the bi-grams, I re-used some sample Mortar code from Doug Daniels shown below:

from pig_util import outputSchema import nltk @outputSchema("top_five:bag{t:(bigram:chararray)}") def top_5_bigrams(tweets): tokenized_tweets = [ nltk.tokenize.WhitespaceTokenizer().tokenize(t[0]) for t in tweets ] bgm = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_documents(tokenized_tweets) top_5 = finder.nbest(bgm.likelihood_ratio, 5) return [ ("%s %s" % (s[0], s[1]),) for s in top_5 ]

Results
The pig job loaded 58,624 files, processed them and created the elasticsearch index in 53 seconds. The NLTK python UDF finished in another 34 seconds resulting in a total time of 87 seconds.

You can see the working elasticsearch in the following screen shot:

One thing of note
The elasticsearch hadoop connector doesn’t handle geo-coordinates quite yet so you can’t create an index with latitude/longitude. That should be coming soon.

Getting the Data

Directories

Leave a Reply Cancel reply