In preparation for my talk at GraphConnect, I’ve added onto the talk by graphing out congressional sponsorship of bills using Neo4J. This post will specifically talk about data acquisition, data preparation, data loading into Neo4J and simple cypher queries against the data. Follow-on posts will look at processing this same data in Hadoop, processing this data in data-flows with Pig and Python to expose new and interesting properties, and visually graphing the data using d3.js.
In this project we will download a spreadsheet of all members of congress as well as all bills/resolutions from the 112th session of congress. Govtrack.us has made an xml file of everyone that has ever served in Congress, and U.S. presidents, with their party affiliation, terms in Congress, birthdays, etc. Additionally, they have provided all bills and resolutions for the current Congress and similarly named directories for other Congresses.
GovTrack has provided easy to use instructions on obtaining their data. To download GovTrack’s raw data files in bulk, or if you plan to regularly update the files, you should use the rsync tool. Rsync is good for selecting which directory of data you want and keeping your files up to date by only downloading changes on each update.
rsync -avz --delete --delete-excluded govtrack.us::govtrackdata/us/112/bills .
Complete code for this example will be made available on github.
Using Excel, you can load in the Congress xml file and it will be transformed into a spreadsheet. By filtering on the active field, you can quickly create a spreadsheet of 535 congressmen. This file is then saved out to a tab-delimited format and then using TextWrangler converted the tabs to pipe delimiters.
Each of the bills is downloaded as an xml file. A sample is shown below:
REFERRED A resolution expressing the sense of the Senate that the primary safeguard for the well-being and protection of children is the family, and that the primary safeguards for the legal rights of children in the United States are the Constitutions of the United States and the several States, and that, because the use of international treaties to govern policy in the United States on families and children is contrary to principles of self-government and federalism, and that, because the United Nations Convention on the Rights of the Child undermines traditional principles of law in the United States regarding parents and children, the President should not transmit the Convention to the Senate for its advice and consent. Referred to the Committee on Foreign Relations. <summary> 3/10/2011--Introduced. Expresses the sense of the Senate that: (1) the United Nations Convention on the Rights of the Child is incompatible with the Constitution, the laws, and the traditions of the United States; (2) the Convention would undermine presumptions of freedom and independence for U.S. families; (3) the Convention would interfere with U.S. principles of sovereignty, independence, and self-government that preclude the necessity or propriety of adopting international law to govern domestic matters; and (4) the President should not transmit the Convention to the Senate for its advice and consent. </summary>
Using a quickly written java application, I was able to create several delimited files that I would need to support data loading into Neo4J. Each bill has a sponsor, may have 1 or more co-sponsors, has one or more topic or subject that it describes and has activity as it moves through congress. You can see that modeled below:
Data Loading into Neo4J
Using the BatchInserter framework from Max De Marzi and Michael Hunger, the data can be quickly loaded into Neo4J (less than 5 minutes). There are 50,889 nodes, 121,049 properties, and 266,197 relationships.
Using Cypher, you can quickly see who sponsored the most bills:
START n=node:congress('lastName:*') MATCH n-[r:SPONSORS]->c RETURN n.lastname, n.firstname, count(*) ORDER BY count(*) desc LIMIT 5;
Casey, Robert 228 Menéndez, Robert 207 Schumer, Charles 146 Brown, Sherrod 118 Hagan, Kay 116
And those that COSPONSOR:
START n=node:congress('lastName:*') MATCH n-[r:COSPONSORS]->c RETURN n.lastname, n.firstname, count(*) ORDER BY count(*) desc LIMIT 5;
Grijalva, Raul 752 Filner, Bob 704 Rangel, Charles 687 Norton, Eleanor 681 Lee, Barbara 605
We’ve seen how we can quickly transform data in XML files into Neo4J and begin to look at the data. The next post will look at further analyzing of the data.