This post is inspired by Marko Rodriguez’ excellent post on a Graph-Based Movie Recommendation engine. I will use many of the same concepts that he describes in his post in order to load the data into Neo4J and then begin to analyze the data. This post will focus on the data loading. Follow-on posts will look at further analysis based on the relationships.
Background
The Federal Election Commission has made campaign contribution data publicly available for download here. The FEC has provided campaign finance maps on its home page. The Sunlight Foundation has created the Influence Explorer to provide similar analysis.
This post and follow-on posts will look at analyzing the Campaign Data using the graph database Neo4j, and the graph traversal language Gremlin. This post will go about showing the data preparation, the data modeling and then loading into Neo4J.
The FEC Data
The FEC data is available for download from the FEC website via FTP. It is composed of three main files which are the Campaign Committees, Campaign Candidates and the Individual Contributors. As of this post, there were approximately 10,875 committees, 3,600 candidates, and 455,000 unique contributions. Each of the data sets has a data description as well as frequency counts. The 2011-2012 data can be found here.
Gremlin and Neo4J
Gremlin 1.3 is available for download at this location. Neo4J 1.5M01 is available for download at this location. For this demonstration, we will be running the community edition of Neo4J in a Windows Virtual Machine.
Data Preparation
The FEC data is in formatted, fixed-length fields. This makes it a little bit harder to prepare for import into Neo4J with my limited skills and abilities. To work around that, I was able to load the data into Oracle using SQL Loader and then I wrote a simple PHP program to query the database and format the data into a delimited file. If interested in those files, feel free to contact me.
The FEC Data Graph
The FEC data is represented in the following graph. Each committee supports a candidate. Some candidates may be independent from a committee. Individuals contribute 1 or more times to a committee. For this demonstration, we’ve haven’t separated out city/state/zip and created a common location.
A couple of notes on the data. Some of the committees did not have a treasurer so I added in a value of “No Treasurer”. Some of the candidates were referenced to non-existent committees. In this case, I’ve created entries for those committees in order to load the data and create the links. Additionally, the individual contribution file has overpunch characters to different amounts or negative amounts. Those values were adjusted in the database so the data could be loaded as an integer value.
Loading Data
The data will be inserted into the graph database Neo4j. The Gremlin/Groovy code below creates a new Neo4j graph, removes an unneeded default edge index, and sets the transaction buffer to 2500 mutations per commit.
g = new Neo4jGraph('/tmp/FEC') g.dropIndex("edges") g.setMaxBufferSize(2500)
Loading Committee Data
The committee data contains information about the different election committees. In our case, it has seven columns.
C00000059::HALLMARK CARDS PAC::UNK::KANSAS CITY::MO::64108::GREG SWARENS C00000422::AMERICAN MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE::::WASHINGTON::DC::20001::WALKER, KEVIN C00000489::D R I V E POLITICAL FUND, TEAMSTERS LOCAL UNION CHAPTER 886::::OKLAHOMA CITY::OK::73107::TOM RITTER C00000547::KANSAS MEDICAL SOCIETY POLITICAL ACTION COMMITTEE::UNK::TOPEKA::KS::66612::C. RICHARD BONEBRAKE, M.D. C00000638::INDIANA STATE MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE::::INDIANAPOLIS::IN::46202::KORA, M.D., VIDYA C00000729::AMERICAN DENTAL ASSOCIATION POLITICAL ACTION COMMITTEE::UNK::WASHINGTON::DC::20005::CONNOR, FRANCIS DR
The code needed to parse this data is below:
new File('committee.dat').eachLine {def line -> def components = line.split('::'); def committeeVertex = g.addVertex(['type':'Committee','committeeId':components[0], 'name':components[1], 'party':components[2],'city':components[3],'state':components[4],'zip':components[5],'treasurer':components[6]]); }
Parsing Candidate Data
The candidate data contains information about the various candidates. In our case, it has nine columns. A sample of the data is below:
H0AL00016::BOZEMAN, MARTHA RENEE::BIRMINGHAM::AL::35201::UNK::P::10::07 H0AL01030::GOUNARES, PETER HUNTER::ORANGE BEACH::AL::36561::REP::P::10::01 H0AL01048::WALTER, DAVID MARSH::FOLEY::AL::36535::CON::P::10::01 H0AL02087::ROBY, MARTHA::MONTGOMERY::AL::36106::REP::C::12::02 H0AL05049::CRAMER, ROBERT E "BUD" JR::HUNTSVILLE::AL::35804::DEM::P::08::05 H0AL05155::PHILLIP, LESTER S::MADISON::AL::35758::REP::P::10::05 H0AL05163::BROOKS, MO::HUNTSVILLE::AL::35802::REP::C::12::05 H0AL05189::SHEPARD, TAZEWELL::HUNTSVILLE::AL::35801::DEM::P::10::05 H0AL05197::RABY, STEPHEN WALKER::TOREY::AL::35773::DEM::P::10::05 H0AL06088::COOKE, STANLEY KYLE::KIMBERLY::AL::35091::REP::P::10::06 H0AL06096::LAMBERT, PAUL ANTHONY::MAYLENE::AL::35114::REP::N::10::06
The code to parse the candidate file is:
new File('candidate.dat').eachLine {def line -> def components = line.split('::'); def candVertex = g.addVertex(['type':'Candidate','candId':components[0], 'candName':components[2], 'candCity':components[3], 'candState':components[4],'candZip':components[5],'candParty':components[6],'candStatus':components[7],'candYear':components[8],'candDistrict':components[9]]); def supportedEdge = g.addEdge(g.idx(T.v)[[committeeId:components[1]]].next(), candVertex, 'supports'); }
Loading the Individual Contributors File
The individual contributors file contains all of the contributions made to different committees.
The sample data is:
C00000422::0009951::Helm, Douglas Alan MD::PERINATAL ASSOCIATES/Physician::Fresno::CA::93701::01::11::11::20::0000500::M2 C00000422::0009952::Karasek, Dennis Edward MD::SELF-EMPLOYED/Physician::San Antonio::TX::78231::01::11::11::20::0002000::M2 C00000422::0009953::Kilgore, Shannon M MD::VA PALO ALTO HCS/Physician::Palo Alto::CA::94304::01::11::11::20::0000500::M2 C00000422::0009954::Matthews, George Philip MD::VISION QUEST/Physician::Arlington::TX::76006::01::11::11::20::0000500::M2 C00000422::0009955::Kimball, Daniel B Jr. MD::N/A/Retired Physician::Reading::PA::19611::01::15::11::20::0001000::M2 C00000422::0009956::Mehling, Brian Macdermott MD::MEHLING ORTHOPAEDIC/Physician::West Islip::NY::11795::01::14::11::20::0000291::M2
Given that there are about a half a million contributors, parsing this data and loading will take a couple of minutes.
new File('indiv.dat').eachLine {def line -> def components = line.split('::'); def indivVertex = g.addVertex(['type':'Individual','indivId':components[1], 'indivName':components[2], 'indivOccupation':components[3],'indivCity':components[4], 'indivState':components[5],'indivZip':components[6],'transDate':components[7] + components[8] +components[9],'amount':components[11],'transactionType':components[12]]); }
To commit any data left over in the transaction buffer, successfully stop the current transaction. Now the data is persisted to disk. If you plan on leaving the Gremlin console, be sure to g.shutdown() the graph first.
g.stopTransaction(TransactionalGraph.Conclusion.SUCCESS)
Validating the Data
gremlin> g.V.count() ==>462915 gremlin> g.E.count() ==>441262 gremlin> g.V[[type:'Committee']].count() ==>21653 gremlin> g.V[[type:'Candidate']].count() ==>3536 gremlin> g.V[[type:'Individual']].count() ==>437726
Let’s look at some distributions
What is the distribution of contributions among states?
gremlin> g.V[[type:'Individual']].indivState.groupCount(m) >> -1 ==>null gremlin> m.sort{a,b -> b.value<=>a.value} ==>CA=52767 ==>NY=34742 ==>TX=34521 ==>FL=25758 ==>VA=22660 ==>IL=19075 ==>PA=15494 ==>MA=14134 ==>DC=14108 ==>OH=13425 ==>MI=11938 ==>MD=11647 ==>NJ=11523 ==>CT=11165 ==>WA=9410 ==>GA=9195 ==>MN=8339 ==>TN=8112 ==>MO=7738 ==>NC=7380 ==>AZ=6921 ==>CO=6876 ==>IN=6529 ==>WI=6002 ==>LA=5020 ==>NV=4118 ==>NM=3956 ==>KS=3862 ==>OR=3758 ==>IA=3593 ==>AL=3383 ==>SC=3310 ==>OK=3227 ==>KY=3218 ==>NE=3147 ==>MT=2417 ==>UT=2388 ==>AR=2269 ==>NH=2051 ==>MS=1866 ==>RI=1840 ==>HI=1629 ==>ME=1605 ==>ND=1578 ==>WV=1492 ==>SD=1442 ==>DE=1288 ==>VT=1133 ==>ID=980 ==>AK=965 ==>WY=920 ==>=800 ==>PR=611 ==>VI=160 ==>ZZ=141 ==>GU=83 ==>MP=14 ==>AS=3 gremlin>
What about the average contribution?
gremlin> g.V[[type:'Individual']].amount.mean() ==>1138.58
Are there any treasurers supporting multiple committees?
gremlin> m=[:] gremlin> g.V[[type:'Committee']].treasurer.groupCount(m) >> -1 ==>null gremlin> m.sort{a,b -> b.value <=> a.value}[0..19] ==>No Chair=1716 ==>LAROSE, JOSUE=122 ==>DURKEE, KINDE=90 ==>KINDE DURKEE=84 ==>LISA LISKER=80 ==>JUDITH ZAMORE=68 ==>KEITH A DAVIS=66 ==>LISKER, LISA=62 ==>CUSHMAN, NANCY=56 ==>NANCY H WATKINS=56 ==>NO TREASURER=43 ==>CABELL HOBBS=40 ==>KILGORE, PAUL=34 ==>KELLY LAWLER=34 ==>PAUL KILGORE=32 ==>KELLEY, MEREDITH=30 ==>WATKINS, NANCY H.=30 ==>MACKENZIE, SCOTT B=28 ==>ADRIANE RUMMEL=28 ==>BAUER, DAVID=22
No chair and no treasurer indicate that the treasurer value was empty. However, there are several treasurers supporting multiple committees.
Next Steps
The next steps will be to look at some of the relationships between contributors and committees and see if there are treasurers serving on multiple committees.
Additionally, because each contribution is counted individually, there are several duplicate donors/campaign contributors. In order to address that, I will separate out the donors and their address as a separate table and link them to the contributions.
If you have questions about this post, feel free to email me.