Federal Election Commission Campaign Data Analysis

This post is inspired by Marko Rodriguez’ excellent post on a Graph-Based Movie Recommendation engine. I will use many of the same concepts that he describes in his post in order to load the data into Neo4J and then begin to analyze the data. This post will focus on the data loading. Follow-on posts will look at further analysis based on the relationships.

The Federal Election Commission has made campaign contribution data publicly available for download here. The FEC has provided campaign finance maps on its home page. The Sunlight Foundation has created the Influence Explorer to provide similar analysis.

This post and follow-on posts will look at analyzing the Campaign Data using the graph database Neo4j, and the graph traversal language Gremlin. This post will go about showing the data preparation, the data modeling and then loading into Neo4J.

The FEC Data
The FEC data is available for download from the FEC website via FTP. It is composed of three main files which are the Campaign Committees, Campaign Candidates and the Individual Contributors. As of this post, there were approximately 10,875 committees, 3,600 candidates, and 455,000 unique contributions. Each of the data sets has a data description as well as frequency counts. The 2011-2012 data can be found here.

Gremlin and Neo4J
Gremlin 1.3 is available for download at this location. Neo4J 1.5M01 is available for download at this location. For this demonstration, we will be running the community edition of Neo4J in a Windows Virtual Machine.

Data Preparation
The FEC data is in formatted, fixed-length fields. This makes it a little bit harder to prepare for import into Neo4J with my limited skills and abilities. To work around that, I was able to load the data into Oracle using SQL Loader and then I wrote a simple PHP program to query the database and format the data into a delimited file. If interested in those files, feel free to contact me.

The FEC Data Graph
The FEC data is represented in the following graph. Each committee supports a candidate. Some candidates may be independent from a committee. Individuals contribute 1 or more times to a committee. For this demonstration, we’ve haven’t separated out city/state/zip and created a common location.

A couple of notes on the data. Some of the committees did not have a treasurer so I added in a value of “No Treasurer”. Some of the candidates were referenced to non-existent committees. In this case, I’ve created entries for those committees in order to load the data and create the links. Additionally, the individual contribution file has overpunch characters to different amounts or negative amounts. Those values were adjusted in the database so the data could be loaded as an integer value.

Loading Data
The data will be inserted into the graph database Neo4j. The Gremlin/Groovy code below creates a new Neo4j graph, removes an unneeded default edge index, and sets the transaction buffer to 2500 mutations per commit.

g = new Neo4jGraph('/tmp/FEC')

Loading Committee Data
The committee data contains information about the different election committees. In our case, it has seven columns.


The code needed to parse this data is below:

new File('committee.dat').eachLine {def line ->
  def components = line.split('::');
  def committeeVertex = g.addVertex(['type':'Committee','committeeId':components[0], 'name':components[1], 'party':components[2],'city':components[3],'state':components[4],'zip':components[5],'treasurer':components[6]]);

Parsing Candidate Data
The candidate data contains information about the various candidates. In our case, it has nine columns. A sample of the data is below:

H0AL01048::WALTER, DAVID MARSH::FOLEY::AL::36535::CON::P::10::01
H0AL02087::ROBY, MARTHA::MONTGOMERY::AL::36106::REP::C::12::02
H0AL05049::CRAMER, ROBERT E "BUD" JR::HUNTSVILLE::AL::35804::DEM::P::08::05
H0AL05155::PHILLIP, LESTER S::MADISON::AL::35758::REP::P::10::05
H0AL05163::BROOKS, MO::HUNTSVILLE::AL::35802::REP::C::12::05
H0AL05197::RABY, STEPHEN WALKER::TOREY::AL::35773::DEM::P::10::05
H0AL06088::COOKE, STANLEY KYLE::KIMBERLY::AL::35091::REP::P::10::06
H0AL06096::LAMBERT, PAUL ANTHONY::MAYLENE::AL::35114::REP::N::10::06

The code to parse the candidate file is:

new File('candidate.dat').eachLine {def line ->
  def components = line.split('::');
  def candVertex = g.addVertex(['type':'Candidate','candId':components[0], 'candName':components[2], 'candCity':components[3], 'candState':components[4],'candZip':components[5],'candParty':components[6],'candStatus':components[7],'candYear':components[8],'candDistrict':components[9]]);
  def supportedEdge = g.addEdge(g.idx(T.v)[[committeeId:components[1]]].next(), candVertex, 'supports');

Loading the Individual Contributors File
The individual contributors file contains all of the contributions made to different committees.

The sample data is:

C00000422::0009951::Helm, Douglas Alan MD::PERINATAL ASSOCIATES/Physician::Fresno::CA::93701::01::11::11::20::0000500::M2
C00000422::0009952::Karasek, Dennis Edward MD::SELF-EMPLOYED/Physician::San Antonio::TX::78231::01::11::11::20::0002000::M2
C00000422::0009953::Kilgore, Shannon M MD::VA PALO ALTO HCS/Physician::Palo Alto::CA::94304::01::11::11::20::0000500::M2
C00000422::0009954::Matthews, George Philip MD::VISION QUEST/Physician::Arlington::TX::76006::01::11::11::20::0000500::M2
C00000422::0009955::Kimball, Daniel B Jr. MD::N/A/Retired Physician::Reading::PA::19611::01::15::11::20::0001000::M2
C00000422::0009956::Mehling, Brian Macdermott MD::MEHLING ORTHOPAEDIC/Physician::West Islip::NY::11795::01::14::11::20::0000291::M2

Given that there are about a half a million contributors, parsing this data and loading will take a couple of minutes.

new File('indiv.dat').eachLine {def line ->
  def components = line.split('::');
  def indivVertex = g.addVertex(['type':'Individual','indivId':components[1], 'indivName':components[2], 'indivOccupation':components[3],'indivCity':components[4], 'indivState':components[5],'indivZip':components[6],'transDate':components[7] + components[8] +components[9],'amount':components[11],'transactionType':components[12]]);

To commit any data left over in the transaction buffer, successfully stop the current transaction. Now the data is persisted to disk. If you plan on leaving the Gremlin console, be sure to g.shutdown() the graph first.


Validating the Data

gremlin> g.V.count()
gremlin> g.E.count()
gremlin> g.V[[type:'Committee']].count()
gremlin> g.V[[type:'Candidate']].count()
gremlin> g.V[[type:'Individual']].count()

Let’s look at some distributions
What is the distribution of contributions among states?

gremlin> g.V[[type:'Individual']].indivState.groupCount(m) >> -1
gremlin> m.sort{a,b -> b.value<=>a.value}

What about the average contribution?

gremlin> g.V[[type:'Individual']].amount.mean()

Are there any treasurers supporting multiple committees?

gremlin> m=[:]
gremlin> g.V[[type:'Committee']].treasurer.groupCount(m) >> -1
gremlin>  m.sort{a,b -> b.value <=> a.value}[0..19]
==>No Chair=1716

No chair and no treasurer indicate that the treasurer value was empty. However, there are several treasurers supporting multiple committees.

Next Steps
The next steps will be to look at some of the relationships between contributors and committees and see if there are treasurers serving on multiple committees.

Additionally, because each contribution is counted individually, there are several duplicate donors/campaign contributors. In order to address that, I will separate out the donors and their address as a separate table and link them to the contributions.

If you have questions about this post, feel free to email me.

i2 Report File – Palantir Plugin (Update)

Since the initial posting, I’ve made some updates to the Palantir import helper allows a user to select the report file and then import the file.

Once the user clicks on Import, a list of i2 types are presented to the user (both links and entities). The user can map each of the i2 types to Palantir objects or links.

Finally a summary of the number of entities and links processed are presented to the user. The entities and links are added to the chart.

i2 Report File – Palantir Plugin

i2 ANB allows users to export chart information about entities, links, attributes and cards to a report. This is useful if you want to create a report containing the information in all or part of your chart. This report is created as a text file which can then be used in other applications.

i2 ANB allows users to define the items you want to include in your report using a report specification. A report specification is a series of settings which tell ANB what kind of report to create and what you want to include in it. Report specifications enable you to define the items, content and destination of our report.

For our useage, we’ve modified a default report template. For i2 ANB entities, we want access to the entity type, identity, label, description, date and the attributes. For the attributes, we are printing out the attribute name and attribute value. We use tabs between the attribute name and value. We also use ]] as a delimiter between attribute name and value.

For links, we print out the link type, label, link 1 and link 2 as well as the Ends. We’ve added link1 and link2 values as it isn’t always possible to parse the Ends value properly.

The report configuration are shown in the following 2 screen shots.

i2 Report Setup
i2 Report Setup

i2 Report Screen Capture 2

As shown in below, the Palantir import helper allows a user to select the report file and then import the file. A summary of the number of entities and links processed are presented to the user. The entities and links are added to the chart. Users can customize the mappings between i2 entity types and palantir entity types by modifying an XML file that is part of the Palantir helpers.

Here’s the final import into a Palantir graph.

Palantir Graph
Palantir Graph

Questions about the Palantir helper can be sent to dsfauth_at_gmail.com

Quick Links on APIs

Some quick links that have popped up over the last few days:

Your API Sucks: Why Developers Hang Up and How to Stop That An article from Apigee that talks how APIs don’t need to suck for developers.

Get free admission to Strata and a chance to showcase to investors Thanks to Pete Warden, here’s a way for big data startups a chance to get to Strata and in front of VCs.

Enterprise 2.0 RESTful APIs made easy with PHP FRAPIFRAPI is a high-level API framework that puts the “rest” back into RESTful. Use it to power your web apps, mobile services, and legacy systems.

Best Practices for API Development Recently tips from the founder of the Lokad API, a sales forecasting service, summarized some of her tips for API design.

2011 Data Conferences

A few notable conferences for 2011.

Government Big Data Forum 2011 – Big data is not only in the commercial space but is a challenge in the Federal Government. In what should be an interesting forum held in Washington, DC, panels include does ETL still work, de-duplication of data and sensemaking of data. – Held January 26, 2011

O’Reilly Strata Conference – Making Data Work
Big Data is here. Turning data into decisions. This will be held February 1-3, 2011 in Santa Clara, CA.

Glue Conference – As the “cloud” becomes a common platform, web applications still live in a “stovepipe” world. It’s not a question of “should we move to the cloud?” It’s a question of once some, or most, or all of our web applications live in the cloud, how do we handle the problems of scalability, security, identity, storage, integration and interoperability? What was the problem of “enterprise application integration” in the late 90s, is now the cambrian explosion of web-based applications that will demand similar levels of integration. The problem, put simply, is how to “glue” all of these apps, data, people, work-flows, and networks together. – Held May 25-26 in Broomfield, CO

Defrag Conference. November 9-10, 2011 in Broomfield, CO.

Short Links

Taking a page from Pete Warden, I’ve decided to start off with some short links. In between, I’ll mix it up with some longer posts, but the intent of the short links is to highlight interesting pages/links/sites that I’ve found over the past few days.

Government Big Data Forum 2011 – Big data is not only in the commercial space but is a challenge in the Federal Government. In what should be an interesting forum held in Washington, DC, panels include does ETL still work, de-duplication of data and sensemaking of data.

Social Network Visualization – A great collection of papers related to social network visualization from UC Davis. Social networks are visual in nature. Visualization techniques have been applied in social analysis since the field began. We aim to develop interactive visual analytic tools for complex social networks.

RIM In Talks to acquire Gist – As a Gist user and not a Blackberry user, I’m closely watching this news.

Data Scientists – As more and more data is made available, people are needed to make sense of it. Companies such as bit.ly, LinkedIn and Foursquare are hiring. If I was going back to school, this is a career that I would target.