DocGraph Analysis using Hadoop and D3.JS

Visualizing the DocGraph for Wyoming Medicare Providers

I have been participating in the DocGraph MedStartr project. After hearing about the project at GraphConnect 2012, I wanted to use this data to investigate additional capabilities of Hadoop and BigData processing. You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

One great thing about this data is that you can combine the DocGraph data with with other data sets. For example, we can combine NPPES data with the DocGraph data. The NPPES is the federal registry for NPI numbers and associated provider information. Additionally you can bring in other data such as Census data and Virginia State information.

In this example, I want to use MortarData (Hadoop in the cloud) to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes. The desired outcome is to compare the referrals between taxonomy codes for the entire State of Virginia and the areas of Virginia with a population of less that 25,000.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012

Using Hadoop and Pig, I am going to use the following approach:

1. Load up the four data sets.
2. Filter the NPI data from NPPES by the provider’s state.
3. Filter the State Data by the desired population.
4. Join both the primary and the referring doctors to the NPI/NPPES/Census data.
5. Carve out the Primary doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
6. Carve out the Referring doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
7. Carve out the primary and referring doctors, count the number of primary referrals and then link the taxonomy codes to both the primary and referring doctors.
8. Export the data out for future visualization.

Why Mortar Data and Hadoop
Using Hadoop, Pig and Mortar’s platform, I have several advantages:
1. I can store all of the data files as flat files in an Amazon S3 store. I don’t need a dedicated server.
2. I can spin up as many Hadoop clusters as I need in a short time.
3. I can write Pig code to do data processing, joins, filters, etc. that work on the data.
4. I can add in Python libraries and code to supplement the Pig.
5. I can add parameters and change the state and population on the fly.

You can see the Mortar Web interface here:
Screen shot 2013-02-19 at 10.14.07 PM

I plan on using the D3.js library to create some visualizations. One example visualization I am working on is a Hierarchical Edge Bundling chart. You can see the initial prototype here. I still need to fill in all of the links.

Campaign Data Analysis Video

As a wrap-up on the Campaign Analysis that I presented at GraphConnect, I decided to make a video showing the usage of Mortar Data, Neo4J and D3 JS.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012

Neo4j is an open-source, high-performance, NOSQL graph database, optimized for superfast graph traversals. With Neo4J, you can easily model and change the full complexity of your system.

Neo4J was listed as a “big data vendor to watch in 2013” by Infoworld.

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

GraphConnect – Graphies

GraphConnect was held two weeks ago (Nov 5-6) in San Francisco. GraphConnect was put together by Neo4J as an initial conference on graph databases.

Part of the conference was a contest on building innovative graphs. The “Graphies” recognized and celebrated individuals and teams that were developing innovative and impressive graph database applications. Additionally, there was a twitter contest called #ShowMeYourGraph. The contest was to tap into your creative side and show us your most creative graph using any online format: graph visualization, photo, video, blog, etc.

Two of my submissions were SuperPac contributions and SuperPac contributions with labels. The rest of this post will talk about how I created the two charts.

SuperPac Contributions
The Sunlight Foundation provided the SuperPac expenditure data for the 2012 campaign. The data was in a csv format.

For this example, I wanted to look at contributions from September 1 through October 2012 as well as look only at the Presidential candidates. This would show which SuperPacs were funding or opposing candidates. Using some crazy Excel skills, I was able to identify the unique SuperPacs, their support or opposition, and the candidate they were supporting or opposing.

Visualization using D3.js

After finding a good example of a force node graph in D3.js, I was able to create the JSON file for the data. In order to create the different colors, I needed to figure out different groupings. The groups for Barack Obama and Mitt Romney were unique groups. Additional groups were created for SuperPacs that supported a candidate, opposed a candidate, supported one candidate while opposing another candidate (i.e. Support Obama and Oppose Romney and vice versa) and those that did all three.

In this instance, Mitt Romney is purple and Barack Obama is the peach node. The red nodes supported Mitt Romney and the blue opposed. The orange nodes opposed Barack Obama and the light blue supported Barack Obama. The green supported Romney and opposed Obama while the light orange supported Obama and opposed Romney.

With labels, the graph gets a lot more messy. You can see it here:

The initial graph was chosen as the winner of the Most Innovative Educational Graph Application as well as a winning entry in the #ShowMeYourGraph contest. Thanks to Neo4J for the conference and for the competition.

Graphing Congressional Sponsorship of Bills – Part 2

This is a follow-up to the previous post.

Note:The data and java code is available at Github.

The Data
I decided to use six years of data (Congress sessions 107-112). Since there is significant turnover in Congress (re-election, sex scandals, resignation, and death among other reasons), I had to expand the number of congressional members to 928. In addition, over the six years of bills, there are now 5,870 unique topics or subjects of the bills (i.e. Arts, Boy Scouts, Clergy, Welfare, Whaling to mention a few).

Using the BatchInserter framework from Max De Marzi and Michael Hunger, the data can be quickly loaded into Neo4J (less than 5 minutes). There are 340,619 nodes, 846,697 properties, and 2,514,565 relationships.

Data Exploration
We’ll start out simple using Cypher queries and build from there. Let’s see how many bills discuss Whaling.

START n=node:subjects('subject:Whaling')
RETURN c.Session, count(c.Session)

Our output:

"110"	5
"109"	7
"108"	11
"107"	13

To see who is sponsoring or cosponsoring the most bills related to “whaling”, we run the following query:

START n=node:subjects('subject:Whaling')
RETURN z.firstname, z.lastname, count(z) As popCoSponsors
ORDER BY popCoSponsors DESC

Our output is:

z.firstname  z.lastname  popCoSponsors
"Joseph"     "Lieberman"  9
"John"	     "Kerry"	  8
"Susan"	     "Collins"	  8
"Edward"     "Kennedy"	  8
"Barbara"    "Boxer"	  8
"Ron"        "Wyden"	  7
"Daniel"     "Akaka"	  7
"Christopher" "Dodd"	  6
"Olympia"     "Snowe"	  6
"John"	      "Reed"	  6
"Carl"	      "Levin"	  6
"Russell"     "Feingold"  6
"Wayne"	      "Gilchrest" 6
"John"	      "McCain"	  6
"Joseph"      "Biden"	  5

If I want to see who usually co-sponsors bills sponsored by Olympia Snowe, I would run this query:

START n=node:subjects('subject:Whaling'), z=node:congress('lastName:Snowe')
RETURN x.firstname, x.lastname, count(x) As popCoSponsors
ORDER BY popCoSponsors DESC

The output for this is:

"Ron"	        "Wyden"	        4
"Christopher"	"Dodd"	        4
"John"	        "Reed"	        4
"Daniel"	"Akaka"	        4
"Susan"	        "Collins"	4
"Joseph"	"Biden"	        4
"Edward"	"Kennedy"	4
"Barbara"	"Boxer"	        4
"John"	        "McCain"	4
"Joseph"	"Lieberman"	4

Paths between Congressmen
To find the shortest path between two congressmen, we can run this query:

START ryanvp = node(6233), obama = node(6323)
MATCH p=shortestPath(ryanvp-[*..10]-obama) return p;

This returns the following path:

Obama -> Bill SR 97 110 Session -> Topic of Congressional Tributes -> HR716 112 Session -> Paul Ryan

Graphing Congressional Sponsorship of Bills


In preparation for my talk at GraphConnect, I’ve added onto the talk by graphing out congressional sponsorship of bills using Neo4J. This post will specifically talk about data acquisition, data preparation, data loading into Neo4J and simple cypher queries against the data. Follow-on posts will look at processing this same data in Hadoop, processing this data in data-flows with Pig and Python to expose new and interesting properties, and visually graphing the data using d3.js.

The Data:

In this project we will download a spreadsheet of all members of congress as well as all bills/resolutions from the 112th session of congress. has made an xml file of everyone that has ever served in Congress, and U.S. presidents, with their party affiliation, terms in Congress, birthdays, etc. Additionally, they have provided all bills and resolutions for the current Congress and similarly named directories for other Congresses.

GovTrack has provided easy to use instructions on obtaining their data. To download GovTrack’s raw data files in bulk, or if you plan to regularly update the files, you should use the rsync tool. Rsync is good for selecting which directory of data you want and keeping your files up to date by only downloading changes on each update.

rsync -avz --delete --delete-excluded .

Complete code for this example will be made available on github.

Data Processing:

Using Excel, you can load in the Congress xml file and it will be transformed into a spreadsheet. By filtering on the active field, you can quickly create a spreadsheet of 535 congressmen. This file is then saved out to a tab-delimited format and then using TextWrangler converted the tabs to pipe delimiters.

Each of the bills is downloaded as an xml file. A sample is shown below:


 A resolution expressing the sense of the Senate that the primary safeguard for the well-being and protection of children is the family, and that the primary safeguards for the legal rights of children in the United States are the Constitutions of the United States and the several States, and that, because the use of international treaties to govern policy in the United States on families and children is contrary to principles of self-government and federalism, and that, because the United Nations Convention on the Rights of the Child undermines traditional principles of law in the United States regarding parents and children, the President should not transmit the Convention to the Senate for its advice and consent.

 Referred to the Committee on Foreign Relations.
<summary> 3/10/2011--Introduced.
Expresses the sense of the Senate that: (1) the United Nations Convention on the Rights of the Child is incompatible with the Constitution, the laws, and the traditions of the United States; (2) the Convention would undermine presumptions of freedom and independence for U.S. families; (3) the Convention would interfere with U.S. principles of sovereignty, independence, and self-government that preclude the necessity or propriety of adopting international law to govern domestic matters; and (4) the President should not transmit the Convention to the Senate for its advice and consent.

Using a quickly written java application, I was able to create several delimited files that I would need to support data loading into Neo4J. Each bill has a sponsor, may have 1 or more co-sponsors, has one or more topic or subject that it describes and has activity as it moves through congress. You can see that modeled below:

Data Loading into Neo4J
Using the BatchInserter framework from Max De Marzi and Michael Hunger, the data can be quickly loaded into Neo4J (less than 5 minutes). There are 50,889 nodes, 121,049 properties, and 266,197 relationships.

Quick Analysis

Using Cypher, you can quickly see who sponsored the most bills:

START n=node:congress('lastName:*')
RETURN n.lastname, n.firstname, count(*)
ORDER BY count(*) desc LIMIT 5;

Casey, Robert 228
Menéndez, Robert 207
Schumer, Charles 146
Brown, Sherrod 118
Hagan, Kay 116

And those that COSPONSOR:

START n=node:congress('lastName:*')
RETURN n.lastname, n.firstname, count(*)
ORDER BY count(*) desc LIMIT 5;
Grijalva, Raul 752
Filner, Bob 704
Rangel, Charles 687
Norton, Eleanor 681
Lee, Barbara 605

We’ve seen how we can quickly transform data in XML files into Neo4J and begin to look at the data. The next post will look at further analyzing of the data.

Analysis of Cash Pouring into State Campaigns

In today’s Star Tribune, there was an article discussing the amount of cash that is flowing from outside of the state of Minnesota into various candidate’s campaign war chests. The authors used FEC individual campaign contribution data to perform the analysis described in the article.

Using a similar approach outlined earlier, I decided to see how easy it would be to use Datameer to perform similar analysis. In this case, I filtered out candidates so only Minnesota candidates would show up as shown below.

After joining this set of records to campaign contributions, I then filtered on individuals who listed Minnesota as their address and also filtered on non-Minnesota contributors. After grouping the sums by candidate and joining the sheets, I had a worksheet that I could then use to graph results. That sheet is below:

Using the Infographic capability, I created a quick bar graph from these results.

I have to look at the data set again as the numbers don’t quite match to the Star Tribune article. However, it shows that tools like Datameer 2.0 can be used for multi-purpose data analysis and tasks that we would normally think only a data scientist would be capable of performing.

Graph Connect 2012

Inaugural Graph Database Conference

The NOSQL movement has taken the world by storm, bringing a new coherency and meaning to connected data, and giving developers and technical leads the power to manage modern data at a new speed and size.

GraphConnect 2012 is the first conference to focus on graph databases and the applications making sense of connected data. Graph thought leaders from many disciplines and organizations will come to explore new ideas, share innovations in graph technology, and make connections with researchers and developers from around the globe.

Early registration is available at The call for papers expires at the end of July.

Register for GraphConnect 2012 (size 728 X 90)

Datameer 2.0 – Analysis of Campaign Contributions

In the previous post, we looked at using Datameer 2.0 to analyze some Federal Election Commission data. In this post, I will show some of the analysis performed on the data specifically using the table join, filter, sort, a couple of functions and then a couple of infographics.

Problem Statement
The problem statement for this exercise is to determine the top 15 Virginia candidates in terms of total campaign contributions and then show the source of the campaign contributions (in state vs. out-of-state).

As mentioned previously, I created multiple workbooks for each of the data sets (committees, candidates and individual contributors). In the candidate data, I used the filter feature and created a worksheet of only Virginia candidates.

I created a new worksheet in order to join the Virginia candidates with the committees that are supporting them. I imported the filtered candidate data as well as the committee data. Using Datameer’s join capability, I was able to easily join the two data sets in order to create a new worksheet. The join feature allows the user to select the columns to join and also select which columns to include in the joined result. If you forget what type of join you want to use, a link to a wiki describing the joins is available.

One of the impressive things about Datameer is that it shows you the sheet dependencies or where the data came from. For Virginia Contributions, I joined the worksheet of candidates/committees with the individual campaign contributions. I then created new filtered worksheets by first filtering on contributors with an address listing VA as their state and a second worksheet for all other states. An example of the dependencies is below:

Once that worksheet was created, I created a new worksheet and then used the Datameer groupby and groupbysum functions to total up the contributions to each candidate.
After that, another join of the last two worksheets followed by a filtering of the Top 15 by total contribution left me with this worksheet.

Infographics or Cool Charts
Once the data was available, it was time to create a bar chart of the data. Datameer’s extensive library of widgets includes tables, graphs, charts, diagrams, maps, and tag clouds which enables users to create simple dashboards or stunning business infographics and visualizations. Datameer’s WYSIWYG Editor speeds creation of compelling and insightful business infographics. The Editor includes a graphics Inspector, a simple but powerful tool for configuring graphic and text elements including colors, fonts, etc. The Editor provides the realtime view of all graphic and text elements as they are created and edited so that the final visualizations are rendered exactly as intended.

For this case, I chose a rather simple bar chart. Once placed on the WYSIWYG editor, you simply drag the fields from the list of worksheets and the chart takes shape. There are widget settings for the each type of display allowing you to customize the look and layout of the final product. You can see the stacked bar chart below:

In this graph, we are looking at what the breakdown is of contributions from Virginia addresses versus non-Virginia addresses. In the graph, the total amount is Blue. The amount from Virginia is Orange, and the amount outside of Virginia is Green. You can see that Cantor and Kaine have received a majority of campaign contributions from outside of Virginia.

Geographical Analysis
One of the last charts I used was a map. Data with latitude and longitude coordinates are automatically plotted on the map. In this case, I filtered out the contribution for one Virginia candidate (Tim Kaine) and geocoded a portion of the data. That information was then loaded into a worksheet and plotted on a map infographic shown below.

The data did not include any Virginia contributions but at a glance you can see a large number of contributions to the Kaine campaign from the southern US.

Datameer 2.0 opens up data analytics to more people making it relatively easy. I didn’t need to stand up a Hadoop cluster in order to get started. With a little bit of knowledge and asking some questions, I was able to easily drill down into the data for answers.

Datameer 2.0 – Initial Impressions

Today was the first day I had a chance to use Datameer 2.0 on some actual data. I’ve decided to use some of the FEC campaign contribution data and see how Datameer works with the data.

The Federal Election Commission has made campaign contribution data publicly available for download here. The FEC has provided campaign finance maps on its home page. The Sunlight Foundation has created the Influence Explorer to provide similar analysis. For the most recent dataset, there are 5,257 candidates, 12,588 committees and 1,505,580 individual campaign contributions.

The data I am working with consists of a list of candidates, committees supporting the candidates and individual campaign contributions. Each of these files can be downloaded from the FEC website listed above. The FEC is changing the layout of the files from fixed length fields to pipe delimited. You can read more about that here. Since the candidates and committee fields weren’t yet converted, I loaded those files into Oracle and wrote a quick PHP script to output them in delimited format.

Data Access
One of the pluses about Datameer 2.0 is that it can access data in a variety of formats and locations. File Link TypesTo test out these capabilities, I uploaded the three data files to my Amazon S3 instance. Connecting to the S3 instance was a simple process. I selected S3 as the data store and then filled in the connection details shown below. S3 Data Store

After saving the data store, you can see the data store in the Datameer Browser tab. I simply called mine DataStore.

Once the DataStore was set up, I created a new data link to each of the three files that I had stored in my S3 instance. The data link steps the users through selecting the data store and the file type (log file, delimited file, fixed width, twitter data, and XML are some of the file types). It then asks you for the file name and whether the file has column header information in the first row.

One of the great features of Datameer 2.0 is that it is able to make smart choices about the data fields and types based on the data. The user can rename column names and modify data types before the data is loaded. This is a real easy interface to use.

Once all three data links were set up, you could see them in the browser tab.

Data Analysis
Using the familiarity of the spreadsheet, Datameer 2.0 provides the user the capability to do powerful analysis of the data. Even the most complex nested joins of a large number of datasets can be performed using an interactive dialog. Mix and match analytics and data transformations in unlimited number of data processing pipelines leaving the raw data untouched. The image below is a sample of the candidates file.

In the next post, we’ll look at doing some of the joins between the data and begin looking at some of the built-in graphical presentation capabilities of Datameer 2.0.