Hadoop to Neo4J

Leading up to Graphconnect NY, I was distracting myself from working on my talk by determining if there was any way to import data directly from Hadoop into a graph database, specifically, Neo4j. Previously, I had written some Pig jobs to output the data into various files and then used the Neo4J batchinserter to load the data. This process works great and others have written about it. For example, this approach also uses the batchinserter while this approach uses some Java UDFs to write the Neo4J files directly.

Both of these approaches work great but I was wondering if I could use a Python UDF and create the Neo4J database directly. To test this out, I decided to resurrect some work I had done on the congressional bill data from Govtrack. You can read about the data and the java code I used to convert the files into single-line JSON files here. It’s also a good time to read up on how to create an Elasticsearch index using Hadoop. Now that you’re back from reading that link, let’s look at the approach to try and go from Hadoop directly into Neo4J. From the previous article, you remember that recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Nigel Small had written Py2Neo which is a simple and pragmatic Python library that provides access to the popular graph database Neo4j via its RESTful web service interface. That sounded awesome and something worth trying out. Py2Neo is easy to install using pip or easy_install. Installation instructions are located here.

The model that I was trying to create looks something like this:

Bills Layout

The approach taken was to use Pig with a streaming Python UDF to write to the Neo4J database using its RESTful web service. I tested this out with Neo4J 2.0M6. I attempted to use Neo4J2.0RC1 but ran into several errors relating to missing nodes. The example code is below:

-- run the Counter UDF to create a Node ID
keyNodeList = FOREACH c GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int, nodeType;

Since Neo4J uses an incrementing counter for each node, we have to create an id for each keyValue (node name) that we are creating. The keyValues are the congressional session, name of the congresswoman or congressman, billID or subject. Below is a simple Python code that creates that ID.

from pig_util import outputSchema

COUNT = 0

@outputSchema('auto_increment_id:int')
def auto_increment_id():
    global COUNT
    COUNT += 1
    return COUNT

Once we have the id, we can use Py2Neo to create the nodes and relationships.

from pig_util import outputSchema

from py2neo import neo4j
from py2neo import node, rel

@outputSchema('nodeCreated:int')
def createNode(nodeValue, sLabel):
    if nodeValue:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        batch = neo4j.WriteBatch(graph_db)
        alice=batch.create(node(name=nodeValue,label=sLabel))
        results=batch.submit()
        return 1
    else:
        return 0

@outputSchema('nodeCreated:int')
def createRelationship(fromNode, toNode, sRelationship):
    if fromNode:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        ref_node = graph_db.node(fromNode)
        to_node = graph_db.node(toNode)
        aliceRel=graph_db.create(rel(ref_node,sRelationship,to_node))
        return 1
    else:
        return 0   

#myudf.py
@outputSchema('nodeCreated:int')
def createBillNode(nodeValue, sLabel, sTitle, sUpdated, sBillType,sBillNumber,sIntroducedAt,sStatus,sStatusAt):
    if nodeValue:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        foundNode,=graph_db.create(node(name=nodeValue))
        foundNode.add_labels(sLabel)
        foundNode["title"]=sTitle
        foundNode["updateDate"]=sUpdated
        foundNode["billType"]=sBillType
        foundNode["billNumber"]=sBillNumber
        foundNode["introducedAt"]=sIntroducedAt
        foundNode["status"]=sStatus
        foundNode["statusDate"]=sStatusAt
        return 1
    else:
        return 0

#myudf.py
@outputSchema('nodeUpdated:int')
def updateBillNode(nodeID, sTitle, sUpdated, sBillType,sBillNumber,sIntroducedAt,sStatus,sStatusAt):
    if nodeID:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        foundNode= graph_db.node(nodeID)
        foundNode["title"]=sTitle
        foundNode["updateDate"]=sUpdated
        foundNode["billType"]=sBillType
        foundNode["billNumber"]=sBillNumber
        foundNode["introducedAt"]=sIntroducedAt
        foundNode["status"]=sStatus
        foundNode["statusDate"]=sStatusAt
        return 1
    else:
        return 0

Of note is the ability to create the node and add the label in the createNode function. To create the relationship, we pass in the two node ids and the relationship type. This is passed via the REST API interface and the relationship is created.

Performance – Performance wasn’t what I thought it would be. Py2Neo interacts with Neo4j via its REST API interface and so every interaction requires a separate HTTP request to be sent. This approach, along with logging, made this much slower than I anticipated. Overall, it took about 40 minutes on my MacBook Pro with 16GB ram and SSD to create the Neo4J database.

Py2Neo Batches – Batches allow multiple requests to be grouped and sent together, cutting down on network traffic and latency. Such requests also have the advantage of being executed within a single transaction. The second run was done by adding some Py2Neo batches. This really didn’t make a huge difference as the log files were still being written.

Overall, it still took about 60 minutes on my MacBook Pro with 16GB ram and SSD to create the Neo4J database.

Next Steps
Hmmm….I should have known that the RESTful service performance wasn’t going to be anywhere near as fast as the batchinserter performance due to logging. You could see the log files grow and grow as the data was added. I’m going to go back to the drawing board and see if a Java UDF could work better. The worst case is I just go back to writing out files and writing a custom batchinserter each time.

Creating an Elasticsearch index of Congress Bills using Pig

pig_python_elasticsearch

Recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Shortly thereafter, Elasticsearch announced integration with Hadoop. “Using Elasticsearch in Hadoop has never been easier. Thanks to the deep API integration, interacting with Elasticsearch is similar to that of HDFS resources. And since Hadoop is more then just vanilla Map/Reduce, in elasticsearch-hadoop one will find support for Apache Hive, Apache Pig and Cascading in addition to plain Map/Reduce.”

Elasticsearch published the first milestone (1.3.0.M1) based on the new code-base that has been in the works for the last few months.

The intial attempt at testing out Mortar and Elasticsearch didn’t work. Working with the great team at Mortar and costinl at Elasticsearch, Mortar was able to update their platform to allow Mortar to write out to Elasticsearch at scale.

Test Case
To test this out, I decided to process congressional bill data from the past several congresses. The process will be to read in the json files, process the file using Pig, use NTLK to find the top 5 bigrams and then write the data out to an Elasticsearch index.

The Data
GovTrack.us, a tool by Civic Impulse, LLC, is one of the world’s most visited government transparency websites. The site helps ordinary citizens find and track bills in the U.S. Congress and understand their representatives’ legislative record.

The bulk data is a deep directory structure of flat XML and JSON files. The directory layout is described below.

Our files are in three main directories:

Getting the Data

To fetch the data we support rsync, a common Unix/Mac tool for efficiently fetching files and keeping them updated as they change. The root of our rsync tree is govtrack.us::govtrackdata, and this corresponds exactly to what you see at http://www.govtrack.us/data/.

To download bill data for the 113th Congress into a local directory named bills, run:

rsync -avz --delete --delete-excluded --exclude **/text-versions/ \
		govtrack.us::govtrackdata/congress/113/bills .

(Note the double colons in the middle and the period at the end. This is a long command. I’ve indicated the line continuation with a backslash.)

Directories

The following code loops through a directory of bills and converts all of the .json files into single line .json files.

The following pig code reads all of the single line .json files, pulls out some of the fields, calls a Python UDF to find the top 5 bigrams and then writes the data into an elasticsearch index.

The important steps are to:
a) register the jar file
b) define the storage to the elasticsearch index
c) write out the data using the defined storage\

If you are using the mortar framework, nltk isn’t installed by default. Here’s how you can install it:

# From your project's root directory - Switch to the mortar local virtualenv
source .mortar-local/pythonenv/bin/activate

#Install nltk (http://nltk.org/install.html)
sudo pip install -U pyyaml nltk

For the bi-grams, I re-used some sample Mortar code from Doug Daniels shown below:

Results
The pig job loaded 58,624 files, processed them and created the elasticsearch index in 53 seconds. The NLTK python UDF finished in another 34 seconds resulting in a total time of 87 seconds.

Screen Shot 2013-10-24 at 7.32.59 PM

You can see the working elasticsearch in the following screen shot:

Screen Shot 2013-10-24 at 7.36.14 PM

One thing of note
The elasticsearch hadoop connector doesn’t handle geo-coordinates quite yet so you can’t create an index with latitude/longitude. That should be coming soon.

Health Insurance Marketplace Costs

Data.Healthcare.Gov released QHP cost information for various health care plans for states in the Federally-Facilitated and State-Partnership Marketplaces. The data is available in a variety of formats and lays out costs for various levels of health care plans (Gold, Silver, Bronze and Catastrophe) for different categories.

Premium Information
Premium amounts do not include tax credits that will lower premiums for the majority of those applying, specifically those with income up to 400 percent of the federal poverty level. The document shows premiums for the following example rating scenarios below:

  • Adult Individual Age 27 = one adult age 27
  • Adult Individual Age 50 = one adult age 50
  • Family = two adults age 30, two children
  • Single Parent Family = one adult age 30, two children
  • Couple = two adults age 40, no children
  • Child = one child any age

Cost Comparisons
Looking at the information, I wanted to do some comparisons across the various plans and rating scenarios to see where the highest costs where, what states had the largest variance and to look at the standard deviation across states/plans.

While I could have run this in Excel or R, I decided to write a simple Pig job to determine the maximum, minimum and average costs by plan for each state. I also then calculated the variance and standard deviations.

Initial Cost Analysis
There is a wide range in costs across the states with Virginia being consistently the highest average cost plan. Looking at the catastrophic costs, Virginia plans are five times (5x) more expensive than Kansas or Alabama.

Catastrophic Plan Costs

For Gold plans, Virginia is again between two and three times (2-3X) more expensive to buy insurance.
Gold Plan costs

Variance and Standard Deviation
It comes as no surprise that Virginia has the largest variance and standard deviation for the cost data by a large margin. Virginia’s variance on the Gold plans is 2742 times that of Alabama. New Hampshire, Alaska, Delaware and Utah all have small variances and are consistent across the rating scenarios.

Gold Plan Variance

Again, Virginia’s variance on the bronze plans are way out of balance compared to other states.
Screen Shot 2013-10-07 at 9.11.03 AM

However, for a Platinum plan, has the ninth smallest variation across all rating scenarios. New Jersey, Michigan and Wisconsin have the largest variations.
Screen Shot 2013-10-07 at 9.17.18 AM

Code and Data
The code and data is on Github. If you have questions, you can reach me at dsfauth at gmail dot com.

Part 2 – Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

In the last post, I talked about creating the enhanced DocGraph dataset using Mortar and Neo4J. Our data model looks like the following:

Nodes
Organizations
Specialties
Providers
Locations
CountiesZip
Census

Relationships
* Organizations -[:PARENT_OF] – Providers -[:SPECIALTY]- Specialties
* Providers -[:LOCATED_IN]-Locations
* Providers -[:REFERRED]-Providers
* Counties -[:INCOME_IN]- CountiesZip
* Locations – [:LOCATED_IN]-Locations

Each of the nodes will have several properties associated with them. For example, Organizations will have a name associated with it. Locations have a city, state and postal code associated with each location.

Data
The data we are going to use is the initial DocGraph set, the Health Care Provider Taxonomy Code (NUCC) set located here, the National Plan and Provider Enumeration System (NPPES) Downloadable File here, and a zipcode to state file and the income per zipcode downloaded from the US Census. These files were loaded to an Amazon S3 bucket for processing.

Mortar Project
To create the Neo4J graph database, we will need to create several files to be loaded into Neo4J. To create the files, we are going to create a Mortar Project and use the pig file that we created in the last post.

Create Mortar Project
In order to fully leverage the Mortar Project framework, I created a mortar project which makes it available in GitHub. This will create a new project skeleton and register it with Mortar. This project will have folders created for commonly used items, such as pigscripts, macros, and UDFs.

cd mortar-examples
mortar projects:create docGraphNeo4J

Pig Code
Any Pig code that you want to run with Mortar should be put in the pigscripts directory in your project. I replaced the example pigscript in that directory called my-sample-project.pig with my docGraphNeo4J.pig script.

Illustrate
Illustrate is the best tool to check what you’ve written so far. Illustrate will check your Pig syntax, and then show a small subset of data flowing through each alias in your pigscript.

To get the fastest results, use the local:illustrate command.

mortar local:illustrate pigscripts/my-sample-project.pig

Once the illustrate result is ready, a web browser tab will open to show the results:

Screen Shot 2013-08-26 at 11.57.40 PM

Mortar Watchtower
Mortar Watchtower is fastest way to develop with Pig. Rather than waiting for local or remote Pig run, you can validate that your scripts work simply by saving. Watchtower sits in the background analyzing your script, showing you your data flowing through the scripts instantly.

After installing Mortar Watchtower, I was able to do near realtime analysis of the data simply by typing in:

mortar watch ./pigscripts/docGraphNeo4J2.pig

Once I type that into my console window, I see:
Screen Shot 2013-08-27 at 12.04.20 AM

A browser window then pops up:
Screen Shot 2013-08-27 at 12.05.42 AM

As you can see, the Watchtower Viewer redisplays your script with example data embedded inline with each alias. You can click on the header of this inline table to toggle between different numbers of example rows. You can also click on any given table cell to see the complete data, including any truncated.

Full Run on Mortar
Once the code was ready for running, it was time to run on a full Hadoop cluster. To specify cluster size for your run, use the –clustersize option:

$ mortar jobs:run pigscripts/docGraphNeo4J.pig --clustersize 4

When I ran these jobs on the full Hadoop cluster, it ran in about 16 minutes. It wrote the following records to my Amazon S3 buckets:

Input(s):
Successfully read 3998551 records from: "s3n://NPIData/npidata_20050523-20130512.csv"
Successfully read 830 records from: "s3n://NUCC-Taxonomy/nucc_taxonomy_130.txt"
Successfully read 49685587 records from: "s3n://medgraph/refer.2011.csv"

Output(s):
Successfully stored 3998551 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/providerList"
Successfully stored 3998551 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/locations"
Successfully stored 77896 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/parentOfLink"
Successfully stored 33212 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/addrList"
Successfully stored 830 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/specialties"
Successfully stored 4746915 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/specialtiesProviders"
Successfully stored 33212 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/uniquelocations"
Successfully stored 694221 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/organizations"
Successfully stored 49685587 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/docGraphProviders"
Successfully stored 1826823 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/uniqueDoctorList"

Summary
In summary, I was able to take the three raw data files, write a pig script to process the data, run the pig job on a Hadoop cluster and create the multiple files that I will need to populate the Neo4J instance.

Why did I choose Mortar?
Mortar is fast, open and free. As Mortar says, using a Mortar project provides you with the following advantages:

* Pig and Hadoop on Your Computer: When you create a Mortar Project, you get a local installation of Pig and Hadoop ready to use, without needing to install anything yourself. That means faster development, and better testing.
* Version Control and Code Sharing: Mortar Projects are backed by source control, either through Mortar or your own system, so you can collaborate with team members on a project.
* 1-Button Deployment: When you’re ready to run your project on a Hadoop cluster, a single command is all that’s needed to deploy and run in the cloud.

Using Mortar’s Watchtower, I was able to get an instant sampling of my data, complete file watching, instant schema validation and instant error catching.

For me, Mortar was easy, fast and a great tool to get the data ready for loading into Neo4J.

Next Steps
In the next post, I’ll write about how to move the data from the data files and load them into Neo4J.

Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

“The average doctor has likely never heard of Fred Trotter, but he has some provocative ideas about using physician data to change how healthcare gets delivered.” This was from a recent Gigaom article. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

The current DocGraph social graph was built in Neo4J. With new enhancements in Neo4J 2.0 (primarily labels), now was a good time to rebuild the social graph, add in data about each doctor, their specialties and their locations. Finally, I’ve added in some census income data at the zip code level. Researchers could look at economic indicators to see if there are discernable economic patterns in the referrals.

In this series of blog posts, I will attempt to walk through the process in building the Neo4J updated DocGraph using Hadoop followed by the Neo4J batch inserter.

Building the import documents.

One of the goals of the project was to learn Pig in combination with Hadoop to process the large files. I could easily have worked in MySQL or Oracle, but I also wanted an easy way to run jobs on large data sets.

My friends at Mortar have a great platform for leveraging Hadoop, Pig and Python. Mortar is the fastest and easiest way to work with Pig and Python on Hadoop. Mortar’s platform is for everything from joining and cleansing large data sets to machine learning and building recommender systems.
Mortar makes it easy for developers and data scientists to do powerful work with Hadoop. The main advantages of Mortar are:

  • Zero Setup Time: Mortar takes only minutes to set up (or no time at all on the web), and you can start running Pig jobs immediately. No need for painful installation or configuration.
  • Powerful Tooling: Mortar provides a rich suite of tools to aid in Pig development, including the ability to Illustrate a script before running it, and an extremely fast and free local development mode.
  • Elastic Clusters: We spin up Hadoop clusters as you need them, so you don’t have to predict your needs in advance, and you don’t pay for machines you don’t use.
  • Solid Support: Whether the issue is in your script or in Hadoop, we’ll help you figure out a solution.

Data Sets

One great thing about this data is that you can combine the DocGraph data with with other data sets. For example, we can combine NPPES data with the DocGraph data. The NPPES is the federal registry for NPI numbers and associated provider information.

To create the data sets for ingest into Neo4J, we are going to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes.

Pig Scripts
Using Pig scripts, I was able to create several data files that could then be loaded into Neo4J.

Running the Pig Code in Mortar
In the next post, we will look at using Mortar’s framework to run the Pig jobs.

Recommender Tips, Mortar and DocGraph

Jonathan Packer wrote on Mortar’s blog about flexible recommender models. Jonathan articulates that “from a business perspective the two most salient advantages of graph-based models: flexibility and simplicity.”

Some of salient points made in the article are:

  • graph-based models are modular and transparent
  • simple graph-based model will allow you to build a viable recommender system for your product without delaying its time-to-market
  • Graphs can be visualized, explained, discussed, and debugged collaboratively in a way that sophisticated machine learning techniques cannot.

Jonathan ends with “My opinion is that the next big advances to be made in recommender systems will be made by combining automated tools with human—possibly crowdsourced—editorial judgement and writing talent. They will be made in finding more engaging ways to present recommendations to users than cloying sidebars and endlessly scrolling lists.”

DocGraph
“The average doctor has likely never heard of Fred Trotter, but he has some provocative ideas about using physician data to change how healthcare gets delivered.” This was from a recent Gigaom article. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

The current DocGraph social graph was built in Neo4J. With new enhancements in Neo4J 2.0 (primarily labels), now was a good time to rebuild the social graph, add in data about each doctor, their specialties and their locations. Finally, I’ve added in some census income data at the zip code level. Researchers could look at economic indicators to see if there are discernable economic patterns in the referrals.

Recommendation Engine
The combination of the Neo4J social graph, the medical data and the capability to build a recommendation engine in Mortar makes a compelling use case. I believe that this use case will address Jonathan’s premise that the new engaging recommendation engines can be built to help give patients a sense of which doctors are most respected by their peers. Additionally, the graph data could help hospitals understand the referral patterns associated with poor care coordination, and provide health IT startups with a map of the most plugged-in doctors in each city.

Next steps
Over the next couple of weeks, I’ll be writing on how I used Mortar, Pig and Neo4J to build the updated DocGraph data set.

Chicago Sacred Heart Hospital – Medicare Kickback Scheme

According to an April 16, 2013 FBI press release, Chicago Sacred Heart Hospital Owner, Executive, and Four Doctors Arrested in Alleged Medicare Referral Kickback Conspiracy.

From the press release:

CHICAGO—The owner and another senior executive of Sacred Heart Hospital and four physicians affiliated with the west side facility were arrested today for allegedly conspiring to pay and receive illegal kickbacks, including more than $225,000 in cash, along with other forms of payment, in exchange for the referral of patients insured by Medicare and Medicaid to the hospital, announced U.S. Attorney for the Northern District of Illinois Gary S. Shapiro.

Arrested were Edward J. Novak, 58, of Park Ridge, Sacred Heart’s owner and chief executive officer since the late 1990s; Roy M. Payawal, 64, of Burr Ridge, executive vice president and chief financial officer since the early 2000s; and Drs. Venkateswara R. “V.R.” Kuchipudi, 66, of Oak Brook, Percy Conrad May, Jr., 75, of Chicago, Subir Maitra, 73, of Chicago, and Shanin Moshiri, 57, of Chicago.

DocGraph DataI wanted to see what the graph of these doctors looked like in the DocGraph dataset. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

Hadoop Data Processing Using Mortar for online hadoop processing, Amazon S3 storage and access to the data, I wrote up a Hadoop script that filters the DocGraph data where any of the accused where the referring doctors, joined them to the National Provider registry and wrote the data out to an S3 bucket.

medGraphData = LOAD 's3n://medgraph/refer.2011.csv' USING PigStorage(',') AS
(primaryProvider:chararray,
referredDoctor: chararray,
qtyReferred:chararray);

nucc_codes = LOAD 's3n://NUCC-Taxonomy/nucc_taxonomy_130.txt' USING PigStorage('\t') AS
(nuccCode:chararray,
nuccType:chararray,
nuccClassification:chararray,
nuccSpecialty:chararray);

-- Load NPI Data
npiData = LOAD 's3n://NPIData/npidata_20050523-20130113.csv' USING PigStorage(',') AS
(NPICode:chararray,
f2:chararray,
f3:chararray,
f4:chararray,
f5:chararray,
f6:chararray,
f7:chararray,
f8:chararray,
f9:chararray,
f10:chararray,
f11:chararray,
f12:chararray,
f13:chararray,
f14:chararray,
f15:chararray,
f16:chararray,
f17:chararray,
f18:chararray,
f19:chararray,
f20:chararray,
f21:chararray,
f22:chararray,
f23:chararray,
f24:chararray,
f25:chararray,
f26:chararray,
f27:chararray,
f28:chararray,
f29:chararray,
f30:chararray,
f31:chararray,
f32:chararray,
f33:chararray,
f34:chararray,
f35:chararray,
f36:chararray,
f37:chararray,
f38:chararray,
f39:chararray,
f40:chararray,
f41:chararray,
f42:chararray,
f43:chararray,
f44:chararray,
f45:chararray,
f46:chararray,
f47:chararray,
f48:chararray,
f49:chararray);

chicagoSacredHeartHosp = FILTER medGraphData BY (referredDoctor == '1003163122' OR referredDoctor == '1760730063');

chicagoSacredHeartHospPrimary = FILTER medGraphData BY (primaryProvider == '1003163122' OR primaryProvider == '1760730063');

docFraud = FILTER medGraphData BY (primaryProvider == '1598896086' OR primaryProvider == '1003450178' OR primaryProvider == '1255463576' OR primaryProvider == '1588694343' OR primaryProvider == '1588694343' OR primaryProvider == '1265492128');

--chicagoDocs = FILTER npiData BY ((f23 == '"CHICAGO"' OR f31 == '"CHICAGO"' ) AND f29 matches '.*3240.*');
out = FOREACH npiData GENERATE REPLACE(NPICode,'\\"','') as newNPICode, 
REPLACE(f5, '\\"','') as orgName,
REPLACE(f6, '\\"','') as orgLastName,
REPLACE(f7, '\\"', '') as firstName, 
REPLACE(f21, '\\"','') as docAddra1,
REPLACE(f22, '\\"','') as docAddra2,
REPLACE(f23, '\\"','') as docCity1,
REPLACE(f29, '\\"','') as docAddr1,
REPLACE(f30, '\\"','') as docAddr2,
REPLACE(f31, '\\"','') as docCity,
REPLACE(f32, '\\"','') as docState,
REPLACE(f33, '\\"','') as docPostalCode,
REPLACE(f48, '\\"','') as taxonomyCode;

docFraudSacredHeart = JOIN docFraud BY (referredDoctor), out BY newNPICode;

rmf s3n://DataOut/DocGraph/ChicagoDocs;
rmf s3n://DataOut/DocGraph/ChicagoMedicareFraud;
rmf s3n://DataOut/DocGraph/docFraud;
rmf s3n://DataOut/DocGraph/docFraudSacredHeart;

--STORE sampleJoinedVADoc INTO 's3n://DataOut/DocGraph/DocHosp' USING PigStorage('|');
--STORE out INTO 's3n://DataOut/DocGraph/ChicagoDocs' USING PigStorage('|');
STORE chicagoSacredHeartHospPrimary INTO 's3n://DataOut/DocGraph/ChicagoMedicareFraud' USING PigStorage('|');
STORE docFraud INTO 's3n://DataOut/DocGraph/docFraud' USING PigStorage('|');
STORE docFraudSacredHeart INTO 's3n://DataOut/DocGraph/docFraudSacredHeart' USING PigStorage('|');

Data Results
Looking at the data results, three of the doctors made referrals to Sacred Heart.

Doctor         NPI              Hospital NPI    Nbr Referrals
Dr. Maitra    1598896086	1558367656	    2495
Dr. Kuchipudi 1265492128	1558367656	    1171
Dr. May       1588694343	1558367656	     417
 

Visualization Using Gephi, I was able to visualize the referrals for these three doctors.

screenshot_234507

While this doesn’t provide a detailed look into the fraud, it does show there were referrals made to Sacred Heart.

DocGraph Analysis using Hadoop and D3.JS

Visualizing the DocGraph for Wyoming Medicare Providers

I have been participating in the DocGraph MedStartr project. After hearing about the project at GraphConnect 2012, I wanted to use this data to investigate additional capabilities of Hadoop and BigData processing. You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

One great thing about this data is that you can combine the DocGraph data with with other data sets. For example, we can combine NPPES data with the DocGraph data. The NPPES is the federal registry for NPI numbers and associated provider information. Additionally you can bring in other data such as Census data and Virginia State information.

In this example, I want to use MortarData (Hadoop in the cloud) to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes. The desired outcome is to compare the referrals between taxonomy codes for the entire State of Virginia and the areas of Virginia with a population of less that 25,000.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012

Approach
Using Hadoop and Pig, I am going to use the following approach:

1. Load up the four data sets.
2. Filter the NPI data from NPPES by the provider’s state.
3. Filter the State Data by the desired population.
4. Join both the primary and the referring doctors to the NPI/NPPES/Census data.
5. Carve out the Primary doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
6. Carve out the Referring doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
7. Carve out the primary and referring doctors, count the number of primary referrals and then link the taxonomy codes to both the primary and referring doctors.
8. Export the data out for future visualization.

Why Mortar Data and Hadoop
Using Hadoop, Pig and Mortar’s platform, I have several advantages:
1. I can store all of the data files as flat files in an Amazon S3 store. I don’t need a dedicated server.
2. I can spin up as many Hadoop clusters as I need in a short time.
3. I can write Pig code to do data processing, joins, filters, etc. that work on the data.
4. I can add in Python libraries and code to supplement the Pig.
5. I can add parameters and change the state and population on the fly.

You can see the Mortar Web interface here:
Screen shot 2013-02-19 at 10.14.07 PM

Visualization
I plan on using the D3.js library to create some visualizations. One example visualization I am working on is a Hierarchical Edge Bundling chart. You can see the initial prototype here. I still need to fill in all of the links.

Campaign Data Analysis Video

As a wrap-up on the Campaign Analysis that I presented at GraphConnect, I decided to make a video showing the usage of Mortar Data, Neo4J and D3 JS.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012

Neo4J
Neo4j is an open-source, high-performance, NOSQL graph database, optimized for superfast graph traversals. With Neo4J, you can easily model and change the full complexity of your system.

Neo4J was listed as a “big data vendor to watch in 2013” by Infoworld.

D3.JS
D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.