Chicago Sacred Heart Hospital – Medicare Kickback Scheme

According to an April 16, 2013 FBI press release, Chicago Sacred Heart Hospital Owner, Executive, and Four Doctors Arrested in Alleged Medicare Referral Kickback Conspiracy.

From the press release:

CHICAGO—The owner and another senior executive of Sacred Heart Hospital and four physicians affiliated with the west side facility were arrested today for allegedly conspiring to pay and receive illegal kickbacks, including more than $225,000 in cash, along with other forms of payment, in exchange for the referral of patients insured by Medicare and Medicaid to the hospital, announced U.S. Attorney for the Northern District of Illinois Gary S. Shapiro.

Arrested were Edward J. Novak, 58, of Park Ridge, Sacred Heart’s owner and chief executive officer since the late 1990s; Roy M. Payawal, 64, of Burr Ridge, executive vice president and chief financial officer since the early 2000s; and Drs. Venkateswara R. “V.R.” Kuchipudi, 66, of Oak Brook, Percy Conrad May, Jr., 75, of Chicago, Subir Maitra, 73, of Chicago, and Shanin Moshiri, 57, of Chicago.

DocGraph DataI wanted to see what the graph of these doctors looked like in the DocGraph dataset. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

Hadoop Data Processing Using Mortar for online hadoop processing, Amazon S3 storage and access to the data, I wrote up a Hadoop script that filters the DocGraph data where any of the accused where the referring doctors, joined them to the National Provider registry and wrote the data out to an S3 bucket.

medGraphData = LOAD 's3n://medgraph/refer.2011.csv' USING PigStorage(',') AS
referredDoctor: chararray,

nucc_codes = LOAD 's3n://NUCC-Taxonomy/nucc_taxonomy_130.txt' USING PigStorage('\t') AS

-- Load NPI Data
npiData = LOAD 's3n://NPIData/npidata_20050523-20130113.csv' USING PigStorage(',') AS

chicagoSacredHeartHosp = FILTER medGraphData BY (referredDoctor == '1003163122' OR referredDoctor == '1760730063');

chicagoSacredHeartHospPrimary = FILTER medGraphData BY (primaryProvider == '1003163122' OR primaryProvider == '1760730063');

docFraud = FILTER medGraphData BY (primaryProvider == '1598896086' OR primaryProvider == '1003450178' OR primaryProvider == '1255463576' OR primaryProvider == '1588694343' OR primaryProvider == '1588694343' OR primaryProvider == '1265492128');

--chicagoDocs = FILTER npiData BY ((f23 == '"CHICAGO"' OR f31 == '"CHICAGO"' ) AND f29 matches '.*3240.*');
out = FOREACH npiData GENERATE REPLACE(NPICode,'\\"','') as newNPICode, 
REPLACE(f5, '\\"','') as orgName,
REPLACE(f6, '\\"','') as orgLastName,
REPLACE(f7, '\\"', '') as firstName, 
REPLACE(f21, '\\"','') as docAddra1,
REPLACE(f22, '\\"','') as docAddra2,
REPLACE(f23, '\\"','') as docCity1,
REPLACE(f29, '\\"','') as docAddr1,
REPLACE(f30, '\\"','') as docAddr2,
REPLACE(f31, '\\"','') as docCity,
REPLACE(f32, '\\"','') as docState,
REPLACE(f33, '\\"','') as docPostalCode,
REPLACE(f48, '\\"','') as taxonomyCode;

docFraudSacredHeart = JOIN docFraud BY (referredDoctor), out BY newNPICode;

rmf s3n://DataOut/DocGraph/ChicagoDocs;
rmf s3n://DataOut/DocGraph/ChicagoMedicareFraud;
rmf s3n://DataOut/DocGraph/docFraud;
rmf s3n://DataOut/DocGraph/docFraudSacredHeart;

--STORE sampleJoinedVADoc INTO 's3n://DataOut/DocGraph/DocHosp' USING PigStorage('|');
--STORE out INTO 's3n://DataOut/DocGraph/ChicagoDocs' USING PigStorage('|');
STORE chicagoSacredHeartHospPrimary INTO 's3n://DataOut/DocGraph/ChicagoMedicareFraud' USING PigStorage('|');
STORE docFraud INTO 's3n://DataOut/DocGraph/docFraud' USING PigStorage('|');
STORE docFraudSacredHeart INTO 's3n://DataOut/DocGraph/docFraudSacredHeart' USING PigStorage('|');

Data Results
Looking at the data results, three of the doctors made referrals to Sacred Heart.

Doctor         NPI              Hospital NPI    Nbr Referrals
Dr. Maitra    1598896086	1558367656	    2495
Dr. Kuchipudi 1265492128	1558367656	    1171
Dr. May       1588694343	1558367656	     417

Visualization Using Gephi, I was able to visualize the referrals for these three doctors.


While this doesn’t provide a detailed look into the fraud, it does show there were referrals made to Sacred Heart.

DocGraph Analysis using Hadoop and D3.JS

Visualizing the DocGraph for Wyoming Medicare Providers

I have been participating in the DocGraph MedStartr project. After hearing about the project at GraphConnect 2012, I wanted to use this data to investigate additional capabilities of Hadoop and BigData processing. You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

One great thing about this data is that you can combine the DocGraph data with with other data sets. For example, we can combine NPPES data with the DocGraph data. The NPPES is the federal registry for NPI numbers and associated provider information. Additionally you can bring in other data such as Census data and Virginia State information.

In this example, I want to use MortarData (Hadoop in the cloud) to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes. The desired outcome is to compare the referrals between taxonomy codes for the entire State of Virginia and the areas of Virginia with a population of less that 25,000.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012

Using Hadoop and Pig, I am going to use the following approach:

1. Load up the four data sets.
2. Filter the NPI data from NPPES by the provider’s state.
3. Filter the State Data by the desired population.
4. Join both the primary and the referring doctors to the NPI/NPPES/Census data.
5. Carve out the Primary doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
6. Carve out the Referring doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
7. Carve out the primary and referring doctors, count the number of primary referrals and then link the taxonomy codes to both the primary and referring doctors.
8. Export the data out for future visualization.

Why Mortar Data and Hadoop
Using Hadoop, Pig and Mortar’s platform, I have several advantages:
1. I can store all of the data files as flat files in an Amazon S3 store. I don’t need a dedicated server.
2. I can spin up as many Hadoop clusters as I need in a short time.
3. I can write Pig code to do data processing, joins, filters, etc. that work on the data.
4. I can add in Python libraries and code to supplement the Pig.
5. I can add parameters and change the state and population on the fly.

You can see the Mortar Web interface here:
Screen shot 2013-02-19 at 10.14.07 PM

I plan on using the D3.js library to create some visualizations. One example visualization I am working on is a Hierarchical Edge Bundling chart. You can see the initial prototype here. I still need to fill in all of the links.

Campaign Data Analysis Video

As a wrap-up on the Campaign Analysis that I presented at GraphConnect, I decided to make a video showing the usage of Mortar Data, Neo4J and D3 JS.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012

Neo4j is an open-source, high-performance, NOSQL graph database, optimized for superfast graph traversals. With Neo4J, you can easily model and change the full complexity of your system.

Neo4J was listed as a “big data vendor to watch in 2013” by Infoworld.

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

GraphConnect – Graphies

GraphConnect was held two weeks ago (Nov 5-6) in San Francisco. GraphConnect was put together by Neo4J as an initial conference on graph databases.

Part of the conference was a contest on building innovative graphs. The “Graphies” recognized and celebrated individuals and teams that were developing innovative and impressive graph database applications. Additionally, there was a twitter contest called #ShowMeYourGraph. The contest was to tap into your creative side and show us your most creative graph using any online format: graph visualization, photo, video, blog, etc.

Two of my submissions were SuperPac contributions and SuperPac contributions with labels. The rest of this post will talk about how I created the two charts.

SuperPac Contributions
The Sunlight Foundation provided the SuperPac expenditure data for the 2012 campaign. The data was in a csv format.

For this example, I wanted to look at contributions from September 1 through October 2012 as well as look only at the Presidential candidates. This would show which SuperPacs were funding or opposing candidates. Using some crazy Excel skills, I was able to identify the unique SuperPacs, their support or opposition, and the candidate they were supporting or opposing.

Visualization using D3.js

After finding a good example of a force node graph in D3.js, I was able to create the JSON file for the data. In order to create the different colors, I needed to figure out different groupings. The groups for Barack Obama and Mitt Romney were unique groups. Additional groups were created for SuperPacs that supported a candidate, opposed a candidate, supported one candidate while opposing another candidate (i.e. Support Obama and Oppose Romney and vice versa) and those that did all three.

In this instance, Mitt Romney is purple and Barack Obama is the peach node. The red nodes supported Mitt Romney and the blue opposed. The orange nodes opposed Barack Obama and the light blue supported Barack Obama. The green supported Romney and opposed Obama while the light orange supported Obama and opposed Romney.

With labels, the graph gets a lot more messy. You can see it here:

The initial graph was chosen as the winner of the Most Innovative Educational Graph Application as well as a winning entry in the #ShowMeYourGraph contest. Thanks to Neo4J for the conference and for the competition.

Graphing Congressional Sponsorship of Bills – Part 2

This is a follow-up to the previous post.

Note:The data and java code is available at Github.

The Data
I decided to use six years of data (Congress sessions 107-112). Since there is significant turnover in Congress (re-election, sex scandals, resignation, and death among other reasons), I had to expand the number of congressional members to 928. In addition, over the six years of bills, there are now 5,870 unique topics or subjects of the bills (i.e. Arts, Boy Scouts, Clergy, Welfare, Whaling to mention a few).

Using the BatchInserter framework from Max De Marzi and Michael Hunger, the data can be quickly loaded into Neo4J (less than 5 minutes). There are 340,619 nodes, 846,697 properties, and 2,514,565 relationships.

Data Exploration
We’ll start out simple using Cypher queries and build from there. Let’s see how many bills discuss Whaling.

START n=node:subjects('subject:Whaling')
RETURN c.Session, count(c.Session)

Our output:

"110"	5
"109"	7
"108"	11
"107"	13

To see who is sponsoring or cosponsoring the most bills related to “whaling”, we run the following query:

START n=node:subjects('subject:Whaling')
RETURN z.firstname, z.lastname, count(z) As popCoSponsors
ORDER BY popCoSponsors DESC

Our output is:

z.firstname  z.lastname  popCoSponsors
"Joseph"     "Lieberman"  9
"John"	     "Kerry"	  8
"Susan"	     "Collins"	  8
"Edward"     "Kennedy"	  8
"Barbara"    "Boxer"	  8
"Ron"        "Wyden"	  7
"Daniel"     "Akaka"	  7
"Christopher" "Dodd"	  6
"Olympia"     "Snowe"	  6
"John"	      "Reed"	  6
"Carl"	      "Levin"	  6
"Russell"     "Feingold"  6
"Wayne"	      "Gilchrest" 6
"John"	      "McCain"	  6
"Joseph"      "Biden"	  5

If I want to see who usually co-sponsors bills sponsored by Olympia Snowe, I would run this query:

START n=node:subjects('subject:Whaling'), z=node:congress('lastName:Snowe')
RETURN x.firstname, x.lastname, count(x) As popCoSponsors
ORDER BY popCoSponsors DESC

The output for this is:

"Ron"	        "Wyden"	        4
"Christopher"	"Dodd"	        4
"John"	        "Reed"	        4
"Daniel"	"Akaka"	        4
"Susan"	        "Collins"	4
"Joseph"	"Biden"	        4
"Edward"	"Kennedy"	4
"Barbara"	"Boxer"	        4
"John"	        "McCain"	4
"Joseph"	"Lieberman"	4

Paths between Congressmen
To find the shortest path between two congressmen, we can run this query:

START ryanvp = node(6233), obama = node(6323)
MATCH p=shortestPath(ryanvp-[*..10]-obama) return p;

This returns the following path:

Obama -> Bill SR 97 110 Session -> Topic of Congressional Tributes -> HR716 112 Session -> Paul Ryan

Graphing Congressional Sponsorship of Bills


In preparation for my talk at GraphConnect, I’ve added onto the talk by graphing out congressional sponsorship of bills using Neo4J. This post will specifically talk about data acquisition, data preparation, data loading into Neo4J and simple cypher queries against the data. Follow-on posts will look at processing this same data in Hadoop, processing this data in data-flows with Pig and Python to expose new and interesting properties, and visually graphing the data using d3.js.

The Data:

In this project we will download a spreadsheet of all members of congress as well as all bills/resolutions from the 112th session of congress. has made an xml file of everyone that has ever served in Congress, and U.S. presidents, with their party affiliation, terms in Congress, birthdays, etc. Additionally, they have provided all bills and resolutions for the current Congress and similarly named directories for other Congresses.

GovTrack has provided easy to use instructions on obtaining their data. To download GovTrack’s raw data files in bulk, or if you plan to regularly update the files, you should use the rsync tool. Rsync is good for selecting which directory of data you want and keeping your files up to date by only downloading changes on each update.

rsync -avz --delete --delete-excluded .

Complete code for this example will be made available on github.

Data Processing:

Using Excel, you can load in the Congress xml file and it will be transformed into a spreadsheet. By filtering on the active field, you can quickly create a spreadsheet of 535 congressmen. This file is then saved out to a tab-delimited format and then using TextWrangler converted the tabs to pipe delimiters.

Each of the bills is downloaded as an xml file. A sample is shown below:


 A resolution expressing the sense of the Senate that the primary safeguard for the well-being and protection of children is the family, and that the primary safeguards for the legal rights of children in the United States are the Constitutions of the United States and the several States, and that, because the use of international treaties to govern policy in the United States on families and children is contrary to principles of self-government and federalism, and that, because the United Nations Convention on the Rights of the Child undermines traditional principles of law in the United States regarding parents and children, the President should not transmit the Convention to the Senate for its advice and consent.

 Referred to the Committee on Foreign Relations.
<summary> 3/10/2011--Introduced.
Expresses the sense of the Senate that: (1) the United Nations Convention on the Rights of the Child is incompatible with the Constitution, the laws, and the traditions of the United States; (2) the Convention would undermine presumptions of freedom and independence for U.S. families; (3) the Convention would interfere with U.S. principles of sovereignty, independence, and self-government that preclude the necessity or propriety of adopting international law to govern domestic matters; and (4) the President should not transmit the Convention to the Senate for its advice and consent.

Using a quickly written java application, I was able to create several delimited files that I would need to support data loading into Neo4J. Each bill has a sponsor, may have 1 or more co-sponsors, has one or more topic or subject that it describes and has activity as it moves through congress. You can see that modeled below:

Data Loading into Neo4J
Using the BatchInserter framework from Max De Marzi and Michael Hunger, the data can be quickly loaded into Neo4J (less than 5 minutes). There are 50,889 nodes, 121,049 properties, and 266,197 relationships.

Quick Analysis

Using Cypher, you can quickly see who sponsored the most bills:

START n=node:congress('lastName:*')
RETURN n.lastname, n.firstname, count(*)
ORDER BY count(*) desc LIMIT 5;

Casey, Robert 228
Menéndez, Robert 207
Schumer, Charles 146
Brown, Sherrod 118
Hagan, Kay 116

And those that COSPONSOR:

START n=node:congress('lastName:*')
RETURN n.lastname, n.firstname, count(*)
ORDER BY count(*) desc LIMIT 5;
Grijalva, Raul 752
Filner, Bob 704
Rangel, Charles 687
Norton, Eleanor 681
Lee, Barbara 605

We’ve seen how we can quickly transform data in XML files into Neo4J and begin to look at the data. The next post will look at further analyzing of the data.

Analysis of Cash Pouring into State Campaigns

In today’s Star Tribune, there was an article discussing the amount of cash that is flowing from outside of the state of Minnesota into various candidate’s campaign war chests. The authors used FEC individual campaign contribution data to perform the analysis described in the article.

Using a similar approach outlined earlier, I decided to see how easy it would be to use Datameer to perform similar analysis. In this case, I filtered out candidates so only Minnesota candidates would show up as shown below.

After joining this set of records to campaign contributions, I then filtered on individuals who listed Minnesota as their address and also filtered on non-Minnesota contributors. After grouping the sums by candidate and joining the sheets, I had a worksheet that I could then use to graph results. That sheet is below:

Using the Infographic capability, I created a quick bar graph from these results.

I have to look at the data set again as the numbers don’t quite match to the Star Tribune article. However, it shows that tools like Datameer 2.0 can be used for multi-purpose data analysis and tasks that we would normally think only a data scientist would be capable of performing.

Graph Connect 2012

Inaugural Graph Database Conference

The NOSQL movement has taken the world by storm, bringing a new coherency and meaning to connected data, and giving developers and technical leads the power to manage modern data at a new speed and size.

GraphConnect 2012 is the first conference to focus on graph databases and the applications making sense of connected data. Graph thought leaders from many disciplines and organizations will come to explore new ideas, share innovations in graph technology, and make connections with researchers and developers from around the globe.

Early registration is available at The call for papers expires at the end of July.

Register for GraphConnect 2012 (size 728 X 90)

Datameer 2.0 – Analysis of Campaign Contributions

In the previous post, we looked at using Datameer 2.0 to analyze some Federal Election Commission data. In this post, I will show some of the analysis performed on the data specifically using the table join, filter, sort, a couple of functions and then a couple of infographics.

Problem Statement
The problem statement for this exercise is to determine the top 15 Virginia candidates in terms of total campaign contributions and then show the source of the campaign contributions (in state vs. out-of-state).

As mentioned previously, I created multiple workbooks for each of the data sets (committees, candidates and individual contributors). In the candidate data, I used the filter feature and created a worksheet of only Virginia candidates.

I created a new worksheet in order to join the Virginia candidates with the committees that are supporting them. I imported the filtered candidate data as well as the committee data. Using Datameer’s join capability, I was able to easily join the two data sets in order to create a new worksheet. The join feature allows the user to select the columns to join and also select which columns to include in the joined result. If you forget what type of join you want to use, a link to a wiki describing the joins is available.

One of the impressive things about Datameer is that it shows you the sheet dependencies or where the data came from. For Virginia Contributions, I joined the worksheet of candidates/committees with the individual campaign contributions. I then created new filtered worksheets by first filtering on contributors with an address listing VA as their state and a second worksheet for all other states. An example of the dependencies is below:

Once that worksheet was created, I created a new worksheet and then used the Datameer groupby and groupbysum functions to total up the contributions to each candidate.
After that, another join of the last two worksheets followed by a filtering of the Top 15 by total contribution left me with this worksheet.

Infographics or Cool Charts
Once the data was available, it was time to create a bar chart of the data. Datameer’s extensive library of widgets includes tables, graphs, charts, diagrams, maps, and tag clouds which enables users to create simple dashboards or stunning business infographics and visualizations. Datameer’s WYSIWYG Editor speeds creation of compelling and insightful business infographics. The Editor includes a graphics Inspector, a simple but powerful tool for configuring graphic and text elements including colors, fonts, etc. The Editor provides the realtime view of all graphic and text elements as they are created and edited so that the final visualizations are rendered exactly as intended.

For this case, I chose a rather simple bar chart. Once placed on the WYSIWYG editor, you simply drag the fields from the list of worksheets and the chart takes shape. There are widget settings for the each type of display allowing you to customize the look and layout of the final product. You can see the stacked bar chart below:

In this graph, we are looking at what the breakdown is of contributions from Virginia addresses versus non-Virginia addresses. In the graph, the total amount is Blue. The amount from Virginia is Orange, and the amount outside of Virginia is Green. You can see that Cantor and Kaine have received a majority of campaign contributions from outside of Virginia.

Geographical Analysis
One of the last charts I used was a map. Data with latitude and longitude coordinates are automatically plotted on the map. In this case, I filtered out the contribution for one Virginia candidate (Tim Kaine) and geocoded a portion of the data. That information was then loaded into a worksheet and plotted on a map infographic shown below.

The data did not include any Virginia contributions but at a glance you can see a large number of contributions to the Kaine campaign from the southern US.

Datameer 2.0 opens up data analytics to more people making it relatively easy. I didn’t need to stand up a Hadoop cluster in order to get started. With a little bit of knowledge and asking some questions, I was able to easily drill down into the data for answers.