Intelliwareness - Blog on Big Data, Data Analytics and Other IT

Extracting Insights from FBO.Gov data – Part 3

Post author By dave fauth
Post date January 21, 2014
Categories In Hadoop, Mortar, opendata
No Comments on Extracting Insights from FBO.Gov data – Part 3

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from FBO.gov. In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

In the first part of this series, I looked at the data and munged the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. In part two of this series, I created some bubble charts looking at awards by Agency and also the most popular Awardees.

In the third part of the series, I am going to look at awards by date and then displaying that information in a calendar view. Then we will look at the types of awards.

For the date analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field, and then calculate the counts by date for the awards. The goal is to see when awards are being made.

The Data Munging
The CSV files contain things like HTML codes and returns within cells. During initial runs, the Piggybank function CSVExcelStorage wasn’t able to correctly parse the CSV files. Other values would end up in the date field and provide inaccurate results. To combat that problem, I ended up finding the SuperCSV java classes. Using their sample code, I was able to write a custom cell processor that would read in the CSV file, strip out carriage returns, and write the information back out in a tab-delimited format.

My friends at Mortar are taking a look at the data and the CSVExcelStorage function to see if my issue can be addressed.

Pig Script
The pig script is really simple. I read in all of the metadata files, filter them where there is an award and it isn’t null. All six files are then joined and a new set of data is generated of just the date of award. These are then grouped and counted. No real rocket science there.

Here is a sample of the results:

Date	Amt
2012-09-28	1190
2009-09-30	1187
2010-09-30	1052
2010-09-29	1027
2009-06-04	1002

Visualization
For visualization of this data, I used an adaptation of Mike Bostock’s D3.js calendar example. I used the same colors but made a few tweaks to the scale. A snapshot calendar is below:

In this visualization, larger award counts are green while lower award counts are in red. Anyone familiar with government contracting will notice that “end of the fiscal year” awards are always made in September. Contracting officers need to spend fiscal year money before it expires. In the calendar, it is easy to see this for each of the fiscal years.

Data by Categories
Let’s look at the data by categories. There are 103 unique categories of award types in the data sets. You can see that list here. There are over 9600 combinations of agencies awarding contracts in a category. You can see the CSV file here.

One of the curious categories is the “Live Animals” category. The agency, category and number of awards are listed below:

Agency	Category	Number of Awards
Customs and Border Protection	88 — Live animals	46
National Institutes of Health	88 — Live animals	24
National Park Service	88 — Live animals	9
Centers for Disease Control and Prevention	88 — Live animals	7
Army Contracting Command, MICC	88 — Live animals	6
Office of the Chief Procurement Officer	88 — Live animals	4
United States Marshals Service	88 — Live animals	4
Agricultural Research Service	88 — Live animals	3
Animal and Plant Health Inspection Service	88 — Live animals	3
Food and Drug Administration	88 — Live animals	3
Air Force Materiel Command	88 — Live animals	2
Bureau of Alcohol, Tobacco and Firearms (ATF)	88 — Live animals	2
Forest Service	88 — Live animals	2
Public Buildings Service (PBS)	88 — Live animals	2
TRICARE Management Activity	88 — Live animals	2
U.S. Special Operations Command	88 — Live animals	2
United States Marine Corps	88 — Live animals	2
United States Secret Service (USSS)	88 — Live animals	2
Army Contracting Command	88 — Live animals	1
Bureau of Indian Affairs	88 — Live animals	1
Bureau of Medicine and Surgery	88 — Live animals	1
Direct Reporting Unit – Air Force District of Washington	88 — Live animals	1
Direct Reporting Units	88 — Live animals	1
Farm Service Agency	88 — Live animals	1
FedBid	88 — Live animals	1
Fish and Wildlife Service	88 — Live animals	1
Fresno VAMC	88 — Live animals	1
Office of Acquisitions	88 — Live animals	1
Pacific Air Forces	88 — Live animals	1
San Francisco VAMC	88 — Live animals	1
U.S. Army Corps of Engineers	88 — Live animals	1
VA Connecticut Health Care System	88 — Live animals	1
Washington Headquarters Services	88 — Live animals	1

Bi-Grams in the Description
I decided to take a look at descriptions associated with the awards and see what bi-grams appeared when running this through NLTK. I created a simple Pig script to filter the awards, join them to the descriptions and then run the descriptions through Python/NLTK.

/* filter awards by NIH */ activeHasAward = FILTER active_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL AND classCode == '88 -- Live animals'); fy0004HasAward = FILTER fy00_04_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL AND classCode == '88 -- Live animals'); fy0507HasAward = FILTER fy05_07_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL AND classCode == '88 -- Live animals'); fy0809HasAward = FILTER fy08_09_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL AND classCode == '88 -- Live animals'); fy1011HasAward = FILTER fy10_11_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL AND classCode == '88 -- Live animals'); fy1213HasAward = FILTER fy12_13_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL AND classCode == '88 -- Live animals'); /*group Data */ allAwardData = UNION activeHasAward, fy0004HasAward, fy0507HasAward, fy0809HasAward, fy1011HasAward, fy1213HasAward; allDescrData = UNION active_descr, fy0004descr, fy0507descr, fy0809descr, fy1011descr, fy1213descr; /* join the data */ joinedLADetails = JOIN allDescrData BY awardID, allAwardData BY noticeID; /*remove HTML tagging from description */ B = FOREACH joinedLADetails GENERATE awardID, agencyName, awardee, contractAwardAmount, nltk_udfs.stripHTML(description) AS fboDescr; liveAnimalResults = FOREACH B GENERATE awardID, agencyName, awardee, contractAwardAmount,nltk_udfs.top5_bigrams(fboDescr), fboDescr;

The Python/NLTK piece is below:

Results
A majority (109 out of 192) consisted of “No description provided”. Thus the bi-gram results were “{(description provided),(no description)}”. Of the remaining awards, there were some interesting results. For example:

{(, boarding),(10 paso),(5 potential),(and general),(boarding and)} was from a US Marshals Service Contract for “serives to include the transportation, boarding and general care of 10 paso fino horses”.

{(country of),(of japan),(japan .),(the country),(& removal)} was from a Pacific Air Forces contract for “this is feral pig control & removal service at tama service annex. ”

and then there was this one:
{(for the),(farm hands),(on april),(a contract),(ms. crundwell)}
{(for the),(the horses),(on april),(care for),(ms. crundwell)}

contract was awarded under far 6.302-2, unusual and compelling urgency. on april 17, 2012, rita crundwell, comptroller for the city of dixon, illinois, was arrested by the federal bureau of investigation (fbi) for wire fraud. ms. crundwell was accused of embezzling $53 million from the city of dixon. on april 18, 2012, the u.s. marshals service was notified by the fbi that ms. crundwell’s bank accounts had been frozen. the defendant was using embezzled money for the care and maintenance of over 200 horses located in dixon, illinois and beliot, wisconsin and several additional locations. since her accounts were frozen, the farm/ranch did not have the means to care for the horses. farm hands continued to work after this date even though they were unsure if they would receive payment. if the farm hands were to walk off the job, there would be no means to care for the horses. in addition, weekly deliveries of hay and grain would cease. the usms issued purchase orders for hay and grain deliveries to continue until a contract award could be made for the management and care of the horses. the government recognized that there was an immediate need to care for the animals. if a contract was not awarded immediately, the lives of the horses would be at risk.

You can read about Ms. Crundwell here. There were two contracts awarded for $625,840 and for $302,850 to ensure the horses’ safety.

In summary, I looked at the award dates to look for patterns and then I looked at the text descriptions to look for interesting data combinations.

Extracting Insights from FBO.Gov data – Part 2

Post author By dave fauth
Post date January 5, 2014
Categories In Hadoop, Mortar, opendata, Uncategorized, Visualization
No Comments on Extracting Insights from FBO.Gov data – Part 2

In the first part of this series, we looked at the data and munging the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. Those didn’t really work out all that well. Actually they sucked. They didn’t represent the data well at all. The data was too sparsely related so I was forced to show more data than really made sense. In part two of this post, we will create some bubble charts looking at awards by Agency and also the most popular Awardees.

The Data
For this analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field and then calculate the sum and count values by the awarding agency and by the awardee.

Pig Script
Again, I created a Mortar project to process the data. The pig script loads the six tab delimited files, filters the files by ‘(noticeType == ‘Award Notice’ AND contractAwardAmount IS NOT NULL)’, does a union and then calculates the count and sum values. I used a small piece of Python code to clean up the award amount field.

The results are written out to a delimited file. Looking at the results, there were 465 agencies that awarded a total of 428,937 contracts. Again, this is for all of the available data (2000-present). For awardees, there were over 192,000 unique names receiving awards.

/** * FBO_Data */ %default INPUT_PATH '/Users/davidfauth/fbo_data/fbo_data_active.csv' %default INPUT_NEW_PATH '/Users/davidfauth/fbo_data/fbo_data_pig/fbo_data_archive_12_13_tab.txt' %default INPUT_DATA_PATH '/Users/davidfauth/fbo_data/fbo_data_pig' %default OUTPUT_PATH '/Users/davidfauth/MortarBillsData' /** * User-Defined Functions (UDFs) */ REGISTER '/Users/davidfauth/mortarProjects/fbo_data/udfs/python/fbo_data.py' USING streaming_python AS nltk_udfs; fy12_13_data = LOAD '$INPUT_DATA_PATH/fbo_data_archive_12_13_tab.txt' USING PigStorage('\t') AS (postedDate:chararray, classCode:chararray, naicsCode:chararray, agencyName:chararray, title:chararray, solicitationNumber:chararray, responseDeadline:chararray, pocEmail:chararray, setAside:chararray, popAddress:chararray, popCity:chararray, popZip:chararray, popCountry:chararray, placeOfPerformanceText:chararray, noticeType:chararray, contractAwardNumber:chararray, contractAwardAmount:chararray, contractAwardDate:chararray, awardee:chararray, contractorAwardedDuns:chararray, noticeID:chararray); /* get Awards */ activeHasAward = FILTER active_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL); fy0004HasAward = FILTER fy00_04_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL); fy0507HasAward = FILTER fy05_07_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL); fy0809HasAward = FILTER fy08_09_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL); fy1011HasAward = FILTER fy10_11_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL); fy1213HasAward = FILTER fy12_13_data BY (noticeType == 'Award Notice' AND contractAwardAmount IS NOT NULL); /*group Data */ allAwardData = UNION activeHasAward, fy0004HasAward, fy0507HasAward, fy0809HasAward, fy1011HasAward, fy1213HasAward; groupAllAwardDataAgency = FOREACH allAwardData generate agencyName, (float) nltk_udfs.cleanAmount(contractAwardAmount) AS fboAmount; groupAllAwardDataAgencyAwardee = FOREACH allAwardData generate agencyName, awardee, (float) nltk_udfs.cleanAmount(contractAwardAmount) AS fboAmount; rawAwardsByAgency = GROUP groupAllAwardDataAgency BY (agencyName); agencyAwardSummary = foreach rawAwardsByAgency GENERATE FLATTEN(group), COUNT(groupAllAwardDataAgency) as countAwardsByAgency,SUM(groupAllAwardDataAgency.fboAmount) as sumAwardAmt; orderedAgencyAwardSummary = ORDER agencyAwardSummary BY countAwardsByAgency DESC; -- remove any existing data rmf $OUTPUT_PATH; -- store the results STORE orderedAgencyAwardSummary INTO '$OUTPUT_PATH/agencyAwardDetails' USING PigStorage('|');

Visualization
For this visualization, I wanted to try a bubble chart. The bubble chart will allow me to visualize the number of awards, the size of awards and a relative size based on the number of awards.

For the bubble chart, I am using HighCharts, a javascript library for creating charts. Since this is a personal, non-commercial website, there is no charge to use the product.

Here is a sample of the results for agency awards:

Agency	Number of Awards	Sum of Award Amount
DLA Acquisition Locations	119231	34758490026.0626
Naval Supply Systems Command	42013	1330565313323460
Army Contracting Command	40464	162850429349.624
Air Force Materiel Command	37267	16473214961095.8
U.S. Army Corps of Engineers	17032	211563631159.529

And here’s a sample of the results for awardees:

Agency	Number of Awards	Sum of Award Amount
KAMPI COMPONENTS CO., INC.-7Z016 KAMPI COMPONENTS CO., INC. 88 CANAL RD FAIRLESS HILLS PA 19030-4302 US	1312	89038206.8847656
OSHKOSH CORPORATION-45152 OSHKOSH CORPORATION 2307 OREGON ST OSHKOSH WI 54902-7062 US	1017	159693165.185546
PIONEER INDUSTRIES, INC.-66200 PIONEER INDUSTRIES, INC. 155 MARINE ST FARMINGDALE NY 11735-5609 US	975	58782020.125
BELL BOEING JOINT PROJECT OFFICE, AMARILLO, TX 79120-3160	940	465747753

Let’s look at a simple bar chart of the Agency Awards:

Now, let’s look at the bubble chart of the same data:

Here is a simple bar chart of award recipients:

And a bubble chart showing awards by recipients:

Analysis

There really isn’t a lot of surprise in which agencies are awarding contracts. 8 of the top 10 are in the defense industry. Similarily the top recipient of contracts was Kampi Components, which supplies factory replacement spare parts to the United States military. Number two was Oshkosh Corporation, who are manufacturers of severe heavy duty all wheel drive defense or military trucks, aircraft or emergency rescue and firefighting. The one outlier is NIH which didn’t award a large number of contracts but awarded large contracts.

Next steps will be to look at the types of awards and see how to display those in a meaningful fashion.

Extracting Insights from FBO.Gov data – Part 1

Post author By dave fauth
Post date December 30, 2013
Categories In Hadoop, Mortar, Vertascale, Visualization
No Comments on Extracting Insights from FBO.Gov data – Part 1

Extracting Insights from FBO.Gov data – Part 1

From the Sunlight page linked above:
“The notices are broken out into two parts. One file has records with the unique id for each notice, and almost all of the related data fields. The other file with the _desc suffix has the unique id for each notice and a longer prose field that includes the description of the notice. Additionally, the numbers at the end of each file name indicate the year range.” So what we have are two files for each year. One file has the metadata about the solicitation and award and the other related file has the verbage about the notice.

The metadata consists of the following rows:

Posted Date
Class Code
Office Address Text
Agency Name
POC Name
POC Text
Solicitation Number
Response Deadline
Archive Date
Additional Info Link
POC Email
Set Aside
Notice Type
Contract Award Number
Contract Award Amount
Set Aside
Contract Line Item Number
Contract Award Date
Awardee
Contractor Awarded DUNS
JIA Statutory Authority
JA Mod Number
Fair Opportunity JA
Delivery Order Number
Notice ID

The metadata file is a .CSV approximately 250-300MB in size. It is a relatively large file that isn’t easily manipulated in Excel. Additionally, it has embedded quotes, commas, html code and multiple lines with returns in the field. For file manipulation, I decided to use Vertascale. Vertascale is designed to provide real-time insights for data professionals on data stored in S3, Hadoop or on your desktop. It can easily handle text, json, zip, and a few other formats.

The decision to use Vertascale was driven by the size of the file, the ability for it to read the file correctly (i.e. parse the file correctly) and the ability to transform the data out into a format that I could use with other tools. Let’s run through how Vertascale can do this quickly and easily:

A) Opening & converting raw data into a more useful format

Step 1: Open the Vertascale application and navigate to the file that is on my local machine. In this case, we will use:

Step 2: Next, use Vertascale’s built in parsing tools to convert the file into a more user friendly view:

Step 3: Export the columns into a file that we can use later.

B) Getting a feel for the data

Now that the data is loaded in Vertascale (which can be an overwhelming task at times), we can explore and analyze the data. We will look at the Browse and the Count feature of Vertascale to understand the data.

1) Browse
I used Vertascale’s built-in data browser to browse through the dataset to see what data was available. With the slider and next 50 options, I could quickly go to any spot in the file and see what the data was available.

2) Count
I wanted to look at the “Set aside” column to see what the distribution of set aside values where in the file. Using the “Count Distinct” function, I was able to see how those values were distributed across the entire file. In this case, we can see

Next Steps
At this point, we have retrieved the FBO datafiles, used Vertascale to get a feel for what is in the files, extracted columns out to do some initial analysis. In the next post, we’ll dive in a little more to look at some ways of analyzing and visualizing this data.

Hadoop to Neo4J

Leading up to Graphconnect NY, I was distracting myself from working on my talk by determining if there was any way to import data directly from Hadoop into a graph database, specifically, Neo4j. Previously, I had written some Pig jobs to output the data into various files and then used the Neo4J batchinserter to load the data. This process works great and others have written about it. For example, this approach also uses the batchinserter while this approach uses some Java UDFs to write the Neo4J files directly.

Both of these approaches work great but I was wondering if I could use a Python UDF and create the Neo4J database directly. To test this out, I decided to resurrect some work I had done on the congressional bill data from Govtrack. You can read about the data and the java code I used to convert the files into single-line JSON files here. It’s also a good time to read up on how to create an Elasticsearch index using Hadoop. Now that you’re back from reading that link, let’s look at the approach to try and go from Hadoop directly into Neo4J. From the previous article, you remember that recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Nigel Small had written Py2Neo which is a simple and pragmatic Python library that provides access to the popular graph database Neo4j via its RESTful web service interface. That sounded awesome and something worth trying out. Py2Neo is easy to install using pip or easy_install. Installation instructions are located here.

The model that I was trying to create looks something like this:

The approach taken was to use Pig with a streaming Python UDF to write to the Neo4J database using its RESTful web service. I tested this out with Neo4J 2.0M6. I attempted to use Neo4J2.0RC1 but ran into several errors relating to missing nodes. The example code is below:

-- 'Document' is the delimiter -- 'event, gathering' is the tag list %default OUTPUT_PATH '/Users/davidfauth/MortarBillsData' %default S3_OUTPUT_PATH 's3n://df-bills-project' %default S3_INPUT_PATH 's3n://df-bills-data' %default INPUT_PATH '/Users/davidfauth/MortarNeoTestData' %default BULK_INPUT_PATH '/Users/davidfauth/MortarTestDataBulk' REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/billsProject.py' USING streaming_python AS nltk_udfs; REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/utilities.py' USING streaming_python AS utility_udfs; REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/neo4JUtility.py' USING streaming_python AS neo4j_udfs; rmf $OUTPUT_PATH; --rmf $S3_OUTPUT_PATH; bills = LOAD '$BULK_INPUT_PATH' USING org.apache.pig.piggybank.storage.JsonLoader( 'bill_id:chararray, congress:chararray, official_title:chararray, updated_at:chararray, subjects_top_term:chararray,summary:map[], sponsor:map[], subjects:map[],cosponsors:map[], bill_type:chararray, number:chararray,introduced_at:chararray,status:chararray,status_at:chararray'); data = LOAD '$BULK_INPUT_PATH' USING org.apache.pig.piggybank.storage.JsonLoader(); billNodes = LOAD '$OUTPUT_PATH/logs/keyNodeList' USING PigStorage('\t') AS (keyValue:chararray, nodeID:int, nodeType:charrary); -- get unique list of bills, subjects, sponsors and cosponsors to create nodes billList = FOREACH bills GENERATE bill_id as keyValue, 'bill' as nodeType; congressList = FOREACH bills GENERATE congress as keyValue, 'congress' as nodeType; congressBillList = FOREACH bills GENERATE congress as congressID, bill_id; sponsors = FOREACH bills GENERATE bill_id, sponsor#'name' AS sponsorName:chararray, sponsor#'state' AS sponsorState:chararray, sponsor#'district' AS sponsorDistrict:chararray, CONCAT(CONCAT(sponsor#'name',' '),sponsor#'state') as keyValue:chararray; --sponsorNameKey = FOREACH sponsors GENERATE CONCAT(CONCAT(sponsorName,' '),sponsorState) as keyValue:chararray, 'MemberOfCongress' as nodeType; listSponsors = FOREACH sponsors GENERATE keyValue; sponsorNameKey = FOREACH sponsors GENERATE keyValue, 'sponsor' as nodeType; cs = FOREACH data GENERATE flatten(object#'bill_id') as billid,flatten(object#'cosponsors') AS cosponsors:map[]; names = FOREACH cs GENERATE billid as bill_id, flatten(cosponsors#'name') as coSponsorName:chararray, flatten(cosponsors#'state') as coSponsorState:chararray, flatten(cosponsors#'district') as coSponsorDistrict:chararray, CONCAT(CONCAT(cosponsors#'name',' '),cosponsors#'state') as keyValue:chararray; cosponsorNameKey = FOREACH names GENERATE CONCAT(CONCAT(coSponsorName,' '),coSponsorState) as keyValue:chararray, 'MemberOfCongress' as nodeType; listCoSponsors = FOREACH names GENERATE keyValue; -- create list of distinct sponsors/cosponsors unionSponsorCoSponsors = UNION listSponsors, listCoSponsors; bUnion = GROUP unionSponsorCoSponsors BY 1; cUsCS = FOREACH bUnion GENERATE flatten(unionSponsorCoSponsors); listdistinctSponsorsCosponsors = DISTINCT cUsCS; uniqueCongressList = DISTINCT congressList; uniquebillList = DISTINCT billList; uniqueSponsors = DISTINCT sponsorNameKey; uniqueCoSponsors = DISTINCT cosponsorNameKey; -- create the subject List -- for some reason, it needs to be written out to file and brought back in subjectList = FOREACH data GENERATE object#'bill_id' as bill_id:chararray, flatten(object#'subjects') AS keyValue:chararray; STORE subjectList INTO '/Users/davidfauth/MortarBillsData/subjects' USING PigStorage('\t'); subjectData = LOAD '/Users/davidfauth/MortarBillsData/subjects' USING PigStorage('\t') as (bill_id:chararray, keyValue:chararray); tmpSubjectList = FOREACH subjectData GENERATE keyValue; uniqueSubjectList = DISTINCT tmpSubjectList; keySubjectList = FOREACH uniqueSubjectList GENERATE keyValue, 'subject' as nodeType; ordereduniqueSubjList = ORDER keySubjectList by keyValue ASC; ordereduniqueBillListValues = ORDER uniquebillList BY keyValue; orderedUniqueSponsors = ORDER uniqueSponsors BY keyValue; orderedUniqueCoSponsors = ORDER uniqueCoSponsors BY keyValue; orderedUniqueSCoS = ORDER listdistinctSponsorsCosponsors By keyValue; -- create the key values (list of nodes) that Neo4J will use unionKeys = UNION uniqueCongressList, ordereduniqueSubjList, ordereduniqueBillListValues, orderedUniqueSponsors, orderedUniqueCoSponsors; --unionKeys = UNION uniqueCongressList, ordereduniqueSubjList, ordereduniqueBillListValues, orderedUniqueSCoS; b = GROUP unionKeys BY 1; c = FOREACH b GENERATE flatten(unionKeys); -- run the counter UDF inside the single reducer --numBillKeyValue = FOREACH ordereduniqueBillListValues GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numSponsorsKeyValue = FOREACH orderedUniqueSponsors GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numCoSponsorsKeyValue = FOREACH orderedUniqueCoSponsors GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numSubjectsKeyValue = FOREACH ordereduniqueSubjList GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; -- run the Counter UDF to create a Node ID keyNodeList = FOREACH c GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int, nodeType; --Create Nodes (can I group and create a tuple/values) nodeValue = FOREACH keyNodeList GENERATE neo4j_udfs.createNode(keyValue, nodeType) as nodeCreated; --nodeSponsorValue = FOREACH numSponsorsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'sponsor') as nodeCreated; --nodeCoSponsorValue = FOREACH numCoSponsorsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'cosponsor') as nodeCreated; --nodeSubjectValue = FOREACH numSubjectsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'subject') as nodeCreated; -- Update Bill nodes with additional details updatedBillNodes = JOIN keyNodeList by keyValue, bills by bill_id; -- Create bills to subjects relationships billRelBillID = JOIN keyNodeList BY keyValue, subjectData by bill_id; billRel = JOIN billRelBillID by subjectData::keyValue, keyNodeList BY keyValue; --relValue = JOIN billRel by keyValue, subjectData by bill_id; --Create bills to sponsors relationships billRelSponsorID = JOIN keyNodeList BY keyValue, sponsors by bill_id; billSponsorRel = JOIN billRelSponsorID by sponsors::keyValue, keyNodeList BY keyValue; --Create bills to cosponsors relationships billCoRelSponsorID = JOIN keyNodeList BY keyValue, names by bill_id; billCoSponsorRel = JOIN billCoRelSponsorID by names::keyValue, keyNodeList BY keyValue; --Create Congress to Sponsor relationships congressRelSponsorID = JOIN keyNodeList BY keyValue, congressBillList by congressID; congressSponsorRel = JOIN congressRelSponsorID by congressBillList::bill_id, sponsors by bill_id; congressSponsorNodes = JOIN congressSponsorRel by sponsors::keyValue, keyNodeList BY keyValue; --Create Congress to CoSponsor relationships congressRelCoSponsorID = JOIN keyNodeList BY keyValue, congressBillList by congressID; congressCoSponsorRel = JOIN congressRelCoSponsorID by congressBillList::bill_id, names by bill_id; congressCoSponsorNodes = JOIN congressCoSponsorRel by names::keyValue, keyNodeList BY keyValue; --Create Relationships relBillValue = FOREACH billRel GENERATE neo4j_udfs.createRelationship(billRelBillID::keyNodeList::my_id,keyNodeList::my_id,'subject_of'); relBillSponsor = FOREACH billSponsorRel GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,billRelSponsorID::keyNodeList::my_id,'sponsor_of'); relBillCoSponsor = FOREACH billCoSponsorRel GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,billCoRelSponsorID::keyNodeList::my_id,'cosponsor_of'); relCongressSponsor = FOREACH congressSponsorNodes GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,congressSponsorRel::congressRelSponsorID::keyNodeList::my_id,'member_of'); relCongressCoSponsor = FOREACH congressCoSponsorNodes GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,congressCoSponsorRel::congressRelCoSponsorID::keyNodeList::my_id,'member_of'); nodeBillDetails = FOREACH updatedBillNodes GENERATE neo4j_udfs.updateBillNode(keyNodeList::my_id, bills::official_title,bills::updated_at,bills::bill_type, bills::number, bills::introduced_at, bills::status, bills::status_at); -- Log nodeCreation STORE nodeBillDetails INTO '$OUTPUT_PATH/logs/nodeBillDetails' USING PigStorage('\t'); STORE nodeValue INTO '$OUTPUT_PATH/logs/bills' USING PigStorage('\t'); STORE billRel INTO '$OUTPUT_PATH/logs/billsRel' USING PigStorage('\t'); STORE keyNodeList INTO '$OUTPUT_PATH/logs/keyNodeList' USING PigStorage('\t'); STORE billRelBillID INTO '$OUTPUT_PATH/logs/billRelID' USING PigStorage('\t'); STORE relBillValue INTO '$OUTPUT_PATH/logs/billRelValues' USING PigStorage('\t'); STORE relBillSponsor INTO '$OUTPUT_PATH/logs/billRelSponsorValues' USING PigStorage('\t'); STORE relBillCoSponsor INTO '$OUTPUT_PATH/logs/billRelCoSponsorValues' USING PigStorage('\t'); STORE relCongressSponsor INTO '$OUTPUT_PATH/logs/relCongressSponsor' USING PigStorage('\t'); STORE relCongressCoSponsor INTO '$OUTPUT_PATH/logs/relCongressCoSponsor' USING PigStorage('\t'); STORE congressSponsorNodes INTO '$OUTPUT_PATH/logs/congressSponsorlRel' USING PigStorage('\t'); STORE updatedBillNodes INTO '$OUTPUT_PATH/logs/nodeBillUpdateDetails' USING PigStorage('\t'); STORE c INTO '$OUTPUT_PATH/logs/unionValues' USING PigStorage('\t');

-- run the Counter UDF to create a Node ID
keyNodeList = FOREACH c GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int, nodeType;

Since Neo4J uses an incrementing counter for each node, we have to create an id for each keyValue (node name) that we are creating. The keyValues are the congressional session, name of the congresswoman or congressman, billID or subject. Below is a simple Python code that creates that ID.

from pig_util import outputSchema

COUNT = 0

@outputSchema('auto_increment_id:int')
def auto_increment_id():
    global COUNT
    COUNT += 1
    return COUNT

Once we have the id, we can use Py2Neo to create the nodes and relationships.

from pig_util import outputSchema

from py2neo import neo4j
from py2neo import node, rel

@outputSchema('nodeCreated:int')
def createNode(nodeValue, sLabel):
    if nodeValue:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        batch = neo4j.WriteBatch(graph_db)
        alice=batch.create(node(name=nodeValue,label=sLabel))
        results=batch.submit()
        return 1
    else:
        return 0

@outputSchema('nodeCreated:int')
def createRelationship(fromNode, toNode, sRelationship):
    if fromNode:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        ref_node = graph_db.node(fromNode)
        to_node = graph_db.node(toNode)
        aliceRel=graph_db.create(rel(ref_node,sRelationship,to_node))
        return 1
    else:
        return 0   

#myudf.py
@outputSchema('nodeCreated:int')
def createBillNode(nodeValue, sLabel, sTitle, sUpdated, sBillType,sBillNumber,sIntroducedAt,sStatus,sStatusAt):
    if nodeValue:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        foundNode,=graph_db.create(node(name=nodeValue))
        foundNode.add_labels(sLabel)
        foundNode["title"]=sTitle
        foundNode["updateDate"]=sUpdated
        foundNode["billType"]=sBillType
        foundNode["billNumber"]=sBillNumber
        foundNode["introducedAt"]=sIntroducedAt
        foundNode["status"]=sStatus
        foundNode["statusDate"]=sStatusAt
        return 1
    else:
        return 0

#myudf.py
@outputSchema('nodeUpdated:int')
def updateBillNode(nodeID, sTitle, sUpdated, sBillType,sBillNumber,sIntroducedAt,sStatus,sStatusAt):
    if nodeID:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        foundNode= graph_db.node(nodeID)
        foundNode["title"]=sTitle
        foundNode["updateDate"]=sUpdated
        foundNode["billType"]=sBillType
        foundNode["billNumber"]=sBillNumber
        foundNode["introducedAt"]=sIntroducedAt
        foundNode["status"]=sStatus
        foundNode["statusDate"]=sStatusAt
        return 1
    else:
        return 0

Of note is the ability to create the node and add the label in the createNode function. To create the relationship, we pass in the two node ids and the relationship type. This is passed via the REST API interface and the relationship is created.

Performance – Performance wasn’t what I thought it would be. Py2Neo interacts with Neo4j via its REST API interface and so every interaction requires a separate HTTP request to be sent. This approach, along with logging, made this much slower than I anticipated. Overall, it took about 40 minutes on my MacBook Pro with 16GB ram and SSD to create the Neo4J database.

Py2Neo Batches – Batches allow multiple requests to be grouped and sent together, cutting down on network traffic and latency. Such requests also have the advantage of being executed within a single transaction. The second run was done by adding some Py2Neo batches. This really didn’t make a huge difference as the log files were still being written.

Overall, it still took about 60 minutes on my MacBook Pro with 16GB ram and SSD to create the Neo4J database.

Next Steps
Hmmm….I should have known that the RESTful service performance wasn’t going to be anywhere near as fast as the batchinserter performance due to logging. You could see the log files grow and grow as the data was added. I’m going to go back to the drawing board and see if a Java UDF could work better. The worst case is I just go back to writing out files and writing a custom batchinserter each time.

Creating an Elasticsearch index of Congress Bills using Pig

Post author By dave fauth
Post date October 24, 2013
Categories In Uncategorized
No Comments on Creating an Elasticsearch index of Congress Bills using Pig

Recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Shortly thereafter, Elasticsearch announced integration with Hadoop. “Using Elasticsearch in Hadoop has never been easier. Thanks to the deep API integration, interacting with Elasticsearch is similar to that of HDFS resources. And since Hadoop is more then just vanilla Map/Reduce, in elasticsearch-hadoop one will find support for Apache Hive, Apache Pig and Cascading in addition to plain Map/Reduce.”

Elasticsearch published the first milestone (1.3.0.M1) based on the new code-base that has been in the works for the last few months.

The intial attempt at testing out Mortar and Elasticsearch didn’t work. Working with the great team at Mortar and costinl at Elasticsearch, Mortar was able to update their platform to allow Mortar to write out to Elasticsearch at scale.

Test Case
To test this out, I decided to process congressional bill data from the past several congresses. The process will be to read in the json files, process the file using Pig, use NTLK to find the top 5 bigrams and then write the data out to an Elasticsearch index.

The Data
GovTrack.us, a tool by Civic Impulse, LLC, is one of the world’s most visited government transparency websites. The site helps ordinary citizens find and track bills in the U.S. Congress and understand their representatives’ legislative record.

The bulk data is a deep directory structure of flat XML and JSON files. The directory layout is described below.

Our files are in three main directories:

/data/congress-legislators/
Information on Members of Congress from 1789-present, presidents and vice presidents, Congressional committees, and current committee assignments. This data is a mirror of the files in github:unitedstates/congress-legislators.
/data/congress/ (i.e. http://www.govtrack.us/data/congress/)
Bill status and other legislative data from 2013 (113th Congress) and forward. This data is the output of the scrapers developed by the github:unitedstates/congress project.

Getting the Data

To fetch the data we support rsync, a common Unix/Mac tool for efficiently fetching files and keeping them updated as they change. The root of our rsync tree is govtrack.us::govtrackdata, and this corresponds exactly to what you see at http://www.govtrack.us/data/.

To download bill data for the 113th Congress into a local directory named bills, run:

rsync -avz --delete --delete-excluded --exclude **/text-versions/ \
		govtrack.us::govtrackdata/congress/113/bills .

(Note the double colons in the middle and the period at the end. This is a long command. I’ve indicated the line continuation with a backslash.)

Directories

/data/congress/113/bills/[bill_type]/[bill_type][bill_number]/data.[json|xml]
Bill and resolution status for bills in the 113th Congress. See the github:unitedstates/congress project documentation for details of the JSON format. The XML format is backwards-compatible with our legacy bill XML files (documentation).

The following code loops through a directory of bills and converts all of the .json files into single line .json files.

package jsonFormatter; import java.io.*; import java.nio.file.FileVisitResult; import java.nio.file.FileVisitor; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.SimpleFileVisitor; import java.nio.file.attribute.BasicFileAttributes; import java.io.IOException; public final class importer { public static void main(String... aArgs) throws IOException{ String ROOT = "/Users/davidfauth/bills/109"; FileVisitor<Path> fileProcessor = new ProcessFile(); Files.walkFileTree(Paths.get(ROOT), fileProcessor); } private static final class ProcessFile extends SimpleFileVisitor<Path> { @Override public FileVisitResult visitFile( Path aFile, BasicFileAttributes aAttrs ) throws IOException { FileWriter fileWriter = null; String absolutePath = ""; String filePath=""; String outFilePath = ""; File f = new File(aFile.toString()); System.out.println(f.getName()); absolutePath = f.getAbsolutePath(); filePath = absolutePath.substring(0,absolutePath.lastIndexOf(File.separator)); // System.out.println("file path is: " + filePath); // System.out.println("new file path is: " + filePath.substring(17)); outFilePath = "/Users/davidfauth/MortarData/" + filePath.substring(18).replace("/", "_");; System.out.println("new file is " + outFilePath); if (f.getName().toString().equals("data.json")){ System.out.println("Processing file:" + aFile); BufferedReader br = new BufferedReader(new FileReader(aFile.toString())); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { sb.append(line); sb.append(" "); line = br.readLine(); } File newTextFile = new File(outFilePath+"_data1.json"); fileWriter = new FileWriter(newTextFile); fileWriter.write(sb.toString()); fileWriter.close(); } finally { br.close(); } } return FileVisitResult.CONTINUE; } @Override public FileVisitResult preVisitDirectory( Path aDir, BasicFileAttributes aAttrs ) throws IOException { // System.out.println("Processing directory:" + aDir); return FileVisitResult.CONTINUE; } } }

The following pig code reads all of the single line .json files, pulls out some of the fields, calls a Python UDF to find the top 5 bigrams and then writes the data into an elasticsearch index.

The important steps are to:
a) register the jar file
b) define the storage to the elasticsearch index
c) write out the data using the defined storage\

-- 'Document' is the delimiter -- 'event, gathering' is the tag list %default OUTPUT_PATH '/Users/davidfauth/MortarBillsData' REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/billsProject.py' USING streaming_python AS nltk_udfs; REGISTER '/Users/davidfauth/Downloads/elasticsearch-hadoop-1.3.0.M1.jar'; define ESStorage org.elasticsearch.hadoop.pig.ESStorage('es.resource=govtrack/bills'); bills = LOAD '/Users/davidfauth/MortarData/' USING org.apache.pig.piggybank.storage.JsonLoader( 'bill_id:chararray, congress:chararray, official_title:chararray, updated_at:chararray, subjects_top_term:chararray,summary:map[], sponsor:map[], subjects'); billDetails = FOREACH bills GENERATE bill_id, congress, official_title, updated_at, subjects_top_term, sponsor#'name' as sponsorName:chararray, sponsor#'state' as sponsorState:chararray, subjects AS subjectList: {t: (subjects: chararray)}, summary#'text' AS billText:chararray; billSearch = FOREACH bills GENERATE bill_id, congress, official_title, updated_at, subjects_top_term, sponsor#'name' as sponsorName:chararray, sponsor#'state' as sponsorState:chararray, summary#'text' AS billText:chararray; -- Group the tweets by place name and use a CPython UDF to find the top 5 bigrams -- for each of these places. bigrams_by_place = FOREACH (GROUP billDetails BY subjects_top_term) GENERATE group AS subjects_top_term:chararray, nltk_udfs.top_5_bigrams(billDetails.official_title), COUNT(billDetails) AS sample_size; top_100_places = LIMIT (ORDER bigrams_by_place BY sample_size DESC) 100; STORE billSearch INTO 'govtrack/bills' USING org.elasticsearch.hadoop.pig.ESStorage(); rmf $OUTPUT_PATH; STORE top_100_places INTO '/Users/davidfauth/MortarBillsData' USING PigStorage('\t');

If you are using the mortar framework, nltk isn’t installed by default. Here’s how you can install it:

# From your project's root directory - Switch to the mortar local virtualenv
source .mortar-local/pythonenv/bin/activate

#Install nltk (http://nltk.org/install.html)
sudo pip install -U pyyaml nltk

For the bi-grams, I re-used some sample Mortar code from Doug Daniels shown below:

from pig_util import outputSchema import nltk @outputSchema("top_five:bag{t:(bigram:chararray)}") def top_5_bigrams(tweets): tokenized_tweets = [ nltk.tokenize.WhitespaceTokenizer().tokenize(t[0]) for t in tweets ] bgm = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_documents(tokenized_tweets) top_5 = finder.nbest(bgm.likelihood_ratio, 5) return [ ("%s %s" % (s[0], s[1]),) for s in top_5 ]

Results
The pig job loaded 58,624 files, processed them and created the elasticsearch index in 53 seconds. The NLTK python UDF finished in another 34 seconds resulting in a total time of 87 seconds.

You can see the working elasticsearch in the following screen shot:

One thing of note
The elasticsearch hadoop connector doesn’t handle geo-coordinates quite yet so you can’t create an index with latitude/longitude. That should be coming soon.

Health Insurance Marketplace Costs

Post author By dave fauth
Post date October 7, 2013
Categories In Uncategorized
1 Comment on Health Insurance Marketplace Costs

Data.Healthcare.Gov released QHP cost information for various health care plans for states in the Federally-Facilitated and State-Partnership Marketplaces. The data is available in a variety of formats and lays out costs for various levels of health care plans (Gold, Silver, Bronze and Catastrophe) for different categories.

Premium Information
Premium amounts do not include tax credits that will lower premiums for the majority of those applying, specifically those with income up to 400 percent of the federal poverty level. The document shows premiums for the following example rating scenarios below:

Adult Individual Age 27 = one adult age 27
Adult Individual Age 50 = one adult age 50
Family = two adults age 30, two children
Single Parent Family = one adult age 30, two children
Couple = two adults age 40, no children
Child = one child any age

Cost Comparisons
Looking at the information, I wanted to do some comparisons across the various plans and rating scenarios to see where the highest costs where, what states had the largest variance and to look at the standard deviation across states/plans.

While I could have run this in Excel or R, I decided to write a simple Pig job to determine the maximum, minimum and average costs by plan for each state. I also then calculated the variance and standard deviations.

Show Pig Code (548 More Words)

[code lang=”Java”]

/**
* healthcareCosts
*/

/**
* Parameters – set default values here; you can override with -p on the command-line.
*/

%default INPUT_PATH ‘/Users/davidfauth/healthcareCosts/QHP_Individual_Medical_Landscape.csv’
%default OUTPUT_PATH ‘/Users/davidfauth/MortarHealthCareCostsOut’
%default OUTPUT_PATH_PLANDELTAS ‘/Users/davidfauth/MortarHealthCareCostsOut/PlanDeltas’
%default OUTPUT_PATH_PLANSTANDARD ‘/Users/davidfauth/MortarHealthCareCostsOut/PlanStandard’

/**
* User-Defined Functions (UDFs)
*/

— Load the input data from the CSV file
raw_data = LOAD ‘$INPUT_PATH’
USING org.apache.pig.piggybank.storage.CSVExcelStorage(‘,’, ‘NO_MULTILINE’, ‘NOCHANGE’, ‘SKIP_INPUT_HEADER’)
AS (State:chararray,
County:chararray,
MetalLevel:chararray,
IssuerName:chararray,
PlanMarketingName:chararray,
PlanType:chararray,
RatingArea:chararray,
PremiumAdultIndividualAge27:double,
PremiumAdultIndividualAge50:double,
PremiumFamily:double,
PremiumSingleParentFamily:double,
PremiumCouple:double,
PremiumChild:double);

— Limit to sa subset of data

noCurrencySymbol = FOREACH raw_data GENERATE State as newState,
County as newCounty,
MetalLevel as newMetalLevel,
PremiumAdultIndividualAge27 as newPremiumAdultIndividualAge27,
PremiumAdultIndividualAge50 as newPremiumAdultIndividualAge50,
PremiumFamily as newPremiumFamily,
PremiumSingleParentFamily as newPremiumSingleParentFamily,
PremiumCouple as newPremiumCouple,
PremiumChild as newPremiumChild;

/* Group together identical tuples */
PlansByMetalLevel = GROUP noCurrencySymbol BY (newMetalLevel, newState);

–calculate Max, Min and Avg costs
costsByPlansPAA27 = FOREACH PlansByMetalLevel GENERATE FLATTEN(group),
COUNT(noCurrencySymbol) as planCount,
AVG(noCurrencySymbol.newPremiumAdultIndividualAge27) as avgPremiumAdultAge27,
MIN(noCurrencySymbol.newPremiumAdultIndividualAge27) as minPremiumAdultAge27,
MAX(noCurrencySymbol.newPremiumAdultIndividualAge27) as maxPremiumAdultAge27,
AVG(noCurrencySymbol.newPremiumAdultIndividualAge50) as avgPremiumAdultAge50,
MIN(noCurrencySymbol.newPremiumAdultIndividualAge50) as minPremiumAdultAge50,
MAX(noCurrencySymbol.newPremiumAdultIndividualAge50) as maxPremiumAdultAge50,
AVG(noCurrencySymbol.newPremiumFamily) as avgPremiumAdultAgePFAM,
MIN(noCurrencySymbol.newPremiumFamily) as minPremiumAdultAgePFAM,
MAX(noCurrencySymbol.newPremiumFamily) as maxPremiumAdultAgePFAM,
AVG(noCurrencySymbol.newPremiumSingleParentFamily) as avgPremiumAdultAgePSPFAM,
MIN(noCurrencySymbol.newPremiumSingleParentFamily) as minPremiumAdultAgePSPFAM,
MAX(noCurrencySymbol.newPremiumSingleParentFamily) as maxPremiumAdultAgePSPFAM,
AVG(noCurrencySymbol.newPremiumCouple) as avgPremiumAdultAgePC,
MIN(noCurrencySymbol.newPremiumCouple) as minPremiumAdultAgePC,
MAX(noCurrencySymbol.newPremiumCouple) as maxPremiumAdultAgePC,
AVG(noCurrencySymbol.newPremiumChild) as avgPremiumAdultAgePCh,
MIN(noCurrencySymbol.newPremiumChild) as minPremiumAdultAgePCh,
MAX(noCurrencySymbol.newPremiumChild) as maxPremiumAdultAgePCh;

–calculate deltas
deltasByPlan = FOREACH costsByPlansPAA27 GENERATE newMetalLevel, newState,
avgPremiumAdultAge27 – minPremiumAdultAge27 as deltaMinAvgAge27,
avgPremiumAdultAge50 – minPremiumAdultAge50 as deltaMinAvgAge50,
avgPremiumAdultAgePFAM – minPremiumAdultAgePFAM as deltaMinAvgAgePFAM,
avgPremiumAdultAgePSPFAM – minPremiumAdultAgePSPFAM as deltaMinAvgAgePSPFAM,
avgPremiumAdultAgePC – minPremiumAdultAgePC as deltaMinAvgAgePC,
avgPremiumAdultAgePCh- minPremiumAdultAgePCh as deltaMinAvgAgePCh;

— calculate variance and Standard Deviations
mean = foreach PlansByMetalLevel {
sum27 = SUM(noCurrencySymbol.newPremiumAdultIndividualAge27);
sum50 = SUM(noCurrencySymbol.newPremiumAdultIndividualAge50);
sumPFAM = SUM(noCurrencySymbol.newPremiumFamily);
sumPSPFAM = SUM(noCurrencySymbol.newPremiumSingleParentFamily);
sumPC = SUM(noCurrencySymbol.newPremiumCouple);
sumPCh = SUM(noCurrencySymbol.newPremiumChild);
count = COUNT(noCurrencySymbol);
generate flatten(noCurrencySymbol), sum27/count as avg27, sum50/count as avg50,
sumPFAM/count as avgPFAM, sumPSPFAM/count as avgPSPFAM, sumPC/count as avgPC,
sumPCh/count as avgPCh, count as count;
};

tmp = foreach mean {
dif27 = (newPremiumAdultIndividualAge27 – avg27) * (newPremiumAdultIndividualAge27 – avg27) ;
dif50 = (newPremiumAdultIndividualAge50 – avg50) * (newPremiumAdultIndividualAge50 – avg50) ;
difPFAM = (newPremiumFamily – avgPFAM) * (newPremiumFamily – avgPFAM) ;
difPSPFAM = (newPremiumSingleParentFamily – avgPSPFAM) * (newPremiumSingleParentFamily – avgPSPFAM) ;
difPC = (newPremiumCouple – avgPC) * (newPremiumCouple – avgPC) ;
difPCh = (newPremiumChild – avgPCh) * (newPremiumChild – avgPCh) ;
generate newMetalLevel, newState, count, dif27 as dif27,
dif50 as dif50, difPFAM as difPFAM, difPSPFAM as difPSPFAM, difPC as difPC, difPCh as difPCh;
};

grp = group tmp by (newMetalLevel, newState);
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif27) as sqr_sum27, SUM(tmp.dif50) as sqr_sum50,
SUM(tmp.difPFAM) as sqr_sumPFAM, SUM(tmp.difPSPFAM) as sqr_sumPSPFAM, SUM(tmp.difPC) as sqr_sumPC,
SUM(tmp.difPCh) as sqr_sumPCh;

standard = foreach standard_tmp generate newState, newMetalLevel,
sqr_sum27 / count as variance27, SQRT(sqr_sum27 / count) as standard27,
sqr_sum50 / count as variance50, SQRT(sqr_sum50 / count) as standard50,
sqr_sumPFAM / count as variancePFAM, SQRT(sqr_sumPFAM / count) as standardPFAM,
sqr_sumPSPFAM / count as variancePSPFAM, SQRT(sqr_sumPSPFAM / count) as standardPSPFAM,
sqr_sumPC / count as variancePC, SQRT(sqr_sumPC / count) as standardPC,
sqr_sumPCh / count as variancePCh, SQRT(sqr_sumPCh / count) as standardPCh;

distinctStandard = DISTINCT standard;

— remove any existing data
rmf $OUTPUT_PATH;

— store the results
STORE costsByPlansPAA27 INTO ‘$OUTPUT_PATH’ USING PigStorage(‘|’);
STORE deltasByPlan INTO ‘$OUTPUT_PATH_PLANDELTAS’ USING PigStorage(‘|’);
STORE distinctStandard INTO ‘$OUTPUT_PATH_PLANSTANDARD’ USING PigStorage(‘|’);

[/code]

Initial Cost Analysis
There is a wide range in costs across the states with Virginia being consistently the highest average cost plan. Looking at the catastrophic costs, Virginia plans are five times (5x) more expensive than Kansas or Alabama.

For Gold plans, Virginia is again between two and three times (2-3X) more expensive to buy insurance.

Variance and Standard Deviation
It comes as no surprise that Virginia has the largest variance and standard deviation for the cost data by a large margin. Virginia’s variance on the Gold plans is 2742 times that of Alabama. New Hampshire, Alaska, Delaware and Utah all have small variances and are consistent across the rating scenarios.

Again, Virginia’s variance on the bronze plans are way out of balance compared to other states.

However, for a Platinum plan, has the ninth smallest variation across all rating scenarios. New Jersey, Michigan and Wisconsin have the largest variations.

Code and Data
The code and data is on Github. If you have questions, you can reach me at dsfauth at gmail dot com.

Part 2 – Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

Post author By dave fauth
Post date August 26, 2013
Categories In Uncategorized
No Comments on Part 2 – Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

In the last post, I talked about creating the enhanced DocGraph dataset using Mortar and Neo4J. Our data model looks like the following:

Nodes
Organizations
Specialties
Providers
Locations
CountiesZip
Census

Relationships
* Organizations -[:PARENT_OF] – Providers -[:SPECIALTY]- Specialties
* Providers -[:LOCATED_IN]-Locations
* Providers -[:REFERRED]-Providers
* Counties -[:INCOME_IN]- CountiesZip
* Locations – [:LOCATED_IN]-Locations

Each of the nodes will have several properties associated with them. For example, Organizations will have a name associated with it. Locations have a city, state and postal code associated with each location.

Data
The data we are going to use is the initial DocGraph set, the Health Care Provider Taxonomy Code (NUCC) set located here, the National Plan and Provider Enumeration System (NPPES) Downloadable File here, and a zipcode to state file and the income per zipcode downloaded from the US Census. These files were loaded to an Amazon S3 bucket for processing.

Mortar Project
To create the Neo4J graph database, we will need to create several files to be loaded into Neo4J. To create the files, we are going to create a Mortar Project and use the pig file that we created in the last post.

Create Mortar Project
In order to fully leverage the Mortar Project framework, I created a mortar project which makes it available in GitHub. This will create a new project skeleton and register it with Mortar. This project will have folders created for commonly used items, such as pigscripts, macros, and UDFs.

cd mortar-examples
mortar projects:create docGraphNeo4J

Pig Code
Any Pig code that you want to run with Mortar should be put in the pigscripts directory in your project. I replaced the example pigscript in that directory called my-sample-project.pig with my docGraphNeo4J.pig script.

Illustrate
Illustrate is the best tool to check what you’ve written so far. Illustrate will check your Pig syntax, and then show a small subset of data flowing through each alias in your pigscript.

To get the fastest results, use the local:illustrate command.

mortar local:illustrate pigscripts/my-sample-project.pig

Once the illustrate result is ready, a web browser tab will open to show the results:

Mortar Watchtower
Mortar Watchtower is fastest way to develop with Pig. Rather than waiting for local or remote Pig run, you can validate that your scripts work simply by saving. Watchtower sits in the background analyzing your script, showing you your data flowing through the scripts instantly.

After installing Mortar Watchtower, I was able to do near realtime analysis of the data simply by typing in:

mortar watch ./pigscripts/docGraphNeo4J2.pig

Once I type that into my console window, I see:

A browser window then pops up:

As you can see, the Watchtower Viewer redisplays your script with example data embedded inline with each alias. You can click on the header of this inline table to toggle between different numbers of example rows. You can also click on any given table cell to see the complete data, including any truncated.

Full Run on Mortar
Once the code was ready for running, it was time to run on a full Hadoop cluster. To specify cluster size for your run, use the –clustersize option:

$ mortar jobs:run pigscripts/docGraphNeo4J.pig --clustersize 4

When I ran these jobs on the full Hadoop cluster, it ran in about 16 minutes. It wrote the following records to my Amazon S3 buckets:

Input(s):
Successfully read 3998551 records from: "s3n://NPIData/npidata_20050523-20130512.csv"
Successfully read 830 records from: "s3n://NUCC-Taxonomy/nucc_taxonomy_130.txt"
Successfully read 49685587 records from: "s3n://medgraph/refer.2011.csv"

Output(s):
Successfully stored 3998551 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/providerList"
Successfully stored 3998551 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/locations"
Successfully stored 77896 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/parentOfLink"
Successfully stored 33212 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/addrList"
Successfully stored 830 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/specialties"
Successfully stored 4746915 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/specialtiesProviders"
Successfully stored 33212 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/uniquelocations"
Successfully stored 694221 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/organizations"
Successfully stored 49685587 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/docGraphProviders"
Successfully stored 1826823 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/uniqueDoctorList"

Summary
In summary, I was able to take the three raw data files, write a pig script to process the data, run the pig job on a Hadoop cluster and create the multiple files that I will need to populate the Neo4J instance.

Why did I choose Mortar?
Mortar is fast, open and free. As Mortar says, using a Mortar project provides you with the following advantages:

* Pig and Hadoop on Your Computer: When you create a Mortar Project, you get a local installation of Pig and Hadoop ready to use, without needing to install anything yourself. That means faster development, and better testing.
* Version Control and Code Sharing: Mortar Projects are backed by source control, either through Mortar or your own system, so you can collaborate with team members on a project.
* 1-Button Deployment: When you’re ready to run your project on a Hadoop cluster, a single command is all that’s needed to deploy and run in the cloud.

Using Mortar’s Watchtower, I was able to get an instant sampling of my data, complete file watching, instant schema validation and instant error catching.

For me, Mortar was easy, fast and a great tool to get the data ready for loading into Neo4J.

Next Steps
In the next post, I’ll write about how to move the data from the data files and load them into Neo4J.

Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

Post author By dave fauth
Post date August 19, 2013
Categories In Uncategorized
1 Comment on Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

“The average doctor has likely never heard of Fred Trotter, but he has some provocative ideas about using physician data to change how healthcare gets delivered.” This was from a recent Gigaom article. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

The current DocGraph social graph was built in Neo4J. With new enhancements in Neo4J 2.0 (primarily labels), now was a good time to rebuild the social graph, add in data about each doctor, their specialties and their locations. Finally, I’ve added in some census income data at the zip code level. Researchers could look at economic indicators to see if there are discernable economic patterns in the referrals.

In this series of blog posts, I will attempt to walk through the process in building the Neo4J updated DocGraph using Hadoop followed by the Neo4J batch inserter.

Building the import documents.

One of the goals of the project was to learn Pig in combination with Hadoop to process the large files. I could easily have worked in MySQL or Oracle, but I also wanted an easy way to run jobs on large data sets.

My friends at Mortar have a great platform for leveraging Hadoop, Pig and Python. Mortar is the fastest and easiest way to work with Pig and Python on Hadoop. Mortar’s platform is for everything from joining and cleansing large data sets to machine learning and building recommender systems.
Mortar makes it easy for developers and data scientists to do powerful work with Hadoop. The main advantages of Mortar are:

Zero Setup Time: Mortar takes only minutes to set up (or no time at all on the web), and you can start running Pig jobs immediately. No need for painful installation or configuration.
Powerful Tooling: Mortar provides a rich suite of tools to aid in Pig development, including the ability to Illustrate a script before running it, and an extremely fast and free local development mode.
Elastic Clusters: We spin up Hadoop clusters as you need them, so you don’t have to predict your needs in advance, and you don’t pay for machines you don’t use.
Solid Support: Whether the issue is in your script or in Hadoop, we’ll help you figure out a solution.

Data Sets

One great thing about this data is that you can combine the DocGraph data with with other data sets. For example, we can combine NPPES data with the DocGraph data. The NPPES is the federal registry for NPI numbers and associated provider information.

To create the data sets for ingest into Neo4J, we are going to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes.

Pig Scripts
Using Pig scripts, I was able to create several data files that could then be loaded into Neo4J.

Show Pig Code (605 More Words)

[code lang=”Java”]
— Load the DocGraph referral data
medGraphData = LOAD ‘s3n://medgraph/refer.2011.csv’ USING PigStorage(‘,’) AS
(primaryProvider:chararray,
referredDoctor: chararray,
qtyReferred:chararray);

— Load the Classification/Specialty Codes
nucc_codes = LOAD ‘s3n://NUCC-Taxonomy/nucc_taxonomy_130.txt’ USING PigStorage(‘\t’) AS
(nuccCode:chararray,
nuccType:chararray,
nuccClassification:chararray,
nuccSpecialty:chararray);

— Load NPI Data
npiData = LOAD ‘s3n://NPIData/npidata_20050523-20130512.csv’ USING org.apache.pig.piggybank.storage.CSVLoader() AS
(NPI:chararray,
Entity_Type_Code:chararray,
Replacement_NPI:chararray,
Employer_Identification_Number:chararray,
Provider_Organization_Name:chararray,
Provider_Last_Name:chararray,
Provider_First_Name:chararray,
Provider_Middle_Name:chararray,
Provider_Name_Prefix_Text:chararray,
Provider_Name_Suffix_Text:chararray,
Provider_Credential_Text:chararray,
Provider_Other_Organization_Name:chararray,
Provider_Other_Organization_Name_Type_Code:chararray,
Provider_Other_Last_Name:chararray,
Provider_Other_First_Name:chararray,
Provider_Other_Middle_Name:chararray,
Provider_Other_Name_Prefix_Text:chararray,
Provider_Other_Name_Suffix_Text:chararray,
Provider_Other_Credential_Text:chararray,
Provider_Other_Last_Name_Type_Code:chararray,
Provider_First_Line_Business_Mailing_Address:chararray,
Provider_Second_Line_Business_Mailing_Address:chararray,
Provider_Business_Mailing_Address_City_Name:chararray,
Provider_Business_Mailing_Address_State_Name:chararray,
Provider_Business_Mailing_Address_Postal_Code:chararray,
Provider_Business_Mailing_Address_Country_Code:chararray,
Provider_Business_Mailing_Address_Telephone_Number:chararray,
Provider_Business_Mailing_Address_Fax_Number:chararray,
Provider_First_Line_Business_Practice_Location_Address:chararray,
Provider_Second_Line_Business_Practice_Location_Address:chararray,
Provider_Business_Practice_Location_Address_City_Name:chararray,
Provider_Business_Practice_Location_Address_State_Name:chararray,
Provider_Business_Practice_Location_Address_Postal_Code:chararray,
Provider_Business_Practice_Location_Address_Country_Code:chararray,
Provider_Business_Practice_Location_Address_Telephone_Number:chararray,
Provider_Business_Practice_Location_Address_Fax_Number:chararray,
Provider_Enumeration_Date:chararray,
Last_Update_Date:chararray,
NPI_Deactivation_Reason_Code:chararray,
NPI_Deactivation_Date:chararray,
NPI_Reactivation_Date:chararray,
Provider_Gender_Code:chararray,
Authorized_Official_Last_Name:chararray,
Authorized_Official_First_Name:chararray,
Authorized_Official_Middle_Name:chararray,
Authorized_Official_Title_or_Position:chararray,
Authorized_Official_Telephone_Number:chararray,
Healthcare_Provider_Taxonomy_Code_1:chararray,
Provider_License_Number_1:chararray,
Provider_License_Number_State_Code_1:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_1:chararray,
Healthcare_Provider_Taxonomy_Code_2:chararray,
Provider_License_Number_2:chararray,
Provider_License_Number_State_Code_2:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_2:chararray,
Healthcare_Provider_Taxonomy_Code_3:chararray,
Provider_License_Number_3:chararray,
Provider_License_Number_State_Code_3:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_3:chararray,
Healthcare_Provider_Taxonomy_Code_4:chararray,
Provider_License_Number_4:chararray,
Provider_License_Number_State_Code_4:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_4:chararray,
Healthcare_Provider_Taxonomy_Code_5:chararray,
Provider_License_Number_5:chararray,
Provider_License_Number_State_Code_5:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_5:chararray,
Healthcare_Provider_Taxonomy_Code_6:chararray,
Provider_License_Number_6:chararray,
Provider_License_Number_State_Code_6:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_6:chararray,
Healthcare_Provider_Taxonomy_Code_7:chararray,
Provider_License_Number_7:chararray,
Provider_License_Number_State_Code_7:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_7:chararray,
Healthcare_Provider_Taxonomy_Code_8:chararray,
Provider_License_Number_8:chararray,
Provider_License_Number_State_Code_8:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_8:chararray,
Healthcare_Provider_Taxonomy_Code_9:chararray,
Provider_License_Number_9:chararray,
Provider_License_Number_State_Code_9:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_9:chararray,
Healthcare_Provider_Taxonomy_Code_10:chararray,
Provider_License_Number_10:chararray,
Provider_License_Number_State_Code_10:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_10:chararray,
Healthcare_Provider_Taxonomy_Code_11:chararray,
Provider_License_Number_11:chararray,
Provider_License_Number_State_Code_11:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_11:chararray,
Healthcare_Provider_Taxonomy_Code_12:chararray,
Provider_License_Number_12:chararray,
Provider_License_Number_State_Code_12:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_12:chararray,
Healthcare_Provider_Taxonomy_Code_13:chararray,
Provider_License_Number_13:chararray,
Provider_License_Number_State_Code_13:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_13:chararray,
Healthcare_Provider_Taxonomy_Code_14:chararray,
Provider_License_Number_14:chararray,
Provider_License_Number_State_Code_14:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_14:chararray,
Healthcare_Provider_Taxonomy_Code_15:chararray,
Provider_License_Number_15:chararray,
Provider_License_Number_State_Code_15:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_15:chararray,
Other_Provider_Identifier_1:chararray,
Other_Provider_Identifier_Type_Code_1:chararray,
Other_Provider_Identifier_State_1:chararray,
Other_Provider_Identifier_Issuer_1:chararray,
Other_Provider_Identifier_2:chararray,
Other_Provider_Identifier_Type_Code_2:chararray,
Other_Provider_Identifier_State_2:chararray,
Other_Provider_Identifier_Issuer_2:chararray,
Other_Provider_Identifier_3:chararray,
Other_Provider_Identifier_Type_Code_3:chararray,
Other_Provider_Identifier_State_3:chararray,
Other_Provider_Identifier_Issuer_3:chararray,
Other_Provider_Identifier_4:chararray,
Other_Provider_Identifier_Type_Code_4:chararray,
Other_Provider_Identifier_State_4:chararray,
Other_Provider_Identifier_Issuer_4:chararray,
Other_Provider_Identifier_5:chararray,
Other_Provider_Identifier_Type_Code_5:chararray,
Other_Provider_Identifier_State_5:chararray,
Other_Provider_Identifier_Issuer_5:chararray,
Other_Provider_Identifier_6:chararray,
Other_Provider_Identifier_Type_Code_6:chararray,
Other_Provider_Identifier_State_6:chararray,
Other_Provider_Identifier_Issuer_6:chararray,
Other_Provider_Identifier_7:chararray,
Other_Provider_Identifier_Type_Code_7:chararray,
Other_Provider_Identifier_State_7:chararray,
Other_Provider_Identifier_Issuer_7:chararray,
Other_Provider_Identifier_8:chararray,
Other_Provider_Identifier_Type_Code_8:chararray,
Other_Provider_Identifier_State_8:chararray,
Other_Provider_Identifier_Issuer_8:chararray,
Other_Provider_Identifier_9:chararray,
Other_Provider_Identifier_Type_Code_9:chararray,
Other_Provider_Identifier_State_9:chararray,
Other_Provider_Identifier_Issuer_9:chararray,
Other_Provider_Identifier_10:chararray,
Other_Provider_Identifier_Type_Code_10:chararray,
Other_Provider_Identifier_State_10:chararray,
Other_Provider_Identifier_Issuer_10:chararray,
Other_Provider_Identifier_11:chararray,
Other_Provider_Identifier_Type_Code_11:chararray,
Other_Provider_Identifier_State_11:chararray,
Other_Provider_Identifier_Issuer_11:chararray,
Other_Provider_Identifier_12:chararray,
Other_Provider_Identifier_Type_Code_12:chararray,
Other_Provider_Identifier_State_12:chararray,
Other_Provider_Identifier_Issuer_12:chararray,
Other_Provider_Identifier_13:chararray,
Other_Provider_Identifier_Type_Code_13:chararray,
Other_Provider_Identifier_State_13:chararray,
Other_Provider_Identifier_Issuer_13:chararray,
Other_Provider_Identifier_14:chararray,
Other_Provider_Identifier_Type_Code_14:chararray,
Other_Provider_Identifier_State_14:chararray,
Other_Provider_Identifier_Issuer_14:chararray,
Other_Provider_Identifier_15:chararray,
Other_Provider_Identifier_Type_Code_15:chararray,
Other_Provider_Identifier_State_15:chararray,
Other_Provider_Identifier_Issuer_15:chararray,
Other_Provider_Identifier_16:chararray,
Other_Provider_Identifier_Type_Code_16:chararray,
Other_Provider_Identifier_State_16:chararray,
Other_Provider_Identifier_Issuer_16:chararray,
Other_Provider_Identifier_17:chararray,
Other_Provider_Identifier_Type_Code_17:chararray,
Other_Provider_Identifier_State_17:chararray,
Other_Provider_Identifier_Issuer_17:chararray,
Other_Provider_Identifier_18:chararray,
Other_Provider_Identifier_Type_Code_18:chararray,
Other_Provider_Identifier_State_18:chararray,
Other_Provider_Identifier_Issuer_18:chararray,
Other_Provider_Identifier_19:chararray,
Other_Provider_Identifier_Type_Code_19:chararray,
Other_Provider_Identifier_State_19:chararray,
Other_Provider_Identifier_Issuer_19:chararray,
Other_Provider_Identifier_20:chararray,
Other_Provider_Identifier_Type_Code_20:chararray,
Other_Provider_Identifier_State_20:chararray,
Other_Provider_Identifier_Issuer_20:chararray,
Other_Provider_Identifier_21:chararray,
Other_Provider_Identifier_Type_Code_21:chararray,
Other_Provider_Identifier_State_21:chararray,
Other_Provider_Identifier_Issuer_21:chararray,
Other_Provider_Identifier_22:chararray,
Other_Provider_Identifier_Type_Code_22:chararray,
Other_Provider_Identifier_State_22:chararray,
Other_Provider_Identifier_Issuer_22:chararray,
Other_Provider_Identifier_23:chararray,
Other_Provider_Identifier_Type_Code_23:chararray,
Other_Provider_Identifier_State_23:chararray,
Other_Provider_Identifier_Issuer_23:chararray,
Other_Provider_Identifier_24:chararray,
Other_Provider_Identifier_Type_Code_24:chararray,
Other_Provider_Identifier_State_24:chararray,
Other_Provider_Identifier_Issuer_24:chararray,
Other_Provider_Identifier_25:chararray,
Other_Provider_Identifier_Type_Code_25:chararray,
Other_Provider_Identifier_State_25:chararray,
Other_Provider_Identifier_Issuer_25:chararray,
Other_Provider_Identifier_26:chararray,
Other_Provider_Identifier_Type_Code_26:chararray,
Other_Provider_Identifier_State_26:chararray,
Other_Provider_Identifier_Issuer_26:chararray,
Other_Provider_Identifier_27:chararray,
Other_Provider_Identifier_Type_Code_27:chararray,
Other_Provider_Identifier_State_27:chararray,
Other_Provider_Identifier_Issuer_27:chararray,
Other_Provider_Identifier_28:chararray,
Other_Provider_Identifier_Type_Code_28:chararray,
Other_Provider_Identifier_State_28:chararray,
Other_Provider_Identifier_Issuer_28:chararray,
Other_Provider_Identifier_29:chararray,
Other_Provider_Identifier_Type_Code_29:chararray,
Other_Provider_Identifier_State_29:chararray,
Other_Provider_Identifier_Issuer_29:chararray,
Other_Provider_Identifier_30:chararray,
Other_Provider_Identifier_Type_Code_30:chararray,
Other_Provider_Identifier_State_30:chararray,
Other_Provider_Identifier_Issuer_30:chararray,
Other_Provider_Identifier_31:chararray,
Other_Provider_Identifier_Type_Code_31:chararray,
Other_Provider_Identifier_State_31:chararray,
Other_Provider_Identifier_Issuer_31:chararray,
Other_Provider_Identifier_32:chararray,
Other_Provider_Identifier_Type_Code_32:chararray,
Other_Provider_Identifier_State_32:chararray,
Other_Provider_Identifier_Issuer_32:chararray,
Other_Provider_Identifier_33:chararray,
Other_Provider_Identifier_Type_Code_33:chararray,
Other_Provider_Identifier_State_33:chararray,
Other_Provider_Identifier_Issuer_33:chararray,
Other_Provider_Identifier_34:chararray,
Other_Provider_Identifier_Type_Code_34:chararray,
Other_Provider_Identifier_State_34:chararray,
Other_Provider_Identifier_Issuer_34:chararray,
Other_Provider_Identifier_35:chararray,
Other_Provider_Identifier_Type_Code_35:chararray,
Other_Provider_Identifier_State_35:chararray,
Other_Provider_Identifier_Issuer_35:chararray,
Other_Provider_Identifier_36:chararray,
Other_Provider_Identifier_Type_Code_36:chararray,
Other_Provider_Identifier_State_36:chararray,
Other_Provider_Identifier_Issuer_36:chararray,
Other_Provider_Identifier_37:chararray,
Other_Provider_Identifier_Type_Code_37:chararray,
Other_Provider_Identifier_State_37:chararray,
Other_Provider_Identifier_Issuer_37:chararray,
Other_Provider_Identifier_38:chararray,
Other_Provider_Identifier_Type_Code_38:chararray,
Other_Provider_Identifier_State_38:chararray,
Other_Provider_Identifier_Issuer_38:chararray,
Other_Provider_Identifier_39:chararray,
Other_Provider_Identifier_Type_Code_39:chararray,
Other_Provider_Identifier_State_39:chararray,
Other_Provider_Identifier_Issuer_39:chararray,
Other_Provider_Identifier_40:chararray,
Other_Provider_Identifier_Type_Code_40:chararray,
Other_Provider_Identifier_State_40:chararray,
Other_Provider_Identifier_Issuer_40:chararray,
Other_Provider_Identifier_41:chararray,
Other_Provider_Identifier_Type_Code_41:chararray,
Other_Provider_Identifier_State_41:chararray,
Other_Provider_Identifier_Issuer_41:chararray,
Other_Provider_Identifier_42:chararray,
Other_Provider_Identifier_Type_Code_42:chararray,
Other_Provider_Identifier_State_42:chararray,
Other_Provider_Identifier_Issuer_42:chararray,
Other_Provider_Identifier_43:chararray,
Other_Provider_Identifier_Type_Code_43:chararray,
Other_Provider_Identifier_State_43:chararray,
Other_Provider_Identifier_Issuer_43:chararray,
Other_Provider_Identifier_44:chararray,
Other_Provider_Identifier_Type_Code_44:chararray,
Other_Provider_Identifier_State_44:chararray,
Other_Provider_Identifier_Issuer_44:chararray,
Other_Provider_Identifier_45:chararray,
Other_Provider_Identifier_Type_Code_45:chararray,
Other_Provider_Identifier_State_45:chararray,
Other_Provider_Identifier_Issuer_45:chararray,
Other_Provider_Identifier_46:chararray,
Other_Provider_Identifier_Type_Code_46:chararray,
Other_Provider_Identifier_State_46:chararray,
Other_Provider_Identifier_Issuer_46:chararray,
Other_Provider_Identifier_47:chararray,
Other_Provider_Identifier_Type_Code_47:chararray,
Other_Provider_Identifier_State_47:chararray,
Other_Provider_Identifier_Issuer_47:chararray,
Other_Provider_Identifier_48:chararray,
Other_Provider_Identifier_Type_Code_48:chararray,
Other_Provider_Identifier_State_48:chararray,
Other_Provider_Identifier_Issuer_48:chararray,
Other_Provider_Identifier_49:chararray,
Other_Provider_Identifier_Type_Code_49:chararray,
Other_Provider_Identifier_State_49:chararray,
Other_Provider_Identifier_Issuer_49:chararray,
Other_Provider_Identifier_50:chararray,
Other_Provider_Identifier_Type_Code_50:chararray,
Other_Provider_Identifier_State_50:chararray,
Other_Provider_Identifier_Issuer_50:chararray,
Is_Sole_Proprietor:chararray,
Is_Organization_Subpart:chararray,
Parent_Organization_LBN:chararray,
Parent_Organization_TIN:chararray,
Authorized_Official_Name_Prefix_Text:chararray,
Authorized_Official_Name_Suffix_Text:chararray,
Authorized_Official_Credential_Text:chararray,
Healthcare_Provider_Taxonomy_Group_1:chararray,
Healthcare_Provider_Taxonomy_Group_2:chararray,
Healthcare_Provider_Taxonomy_Group_3:chararray,
Healthcare_Provider_Taxonomy_Group_4:chararray,
Healthcare_Provider_Taxonomy_Group_5:chararray,
Healthcare_Provider_Taxonomy_Group_6:chararray,
Healthcare_Provider_Taxonomy_Group_7:chararray,
Healthcare_Provider_Taxonomy_Group_8:chararray,
Healthcare_Provider_Taxonomy_Group_9:chararray,
Healthcare_Provider_Taxonomy_Group_10:chararray,
Healthcare_Provider_Taxonomy_Group_11:chararray,
Healthcare_Provider_Taxonomy_Group_12:chararray,
Healthcare_Provider_Taxonomy_Group_13:chararray,
Healthcare_Provider_Taxonomy_Group_14:chararray,
Healthcare_Provider_Taxonomy_Group_15:chararray
);

— generate a Provider List and replace any quotes
providerList = foreach npiData generate REPLACE(NPI, ‘\\"’, ”) AS npiCode,
REPLACE(Entity_Type_Code, ‘\\"’,”) AS entity_type,
REPLACE(Provider_First_Line_Business_Practice_Location_Address, ‘\\"’,”) AS address_first_line,
REPLACE(Provider_Second_Line_Business_Practice_Location_Address, ‘\\"’,”) AS address_second_line,
REPLACE(Provider_Business_Practice_Location_Address_City_Name, ‘\\"’,”) AS address_city_name,
REPLACE(Provider_Business_Practice_Location_Address_State_Name, ‘\\"’,”) AS address_state_name,
REPLACE(Provider_Business_Practice_Location_Address_Postal_Code, ‘\\"’,”) AS address_postal_code,
REPLACE(Provider_Business_Practice_Location_Address_Country_Code, ‘\\"’,”) AS address_country_code,
REPLACE(Provider_Business_Practice_Location_Address_Telephone_Number, ‘\\"’,”) AS telephone_number,
REPLACE(Provider_Business_Practice_Location_Address_Fax_Number, ‘\\"’,”) AS fax_number,
REPLACE(Provider_Gender_Code, ‘\\"’,”) AS gender,
REPLACE(Provider_Organization_Name, ‘\\"’,”) AS ProviderOrgName,
REPLACE(Provider_Name_Prefix_Text, ‘\\"’, ”) AS ProviderPrefix,
REPLACE(Provider_First_Name, ‘\\"’, ”) AS ProviderFirstName,
REPLACE(Provider_Middle_Name, ‘\\"’, ”) AS ProviderMiddleName,
REPLACE(Provider_Last_Name, ‘\\"’, ”) AS ProviderLastName,
REPLACE(Provider_Name_Suffix_Text, ‘\\"’, ”) AS ProviderSuffix,
REPLACE(Provider_Credential_Text, ‘\\"’, ”) AS ProviderCredential;

— create list of NPI codes to Parent Organization
parentOrgLBNList = foreach npiData generate REPLACE(Parent_Organization_LBN, ‘\\"’,”) as newParentOrgLBN;
hasParentOrgLBNValue = filter parentOrgLBNList by newParentOrgLBN != ”;
distinctParentOrgLBNList = distinct hasParentOrgLBNValue;
childHasParentLBN = join npiData by (REPLACE(Parent_Organization_LBN, ‘\\"’,”)), distinctParentOrgLBNList BY (newParentOrgLBN);
npiLBN = foreach childHasParentLBN generate NPI as childNPI, Parent_Organization_LBN as newParentOrgLBN;

— generate unique list of Provider Organization Names
providerSubList = foreach providerList generate ProviderOrgName;
hasProviderOrgName = filter providerSubList by ProviderOrgName != ”;
distinctProvider = distinct hasProviderOrgName;
–hasParentOrgLBNValue = filter hasParentOrgLBN by newParentOrgLBN matches ‘.’;
–grpd = group hasParentOrgLBNValue by NPI;
–parentGroupOut = foreach grpd generate group, COUNT(hasParentOrgLBNValue);

–address list
addressList = foreach providerList generate address_city_name, address_state_name, address_country_code;
uniqueLocations = distinct addressList;
grpdAddressList = group addressList by (address_city_name, address_state_name, address_country_code);
addressListCnt = foreach grpdAddressList generate group, COUNT(addressList) as countAddressList;
addressListOrdered = order addressListCnt BY countAddressList;
addressListFlat = foreach addressListOrdered GENERATE FLATTEN(group) as (address_city_name, address_state_name, address_country_code), countAddressList;

–located in
locatedIn = foreach providerList generate npiCode,address_city_name, address_state_name, address_postal_code, address_country_code;

— doctors taxonomy listing (some doctors may have multiple taxonomies
joinedNPITax1 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_1, ‘\\"’, ”)), nucc_codes BY (nuccCode);
joinedNPITax2 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_2, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax3 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_3, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax4 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_4, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax5 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_5, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax6 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_6, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax7 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_7, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax8 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_8, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax9 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_9, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax10 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_10, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax11 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_11, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax12 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_12, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax13 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_13, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax14 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_14, ‘\\"’, ”)) , nucc_codes BY (nuccCode);
joinedNPITax15 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_15, ‘\\"’, ”)) , nucc_codes BY (nuccCode);

joinedNPINUCC = UNION joinedNPITax1,joinedNPITax2,joinedNPITax3,joinedNPITax4,joinedNPITax5,joinedNPITax6,
joinedNPITax7,joinedNPITax8,joinedNPITax9,joinedNPITax10,joinedNPITax11,joinedNPITax12,joinedNPITax13,joinedNPITax14,joinedNPITax15;

simpleJoinedNPINUCC = foreach joinedNPINUCC GENERATE REPLACE(NPI, ‘\\"’, ”) AS npiCode, nuccCode, nuccType, nuccClassification, nuccSpecialty;

— unique Specialties
uniqueNUCCCodes = distinct nucc_codes;

— unique NPICodes
primaryDoc = foreach medGraphData generate primaryProvider;
referredDoc = foreach medGraphData generate referredDoctor;
uniquePrimaryDoc = distinct primaryDoc;
uniqueReferredDoc = distinct referredDoc;
uniqueDocList = union uniquePrimaryDoc, uniqueReferredDoc;

rmf s3n://DataOut/DocGraph/DocGraphNeo4J;
–STORE sampleJoinedVADoc INTO ‘s3n://DataOut/DocGraph/DocHosp’ USING PigStorage(‘|’);
–STORE cnt INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J’ USING PigStorage(‘|’);
–s3 Work
STORE npiLBN INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/parentOfLink’ USING PigStorage(‘|’);
STORE distinctProvider INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/organizations’ USING PigStorage(‘|’);
STORE providerList INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/providerList’ USING PigStorage(‘|’);
STORE addressListFlat INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/addrList’ USING PigStorage(‘|’);
STORE locatedIn INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/locations’ USING PigStorage(‘|’);
STORE uniqueLocations INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/uniquelocations’ USING PigStorage(‘|’);
STORE uniqueNUCCCodes INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/specialties’ USING PigStorage(‘|’);
STORE simpleJoinedNPINUCC INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/specialtiesProviders’ USING PigStorage(‘|’);
STORE medGraphData INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/docGraphProviders’ USING PigStorage(‘|’);
STORE uniqueDocList INTO ‘s3n://DataOut/DocGraph/DocGraphNeo4J/uniqueDoctorList’ USING PigStorage(‘|’);

–local Development
–STORE parentGroupOut INTO ‘../DataOut/DocGraph/DocGraphNeo4J’ USING PigStorage(‘|’);
–STORE addressListOrdered INTO ‘../DataOut/DocGraph/DocGraphNeo4J’ USING PigStorage(‘|’);
–STORE locatedIn INTO ‘../DataOut/DocGraph/DocGraphNeo4J’ USING PigStorage(‘|’);
–STORE simpleJoinedNPINUCC INTO ‘../DataOut/DocGraph/DocGraphNeo4J’ USING PigStorage(‘|’);

[/code]

Running the Pig Code in Mortar
In the next post, we will look at using Mortar’s framework to run the Pig jobs.

Recommender Tips, Mortar and DocGraph

Post author By dave fauth
Post date August 14, 2013
Categories In Uncategorized
No Comments on Recommender Tips, Mortar and DocGraph

Jonathan Packer wrote on Mortar’s blog about flexible recommender models. Jonathan articulates that “from a business perspective the two most salient advantages of graph-based models: flexibility and simplicity.”

Some of salient points made in the article are:

graph-based models are modular and transparent
simple graph-based model will allow you to build a viable recommender system for your product without delaying its time-to-market
Graphs can be visualized, explained, discussed, and debugged collaboratively in a way that sophisticated machine learning techniques cannot.

Jonathan ends with “My opinion is that the next big advances to be made in recommender systems will be made by combining automated tools with human—possibly crowdsourced—editorial judgement and writing talent. They will be made in finding more engaging ways to present recommendations to users than cloying sidebars and endlessly scrolling lists.”

DocGraph
“The average doctor has likely never heard of Fred Trotter, but he has some provocative ideas about using physician data to change how healthcare gets delivered.” This was from a recent Gigaom article. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

Recommendation Engine
The combination of the Neo4J social graph, the medical data and the capability to build a recommendation engine in Mortar makes a compelling use case. I believe that this use case will address Jonathan’s premise that the new engaging recommendation engines can be built to help give patients a sense of which doctors are most respected by their peers. Additionally, the graph data could help hospitals understand the referral patterns associated with poor care coordination, and provide health IT startups with a map of the most plugged-in doctors in each city.

Next steps
Over the next couple of weeks, I’ll be writing on how I used Mortar, Pig and Neo4J to build the updated DocGraph data set.

Chicago Sacred Heart Hospital – Medicare Kickback Scheme

Post author By dave fauth
Post date April 23, 2013
Categories In Uncategorized
No Comments on Chicago Sacred Heart Hospital – Medicare Kickback Scheme

According to an April 16, 2013 FBI press release, Chicago Sacred Heart Hospital Owner, Executive, and Four Doctors Arrested in Alleged Medicare Referral Kickback Conspiracy.

From the press release:

CHICAGO—The owner and another senior executive of Sacred Heart Hospital and four physicians affiliated with the west side facility were arrested today for allegedly conspiring to pay and receive illegal kickbacks, including more than $225,000 in cash, along with other forms of payment, in exchange for the referral of patients insured by Medicare and Medicaid to the hospital, announced U.S. Attorney for the Northern District of Illinois Gary S. Shapiro.
…
Arrested were Edward J. Novak, 58, of Park Ridge, Sacred Heart’s owner and chief executive officer since the late 1990s; Roy M. Payawal, 64, of Burr Ridge, executive vice president and chief financial officer since the early 2000s; and Drs. Venkateswara R. “V.R.” Kuchipudi, 66, of Oak Brook, Percy Conrad May, Jr., 75, of Chicago, Subir Maitra, 73, of Chicago, and Shanin Moshiri, 57, of Chicago.

DocGraph DataI wanted to see what the graph of these doctors looked like in the DocGraph dataset. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

Hadoop Data Processing Using Mortar for online hadoop processing, Amazon S3 storage and access to the data, I wrote up a Hadoop script that filters the DocGraph data where any of the accused where the referring doctors, joined them to the National Provider registry and wrote the data out to an S3 bucket.

medGraphData = LOAD 's3n://medgraph/refer.2011.csv' USING PigStorage(',') AS
(primaryProvider:chararray,
referredDoctor: chararray,
qtyReferred:chararray);

nucc_codes = LOAD 's3n://NUCC-Taxonomy/nucc_taxonomy_130.txt' USING PigStorage('\t') AS
(nuccCode:chararray,
nuccType:chararray,
nuccClassification:chararray,
nuccSpecialty:chararray);

-- Load NPI Data
npiData = LOAD 's3n://NPIData/npidata_20050523-20130113.csv' USING PigStorage(',') AS
(NPICode:chararray,
f2:chararray,
f3:chararray,
f4:chararray,
f5:chararray,
f6:chararray,
f7:chararray,
f8:chararray,
f9:chararray,
f10:chararray,
f11:chararray,
f12:chararray,
f13:chararray,
f14:chararray,
f15:chararray,
f16:chararray,
f17:chararray,
f18:chararray,
f19:chararray,
f20:chararray,
f21:chararray,
f22:chararray,
f23:chararray,
f24:chararray,
f25:chararray,
f26:chararray,
f27:chararray,
f28:chararray,
f29:chararray,
f30:chararray,
f31:chararray,
f32:chararray,
f33:chararray,
f34:chararray,
f35:chararray,
f36:chararray,
f37:chararray,
f38:chararray,
f39:chararray,
f40:chararray,
f41:chararray,
f42:chararray,
f43:chararray,
f44:chararray,
f45:chararray,
f46:chararray,
f47:chararray,
f48:chararray,
f49:chararray);

chicagoSacredHeartHosp = FILTER medGraphData BY (referredDoctor == '1003163122' OR referredDoctor == '1760730063');

chicagoSacredHeartHospPrimary = FILTER medGraphData BY (primaryProvider == '1003163122' OR primaryProvider == '1760730063');

docFraud = FILTER medGraphData BY (primaryProvider == '1598896086' OR primaryProvider == '1003450178' OR primaryProvider == '1255463576' OR primaryProvider == '1588694343' OR primaryProvider == '1588694343' OR primaryProvider == '1265492128');

--chicagoDocs = FILTER npiData BY ((f23 == '"CHICAGO"' OR f31 == '"CHICAGO"' ) AND f29 matches '.*3240.*');
out = FOREACH npiData GENERATE REPLACE(NPICode,'\\"','') as newNPICode, 
REPLACE(f5, '\\"','') as orgName,
REPLACE(f6, '\\"','') as orgLastName,
REPLACE(f7, '\\"', '') as firstName, 
REPLACE(f21, '\\"','') as docAddra1,
REPLACE(f22, '\\"','') as docAddra2,
REPLACE(f23, '\\"','') as docCity1,
REPLACE(f29, '\\"','') as docAddr1,
REPLACE(f30, '\\"','') as docAddr2,
REPLACE(f31, '\\"','') as docCity,
REPLACE(f32, '\\"','') as docState,
REPLACE(f33, '\\"','') as docPostalCode,
REPLACE(f48, '\\"','') as taxonomyCode;

docFraudSacredHeart = JOIN docFraud BY (referredDoctor), out BY newNPICode;

rmf s3n://DataOut/DocGraph/ChicagoDocs;
rmf s3n://DataOut/DocGraph/ChicagoMedicareFraud;
rmf s3n://DataOut/DocGraph/docFraud;
rmf s3n://DataOut/DocGraph/docFraudSacredHeart;

--STORE sampleJoinedVADoc INTO 's3n://DataOut/DocGraph/DocHosp' USING PigStorage('|');
--STORE out INTO 's3n://DataOut/DocGraph/ChicagoDocs' USING PigStorage('|');
STORE chicagoSacredHeartHospPrimary INTO 's3n://DataOut/DocGraph/ChicagoMedicareFraud' USING PigStorage('|');
STORE docFraud INTO 's3n://DataOut/DocGraph/docFraud' USING PigStorage('|');
STORE docFraudSacredHeart INTO 's3n://DataOut/DocGraph/docFraudSacredHeart' USING PigStorage('|');

Data Results
Looking at the data results, three of the doctors made referrals to Sacred Heart.

Doctor         NPI              Hospital NPI    Nbr Referrals
Dr. Maitra    1598896086	1558367656	    2495
Dr. Kuchipudi 1265492128	1558367656	    1171
Dr. May       1588694343	1558367656	     417

Visualization Using Gephi, I was able to visualize the referrals for these three doctors.

While this doesn’t provide a detailed look into the fraud, it does show there were referrals made to Sacred Heart.