Intelliwareness - Blog on Big Data, Data Analytics and Other IT

Hadoop to Neo4J

Leading up to Graphconnect NY, I was distracting myself from working on my talk by determining if there was any way to import data directly from Hadoop into a graph database, specifically, Neo4j. Previously, I had written some Pig jobs to output the data into various files and then used the Neo4J batchinserter to load the data. This process works great and others have written about it. For example, this approach also uses the batchinserter while this approach uses some Java UDFs to write the Neo4J files directly.

Both of these approaches work great but I was wondering if I could use a Python UDF and create the Neo4J database directly. To test this out, I decided to resurrect some work I had done on the congressional bill data from Govtrack. You can read about the data and the java code I used to convert the files into single-line JSON files here. It’s also a good time to read up on how to create an Elasticsearch index using Hadoop. Now that you’re back from reading that link, let’s look at the approach to try and go from Hadoop directly into Neo4J. From the previous article, you remember that recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Nigel Small had written Py2Neo which is a simple and pragmatic Python library that provides access to the popular graph database Neo4j via its RESTful web service interface. That sounded awesome and something worth trying out. Py2Neo is easy to install using pip or easy_install. Installation instructions are located here.

The model that I was trying to create looks something like this:

The approach taken was to use Pig with a streaming Python UDF to write to the Neo4J database using its RESTful web service. I tested this out with Neo4J 2.0M6. I attempted to use Neo4J2.0RC1 but ran into several errors relating to missing nodes. The example code is below:

-- 'Document' is the delimiter -- 'event, gathering' is the tag list %default OUTPUT_PATH '/Users/davidfauth/MortarBillsData' %default S3_OUTPUT_PATH 's3n://df-bills-project' %default S3_INPUT_PATH 's3n://df-bills-data' %default INPUT_PATH '/Users/davidfauth/MortarNeoTestData' %default BULK_INPUT_PATH '/Users/davidfauth/MortarTestDataBulk' REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/billsProject.py' USING streaming_python AS nltk_udfs; REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/utilities.py' USING streaming_python AS utility_udfs; REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/neo4JUtility.py' USING streaming_python AS neo4j_udfs; rmf $OUTPUT_PATH; --rmf $S3_OUTPUT_PATH; bills = LOAD '$BULK_INPUT_PATH' USING org.apache.pig.piggybank.storage.JsonLoader( 'bill_id:chararray, congress:chararray, official_title:chararray, updated_at:chararray, subjects_top_term:chararray,summary:map[], sponsor:map[], subjects:map[],cosponsors:map[], bill_type:chararray, number:chararray,introduced_at:chararray,status:chararray,status_at:chararray'); data = LOAD '$BULK_INPUT_PATH' USING org.apache.pig.piggybank.storage.JsonLoader(); billNodes = LOAD '$OUTPUT_PATH/logs/keyNodeList' USING PigStorage('\t') AS (keyValue:chararray, nodeID:int, nodeType:charrary); -- get unique list of bills, subjects, sponsors and cosponsors to create nodes billList = FOREACH bills GENERATE bill_id as keyValue, 'bill' as nodeType; congressList = FOREACH bills GENERATE congress as keyValue, 'congress' as nodeType; congressBillList = FOREACH bills GENERATE congress as congressID, bill_id; sponsors = FOREACH bills GENERATE bill_id, sponsor#'name' AS sponsorName:chararray, sponsor#'state' AS sponsorState:chararray, sponsor#'district' AS sponsorDistrict:chararray, CONCAT(CONCAT(sponsor#'name',' '),sponsor#'state') as keyValue:chararray; --sponsorNameKey = FOREACH sponsors GENERATE CONCAT(CONCAT(sponsorName,' '),sponsorState) as keyValue:chararray, 'MemberOfCongress' as nodeType; listSponsors = FOREACH sponsors GENERATE keyValue; sponsorNameKey = FOREACH sponsors GENERATE keyValue, 'sponsor' as nodeType; cs = FOREACH data GENERATE flatten(object#'bill_id') as billid,flatten(object#'cosponsors') AS cosponsors:map[]; names = FOREACH cs GENERATE billid as bill_id, flatten(cosponsors#'name') as coSponsorName:chararray, flatten(cosponsors#'state') as coSponsorState:chararray, flatten(cosponsors#'district') as coSponsorDistrict:chararray, CONCAT(CONCAT(cosponsors#'name',' '),cosponsors#'state') as keyValue:chararray; cosponsorNameKey = FOREACH names GENERATE CONCAT(CONCAT(coSponsorName,' '),coSponsorState) as keyValue:chararray, 'MemberOfCongress' as nodeType; listCoSponsors = FOREACH names GENERATE keyValue; -- create list of distinct sponsors/cosponsors unionSponsorCoSponsors = UNION listSponsors, listCoSponsors; bUnion = GROUP unionSponsorCoSponsors BY 1; cUsCS = FOREACH bUnion GENERATE flatten(unionSponsorCoSponsors); listdistinctSponsorsCosponsors = DISTINCT cUsCS; uniqueCongressList = DISTINCT congressList; uniquebillList = DISTINCT billList; uniqueSponsors = DISTINCT sponsorNameKey; uniqueCoSponsors = DISTINCT cosponsorNameKey; -- create the subject List -- for some reason, it needs to be written out to file and brought back in subjectList = FOREACH data GENERATE object#'bill_id' as bill_id:chararray, flatten(object#'subjects') AS keyValue:chararray; STORE subjectList INTO '/Users/davidfauth/MortarBillsData/subjects' USING PigStorage('\t'); subjectData = LOAD '/Users/davidfauth/MortarBillsData/subjects' USING PigStorage('\t') as (bill_id:chararray, keyValue:chararray); tmpSubjectList = FOREACH subjectData GENERATE keyValue; uniqueSubjectList = DISTINCT tmpSubjectList; keySubjectList = FOREACH uniqueSubjectList GENERATE keyValue, 'subject' as nodeType; ordereduniqueSubjList = ORDER keySubjectList by keyValue ASC; ordereduniqueBillListValues = ORDER uniquebillList BY keyValue; orderedUniqueSponsors = ORDER uniqueSponsors BY keyValue; orderedUniqueCoSponsors = ORDER uniqueCoSponsors BY keyValue; orderedUniqueSCoS = ORDER listdistinctSponsorsCosponsors By keyValue; -- create the key values (list of nodes) that Neo4J will use unionKeys = UNION uniqueCongressList, ordereduniqueSubjList, ordereduniqueBillListValues, orderedUniqueSponsors, orderedUniqueCoSponsors; --unionKeys = UNION uniqueCongressList, ordereduniqueSubjList, ordereduniqueBillListValues, orderedUniqueSCoS; b = GROUP unionKeys BY 1; c = FOREACH b GENERATE flatten(unionKeys); -- run the counter UDF inside the single reducer --numBillKeyValue = FOREACH ordereduniqueBillListValues GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numSponsorsKeyValue = FOREACH orderedUniqueSponsors GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numCoSponsorsKeyValue = FOREACH orderedUniqueCoSponsors GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numSubjectsKeyValue = FOREACH ordereduniqueSubjList GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; -- run the Counter UDF to create a Node ID keyNodeList = FOREACH c GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int, nodeType; --Create Nodes (can I group and create a tuple/values) nodeValue = FOREACH keyNodeList GENERATE neo4j_udfs.createNode(keyValue, nodeType) as nodeCreated; --nodeSponsorValue = FOREACH numSponsorsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'sponsor') as nodeCreated; --nodeCoSponsorValue = FOREACH numCoSponsorsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'cosponsor') as nodeCreated; --nodeSubjectValue = FOREACH numSubjectsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'subject') as nodeCreated; -- Update Bill nodes with additional details updatedBillNodes = JOIN keyNodeList by keyValue, bills by bill_id; -- Create bills to subjects relationships billRelBillID = JOIN keyNodeList BY keyValue, subjectData by bill_id; billRel = JOIN billRelBillID by subjectData::keyValue, keyNodeList BY keyValue; --relValue = JOIN billRel by keyValue, subjectData by bill_id; --Create bills to sponsors relationships billRelSponsorID = JOIN keyNodeList BY keyValue, sponsors by bill_id; billSponsorRel = JOIN billRelSponsorID by sponsors::keyValue, keyNodeList BY keyValue; --Create bills to cosponsors relationships billCoRelSponsorID = JOIN keyNodeList BY keyValue, names by bill_id; billCoSponsorRel = JOIN billCoRelSponsorID by names::keyValue, keyNodeList BY keyValue; --Create Congress to Sponsor relationships congressRelSponsorID = JOIN keyNodeList BY keyValue, congressBillList by congressID; congressSponsorRel = JOIN congressRelSponsorID by congressBillList::bill_id, sponsors by bill_id; congressSponsorNodes = JOIN congressSponsorRel by sponsors::keyValue, keyNodeList BY keyValue; --Create Congress to CoSponsor relationships congressRelCoSponsorID = JOIN keyNodeList BY keyValue, congressBillList by congressID; congressCoSponsorRel = JOIN congressRelCoSponsorID by congressBillList::bill_id, names by bill_id; congressCoSponsorNodes = JOIN congressCoSponsorRel by names::keyValue, keyNodeList BY keyValue; --Create Relationships relBillValue = FOREACH billRel GENERATE neo4j_udfs.createRelationship(billRelBillID::keyNodeList::my_id,keyNodeList::my_id,'subject_of'); relBillSponsor = FOREACH billSponsorRel GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,billRelSponsorID::keyNodeList::my_id,'sponsor_of'); relBillCoSponsor = FOREACH billCoSponsorRel GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,billCoRelSponsorID::keyNodeList::my_id,'cosponsor_of'); relCongressSponsor = FOREACH congressSponsorNodes GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,congressSponsorRel::congressRelSponsorID::keyNodeList::my_id,'member_of'); relCongressCoSponsor = FOREACH congressCoSponsorNodes GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,congressCoSponsorRel::congressRelCoSponsorID::keyNodeList::my_id,'member_of'); nodeBillDetails = FOREACH updatedBillNodes GENERATE neo4j_udfs.updateBillNode(keyNodeList::my_id, bills::official_title,bills::updated_at,bills::bill_type, bills::number, bills::introduced_at, bills::status, bills::status_at); -- Log nodeCreation STORE nodeBillDetails INTO '$OUTPUT_PATH/logs/nodeBillDetails' USING PigStorage('\t'); STORE nodeValue INTO '$OUTPUT_PATH/logs/bills' USING PigStorage('\t'); STORE billRel INTO '$OUTPUT_PATH/logs/billsRel' USING PigStorage('\t'); STORE keyNodeList INTO '$OUTPUT_PATH/logs/keyNodeList' USING PigStorage('\t'); STORE billRelBillID INTO '$OUTPUT_PATH/logs/billRelID' USING PigStorage('\t'); STORE relBillValue INTO '$OUTPUT_PATH/logs/billRelValues' USING PigStorage('\t'); STORE relBillSponsor INTO '$OUTPUT_PATH/logs/billRelSponsorValues' USING PigStorage('\t'); STORE relBillCoSponsor INTO '$OUTPUT_PATH/logs/billRelCoSponsorValues' USING PigStorage('\t'); STORE relCongressSponsor INTO '$OUTPUT_PATH/logs/relCongressSponsor' USING PigStorage('\t'); STORE relCongressCoSponsor INTO '$OUTPUT_PATH/logs/relCongressCoSponsor' USING PigStorage('\t'); STORE congressSponsorNodes INTO '$OUTPUT_PATH/logs/congressSponsorlRel' USING PigStorage('\t'); STORE updatedBillNodes INTO '$OUTPUT_PATH/logs/nodeBillUpdateDetails' USING PigStorage('\t'); STORE c INTO '$OUTPUT_PATH/logs/unionValues' USING PigStorage('\t');

-- run the Counter UDF to create a Node ID
keyNodeList = FOREACH c GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int, nodeType;

Since Neo4J uses an incrementing counter for each node, we have to create an id for each keyValue (node name) that we are creating. The keyValues are the congressional session, name of the congresswoman or congressman, billID or subject. Below is a simple Python code that creates that ID.

from pig_util import outputSchema

COUNT = 0

@outputSchema('auto_increment_id:int')
def auto_increment_id():
    global COUNT
    COUNT += 1
    return COUNT

Once we have the id, we can use Py2Neo to create the nodes and relationships.

from pig_util import outputSchema

from py2neo import neo4j
from py2neo import node, rel

@outputSchema('nodeCreated:int')
def createNode(nodeValue, sLabel):
    if nodeValue:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        batch = neo4j.WriteBatch(graph_db)
        alice=batch.create(node(name=nodeValue,label=sLabel))
        results=batch.submit()
        return 1
    else:
        return 0

@outputSchema('nodeCreated:int')
def createRelationship(fromNode, toNode, sRelationship):
    if fromNode:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        ref_node = graph_db.node(fromNode)
        to_node = graph_db.node(toNode)
        aliceRel=graph_db.create(rel(ref_node,sRelationship,to_node))
        return 1
    else:
        return 0   

#myudf.py
@outputSchema('nodeCreated:int')
def createBillNode(nodeValue, sLabel, sTitle, sUpdated, sBillType,sBillNumber,sIntroducedAt,sStatus,sStatusAt):
    if nodeValue:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        foundNode,=graph_db.create(node(name=nodeValue))
        foundNode.add_labels(sLabel)
        foundNode["title"]=sTitle
        foundNode["updateDate"]=sUpdated
        foundNode["billType"]=sBillType
        foundNode["billNumber"]=sBillNumber
        foundNode["introducedAt"]=sIntroducedAt
        foundNode["status"]=sStatus
        foundNode["statusDate"]=sStatusAt
        return 1
    else:
        return 0

#myudf.py
@outputSchema('nodeUpdated:int')
def updateBillNode(nodeID, sTitle, sUpdated, sBillType,sBillNumber,sIntroducedAt,sStatus,sStatusAt):
    if nodeID:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        foundNode= graph_db.node(nodeID)
        foundNode["title"]=sTitle
        foundNode["updateDate"]=sUpdated
        foundNode["billType"]=sBillType
        foundNode["billNumber"]=sBillNumber
        foundNode["introducedAt"]=sIntroducedAt
        foundNode["status"]=sStatus
        foundNode["statusDate"]=sStatusAt
        return 1
    else:
        return 0

Of note is the ability to create the node and add the label in the createNode function. To create the relationship, we pass in the two node ids and the relationship type. This is passed via the REST API interface and the relationship is created.

Performance – Performance wasn’t what I thought it would be. Py2Neo interacts with Neo4j via its REST API interface and so every interaction requires a separate HTTP request to be sent. This approach, along with logging, made this much slower than I anticipated. Overall, it took about 40 minutes on my MacBook Pro with 16GB ram and SSD to create the Neo4J database.

Py2Neo Batches – Batches allow multiple requests to be grouped and sent together, cutting down on network traffic and latency. Such requests also have the advantage of being executed within a single transaction. The second run was done by adding some Py2Neo batches. This really didn’t make a huge difference as the log files were still being written.

Overall, it still took about 60 minutes on my MacBook Pro with 16GB ram and SSD to create the Neo4J database.

Next Steps
Hmmm….I should have known that the RESTful service performance wasn’t going to be anywhere near as fast as the batchinserter performance due to logging. You could see the log files grow and grow as the data was added. I’m going to go back to the drawing board and see if a Java UDF could work better. The worst case is I just go back to writing out files and writing a custom batchinserter each time.

Creating an Elasticsearch index of Congress Bills using Pig

Post author By dave fauth
Post date October 24, 2013
Categories In Uncategorized
No Comments on Creating an Elasticsearch index of Congress Bills using Pig

Recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Shortly thereafter, Elasticsearch announced integration with Hadoop. “Using Elasticsearch in Hadoop has never been easier. Thanks to the deep API integration, interacting with Elasticsearch is similar to that of HDFS resources. And since Hadoop is more then just vanilla Map/Reduce, in elasticsearch-hadoop one will find support for Apache Hive, Apache Pig and Cascading in addition to plain Map/Reduce.”

Elasticsearch published the first milestone (1.3.0.M1) based on the new code-base that has been in the works for the last few months.

The intial attempt at testing out Mortar and Elasticsearch didn’t work. Working with the great team at Mortar and costinl at Elasticsearch, Mortar was able to update their platform to allow Mortar to write out to Elasticsearch at scale.

Test Case
To test this out, I decided to process congressional bill data from the past several congresses. The process will be to read in the json files, process the file using Pig, use NTLK to find the top 5 bigrams and then write the data out to an Elasticsearch index.

The Data
GovTrack.us, a tool by Civic Impulse, LLC, is one of the world’s most visited government transparency websites. The site helps ordinary citizens find and track bills in the U.S. Congress and understand their representatives’ legislative record.

The bulk data is a deep directory structure of flat XML and JSON files. The directory layout is described below.

Our files are in three main directories:

/data/congress-legislators/
Information on Members of Congress from 1789-present, presidents and vice presidents, Congressional committees, and current committee assignments. This data is a mirror of the files in github:unitedstates/congress-legislators.
/data/congress/ (i.e. http://www.govtrack.us/data/congress/)
Bill status and other legislative data from 2013 (113th Congress) and forward. This data is the output of the scrapers developed by the github:unitedstates/congress project.

Getting the Data

To fetch the data we support rsync, a common Unix/Mac tool for efficiently fetching files and keeping them updated as they change. The root of our rsync tree is govtrack.us::govtrackdata, and this corresponds exactly to what you see at http://www.govtrack.us/data/.

To download bill data for the 113th Congress into a local directory named bills, run:

rsync -avz --delete --delete-excluded --exclude **/text-versions/ \
		govtrack.us::govtrackdata/congress/113/bills .

(Note the double colons in the middle and the period at the end. This is a long command. I’ve indicated the line continuation with a backslash.)

Directories

/data/congress/113/bills/[bill_type]/[bill_type][bill_number]/data.[json|xml]
Bill and resolution status for bills in the 113th Congress. See the github:unitedstates/congress project documentation for details of the JSON format. The XML format is backwards-compatible with our legacy bill XML files (documentation).

The following code loops through a directory of bills and converts all of the .json files into single line .json files.

package jsonFormatter; import java.io.*; import java.nio.file.FileVisitResult; import java.nio.file.FileVisitor; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.SimpleFileVisitor; import java.nio.file.attribute.BasicFileAttributes; import java.io.IOException; public final class importer { public static void main(String... aArgs) throws IOException{ String ROOT = "/Users/davidfauth/bills/109"; FileVisitor<Path> fileProcessor = new ProcessFile(); Files.walkFileTree(Paths.get(ROOT), fileProcessor); } private static final class ProcessFile extends SimpleFileVisitor<Path> { @Override public FileVisitResult visitFile( Path aFile, BasicFileAttributes aAttrs ) throws IOException { FileWriter fileWriter = null; String absolutePath = ""; String filePath=""; String outFilePath = ""; File f = new File(aFile.toString()); System.out.println(f.getName()); absolutePath = f.getAbsolutePath(); filePath = absolutePath.substring(0,absolutePath.lastIndexOf(File.separator)); // System.out.println("file path is: " + filePath); // System.out.println("new file path is: " + filePath.substring(17)); outFilePath = "/Users/davidfauth/MortarData/" + filePath.substring(18).replace("/", "_");; System.out.println("new file is " + outFilePath); if (f.getName().toString().equals("data.json")){ System.out.println("Processing file:" + aFile); BufferedReader br = new BufferedReader(new FileReader(aFile.toString())); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { sb.append(line); sb.append(" "); line = br.readLine(); } File newTextFile = new File(outFilePath+"_data1.json"); fileWriter = new FileWriter(newTextFile); fileWriter.write(sb.toString()); fileWriter.close(); } finally { br.close(); } } return FileVisitResult.CONTINUE; } @Override public FileVisitResult preVisitDirectory( Path aDir, BasicFileAttributes aAttrs ) throws IOException { // System.out.println("Processing directory:" + aDir); return FileVisitResult.CONTINUE; } } }

The following pig code reads all of the single line .json files, pulls out some of the fields, calls a Python UDF to find the top 5 bigrams and then writes the data into an elasticsearch index.

The important steps are to:
a) register the jar file
b) define the storage to the elasticsearch index
c) write out the data using the defined storage\

-- 'Document' is the delimiter -- 'event, gathering' is the tag list %default OUTPUT_PATH '/Users/davidfauth/MortarBillsData' REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/billsProject.py' USING streaming_python AS nltk_udfs; REGISTER '/Users/davidfauth/Downloads/elasticsearch-hadoop-1.3.0.M1.jar'; define ESStorage org.elasticsearch.hadoop.pig.ESStorage('es.resource=govtrack/bills'); bills = LOAD '/Users/davidfauth/MortarData/' USING org.apache.pig.piggybank.storage.JsonLoader( 'bill_id:chararray, congress:chararray, official_title:chararray, updated_at:chararray, subjects_top_term:chararray,summary:map[], sponsor:map[], subjects'); billDetails = FOREACH bills GENERATE bill_id, congress, official_title, updated_at, subjects_top_term, sponsor#'name' as sponsorName:chararray, sponsor#'state' as sponsorState:chararray, subjects AS subjectList: {t: (subjects: chararray)}, summary#'text' AS billText:chararray; billSearch = FOREACH bills GENERATE bill_id, congress, official_title, updated_at, subjects_top_term, sponsor#'name' as sponsorName:chararray, sponsor#'state' as sponsorState:chararray, summary#'text' AS billText:chararray; -- Group the tweets by place name and use a CPython UDF to find the top 5 bigrams -- for each of these places. bigrams_by_place = FOREACH (GROUP billDetails BY subjects_top_term) GENERATE group AS subjects_top_term:chararray, nltk_udfs.top_5_bigrams(billDetails.official_title), COUNT(billDetails) AS sample_size; top_100_places = LIMIT (ORDER bigrams_by_place BY sample_size DESC) 100; STORE billSearch INTO 'govtrack/bills' USING org.elasticsearch.hadoop.pig.ESStorage(); rmf $OUTPUT_PATH; STORE top_100_places INTO '/Users/davidfauth/MortarBillsData' USING PigStorage('\t');

If you are using the mortar framework, nltk isn’t installed by default. Here’s how you can install it:

# From your project's root directory - Switch to the mortar local virtualenv
source .mortar-local/pythonenv/bin/activate

#Install nltk (http://nltk.org/install.html)
sudo pip install -U pyyaml nltk

For the bi-grams, I re-used some sample Mortar code from Doug Daniels shown below:

from pig_util import outputSchema import nltk @outputSchema("top_five:bag{t:(bigram:chararray)}") def top_5_bigrams(tweets): tokenized_tweets = [ nltk.tokenize.WhitespaceTokenizer().tokenize(t[0]) for t in tweets ] bgm = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_documents(tokenized_tweets) top_5 = finder.nbest(bgm.likelihood_ratio, 5) return [ ("%s %s" % (s[0], s[1]),) for s in top_5 ]

Results
The pig job loaded 58,624 files, processed them and created the elasticsearch index in 53 seconds. The NLTK python UDF finished in another 34 seconds resulting in a total time of 87 seconds.

You can see the working elasticsearch in the following screen shot:

One thing of note
The elasticsearch hadoop connector doesn’t handle geo-coordinates quite yet so you can’t create an index with latitude/longitude. That should be coming soon.

Health Insurance Marketplace Costs

Post author By dave fauth
Post date October 7, 2013
Categories In Uncategorized
1 Comment on Health Insurance Marketplace Costs

Data.Healthcare.Gov released QHP cost information for various health care plans for states in the Federally-Facilitated and State-Partnership Marketplaces. The data is available in a variety of formats and lays out costs for various levels of health care plans (Gold, Silver, Bronze and Catastrophe) for different categories.

Premium Information
Premium amounts do not include tax credits that will lower premiums for the majority of those applying, specifically those with income up to 400 percent of the federal poverty level. The document shows premiums for the following example rating scenarios below:

Adult Individual Age 27 = one adult age 27
Adult Individual Age 50 = one adult age 50
Family = two adults age 30, two children
Single Parent Family = one adult age 30, two children
Couple = two adults age 40, no children
Child = one child any age

Cost Comparisons
Looking at the information, I wanted to do some comparisons across the various plans and rating scenarios to see where the highest costs where, what states had the largest variance and to look at the standard deviation across states/plans.

While I could have run this in Excel or R, I decided to write a simple Pig job to determine the maximum, minimum and average costs by plan for each state. I also then calculated the variance and standard deviations.

Show Pig Code (555 More Words)


/**
 * healthcareCosts
 */
 
/** 
 * Parameters - set default values here; you can override with -p on the command-line.
 */
 
%default INPUT_PATH '/Users/davidfauth/healthcareCosts/QHP_Individual_Medical_Landscape.csv'
%default OUTPUT_PATH '/Users/davidfauth/MortarHealthCareCostsOut'
%default OUTPUT_PATH_PLANDELTAS '/Users/davidfauth/MortarHealthCareCostsOut/PlanDeltas'
%default OUTPUT_PATH_PLANSTANDARD '/Users/davidfauth/MortarHealthCareCostsOut/PlanStandard'

/**
 * User-Defined Functions (UDFs)
 */

-- Load the input data from the CSV file
raw_data = LOAD '$INPUT_PATH' 
          USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER')
             AS (State:chararray,
				County:chararray,
				MetalLevel:chararray,
				IssuerName:chararray,
				PlanMarketingName:chararray,
				PlanType:chararray,
				RatingArea:chararray,
				PremiumAdultIndividualAge27:double,
				PremiumAdultIndividualAge50:double,
				PremiumFamily:double,
				PremiumSingleParentFamily:double,
				PremiumCouple:double,
				PremiumChild:double);
				
-- Limit to sa subset of data				
	
noCurrencySymbol = FOREACH raw_data GENERATE State as newState,
County as newCounty,
MetalLevel as newMetalLevel,
PremiumAdultIndividualAge27 as newPremiumAdultIndividualAge27,
PremiumAdultIndividualAge50 as newPremiumAdultIndividualAge50,
PremiumFamily as newPremiumFamily,
PremiumSingleParentFamily as newPremiumSingleParentFamily,
PremiumCouple as newPremiumCouple,
PremiumChild as newPremiumChild; 
	
/* Group together identical tuples */
PlansByMetalLevel = GROUP noCurrencySymbol BY (newMetalLevel, newState);

--calculate Max, Min and Avg costs
costsByPlansPAA27 = FOREACH PlansByMetalLevel GENERATE FLATTEN(group), 
	COUNT(noCurrencySymbol) as planCount,
    AVG(noCurrencySymbol.newPremiumAdultIndividualAge27) as avgPremiumAdultAge27,
	MIN(noCurrencySymbol.newPremiumAdultIndividualAge27) as minPremiumAdultAge27,
	MAX(noCurrencySymbol.newPremiumAdultIndividualAge27) as maxPremiumAdultAge27,
	AVG(noCurrencySymbol.newPremiumAdultIndividualAge50) as avgPremiumAdultAge50,
	MIN(noCurrencySymbol.newPremiumAdultIndividualAge50) as minPremiumAdultAge50,
	MAX(noCurrencySymbol.newPremiumAdultIndividualAge50) as maxPremiumAdultAge50,
	AVG(noCurrencySymbol.newPremiumFamily) as avgPremiumAdultAgePFAM,
	MIN(noCurrencySymbol.newPremiumFamily) as minPremiumAdultAgePFAM,
	MAX(noCurrencySymbol.newPremiumFamily) as maxPremiumAdultAgePFAM,
    AVG(noCurrencySymbol.newPremiumSingleParentFamily) as avgPremiumAdultAgePSPFAM,
    MIN(noCurrencySymbol.newPremiumSingleParentFamily) as minPremiumAdultAgePSPFAM,
	MAX(noCurrencySymbol.newPremiumSingleParentFamily) as maxPremiumAdultAgePSPFAM,
    AVG(noCurrencySymbol.newPremiumCouple) as avgPremiumAdultAgePC,
    MIN(noCurrencySymbol.newPremiumCouple) as minPremiumAdultAgePC,
	MAX(noCurrencySymbol.newPremiumCouple) as maxPremiumAdultAgePC,
    AVG(noCurrencySymbol.newPremiumChild) as avgPremiumAdultAgePCh,
    MIN(noCurrencySymbol.newPremiumChild) as minPremiumAdultAgePCh,
	MAX(noCurrencySymbol.newPremiumChild) as maxPremiumAdultAgePCh;

	--calculate deltas
deltasByPlan = FOREACH costsByPlansPAA27 GENERATE newMetalLevel, newState,
avgPremiumAdultAge27 - minPremiumAdultAge27 as deltaMinAvgAge27,
avgPremiumAdultAge50 - minPremiumAdultAge50 as deltaMinAvgAge50,
avgPremiumAdultAgePFAM - minPremiumAdultAgePFAM as deltaMinAvgAgePFAM,
avgPremiumAdultAgePSPFAM - minPremiumAdultAgePSPFAM as deltaMinAvgAgePSPFAM,
avgPremiumAdultAgePC - minPremiumAdultAgePC as deltaMinAvgAgePC,
avgPremiumAdultAgePCh- minPremiumAdultAgePCh as deltaMinAvgAgePCh;

-- calculate variance and Standard Deviations
mean = foreach PlansByMetalLevel {
        sum27 = SUM(noCurrencySymbol.newPremiumAdultIndividualAge27);
        sum50 = SUM(noCurrencySymbol.newPremiumAdultIndividualAge50);
        sumPFAM = SUM(noCurrencySymbol.newPremiumFamily);
        sumPSPFAM = SUM(noCurrencySymbol.newPremiumSingleParentFamily);
        sumPC = SUM(noCurrencySymbol.newPremiumCouple);
        sumPCh = SUM(noCurrencySymbol.newPremiumChild);
		count = COUNT(noCurrencySymbol);
        generate flatten(noCurrencySymbol), sum27/count as avg27, sum50/count as avg50, 
        sumPFAM/count as avgPFAM, sumPSPFAM/count as avgPSPFAM, sumPC/count as avgPC, 
		sumPCh/count as avgPCh, count as count;
};

tmp = foreach mean {
    dif27 = (newPremiumAdultIndividualAge27 - avg27) * (newPremiumAdultIndividualAge27 - avg27) ;
    dif50 = (newPremiumAdultIndividualAge50 - avg50) * (newPremiumAdultIndividualAge50 - avg50) ;
    difPFAM = (newPremiumFamily - avgPFAM) * (newPremiumFamily - avgPFAM) ;
    difPSPFAM = (newPremiumSingleParentFamily - avgPSPFAM) * (newPremiumSingleParentFamily - avgPSPFAM) ;
    difPC = (newPremiumCouple - avgPC) * (newPremiumCouple - avgPC) ;
    difPCh = (newPremiumChild - avgPCh) * (newPremiumChild - avgPCh) ;
     generate newMetalLevel, newState, count, dif27 as dif27,
	dif50 as dif50, difPFAM as difPFAM, difPSPFAM as difPSPFAM, difPC as difPC, difPCh as difPCh;
};


grp = group tmp by (newMetalLevel, newState);
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif27) as sqr_sum27, SUM(tmp.dif50) as sqr_sum50,
	SUM(tmp.difPFAM) as sqr_sumPFAM, SUM(tmp.difPSPFAM) as sqr_sumPSPFAM, SUM(tmp.difPC) as sqr_sumPC, 
	SUM(tmp.difPCh) as sqr_sumPCh; 
	
standard = foreach standard_tmp generate newState, newMetalLevel,
sqr_sum27 / count as variance27, SQRT(sqr_sum27 / count) as standard27,
sqr_sum50 / count as variance50, SQRT(sqr_sum50 / count) as standard50,
sqr_sumPFAM / count as variancePFAM, SQRT(sqr_sumPFAM / count) as standardPFAM,
sqr_sumPSPFAM / count as variancePSPFAM, SQRT(sqr_sumPSPFAM / count) as standardPSPFAM,
sqr_sumPC / count as variancePC, SQRT(sqr_sumPC / count) as standardPC,
sqr_sumPCh / count as variancePCh, SQRT(sqr_sumPCh / count) as standardPCh;

distinctStandard = DISTINCT standard;

-- remove any existing data
rmf $OUTPUT_PATH;

-- store the results
STORE costsByPlansPAA27 INTO '$OUTPUT_PATH' USING PigStorage('|');
STORE deltasByPlan INTO '$OUTPUT_PATH_PLANDELTAS' USING PigStorage('|');
STORE distinctStandard INTO '$OUTPUT_PATH_PLANSTANDARD' USING PigStorage('|');

Initial Cost Analysis
There is a wide range in costs across the states with Virginia being consistently the highest average cost plan. Looking at the catastrophic costs, Virginia plans are five times (5x) more expensive than Kansas or Alabama.

For Gold plans, Virginia is again between two and three times (2-3X) more expensive to buy insurance.

Variance and Standard Deviation
It comes as no surprise that Virginia has the largest variance and standard deviation for the cost data by a large margin. Virginia’s variance on the Gold plans is 2742 times that of Alabama. New Hampshire, Alaska, Delaware and Utah all have small variances and are consistent across the rating scenarios.

Again, Virginia’s variance on the bronze plans are way out of balance compared to other states.

However, for a Platinum plan, has the ninth smallest variation across all rating scenarios. New Jersey, Michigan and Wisconsin have the largest variations.

Code and Data
The code and data is on Github. If you have questions, you can reach me at dsfauth at gmail dot com.

Part 2 – Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

Post author By dave fauth
Post date August 26, 2013
Categories In Uncategorized
No Comments on Part 2 – Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

In the last post, I talked about creating the enhanced DocGraph dataset using Mortar and Neo4J. Our data model looks like the following:

Nodes
Organizations
Specialties
Providers
Locations
CountiesZip
Census

Relationships
* Organizations -[:PARENT_OF] – Providers -[:SPECIALTY]- Specialties
* Providers -[:LOCATED_IN]-Locations
* Providers -[:REFERRED]-Providers
* Counties -[:INCOME_IN]- CountiesZip
* Locations – [:LOCATED_IN]-Locations

Each of the nodes will have several properties associated with them. For example, Organizations will have a name associated with it. Locations have a city, state and postal code associated with each location.

Data
The data we are going to use is the initial DocGraph set, the Health Care Provider Taxonomy Code (NUCC) set located here, the National Plan and Provider Enumeration System (NPPES) Downloadable File here, and a zipcode to state file and the income per zipcode downloaded from the US Census. These files were loaded to an Amazon S3 bucket for processing.

Mortar Project
To create the Neo4J graph database, we will need to create several files to be loaded into Neo4J. To create the files, we are going to create a Mortar Project and use the pig file that we created in the last post.

Create Mortar Project
In order to fully leverage the Mortar Project framework, I created a mortar project which makes it available in GitHub. This will create a new project skeleton and register it with Mortar. This project will have folders created for commonly used items, such as pigscripts, macros, and UDFs.

cd mortar-examples
mortar projects:create docGraphNeo4J

Pig Code
Any Pig code that you want to run with Mortar should be put in the pigscripts directory in your project. I replaced the example pigscript in that directory called my-sample-project.pig with my docGraphNeo4J.pig script.

Illustrate
Illustrate is the best tool to check what you’ve written so far. Illustrate will check your Pig syntax, and then show a small subset of data flowing through each alias in your pigscript.

To get the fastest results, use the local:illustrate command.

mortar local:illustrate pigscripts/my-sample-project.pig

Once the illustrate result is ready, a web browser tab will open to show the results:

Mortar Watchtower
Mortar Watchtower is fastest way to develop with Pig. Rather than waiting for local or remote Pig run, you can validate that your scripts work simply by saving. Watchtower sits in the background analyzing your script, showing you your data flowing through the scripts instantly.

After installing Mortar Watchtower, I was able to do near realtime analysis of the data simply by typing in:

mortar watch ./pigscripts/docGraphNeo4J2.pig

Once I type that into my console window, I see:

A browser window then pops up:

As you can see, the Watchtower Viewer redisplays your script with example data embedded inline with each alias. You can click on the header of this inline table to toggle between different numbers of example rows. You can also click on any given table cell to see the complete data, including any truncated.

Full Run on Mortar
Once the code was ready for running, it was time to run on a full Hadoop cluster. To specify cluster size for your run, use the –clustersize option:

$ mortar jobs:run pigscripts/docGraphNeo4J.pig --clustersize 4

When I ran these jobs on the full Hadoop cluster, it ran in about 16 minutes. It wrote the following records to my Amazon S3 buckets:

Input(s):
Successfully read 3998551 records from: "s3n://NPIData/npidata_20050523-20130512.csv"
Successfully read 830 records from: "s3n://NUCC-Taxonomy/nucc_taxonomy_130.txt"
Successfully read 49685587 records from: "s3n://medgraph/refer.2011.csv"

Output(s):
Successfully stored 3998551 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/providerList"
Successfully stored 3998551 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/locations"
Successfully stored 77896 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/parentOfLink"
Successfully stored 33212 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/addrList"
Successfully stored 830 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/specialties"
Successfully stored 4746915 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/specialtiesProviders"
Successfully stored 33212 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/uniquelocations"
Successfully stored 694221 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/organizations"
Successfully stored 49685587 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/docGraphProviders"
Successfully stored 1826823 records in: "s3n://DataOut/DocGraph/DocGraphNeo4J/uniqueDoctorList"

Summary
In summary, I was able to take the three raw data files, write a pig script to process the data, run the pig job on a Hadoop cluster and create the multiple files that I will need to populate the Neo4J instance.

Why did I choose Mortar?
Mortar is fast, open and free. As Mortar says, using a Mortar project provides you with the following advantages:

* Pig and Hadoop on Your Computer: When you create a Mortar Project, you get a local installation of Pig and Hadoop ready to use, without needing to install anything yourself. That means faster development, and better testing.
* Version Control and Code Sharing: Mortar Projects are backed by source control, either through Mortar or your own system, so you can collaborate with team members on a project.
* 1-Button Deployment: When you’re ready to run your project on a Hadoop cluster, a single command is all that’s needed to deploy and run in the cloud.

Using Mortar’s Watchtower, I was able to get an instant sampling of my data, complete file watching, instant schema validation and instant error catching.

For me, Mortar was easy, fast and a great tool to get the data ready for loading into Neo4J.

Next Steps
In the next post, I’ll write about how to move the data from the data files and load them into Neo4J.

Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

Post author By dave fauth
Post date August 19, 2013
Categories In Uncategorized
1 Comment on Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

“The average doctor has likely never heard of Fred Trotter, but he has some provocative ideas about using physician data to change how healthcare gets delivered.” This was from a recent Gigaom article. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

The current DocGraph social graph was built in Neo4J. With new enhancements in Neo4J 2.0 (primarily labels), now was a good time to rebuild the social graph, add in data about each doctor, their specialties and their locations. Finally, I’ve added in some census income data at the zip code level. Researchers could look at economic indicators to see if there are discernable economic patterns in the referrals.

In this series of blog posts, I will attempt to walk through the process in building the Neo4J updated DocGraph using Hadoop followed by the Neo4J batch inserter.

Building the import documents.

One of the goals of the project was to learn Pig in combination with Hadoop to process the large files. I could easily have worked in MySQL or Oracle, but I also wanted an easy way to run jobs on large data sets.

My friends at Mortar have a great platform for leveraging Hadoop, Pig and Python. Mortar is the fastest and easiest way to work with Pig and Python on Hadoop. Mortar’s platform is for everything from joining and cleansing large data sets to machine learning and building recommender systems.
Mortar makes it easy for developers and data scientists to do powerful work with Hadoop. The main advantages of Mortar are:

Zero Setup Time: Mortar takes only minutes to set up (or no time at all on the web), and you can start running Pig jobs immediately. No need for painful installation or configuration.
Powerful Tooling: Mortar provides a rich suite of tools to aid in Pig development, including the ability to Illustrate a script before running it, and an extremely fast and free local development mode.
Elastic Clusters: We spin up Hadoop clusters as you need them, so you don’t have to predict your needs in advance, and you don’t pay for machines you don’t use.
Solid Support: Whether the issue is in your script or in Hadoop, we’ll help you figure out a solution.

Data Sets

One great thing about this data is that you can combine the DocGraph data with with other data sets. For example, we can combine NPPES data with the DocGraph data. The NPPES is the federal registry for NPI numbers and associated provider information.

To create the data sets for ingest into Neo4J, we are going to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes.

Pig Scripts
Using Pig scripts, I was able to create several data files that could then be loaded into Neo4J.

Show Pig Code (604 More Words)

-- Load the DocGraph referral data
medGraphData = LOAD 's3n://medgraph/refer.2011.csv' USING PigStorage(',') AS
(primaryProvider:chararray,
referredDoctor: chararray,
qtyReferred:chararray);

-- Load the Classification/Specialty Codes
nucc_codes = LOAD 's3n://NUCC-Taxonomy/nucc_taxonomy_130.txt' USING PigStorage('\t') AS
(nuccCode:chararray,
nuccType:chararray,
nuccClassification:chararray,
nuccSpecialty:chararray);

-- Load NPI Data
npiData = LOAD 's3n://NPIData/npidata_20050523-20130512.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS
(NPI:chararray,
Entity_Type_Code:chararray,
Replacement_NPI:chararray,
Employer_Identification_Number:chararray,
Provider_Organization_Name:chararray,
Provider_Last_Name:chararray,
Provider_First_Name:chararray,
Provider_Middle_Name:chararray,
Provider_Name_Prefix_Text:chararray,
Provider_Name_Suffix_Text:chararray,
Provider_Credential_Text:chararray,
Provider_Other_Organization_Name:chararray,
Provider_Other_Organization_Name_Type_Code:chararray,
Provider_Other_Last_Name:chararray,
Provider_Other_First_Name:chararray,
Provider_Other_Middle_Name:chararray,
Provider_Other_Name_Prefix_Text:chararray,
Provider_Other_Name_Suffix_Text:chararray,
Provider_Other_Credential_Text:chararray,
Provider_Other_Last_Name_Type_Code:chararray,
Provider_First_Line_Business_Mailing_Address:chararray,
Provider_Second_Line_Business_Mailing_Address:chararray,
Provider_Business_Mailing_Address_City_Name:chararray,
Provider_Business_Mailing_Address_State_Name:chararray,
Provider_Business_Mailing_Address_Postal_Code:chararray,
Provider_Business_Mailing_Address_Country_Code:chararray,
Provider_Business_Mailing_Address_Telephone_Number:chararray,
Provider_Business_Mailing_Address_Fax_Number:chararray,
Provider_First_Line_Business_Practice_Location_Address:chararray,
Provider_Second_Line_Business_Practice_Location_Address:chararray,
Provider_Business_Practice_Location_Address_City_Name:chararray,
Provider_Business_Practice_Location_Address_State_Name:chararray,
Provider_Business_Practice_Location_Address_Postal_Code:chararray,
Provider_Business_Practice_Location_Address_Country_Code:chararray,
Provider_Business_Practice_Location_Address_Telephone_Number:chararray,
Provider_Business_Practice_Location_Address_Fax_Number:chararray,
Provider_Enumeration_Date:chararray,
Last_Update_Date:chararray,
NPI_Deactivation_Reason_Code:chararray,
NPI_Deactivation_Date:chararray,
NPI_Reactivation_Date:chararray,
Provider_Gender_Code:chararray,
Authorized_Official_Last_Name:chararray,
Authorized_Official_First_Name:chararray,
Authorized_Official_Middle_Name:chararray,
Authorized_Official_Title_or_Position:chararray,
Authorized_Official_Telephone_Number:chararray,
Healthcare_Provider_Taxonomy_Code_1:chararray,
Provider_License_Number_1:chararray,
Provider_License_Number_State_Code_1:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_1:chararray,
Healthcare_Provider_Taxonomy_Code_2:chararray,
Provider_License_Number_2:chararray,
Provider_License_Number_State_Code_2:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_2:chararray,
Healthcare_Provider_Taxonomy_Code_3:chararray,
Provider_License_Number_3:chararray,
Provider_License_Number_State_Code_3:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_3:chararray,
Healthcare_Provider_Taxonomy_Code_4:chararray,
Provider_License_Number_4:chararray,
Provider_License_Number_State_Code_4:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_4:chararray,
Healthcare_Provider_Taxonomy_Code_5:chararray,
Provider_License_Number_5:chararray,
Provider_License_Number_State_Code_5:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_5:chararray,
Healthcare_Provider_Taxonomy_Code_6:chararray,
Provider_License_Number_6:chararray,
Provider_License_Number_State_Code_6:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_6:chararray,
Healthcare_Provider_Taxonomy_Code_7:chararray,
Provider_License_Number_7:chararray,
Provider_License_Number_State_Code_7:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_7:chararray,
Healthcare_Provider_Taxonomy_Code_8:chararray,
Provider_License_Number_8:chararray,
Provider_License_Number_State_Code_8:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_8:chararray,
Healthcare_Provider_Taxonomy_Code_9:chararray,
Provider_License_Number_9:chararray,
Provider_License_Number_State_Code_9:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_9:chararray,
Healthcare_Provider_Taxonomy_Code_10:chararray,
Provider_License_Number_10:chararray,
Provider_License_Number_State_Code_10:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_10:chararray,
Healthcare_Provider_Taxonomy_Code_11:chararray,
Provider_License_Number_11:chararray,
Provider_License_Number_State_Code_11:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_11:chararray,
Healthcare_Provider_Taxonomy_Code_12:chararray,
Provider_License_Number_12:chararray,
Provider_License_Number_State_Code_12:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_12:chararray,
Healthcare_Provider_Taxonomy_Code_13:chararray,
Provider_License_Number_13:chararray,
Provider_License_Number_State_Code_13:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_13:chararray,
Healthcare_Provider_Taxonomy_Code_14:chararray,
Provider_License_Number_14:chararray,
Provider_License_Number_State_Code_14:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_14:chararray,
Healthcare_Provider_Taxonomy_Code_15:chararray,
Provider_License_Number_15:chararray,
Provider_License_Number_State_Code_15:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_15:chararray,
Other_Provider_Identifier_1:chararray,
Other_Provider_Identifier_Type_Code_1:chararray,
Other_Provider_Identifier_State_1:chararray,
Other_Provider_Identifier_Issuer_1:chararray,
Other_Provider_Identifier_2:chararray,
Other_Provider_Identifier_Type_Code_2:chararray,
Other_Provider_Identifier_State_2:chararray,
Other_Provider_Identifier_Issuer_2:chararray,
Other_Provider_Identifier_3:chararray,
Other_Provider_Identifier_Type_Code_3:chararray,
Other_Provider_Identifier_State_3:chararray,
Other_Provider_Identifier_Issuer_3:chararray,
Other_Provider_Identifier_4:chararray,
Other_Provider_Identifier_Type_Code_4:chararray,
Other_Provider_Identifier_State_4:chararray,
Other_Provider_Identifier_Issuer_4:chararray,
Other_Provider_Identifier_5:chararray,
Other_Provider_Identifier_Type_Code_5:chararray,
Other_Provider_Identifier_State_5:chararray,
Other_Provider_Identifier_Issuer_5:chararray,
Other_Provider_Identifier_6:chararray,
Other_Provider_Identifier_Type_Code_6:chararray,
Other_Provider_Identifier_State_6:chararray,
Other_Provider_Identifier_Issuer_6:chararray,
Other_Provider_Identifier_7:chararray,
Other_Provider_Identifier_Type_Code_7:chararray,
Other_Provider_Identifier_State_7:chararray,
Other_Provider_Identifier_Issuer_7:chararray,
Other_Provider_Identifier_8:chararray,
Other_Provider_Identifier_Type_Code_8:chararray,
Other_Provider_Identifier_State_8:chararray,
Other_Provider_Identifier_Issuer_8:chararray,
Other_Provider_Identifier_9:chararray,
Other_Provider_Identifier_Type_Code_9:chararray,
Other_Provider_Identifier_State_9:chararray,
Other_Provider_Identifier_Issuer_9:chararray,
Other_Provider_Identifier_10:chararray,
Other_Provider_Identifier_Type_Code_10:chararray,
Other_Provider_Identifier_State_10:chararray,
Other_Provider_Identifier_Issuer_10:chararray,
Other_Provider_Identifier_11:chararray,
Other_Provider_Identifier_Type_Code_11:chararray,
Other_Provider_Identifier_State_11:chararray,
Other_Provider_Identifier_Issuer_11:chararray,
Other_Provider_Identifier_12:chararray,
Other_Provider_Identifier_Type_Code_12:chararray,
Other_Provider_Identifier_State_12:chararray,
Other_Provider_Identifier_Issuer_12:chararray,
Other_Provider_Identifier_13:chararray,
Other_Provider_Identifier_Type_Code_13:chararray,
Other_Provider_Identifier_State_13:chararray,
Other_Provider_Identifier_Issuer_13:chararray,
Other_Provider_Identifier_14:chararray,
Other_Provider_Identifier_Type_Code_14:chararray,
Other_Provider_Identifier_State_14:chararray,
Other_Provider_Identifier_Issuer_14:chararray,
Other_Provider_Identifier_15:chararray,
Other_Provider_Identifier_Type_Code_15:chararray,
Other_Provider_Identifier_State_15:chararray,
Other_Provider_Identifier_Issuer_15:chararray,
Other_Provider_Identifier_16:chararray,
Other_Provider_Identifier_Type_Code_16:chararray,
Other_Provider_Identifier_State_16:chararray,
Other_Provider_Identifier_Issuer_16:chararray,
Other_Provider_Identifier_17:chararray,
Other_Provider_Identifier_Type_Code_17:chararray,
Other_Provider_Identifier_State_17:chararray,
Other_Provider_Identifier_Issuer_17:chararray,
Other_Provider_Identifier_18:chararray,
Other_Provider_Identifier_Type_Code_18:chararray,
Other_Provider_Identifier_State_18:chararray,
Other_Provider_Identifier_Issuer_18:chararray,
Other_Provider_Identifier_19:chararray,
Other_Provider_Identifier_Type_Code_19:chararray,
Other_Provider_Identifier_State_19:chararray,
Other_Provider_Identifier_Issuer_19:chararray,
Other_Provider_Identifier_20:chararray,
Other_Provider_Identifier_Type_Code_20:chararray,
Other_Provider_Identifier_State_20:chararray,
Other_Provider_Identifier_Issuer_20:chararray,
Other_Provider_Identifier_21:chararray,
Other_Provider_Identifier_Type_Code_21:chararray,
Other_Provider_Identifier_State_21:chararray,
Other_Provider_Identifier_Issuer_21:chararray,
Other_Provider_Identifier_22:chararray,
Other_Provider_Identifier_Type_Code_22:chararray,
Other_Provider_Identifier_State_22:chararray,
Other_Provider_Identifier_Issuer_22:chararray,
Other_Provider_Identifier_23:chararray,
Other_Provider_Identifier_Type_Code_23:chararray,
Other_Provider_Identifier_State_23:chararray,
Other_Provider_Identifier_Issuer_23:chararray,
Other_Provider_Identifier_24:chararray,
Other_Provider_Identifier_Type_Code_24:chararray,
Other_Provider_Identifier_State_24:chararray,
Other_Provider_Identifier_Issuer_24:chararray,
Other_Provider_Identifier_25:chararray,
Other_Provider_Identifier_Type_Code_25:chararray,
Other_Provider_Identifier_State_25:chararray,
Other_Provider_Identifier_Issuer_25:chararray,
Other_Provider_Identifier_26:chararray,
Other_Provider_Identifier_Type_Code_26:chararray,
Other_Provider_Identifier_State_26:chararray,
Other_Provider_Identifier_Issuer_26:chararray,
Other_Provider_Identifier_27:chararray,
Other_Provider_Identifier_Type_Code_27:chararray,
Other_Provider_Identifier_State_27:chararray,
Other_Provider_Identifier_Issuer_27:chararray,
Other_Provider_Identifier_28:chararray,
Other_Provider_Identifier_Type_Code_28:chararray,
Other_Provider_Identifier_State_28:chararray,
Other_Provider_Identifier_Issuer_28:chararray,
Other_Provider_Identifier_29:chararray,
Other_Provider_Identifier_Type_Code_29:chararray,
Other_Provider_Identifier_State_29:chararray,
Other_Provider_Identifier_Issuer_29:chararray,
Other_Provider_Identifier_30:chararray,
Other_Provider_Identifier_Type_Code_30:chararray,
Other_Provider_Identifier_State_30:chararray,
Other_Provider_Identifier_Issuer_30:chararray,
Other_Provider_Identifier_31:chararray,
Other_Provider_Identifier_Type_Code_31:chararray,
Other_Provider_Identifier_State_31:chararray,
Other_Provider_Identifier_Issuer_31:chararray,
Other_Provider_Identifier_32:chararray,
Other_Provider_Identifier_Type_Code_32:chararray,
Other_Provider_Identifier_State_32:chararray,
Other_Provider_Identifier_Issuer_32:chararray,
Other_Provider_Identifier_33:chararray,
Other_Provider_Identifier_Type_Code_33:chararray,
Other_Provider_Identifier_State_33:chararray,
Other_Provider_Identifier_Issuer_33:chararray,
Other_Provider_Identifier_34:chararray,
Other_Provider_Identifier_Type_Code_34:chararray,
Other_Provider_Identifier_State_34:chararray,
Other_Provider_Identifier_Issuer_34:chararray,
Other_Provider_Identifier_35:chararray,
Other_Provider_Identifier_Type_Code_35:chararray,
Other_Provider_Identifier_State_35:chararray,
Other_Provider_Identifier_Issuer_35:chararray,
Other_Provider_Identifier_36:chararray,
Other_Provider_Identifier_Type_Code_36:chararray,
Other_Provider_Identifier_State_36:chararray,
Other_Provider_Identifier_Issuer_36:chararray,
Other_Provider_Identifier_37:chararray,
Other_Provider_Identifier_Type_Code_37:chararray,
Other_Provider_Identifier_State_37:chararray,
Other_Provider_Identifier_Issuer_37:chararray,
Other_Provider_Identifier_38:chararray,
Other_Provider_Identifier_Type_Code_38:chararray,
Other_Provider_Identifier_State_38:chararray,
Other_Provider_Identifier_Issuer_38:chararray,
Other_Provider_Identifier_39:chararray,
Other_Provider_Identifier_Type_Code_39:chararray,
Other_Provider_Identifier_State_39:chararray,
Other_Provider_Identifier_Issuer_39:chararray,
Other_Provider_Identifier_40:chararray,
Other_Provider_Identifier_Type_Code_40:chararray,
Other_Provider_Identifier_State_40:chararray,
Other_Provider_Identifier_Issuer_40:chararray,
Other_Provider_Identifier_41:chararray,
Other_Provider_Identifier_Type_Code_41:chararray,
Other_Provider_Identifier_State_41:chararray,
Other_Provider_Identifier_Issuer_41:chararray,
Other_Provider_Identifier_42:chararray,
Other_Provider_Identifier_Type_Code_42:chararray,
Other_Provider_Identifier_State_42:chararray,
Other_Provider_Identifier_Issuer_42:chararray,
Other_Provider_Identifier_43:chararray,
Other_Provider_Identifier_Type_Code_43:chararray,
Other_Provider_Identifier_State_43:chararray,
Other_Provider_Identifier_Issuer_43:chararray,
Other_Provider_Identifier_44:chararray,
Other_Provider_Identifier_Type_Code_44:chararray,
Other_Provider_Identifier_State_44:chararray,
Other_Provider_Identifier_Issuer_44:chararray,
Other_Provider_Identifier_45:chararray,
Other_Provider_Identifier_Type_Code_45:chararray,
Other_Provider_Identifier_State_45:chararray,
Other_Provider_Identifier_Issuer_45:chararray,
Other_Provider_Identifier_46:chararray,
Other_Provider_Identifier_Type_Code_46:chararray,
Other_Provider_Identifier_State_46:chararray,
Other_Provider_Identifier_Issuer_46:chararray,
Other_Provider_Identifier_47:chararray,
Other_Provider_Identifier_Type_Code_47:chararray,
Other_Provider_Identifier_State_47:chararray,
Other_Provider_Identifier_Issuer_47:chararray,
Other_Provider_Identifier_48:chararray,
Other_Provider_Identifier_Type_Code_48:chararray,
Other_Provider_Identifier_State_48:chararray,
Other_Provider_Identifier_Issuer_48:chararray,
Other_Provider_Identifier_49:chararray,
Other_Provider_Identifier_Type_Code_49:chararray,
Other_Provider_Identifier_State_49:chararray,
Other_Provider_Identifier_Issuer_49:chararray,
Other_Provider_Identifier_50:chararray,
Other_Provider_Identifier_Type_Code_50:chararray,
Other_Provider_Identifier_State_50:chararray,
Other_Provider_Identifier_Issuer_50:chararray,
Is_Sole_Proprietor:chararray,
Is_Organization_Subpart:chararray,
Parent_Organization_LBN:chararray,
Parent_Organization_TIN:chararray,
Authorized_Official_Name_Prefix_Text:chararray,
Authorized_Official_Name_Suffix_Text:chararray,
Authorized_Official_Credential_Text:chararray,
Healthcare_Provider_Taxonomy_Group_1:chararray,
Healthcare_Provider_Taxonomy_Group_2:chararray,
Healthcare_Provider_Taxonomy_Group_3:chararray,
Healthcare_Provider_Taxonomy_Group_4:chararray,
Healthcare_Provider_Taxonomy_Group_5:chararray,
Healthcare_Provider_Taxonomy_Group_6:chararray,
Healthcare_Provider_Taxonomy_Group_7:chararray,
Healthcare_Provider_Taxonomy_Group_8:chararray,
Healthcare_Provider_Taxonomy_Group_9:chararray,
Healthcare_Provider_Taxonomy_Group_10:chararray,
Healthcare_Provider_Taxonomy_Group_11:chararray,
Healthcare_Provider_Taxonomy_Group_12:chararray,
Healthcare_Provider_Taxonomy_Group_13:chararray,
Healthcare_Provider_Taxonomy_Group_14:chararray,
Healthcare_Provider_Taxonomy_Group_15:chararray
);


-- generate a Provider List and replace any quotes
providerList = foreach npiData generate REPLACE(NPI, '\\"', '') AS npiCode,
       REPLACE(Entity_Type_Code, '\\"','') AS entity_type,
	   REPLACE(Provider_First_Line_Business_Practice_Location_Address, '\\"','') AS address_first_line,
	   REPLACE(Provider_Second_Line_Business_Practice_Location_Address, '\\"','') AS address_second_line,
	   REPLACE(Provider_Business_Practice_Location_Address_City_Name, '\\"','') AS address_city_name,
	   REPLACE(Provider_Business_Practice_Location_Address_State_Name, '\\"','') AS address_state_name,
	   REPLACE(Provider_Business_Practice_Location_Address_Postal_Code, '\\"','') AS address_postal_code,
	   REPLACE(Provider_Business_Practice_Location_Address_Country_Code, '\\"','') AS address_country_code,
	   REPLACE(Provider_Business_Practice_Location_Address_Telephone_Number, '\\"','') AS telephone_number,
	   REPLACE(Provider_Business_Practice_Location_Address_Fax_Number, '\\"','') AS fax_number,
       REPLACE(Provider_Gender_Code, '\\"','') AS gender,
       REPLACE(Provider_Organization_Name, '\\"','') AS ProviderOrgName,
	   REPLACE(Provider_Name_Prefix_Text, '\\"', '') AS ProviderPrefix,
	   REPLACE(Provider_First_Name, '\\"', '') AS ProviderFirstName,
	   REPLACE(Provider_Middle_Name, '\\"', '') AS ProviderMiddleName,
	   REPLACE(Provider_Last_Name, '\\"', '') AS ProviderLastName,
	   REPLACE(Provider_Name_Suffix_Text, '\\"', '') AS ProviderSuffix,
 	   REPLACE(Provider_Credential_Text, '\\"', '') AS ProviderCredential;

-- create list of NPI codes to Parent Organization
parentOrgLBNList = foreach npiData generate REPLACE(Parent_Organization_LBN, '\\"','') as newParentOrgLBN;
hasParentOrgLBNValue = filter parentOrgLBNList by newParentOrgLBN != '';
distinctParentOrgLBNList = distinct hasParentOrgLBNValue;
childHasParentLBN = join npiData by (REPLACE(Parent_Organization_LBN, '\\"','')), distinctParentOrgLBNList BY (newParentOrgLBN);
npiLBN = foreach childHasParentLBN generate NPI as childNPI, Parent_Organization_LBN as newParentOrgLBN;

-- generate unique list of Provider Organization Names
providerSubList = foreach providerList generate ProviderOrgName;
hasProviderOrgName = filter providerSubList by ProviderOrgName != '';
distinctProvider = distinct hasProviderOrgName;
--hasParentOrgLBNValue = filter hasParentOrgLBN by newParentOrgLBN matches '.';
--grpd = group hasParentOrgLBNValue by NPI;
--parentGroupOut = foreach grpd generate group, COUNT(hasParentOrgLBNValue);


--address list
addressList = foreach providerList generate address_city_name, address_state_name, address_country_code;
uniqueLocations = distinct addressList;
grpdAddressList = group addressList by (address_city_name, address_state_name, address_country_code);
addressListCnt = foreach grpdAddressList generate group, COUNT(addressList) as countAddressList;
addressListOrdered = order addressListCnt BY countAddressList;
addressListFlat = foreach addressListOrdered GENERATE FLATTEN(group) as (address_city_name, address_state_name, address_country_code), countAddressList;

--located in
locatedIn = foreach providerList generate npiCode,address_city_name, address_state_name, address_postal_code, address_country_code;

-- doctors taxonomy listing (some doctors may have multiple taxonomies
joinedNPITax1 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_1, '\\"', '')), nucc_codes BY (nuccCode);
joinedNPITax2 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_2, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax3 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_3, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax4 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_4, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax5 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_5, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax6 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_6, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax7 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_7, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax8 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_8, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax9 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_9, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax10 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_10, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax11 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_11, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax12 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_12, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax13 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_13, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax14 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_14, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax15 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_15, '\\"', '')) , nucc_codes BY (nuccCode);

joinedNPINUCC = UNION joinedNPITax1,joinedNPITax2,joinedNPITax3,joinedNPITax4,joinedNPITax5,joinedNPITax6,
joinedNPITax7,joinedNPITax8,joinedNPITax9,joinedNPITax10,joinedNPITax11,joinedNPITax12,joinedNPITax13,joinedNPITax14,joinedNPITax15;

simpleJoinedNPINUCC = foreach joinedNPINUCC GENERATE REPLACE(NPI, '\\"', '') AS npiCode, nuccCode, nuccType, nuccClassification, nuccSpecialty;

-- unique Specialties
uniqueNUCCCodes = distinct nucc_codes;

-- unique NPICodes
primaryDoc = foreach medGraphData generate primaryProvider;
referredDoc = foreach medGraphData generate referredDoctor;
uniquePrimaryDoc = distinct primaryDoc;
uniqueReferredDoc = distinct referredDoc;
uniqueDocList = union uniquePrimaryDoc, uniqueReferredDoc;

rmf s3n://DataOut/DocGraph/DocGraphNeo4J;
--STORE sampleJoinedVADoc INTO 's3n://DataOut/DocGraph/DocHosp' USING PigStorage('|');
--STORE cnt INTO 's3n://DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');
--s3 Work
STORE npiLBN INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/parentOfLink' USING PigStorage('|');
STORE distinctProvider INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/organizations' USING PigStorage('|');
STORE providerList INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/providerList' USING PigStorage('|');
STORE addressListFlat INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/addrList' USING PigStorage('|');
STORE locatedIn INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/locations' USING PigStorage('|');
STORE uniqueLocations INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/uniquelocations' USING PigStorage('|');
STORE uniqueNUCCCodes INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/specialties' USING PigStorage('|');
STORE simpleJoinedNPINUCC INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/specialtiesProviders' USING PigStorage('|');
STORE medGraphData INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/docGraphProviders' USING PigStorage('|');
STORE uniqueDocList INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/uniqueDoctorList' USING PigStorage('|');

--local Development
--STORE parentGroupOut INTO '../DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');
--STORE addressListOrdered INTO '../DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');
--STORE locatedIn INTO '../DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');
--STORE simpleJoinedNPINUCC INTO '../DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');

Running the Pig Code in Mortar
In the next post, we will look at using Mortar’s framework to run the Pig jobs.

Recommender Tips, Mortar and DocGraph

Post author By dave fauth
Post date August 14, 2013
Categories In Uncategorized
No Comments on Recommender Tips, Mortar and DocGraph

Jonathan Packer wrote on Mortar’s blog about flexible recommender models. Jonathan articulates that “from a business perspective the two most salient advantages of graph-based models: flexibility and simplicity.”

Some of salient points made in the article are:

graph-based models are modular and transparent
simple graph-based model will allow you to build a viable recommender system for your product without delaying its time-to-market
Graphs can be visualized, explained, discussed, and debugged collaboratively in a way that sophisticated machine learning techniques cannot.

Jonathan ends with “My opinion is that the next big advances to be made in recommender systems will be made by combining automated tools with human—possibly crowdsourced—editorial judgement and writing talent. They will be made in finding more engaging ways to present recommendations to users than cloying sidebars and endlessly scrolling lists.”

DocGraph
“The average doctor has likely never heard of Fred Trotter, but he has some provocative ideas about using physician data to change how healthcare gets delivered.” This was from a recent Gigaom article. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

Recommendation Engine
The combination of the Neo4J social graph, the medical data and the capability to build a recommendation engine in Mortar makes a compelling use case. I believe that this use case will address Jonathan’s premise that the new engaging recommendation engines can be built to help give patients a sense of which doctors are most respected by their peers. Additionally, the graph data could help hospitals understand the referral patterns associated with poor care coordination, and provide health IT startups with a map of the most plugged-in doctors in each city.

Next steps
Over the next couple of weeks, I’ll be writing on how I used Mortar, Pig and Neo4J to build the updated DocGraph data set.

Chicago Sacred Heart Hospital – Medicare Kickback Scheme

Post author By dave fauth
Post date April 23, 2013
Categories In Uncategorized
No Comments on Chicago Sacred Heart Hospital – Medicare Kickback Scheme

According to an April 16, 2013 FBI press release, Chicago Sacred Heart Hospital Owner, Executive, and Four Doctors Arrested in Alleged Medicare Referral Kickback Conspiracy.

From the press release:

CHICAGO—The owner and another senior executive of Sacred Heart Hospital and four physicians affiliated with the west side facility were arrested today for allegedly conspiring to pay and receive illegal kickbacks, including more than $225,000 in cash, along with other forms of payment, in exchange for the referral of patients insured by Medicare and Medicaid to the hospital, announced U.S. Attorney for the Northern District of Illinois Gary S. Shapiro.
…
Arrested were Edward J. Novak, 58, of Park Ridge, Sacred Heart’s owner and chief executive officer since the late 1990s; Roy M. Payawal, 64, of Burr Ridge, executive vice president and chief financial officer since the early 2000s; and Drs. Venkateswara R. “V.R.” Kuchipudi, 66, of Oak Brook, Percy Conrad May, Jr., 75, of Chicago, Subir Maitra, 73, of Chicago, and Shanin Moshiri, 57, of Chicago.

DocGraph DataI wanted to see what the graph of these doctors looked like in the DocGraph dataset. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

Hadoop Data Processing Using Mortar for online hadoop processing, Amazon S3 storage and access to the data, I wrote up a Hadoop script that filters the DocGraph data where any of the accused where the referring doctors, joined them to the National Provider registry and wrote the data out to an S3 bucket.

medGraphData = LOAD 's3n://medgraph/refer.2011.csv' USING PigStorage(',') AS
(primaryProvider:chararray,
referredDoctor: chararray,
qtyReferred:chararray);

nucc_codes = LOAD 's3n://NUCC-Taxonomy/nucc_taxonomy_130.txt' USING PigStorage('\t') AS
(nuccCode:chararray,
nuccType:chararray,
nuccClassification:chararray,
nuccSpecialty:chararray);

-- Load NPI Data
npiData = LOAD 's3n://NPIData/npidata_20050523-20130113.csv' USING PigStorage(',') AS
(NPICode:chararray,
f2:chararray,
f3:chararray,
f4:chararray,
f5:chararray,
f6:chararray,
f7:chararray,
f8:chararray,
f9:chararray,
f10:chararray,
f11:chararray,
f12:chararray,
f13:chararray,
f14:chararray,
f15:chararray,
f16:chararray,
f17:chararray,
f18:chararray,
f19:chararray,
f20:chararray,
f21:chararray,
f22:chararray,
f23:chararray,
f24:chararray,
f25:chararray,
f26:chararray,
f27:chararray,
f28:chararray,
f29:chararray,
f30:chararray,
f31:chararray,
f32:chararray,
f33:chararray,
f34:chararray,
f35:chararray,
f36:chararray,
f37:chararray,
f38:chararray,
f39:chararray,
f40:chararray,
f41:chararray,
f42:chararray,
f43:chararray,
f44:chararray,
f45:chararray,
f46:chararray,
f47:chararray,
f48:chararray,
f49:chararray);

chicagoSacredHeartHosp = FILTER medGraphData BY (referredDoctor == '1003163122' OR referredDoctor == '1760730063');

chicagoSacredHeartHospPrimary = FILTER medGraphData BY (primaryProvider == '1003163122' OR primaryProvider == '1760730063');

docFraud = FILTER medGraphData BY (primaryProvider == '1598896086' OR primaryProvider == '1003450178' OR primaryProvider == '1255463576' OR primaryProvider == '1588694343' OR primaryProvider == '1588694343' OR primaryProvider == '1265492128');

--chicagoDocs = FILTER npiData BY ((f23 == '"CHICAGO"' OR f31 == '"CHICAGO"' ) AND f29 matches '.*3240.*');
out = FOREACH npiData GENERATE REPLACE(NPICode,'\\"','') as newNPICode, 
REPLACE(f5, '\\"','') as orgName,
REPLACE(f6, '\\"','') as orgLastName,
REPLACE(f7, '\\"', '') as firstName, 
REPLACE(f21, '\\"','') as docAddra1,
REPLACE(f22, '\\"','') as docAddra2,
REPLACE(f23, '\\"','') as docCity1,
REPLACE(f29, '\\"','') as docAddr1,
REPLACE(f30, '\\"','') as docAddr2,
REPLACE(f31, '\\"','') as docCity,
REPLACE(f32, '\\"','') as docState,
REPLACE(f33, '\\"','') as docPostalCode,
REPLACE(f48, '\\"','') as taxonomyCode;

docFraudSacredHeart = JOIN docFraud BY (referredDoctor), out BY newNPICode;

rmf s3n://DataOut/DocGraph/ChicagoDocs;
rmf s3n://DataOut/DocGraph/ChicagoMedicareFraud;
rmf s3n://DataOut/DocGraph/docFraud;
rmf s3n://DataOut/DocGraph/docFraudSacredHeart;

--STORE sampleJoinedVADoc INTO 's3n://DataOut/DocGraph/DocHosp' USING PigStorage('|');
--STORE out INTO 's3n://DataOut/DocGraph/ChicagoDocs' USING PigStorage('|');
STORE chicagoSacredHeartHospPrimary INTO 's3n://DataOut/DocGraph/ChicagoMedicareFraud' USING PigStorage('|');
STORE docFraud INTO 's3n://DataOut/DocGraph/docFraud' USING PigStorage('|');
STORE docFraudSacredHeart INTO 's3n://DataOut/DocGraph/docFraudSacredHeart' USING PigStorage('|');

Data Results
Looking at the data results, three of the doctors made referrals to Sacred Heart.

Doctor         NPI              Hospital NPI    Nbr Referrals
Dr. Maitra    1598896086	1558367656	    2495
Dr. Kuchipudi 1265492128	1558367656	    1171
Dr. May       1588694343	1558367656	     417

Visualization Using Gephi, I was able to visualize the referrals for these three doctors.

While this doesn’t provide a detailed look into the fraud, it does show there were referrals made to Sacred Heart.

DocGraph Analysis using Hadoop and D3.JS

Post author By dave fauth
Post date February 19, 2013
Categories In Uncategorized
No Comments on DocGraph Analysis using Hadoop and D3.JS

Visualizing the DocGraph for Wyoming Medicare Providers

I have been participating in the DocGraph MedStartr project. After hearing about the project at GraphConnect 2012, I wanted to use this data to investigate additional capabilities of Hadoop and BigData processing. You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

In this example, I want to use MortarData (Hadoop in the cloud) to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes. The desired outcome is to compare the referrals between taxonomy codes for the entire State of Virginia and the areas of Virginia with a population of less that 25,000.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012”

Approach
Using Hadoop and Pig, I am going to use the following approach:

1. Load up the four data sets.
2. Filter the NPI data from NPPES by the provider’s state.
3. Filter the State Data by the desired population.
4. Join both the primary and the referring doctors to the NPI/NPPES/Census data.
5. Carve out the Primary doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
6. Carve out the Referring doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
7. Carve out the primary and referring doctors, count the number of primary referrals and then link the taxonomy codes to both the primary and referring doctors.
8. Export the data out for future visualization.

Why Mortar Data and Hadoop
Using Hadoop, Pig and Mortar’s platform, I have several advantages:
1. I can store all of the data files as flat files in an Amazon S3 store. I don’t need a dedicated server.
2. I can spin up as many Hadoop clusters as I need in a short time.
3. I can write Pig code to do data processing, joins, filters, etc. that work on the data.
4. I can add in Python libraries and code to supplement the Pig.
5. I can add parameters and change the state and population on the fly.

You can see the Mortar Web interface here:

Visualization
I plan on using the D3.js library to create some visualizations. One example visualization I am working on is a Hierarchical Edge Bundling chart. You can see the initial prototype here. I still need to fill in all of the links.

Campaign Data Analysis Video

Post author By dave fauth
Post date December 20, 2012
Categories In Uncategorized
No Comments on Campaign Data Analysis Video

As a wrap-up on the Campaign Analysis that I presented at GraphConnect, I decided to make a video showing the usage of Mortar Data, Neo4J and D3 JS.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012”

Neo4J
Neo4j is an open-source, high-performance, NOSQL graph database, optimized for superfast graph traversals. With Neo4J, you can easily model and change the full complexity of your system.

Neo4J was listed as a “big data vendor to watch in 2013” by Infoworld.

D3.JS
D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

Graph Connect Presentation on Federal Election Campaign Data

Post author By dave fauth
Post date December 7, 2012
Categories In Uncategorized
No Comments on Graph Connect Presentation on Federal Election Campaign Data

The video of my GraphConnect presentation on the Federal Election Campaign Data is located here.