Intelliwareness

Blog on Big Data, Data Analytics and Other IT

Blogroll

  • CTOvision
  • Mark Needham
  • Max DeMarzi
  • Neo4j

Archives

Contact Me

  • About

Powered by Genesis

Building an Enhanced DocGraph Dataset using Mortar (Hadoop) and Neo4J

August 19, 2013 by dave fauth 1 Comment

Download PDF

“The average doctor has likely never heard of Fred Trotter, but he has some provocative ideas about using physician data to change how healthcare gets delivered.” This was from a recent Gigaom article. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

The current DocGraph social graph was built in Neo4J. With new enhancements in Neo4J 2.0 (primarily labels), now was a good time to rebuild the social graph, add in data about each doctor, their specialties and their locations. Finally, I’ve added in some census income data at the zip code level. Researchers could look at economic indicators to see if there are discernable economic patterns in the referrals.

In this series of blog posts, I will attempt to walk through the process in building the Neo4J updated DocGraph using Hadoop followed by the Neo4J batch inserter.

Building the import documents.

One of the goals of the project was to learn Pig in combination with Hadoop to process the large files. I could easily have worked in MySQL or Oracle, but I also wanted an easy way to run jobs on large data sets.

My friends at Mortar have a great platform for leveraging Hadoop, Pig and Python. Mortar is the fastest and easiest way to work with Pig and Python on Hadoop. Mortar’s platform is for everything from joining and cleansing large data sets to machine learning and building recommender systems.
Mortar makes it easy for developers and data scientists to do powerful work with Hadoop. The main advantages of Mortar are:

  • Zero Setup Time: Mortar takes only minutes to set up (or no time at all on the web), and you can start running Pig jobs immediately. No need for painful installation or configuration.
  • Powerful Tooling: Mortar provides a rich suite of tools to aid in Pig development, including the ability to Illustrate a script before running it, and an extremely fast and free local development mode.
  • Elastic Clusters: We spin up Hadoop clusters as you need them, so you don’t have to predict your needs in advance, and you don’t pay for machines you don’t use.
  • Solid Support: Whether the issue is in your script or in Hadoop, we’ll help you figure out a solution.

Data Sets

One great thing about this data is that you can combine the DocGraph data with with other data sets. For example, we can combine NPPES data with the DocGraph data. The NPPES is the federal registry for NPI numbers and associated provider information.

To create the data sets for ingest into Neo4J, we are going to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes.

Pig Scripts
Using Pig scripts, I was able to create several data files that could then be loaded into Neo4J.

Show Pig Code (604 More Words)

-- Load the DocGraph referral data
medGraphData = LOAD 's3n://medgraph/refer.2011.csv' USING PigStorage(',') AS
(primaryProvider:chararray,
referredDoctor: chararray,
qtyReferred:chararray);

-- Load the Classification/Specialty Codes
nucc_codes = LOAD 's3n://NUCC-Taxonomy/nucc_taxonomy_130.txt' USING PigStorage('\t') AS
(nuccCode:chararray,
nuccType:chararray,
nuccClassification:chararray,
nuccSpecialty:chararray);

-- Load NPI Data
npiData = LOAD 's3n://NPIData/npidata_20050523-20130512.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS
(NPI:chararray,
Entity_Type_Code:chararray,
Replacement_NPI:chararray,
Employer_Identification_Number:chararray,
Provider_Organization_Name:chararray,
Provider_Last_Name:chararray,
Provider_First_Name:chararray,
Provider_Middle_Name:chararray,
Provider_Name_Prefix_Text:chararray,
Provider_Name_Suffix_Text:chararray,
Provider_Credential_Text:chararray,
Provider_Other_Organization_Name:chararray,
Provider_Other_Organization_Name_Type_Code:chararray,
Provider_Other_Last_Name:chararray,
Provider_Other_First_Name:chararray,
Provider_Other_Middle_Name:chararray,
Provider_Other_Name_Prefix_Text:chararray,
Provider_Other_Name_Suffix_Text:chararray,
Provider_Other_Credential_Text:chararray,
Provider_Other_Last_Name_Type_Code:chararray,
Provider_First_Line_Business_Mailing_Address:chararray,
Provider_Second_Line_Business_Mailing_Address:chararray,
Provider_Business_Mailing_Address_City_Name:chararray,
Provider_Business_Mailing_Address_State_Name:chararray,
Provider_Business_Mailing_Address_Postal_Code:chararray,
Provider_Business_Mailing_Address_Country_Code:chararray,
Provider_Business_Mailing_Address_Telephone_Number:chararray,
Provider_Business_Mailing_Address_Fax_Number:chararray,
Provider_First_Line_Business_Practice_Location_Address:chararray,
Provider_Second_Line_Business_Practice_Location_Address:chararray,
Provider_Business_Practice_Location_Address_City_Name:chararray,
Provider_Business_Practice_Location_Address_State_Name:chararray,
Provider_Business_Practice_Location_Address_Postal_Code:chararray,
Provider_Business_Practice_Location_Address_Country_Code:chararray,
Provider_Business_Practice_Location_Address_Telephone_Number:chararray,
Provider_Business_Practice_Location_Address_Fax_Number:chararray,
Provider_Enumeration_Date:chararray,
Last_Update_Date:chararray,
NPI_Deactivation_Reason_Code:chararray,
NPI_Deactivation_Date:chararray,
NPI_Reactivation_Date:chararray,
Provider_Gender_Code:chararray,
Authorized_Official_Last_Name:chararray,
Authorized_Official_First_Name:chararray,
Authorized_Official_Middle_Name:chararray,
Authorized_Official_Title_or_Position:chararray,
Authorized_Official_Telephone_Number:chararray,
Healthcare_Provider_Taxonomy_Code_1:chararray,
Provider_License_Number_1:chararray,
Provider_License_Number_State_Code_1:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_1:chararray,
Healthcare_Provider_Taxonomy_Code_2:chararray,
Provider_License_Number_2:chararray,
Provider_License_Number_State_Code_2:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_2:chararray,
Healthcare_Provider_Taxonomy_Code_3:chararray,
Provider_License_Number_3:chararray,
Provider_License_Number_State_Code_3:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_3:chararray,
Healthcare_Provider_Taxonomy_Code_4:chararray,
Provider_License_Number_4:chararray,
Provider_License_Number_State_Code_4:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_4:chararray,
Healthcare_Provider_Taxonomy_Code_5:chararray,
Provider_License_Number_5:chararray,
Provider_License_Number_State_Code_5:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_5:chararray,
Healthcare_Provider_Taxonomy_Code_6:chararray,
Provider_License_Number_6:chararray,
Provider_License_Number_State_Code_6:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_6:chararray,
Healthcare_Provider_Taxonomy_Code_7:chararray,
Provider_License_Number_7:chararray,
Provider_License_Number_State_Code_7:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_7:chararray,
Healthcare_Provider_Taxonomy_Code_8:chararray,
Provider_License_Number_8:chararray,
Provider_License_Number_State_Code_8:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_8:chararray,
Healthcare_Provider_Taxonomy_Code_9:chararray,
Provider_License_Number_9:chararray,
Provider_License_Number_State_Code_9:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_9:chararray,
Healthcare_Provider_Taxonomy_Code_10:chararray,
Provider_License_Number_10:chararray,
Provider_License_Number_State_Code_10:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_10:chararray,
Healthcare_Provider_Taxonomy_Code_11:chararray,
Provider_License_Number_11:chararray,
Provider_License_Number_State_Code_11:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_11:chararray,
Healthcare_Provider_Taxonomy_Code_12:chararray,
Provider_License_Number_12:chararray,
Provider_License_Number_State_Code_12:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_12:chararray,
Healthcare_Provider_Taxonomy_Code_13:chararray,
Provider_License_Number_13:chararray,
Provider_License_Number_State_Code_13:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_13:chararray,
Healthcare_Provider_Taxonomy_Code_14:chararray,
Provider_License_Number_14:chararray,
Provider_License_Number_State_Code_14:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_14:chararray,
Healthcare_Provider_Taxonomy_Code_15:chararray,
Provider_License_Number_15:chararray,
Provider_License_Number_State_Code_15:chararray,
Healthcare_Provider_Primary_Taxonomy_Switch_15:chararray,
Other_Provider_Identifier_1:chararray,
Other_Provider_Identifier_Type_Code_1:chararray,
Other_Provider_Identifier_State_1:chararray,
Other_Provider_Identifier_Issuer_1:chararray,
Other_Provider_Identifier_2:chararray,
Other_Provider_Identifier_Type_Code_2:chararray,
Other_Provider_Identifier_State_2:chararray,
Other_Provider_Identifier_Issuer_2:chararray,
Other_Provider_Identifier_3:chararray,
Other_Provider_Identifier_Type_Code_3:chararray,
Other_Provider_Identifier_State_3:chararray,
Other_Provider_Identifier_Issuer_3:chararray,
Other_Provider_Identifier_4:chararray,
Other_Provider_Identifier_Type_Code_4:chararray,
Other_Provider_Identifier_State_4:chararray,
Other_Provider_Identifier_Issuer_4:chararray,
Other_Provider_Identifier_5:chararray,
Other_Provider_Identifier_Type_Code_5:chararray,
Other_Provider_Identifier_State_5:chararray,
Other_Provider_Identifier_Issuer_5:chararray,
Other_Provider_Identifier_6:chararray,
Other_Provider_Identifier_Type_Code_6:chararray,
Other_Provider_Identifier_State_6:chararray,
Other_Provider_Identifier_Issuer_6:chararray,
Other_Provider_Identifier_7:chararray,
Other_Provider_Identifier_Type_Code_7:chararray,
Other_Provider_Identifier_State_7:chararray,
Other_Provider_Identifier_Issuer_7:chararray,
Other_Provider_Identifier_8:chararray,
Other_Provider_Identifier_Type_Code_8:chararray,
Other_Provider_Identifier_State_8:chararray,
Other_Provider_Identifier_Issuer_8:chararray,
Other_Provider_Identifier_9:chararray,
Other_Provider_Identifier_Type_Code_9:chararray,
Other_Provider_Identifier_State_9:chararray,
Other_Provider_Identifier_Issuer_9:chararray,
Other_Provider_Identifier_10:chararray,
Other_Provider_Identifier_Type_Code_10:chararray,
Other_Provider_Identifier_State_10:chararray,
Other_Provider_Identifier_Issuer_10:chararray,
Other_Provider_Identifier_11:chararray,
Other_Provider_Identifier_Type_Code_11:chararray,
Other_Provider_Identifier_State_11:chararray,
Other_Provider_Identifier_Issuer_11:chararray,
Other_Provider_Identifier_12:chararray,
Other_Provider_Identifier_Type_Code_12:chararray,
Other_Provider_Identifier_State_12:chararray,
Other_Provider_Identifier_Issuer_12:chararray,
Other_Provider_Identifier_13:chararray,
Other_Provider_Identifier_Type_Code_13:chararray,
Other_Provider_Identifier_State_13:chararray,
Other_Provider_Identifier_Issuer_13:chararray,
Other_Provider_Identifier_14:chararray,
Other_Provider_Identifier_Type_Code_14:chararray,
Other_Provider_Identifier_State_14:chararray,
Other_Provider_Identifier_Issuer_14:chararray,
Other_Provider_Identifier_15:chararray,
Other_Provider_Identifier_Type_Code_15:chararray,
Other_Provider_Identifier_State_15:chararray,
Other_Provider_Identifier_Issuer_15:chararray,
Other_Provider_Identifier_16:chararray,
Other_Provider_Identifier_Type_Code_16:chararray,
Other_Provider_Identifier_State_16:chararray,
Other_Provider_Identifier_Issuer_16:chararray,
Other_Provider_Identifier_17:chararray,
Other_Provider_Identifier_Type_Code_17:chararray,
Other_Provider_Identifier_State_17:chararray,
Other_Provider_Identifier_Issuer_17:chararray,
Other_Provider_Identifier_18:chararray,
Other_Provider_Identifier_Type_Code_18:chararray,
Other_Provider_Identifier_State_18:chararray,
Other_Provider_Identifier_Issuer_18:chararray,
Other_Provider_Identifier_19:chararray,
Other_Provider_Identifier_Type_Code_19:chararray,
Other_Provider_Identifier_State_19:chararray,
Other_Provider_Identifier_Issuer_19:chararray,
Other_Provider_Identifier_20:chararray,
Other_Provider_Identifier_Type_Code_20:chararray,
Other_Provider_Identifier_State_20:chararray,
Other_Provider_Identifier_Issuer_20:chararray,
Other_Provider_Identifier_21:chararray,
Other_Provider_Identifier_Type_Code_21:chararray,
Other_Provider_Identifier_State_21:chararray,
Other_Provider_Identifier_Issuer_21:chararray,
Other_Provider_Identifier_22:chararray,
Other_Provider_Identifier_Type_Code_22:chararray,
Other_Provider_Identifier_State_22:chararray,
Other_Provider_Identifier_Issuer_22:chararray,
Other_Provider_Identifier_23:chararray,
Other_Provider_Identifier_Type_Code_23:chararray,
Other_Provider_Identifier_State_23:chararray,
Other_Provider_Identifier_Issuer_23:chararray,
Other_Provider_Identifier_24:chararray,
Other_Provider_Identifier_Type_Code_24:chararray,
Other_Provider_Identifier_State_24:chararray,
Other_Provider_Identifier_Issuer_24:chararray,
Other_Provider_Identifier_25:chararray,
Other_Provider_Identifier_Type_Code_25:chararray,
Other_Provider_Identifier_State_25:chararray,
Other_Provider_Identifier_Issuer_25:chararray,
Other_Provider_Identifier_26:chararray,
Other_Provider_Identifier_Type_Code_26:chararray,
Other_Provider_Identifier_State_26:chararray,
Other_Provider_Identifier_Issuer_26:chararray,
Other_Provider_Identifier_27:chararray,
Other_Provider_Identifier_Type_Code_27:chararray,
Other_Provider_Identifier_State_27:chararray,
Other_Provider_Identifier_Issuer_27:chararray,
Other_Provider_Identifier_28:chararray,
Other_Provider_Identifier_Type_Code_28:chararray,
Other_Provider_Identifier_State_28:chararray,
Other_Provider_Identifier_Issuer_28:chararray,
Other_Provider_Identifier_29:chararray,
Other_Provider_Identifier_Type_Code_29:chararray,
Other_Provider_Identifier_State_29:chararray,
Other_Provider_Identifier_Issuer_29:chararray,
Other_Provider_Identifier_30:chararray,
Other_Provider_Identifier_Type_Code_30:chararray,
Other_Provider_Identifier_State_30:chararray,
Other_Provider_Identifier_Issuer_30:chararray,
Other_Provider_Identifier_31:chararray,
Other_Provider_Identifier_Type_Code_31:chararray,
Other_Provider_Identifier_State_31:chararray,
Other_Provider_Identifier_Issuer_31:chararray,
Other_Provider_Identifier_32:chararray,
Other_Provider_Identifier_Type_Code_32:chararray,
Other_Provider_Identifier_State_32:chararray,
Other_Provider_Identifier_Issuer_32:chararray,
Other_Provider_Identifier_33:chararray,
Other_Provider_Identifier_Type_Code_33:chararray,
Other_Provider_Identifier_State_33:chararray,
Other_Provider_Identifier_Issuer_33:chararray,
Other_Provider_Identifier_34:chararray,
Other_Provider_Identifier_Type_Code_34:chararray,
Other_Provider_Identifier_State_34:chararray,
Other_Provider_Identifier_Issuer_34:chararray,
Other_Provider_Identifier_35:chararray,
Other_Provider_Identifier_Type_Code_35:chararray,
Other_Provider_Identifier_State_35:chararray,
Other_Provider_Identifier_Issuer_35:chararray,
Other_Provider_Identifier_36:chararray,
Other_Provider_Identifier_Type_Code_36:chararray,
Other_Provider_Identifier_State_36:chararray,
Other_Provider_Identifier_Issuer_36:chararray,
Other_Provider_Identifier_37:chararray,
Other_Provider_Identifier_Type_Code_37:chararray,
Other_Provider_Identifier_State_37:chararray,
Other_Provider_Identifier_Issuer_37:chararray,
Other_Provider_Identifier_38:chararray,
Other_Provider_Identifier_Type_Code_38:chararray,
Other_Provider_Identifier_State_38:chararray,
Other_Provider_Identifier_Issuer_38:chararray,
Other_Provider_Identifier_39:chararray,
Other_Provider_Identifier_Type_Code_39:chararray,
Other_Provider_Identifier_State_39:chararray,
Other_Provider_Identifier_Issuer_39:chararray,
Other_Provider_Identifier_40:chararray,
Other_Provider_Identifier_Type_Code_40:chararray,
Other_Provider_Identifier_State_40:chararray,
Other_Provider_Identifier_Issuer_40:chararray,
Other_Provider_Identifier_41:chararray,
Other_Provider_Identifier_Type_Code_41:chararray,
Other_Provider_Identifier_State_41:chararray,
Other_Provider_Identifier_Issuer_41:chararray,
Other_Provider_Identifier_42:chararray,
Other_Provider_Identifier_Type_Code_42:chararray,
Other_Provider_Identifier_State_42:chararray,
Other_Provider_Identifier_Issuer_42:chararray,
Other_Provider_Identifier_43:chararray,
Other_Provider_Identifier_Type_Code_43:chararray,
Other_Provider_Identifier_State_43:chararray,
Other_Provider_Identifier_Issuer_43:chararray,
Other_Provider_Identifier_44:chararray,
Other_Provider_Identifier_Type_Code_44:chararray,
Other_Provider_Identifier_State_44:chararray,
Other_Provider_Identifier_Issuer_44:chararray,
Other_Provider_Identifier_45:chararray,
Other_Provider_Identifier_Type_Code_45:chararray,
Other_Provider_Identifier_State_45:chararray,
Other_Provider_Identifier_Issuer_45:chararray,
Other_Provider_Identifier_46:chararray,
Other_Provider_Identifier_Type_Code_46:chararray,
Other_Provider_Identifier_State_46:chararray,
Other_Provider_Identifier_Issuer_46:chararray,
Other_Provider_Identifier_47:chararray,
Other_Provider_Identifier_Type_Code_47:chararray,
Other_Provider_Identifier_State_47:chararray,
Other_Provider_Identifier_Issuer_47:chararray,
Other_Provider_Identifier_48:chararray,
Other_Provider_Identifier_Type_Code_48:chararray,
Other_Provider_Identifier_State_48:chararray,
Other_Provider_Identifier_Issuer_48:chararray,
Other_Provider_Identifier_49:chararray,
Other_Provider_Identifier_Type_Code_49:chararray,
Other_Provider_Identifier_State_49:chararray,
Other_Provider_Identifier_Issuer_49:chararray,
Other_Provider_Identifier_50:chararray,
Other_Provider_Identifier_Type_Code_50:chararray,
Other_Provider_Identifier_State_50:chararray,
Other_Provider_Identifier_Issuer_50:chararray,
Is_Sole_Proprietor:chararray,
Is_Organization_Subpart:chararray,
Parent_Organization_LBN:chararray,
Parent_Organization_TIN:chararray,
Authorized_Official_Name_Prefix_Text:chararray,
Authorized_Official_Name_Suffix_Text:chararray,
Authorized_Official_Credential_Text:chararray,
Healthcare_Provider_Taxonomy_Group_1:chararray,
Healthcare_Provider_Taxonomy_Group_2:chararray,
Healthcare_Provider_Taxonomy_Group_3:chararray,
Healthcare_Provider_Taxonomy_Group_4:chararray,
Healthcare_Provider_Taxonomy_Group_5:chararray,
Healthcare_Provider_Taxonomy_Group_6:chararray,
Healthcare_Provider_Taxonomy_Group_7:chararray,
Healthcare_Provider_Taxonomy_Group_8:chararray,
Healthcare_Provider_Taxonomy_Group_9:chararray,
Healthcare_Provider_Taxonomy_Group_10:chararray,
Healthcare_Provider_Taxonomy_Group_11:chararray,
Healthcare_Provider_Taxonomy_Group_12:chararray,
Healthcare_Provider_Taxonomy_Group_13:chararray,
Healthcare_Provider_Taxonomy_Group_14:chararray,
Healthcare_Provider_Taxonomy_Group_15:chararray
);


-- generate a Provider List and replace any quotes
providerList = foreach npiData generate REPLACE(NPI, '\\"', '') AS npiCode,
       REPLACE(Entity_Type_Code, '\\"','') AS entity_type,
	   REPLACE(Provider_First_Line_Business_Practice_Location_Address, '\\"','') AS address_first_line,
	   REPLACE(Provider_Second_Line_Business_Practice_Location_Address, '\\"','') AS address_second_line,
	   REPLACE(Provider_Business_Practice_Location_Address_City_Name, '\\"','') AS address_city_name,
	   REPLACE(Provider_Business_Practice_Location_Address_State_Name, '\\"','') AS address_state_name,
	   REPLACE(Provider_Business_Practice_Location_Address_Postal_Code, '\\"','') AS address_postal_code,
	   REPLACE(Provider_Business_Practice_Location_Address_Country_Code, '\\"','') AS address_country_code,
	   REPLACE(Provider_Business_Practice_Location_Address_Telephone_Number, '\\"','') AS telephone_number,
	   REPLACE(Provider_Business_Practice_Location_Address_Fax_Number, '\\"','') AS fax_number,
       REPLACE(Provider_Gender_Code, '\\"','') AS gender,
       REPLACE(Provider_Organization_Name, '\\"','') AS ProviderOrgName,
	   REPLACE(Provider_Name_Prefix_Text, '\\"', '') AS ProviderPrefix,
	   REPLACE(Provider_First_Name, '\\"', '') AS ProviderFirstName,
	   REPLACE(Provider_Middle_Name, '\\"', '') AS ProviderMiddleName,
	   REPLACE(Provider_Last_Name, '\\"', '') AS ProviderLastName,
	   REPLACE(Provider_Name_Suffix_Text, '\\"', '') AS ProviderSuffix,
 	   REPLACE(Provider_Credential_Text, '\\"', '') AS ProviderCredential;

-- create list of NPI codes to Parent Organization
parentOrgLBNList = foreach npiData generate REPLACE(Parent_Organization_LBN, '\\"','') as newParentOrgLBN;
hasParentOrgLBNValue = filter parentOrgLBNList by newParentOrgLBN != '';
distinctParentOrgLBNList = distinct hasParentOrgLBNValue;
childHasParentLBN = join npiData by (REPLACE(Parent_Organization_LBN, '\\"','')), distinctParentOrgLBNList BY (newParentOrgLBN);
npiLBN = foreach childHasParentLBN generate NPI as childNPI, Parent_Organization_LBN as newParentOrgLBN;

-- generate unique list of Provider Organization Names
providerSubList = foreach providerList generate ProviderOrgName;
hasProviderOrgName = filter providerSubList by ProviderOrgName != '';
distinctProvider = distinct hasProviderOrgName;
--hasParentOrgLBNValue = filter hasParentOrgLBN by newParentOrgLBN matches '.';
--grpd = group hasParentOrgLBNValue by NPI;
--parentGroupOut = foreach grpd generate group, COUNT(hasParentOrgLBNValue);


--address list
addressList = foreach providerList generate address_city_name, address_state_name, address_country_code;
uniqueLocations = distinct addressList;
grpdAddressList = group addressList by (address_city_name, address_state_name, address_country_code);
addressListCnt = foreach grpdAddressList generate group, COUNT(addressList) as countAddressList;
addressListOrdered = order addressListCnt BY countAddressList;
addressListFlat = foreach addressListOrdered GENERATE FLATTEN(group) as (address_city_name, address_state_name, address_country_code), countAddressList;

--located in
locatedIn = foreach providerList generate npiCode,address_city_name, address_state_name, address_postal_code, address_country_code;

-- doctors taxonomy listing (some doctors may have multiple taxonomies
joinedNPITax1 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_1, '\\"', '')), nucc_codes BY (nuccCode);
joinedNPITax2 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_2, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax3 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_3, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax4 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_4, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax5 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_5, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax6 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_6, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax7 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_7, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax8 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_8, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax9 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_9, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax10 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_10, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax11 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_11, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax12 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_12, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax13 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_13, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax14 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_14, '\\"', '')) , nucc_codes BY (nuccCode);
joinedNPITax15 = JOIN npiData BY (REPLACE(Healthcare_Provider_Taxonomy_Code_15, '\\"', '')) , nucc_codes BY (nuccCode);

joinedNPINUCC = UNION joinedNPITax1,joinedNPITax2,joinedNPITax3,joinedNPITax4,joinedNPITax5,joinedNPITax6,
joinedNPITax7,joinedNPITax8,joinedNPITax9,joinedNPITax10,joinedNPITax11,joinedNPITax12,joinedNPITax13,joinedNPITax14,joinedNPITax15;

simpleJoinedNPINUCC = foreach joinedNPINUCC GENERATE REPLACE(NPI, '\\"', '') AS npiCode, nuccCode, nuccType, nuccClassification, nuccSpecialty;

-- unique Specialties
uniqueNUCCCodes = distinct nucc_codes;

-- unique NPICodes
primaryDoc = foreach medGraphData generate primaryProvider;
referredDoc = foreach medGraphData generate referredDoctor;
uniquePrimaryDoc = distinct primaryDoc;
uniqueReferredDoc = distinct referredDoc;
uniqueDocList = union uniquePrimaryDoc, uniqueReferredDoc;

rmf s3n://DataOut/DocGraph/DocGraphNeo4J;
--STORE sampleJoinedVADoc INTO 's3n://DataOut/DocGraph/DocHosp' USING PigStorage('|');
--STORE cnt INTO 's3n://DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');
--s3 Work
STORE npiLBN INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/parentOfLink' USING PigStorage('|');
STORE distinctProvider INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/organizations' USING PigStorage('|');
STORE providerList INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/providerList' USING PigStorage('|');
STORE addressListFlat INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/addrList' USING PigStorage('|');
STORE locatedIn INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/locations' USING PigStorage('|');
STORE uniqueLocations INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/uniquelocations' USING PigStorage('|');
STORE uniqueNUCCCodes INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/specialties' USING PigStorage('|');
STORE simpleJoinedNPINUCC INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/specialtiesProviders' USING PigStorage('|');
STORE medGraphData INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/docGraphProviders' USING PigStorage('|');
STORE uniqueDocList INTO 's3n://DataOut/DocGraph/DocGraphNeo4J/uniqueDoctorList' USING PigStorage('|');

--local Development
--STORE parentGroupOut INTO '../DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');
--STORE addressListOrdered INTO '../DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');
--STORE locatedIn INTO '../DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');
--STORE simpleJoinedNPINUCC INTO '../DataOut/DocGraph/DocGraphNeo4J' USING PigStorage('|');

Running the Pig Code in Mortar
In the next post, we will look at using Mortar’s framework to run the Pig jobs.

Filed Under: Uncategorized

Recommender Tips, Mortar and DocGraph

August 14, 2013 by dave fauth Leave a Comment

Download PDF

Jonathan Packer wrote on Mortar’s blog about flexible recommender models. Jonathan articulates that “from a business perspective the two most salient advantages of graph-based models: flexibility and simplicity.”

Some of salient points made in the article are:

  • graph-based models are modular and transparent
  • simple graph-based model will allow you to build a viable recommender system for your product without delaying its time-to-market
  • Graphs can be visualized, explained, discussed, and debugged collaboratively in a way that sophisticated machine learning techniques cannot.

Jonathan ends with “My opinion is that the next big advances to be made in recommender systems will be made by combining automated tools with human—possibly crowdsourced—editorial judgement and writing talent. They will be made in finding more engaging ways to present recommendations to users than cloying sidebars and endlessly scrolling lists.”

DocGraph
“The average doctor has likely never heard of Fred Trotter, but he has some provocative ideas about using physician data to change how healthcare gets delivered.” This was from a recent Gigaom article. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

The current DocGraph social graph was built in Neo4J. With new enhancements in Neo4J 2.0 (primarily labels), now was a good time to rebuild the social graph, add in data about each doctor, their specialties and their locations. Finally, I’ve added in some census income data at the zip code level. Researchers could look at economic indicators to see if there are discernable economic patterns in the referrals.

Recommendation Engine
The combination of the Neo4J social graph, the medical data and the capability to build a recommendation engine in Mortar makes a compelling use case. I believe that this use case will address Jonathan’s premise that the new engaging recommendation engines can be built to help give patients a sense of which doctors are most respected by their peers. Additionally, the graph data could help hospitals understand the referral patterns associated with poor care coordination, and provide health IT startups with a map of the most plugged-in doctors in each city.

Next steps
Over the next couple of weeks, I’ll be writing on how I used Mortar, Pig and Neo4J to build the updated DocGraph data set.

Filed Under: Uncategorized

Chicago Sacred Heart Hospital – Medicare Kickback Scheme

April 23, 2013 by dave fauth Leave a Comment

Download PDF

According to an April 16, 2013 FBI press release, Chicago Sacred Heart Hospital Owner, Executive, and Four Doctors Arrested in Alleged Medicare Referral Kickback Conspiracy.

From the press release:

CHICAGO—The owner and another senior executive of Sacred Heart Hospital and four physicians affiliated with the west side facility were arrested today for allegedly conspiring to pay and receive illegal kickbacks, including more than $225,000 in cash, along with other forms of payment, in exchange for the referral of patients insured by Medicare and Medicaid to the hospital, announced U.S. Attorney for the Northern District of Illinois Gary S. Shapiro.
…
Arrested were Edward J. Novak, 58, of Park Ridge, Sacred Heart’s owner and chief executive officer since the late 1990s; Roy M. Payawal, 64, of Burr Ridge, executive vice president and chief financial officer since the early 2000s; and Drs. Venkateswara R. “V.R.” Kuchipudi, 66, of Oak Brook, Percy Conrad May, Jr., 75, of Chicago, Subir Maitra, 73, of Chicago, and Shanin Moshiri, 57, of Chicago.

DocGraph DataI wanted to see what the graph of these doctors looked like in the DocGraph dataset. You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

Hadoop Data Processing Using Mortar for online hadoop processing, Amazon S3 storage and access to the data, I wrote up a Hadoop script that filters the DocGraph data where any of the accused where the referring doctors, joined them to the National Provider registry and wrote the data out to an S3 bucket.

medGraphData = LOAD 's3n://medgraph/refer.2011.csv' USING PigStorage(',') AS
(primaryProvider:chararray,
referredDoctor: chararray,
qtyReferred:chararray);

nucc_codes = LOAD 's3n://NUCC-Taxonomy/nucc_taxonomy_130.txt' USING PigStorage('\t') AS
(nuccCode:chararray,
nuccType:chararray,
nuccClassification:chararray,
nuccSpecialty:chararray);

-- Load NPI Data
npiData = LOAD 's3n://NPIData/npidata_20050523-20130113.csv' USING PigStorage(',') AS
(NPICode:chararray,
f2:chararray,
f3:chararray,
f4:chararray,
f5:chararray,
f6:chararray,
f7:chararray,
f8:chararray,
f9:chararray,
f10:chararray,
f11:chararray,
f12:chararray,
f13:chararray,
f14:chararray,
f15:chararray,
f16:chararray,
f17:chararray,
f18:chararray,
f19:chararray,
f20:chararray,
f21:chararray,
f22:chararray,
f23:chararray,
f24:chararray,
f25:chararray,
f26:chararray,
f27:chararray,
f28:chararray,
f29:chararray,
f30:chararray,
f31:chararray,
f32:chararray,
f33:chararray,
f34:chararray,
f35:chararray,
f36:chararray,
f37:chararray,
f38:chararray,
f39:chararray,
f40:chararray,
f41:chararray,
f42:chararray,
f43:chararray,
f44:chararray,
f45:chararray,
f46:chararray,
f47:chararray,
f48:chararray,
f49:chararray);

chicagoSacredHeartHosp = FILTER medGraphData BY (referredDoctor == '1003163122' OR referredDoctor == '1760730063');

chicagoSacredHeartHospPrimary = FILTER medGraphData BY (primaryProvider == '1003163122' OR primaryProvider == '1760730063');

docFraud = FILTER medGraphData BY (primaryProvider == '1598896086' OR primaryProvider == '1003450178' OR primaryProvider == '1255463576' OR primaryProvider == '1588694343' OR primaryProvider == '1588694343' OR primaryProvider == '1265492128');

--chicagoDocs = FILTER npiData BY ((f23 == '"CHICAGO"' OR f31 == '"CHICAGO"' ) AND f29 matches '.*3240.*');
out = FOREACH npiData GENERATE REPLACE(NPICode,'\\"','') as newNPICode, 
REPLACE(f5, '\\"','') as orgName,
REPLACE(f6, '\\"','') as orgLastName,
REPLACE(f7, '\\"', '') as firstName, 
REPLACE(f21, '\\"','') as docAddra1,
REPLACE(f22, '\\"','') as docAddra2,
REPLACE(f23, '\\"','') as docCity1,
REPLACE(f29, '\\"','') as docAddr1,
REPLACE(f30, '\\"','') as docAddr2,
REPLACE(f31, '\\"','') as docCity,
REPLACE(f32, '\\"','') as docState,
REPLACE(f33, '\\"','') as docPostalCode,
REPLACE(f48, '\\"','') as taxonomyCode;

docFraudSacredHeart = JOIN docFraud BY (referredDoctor), out BY newNPICode;

rmf s3n://DataOut/DocGraph/ChicagoDocs;
rmf s3n://DataOut/DocGraph/ChicagoMedicareFraud;
rmf s3n://DataOut/DocGraph/docFraud;
rmf s3n://DataOut/DocGraph/docFraudSacredHeart;

--STORE sampleJoinedVADoc INTO 's3n://DataOut/DocGraph/DocHosp' USING PigStorage('|');
--STORE out INTO 's3n://DataOut/DocGraph/ChicagoDocs' USING PigStorage('|');
STORE chicagoSacredHeartHospPrimary INTO 's3n://DataOut/DocGraph/ChicagoMedicareFraud' USING PigStorage('|');
STORE docFraud INTO 's3n://DataOut/DocGraph/docFraud' USING PigStorage('|');
STORE docFraudSacredHeart INTO 's3n://DataOut/DocGraph/docFraudSacredHeart' USING PigStorage('|');

Data Results
Looking at the data results, three of the doctors made referrals to Sacred Heart.

Doctor         NPI              Hospital NPI    Nbr Referrals
Dr. Maitra    1598896086	1558367656	    2495
Dr. Kuchipudi 1265492128	1558367656	    1171
Dr. May       1588694343	1558367656	     417
 

Visualization Using Gephi, I was able to visualize the referrals for these three doctors.

screenshot_234507

While this doesn’t provide a detailed look into the fraud, it does show there were referrals made to Sacred Heart.

Filed Under: Uncategorized

DocGraph Analysis using Hadoop and D3.JS

February 19, 2013 by dave fauth Leave a Comment

Download PDF

Visualizing the DocGraph for Wyoming Medicare Providers

I have been participating in the DocGraph MedStartr project. After hearing about the project at GraphConnect 2012, I wanted to use this data to investigate additional capabilities of Hadoop and BigData processing. You can read some excellent work already being done on this data here courtesy of Janos. Ryan Weald has some great work on visualizing geographic connections between doctors here as well.

You can read more details about DocGraph from Fred Trotter’s post. The basic data set is just three columns: two separate NPI numbers (National Provider Identifier) and a weight which is the shared number of Medicare patients in a 30 day forward window. The data is from calendar year 2011 and contains 49,685,810 relationships between 940,492 different Medicare providers.

One great thing about this data is that you can combine the DocGraph data with with other data sets. For example, we can combine NPPES data with the DocGraph data. The NPPES is the federal registry for NPI numbers and associated provider information. Additionally you can bring in other data such as Census data and Virginia State information.

In this example, I want to use MortarData (Hadoop in the cloud) to combine Census Data, DocGraph Data, NPEES database and the National Uniform Claim Committee (NUCC) provider taxonomy codes. The desired outcome is to compare the referrals between taxonomy codes for the entire State of Virginia and the areas of Virginia with a population of less that 25,000.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012”

Approach
Using Hadoop and Pig, I am going to use the following approach:

1. Load up the four data sets.
2. Filter the NPI data from NPPES by the provider’s state.
3. Filter the State Data by the desired population.
4. Join both the primary and the referring doctors to the NPI/NPPES/Census data.
5. Carve out the Primary doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
6. Carve out the Referring doctors. Group by the NUCC code and count the number of each NUCC taxonomy code.
7. Carve out the primary and referring doctors, count the number of primary referrals and then link the taxonomy codes to both the primary and referring doctors.
8. Export the data out for future visualization.

Why Mortar Data and Hadoop
Using Hadoop, Pig and Mortar’s platform, I have several advantages:
1. I can store all of the data files as flat files in an Amazon S3 store. I don’t need a dedicated server.
2. I can spin up as many Hadoop clusters as I need in a short time.
3. I can write Pig code to do data processing, joins, filters, etc. that work on the data.
4. I can add in Python libraries and code to supplement the Pig.
5. I can add parameters and change the state and population on the fly.

You can see the Mortar Web interface here:
Screen shot 2013-02-19 at 10.14.07 PM

Visualization
I plan on using the D3.js library to create some visualizations. One example visualization I am working on is a Hierarchical Edge Bundling chart. You can see the initial prototype here. I still need to fill in all of the links.

Filed Under: Uncategorized

Campaign Data Analysis Video

December 20, 2012 by dave fauth Leave a Comment

Download PDF

As a wrap-up on the Campaign Analysis that I presented at GraphConnect, I decided to make a video showing the usage of Mortar Data, Neo4J and D3 JS.

Mortar Data
Mortar is Hadoop in the cloud—an on-demand, wickedly scalable platform
for big data. Start your work in the browser—zero install. Or if you need more control, use Mortar from your own machine, in your own development environment.

Mortar is listed in GigaOM’s 12 big data tools you need to know and one of the “10 Coolest Big Data Products Of 2012”

Neo4J
Neo4j is an open-source, high-performance, NOSQL graph database, optimized for superfast graph traversals. With Neo4J, you can easily model and change the full complexity of your system.

Neo4J was listed as a “big data vendor to watch in 2013” by Infoworld.

D3.JS
D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

Filed Under: Uncategorized

Graph Connect Presentation on Federal Election Campaign Data

December 7, 2012 by dave fauth Leave a Comment

Download PDF

The video of my GraphConnect presentation on the Federal Election Campaign Data is located here.

Filed Under: Uncategorized

GraphConnect – Graphies

November 18, 2012 by dave fauth Leave a Comment

Download PDF

GraphConnect was held two weeks ago (Nov 5-6) in San Francisco. GraphConnect was put together by Neo4J as an initial conference on graph databases.

Part of the conference was a contest on building innovative graphs. The “Graphies” recognized and celebrated individuals and teams that were developing innovative and impressive graph database applications. Additionally, there was a twitter contest called #ShowMeYourGraph. The contest was to tap into your creative side and show us your most creative graph using any online format: graph visualization, photo, video, blog, etc.

Two of my submissions were SuperPac contributions and SuperPac contributions with labels. The rest of this post will talk about how I created the two charts.

SuperPac Contributions
The Sunlight Foundation provided the SuperPac expenditure data for the 2012 campaign. The data was in a csv format.

For this example, I wanted to look at contributions from September 1 through October 2012 as well as look only at the Presidential candidates. This would show which SuperPacs were funding or opposing candidates. Using some crazy Excel skills, I was able to identify the unique SuperPacs, their support or opposition, and the candidate they were supporting or opposing.

Visualization using D3.js

After finding a good example of a force node graph in D3.js, I was able to create the JSON file for the data. In order to create the different colors, I needed to figure out different groupings. The groups for Barack Obama and Mitt Romney were unique groups. Additional groups were created for SuperPacs that supported a candidate, opposed a candidate, supported one candidate while opposing another candidate (i.e. Support Obama and Oppose Romney and vice versa) and those that did all three.

In this instance, Mitt Romney is purple and Barack Obama is the peach node. The red nodes supported Mitt Romney and the blue opposed. The orange nodes opposed Barack Obama and the light blue supported Barack Obama. The green supported Romney and opposed Obama while the light orange supported Obama and opposed Romney.

With labels, the graph gets a lot more messy. You can see it here:

Winner
The initial graph was chosen as the winner of the Most Innovative Educational Graph Application as well as a winning entry in the #ShowMeYourGraph contest. Thanks to Neo4J for the conference and for the competition.

Filed Under: Uncategorized

Graphing Congressional Sponsorship of Bills – Part 2

October 29, 2012 by dave fauth Leave a Comment

Download PDF

Update
This is a follow-up to the previous post.

Note:The data and java code is available at Github.

The Data
I decided to use six years of data (Congress sessions 107-112). Since there is significant turnover in Congress (re-election, sex scandals, resignation, and death among other reasons), I had to expand the number of congressional members to 928. In addition, over the six years of bills, there are now 5,870 unique topics or subjects of the bills (i.e. Arts, Boy Scouts, Clergy, Welfare, Whaling to mention a few).

Using the BatchInserter framework from Max De Marzi and Michael Hunger, the data can be quickly loaded into Neo4J (less than 5 minutes). There are 340,619 nodes, 846,697 properties, and 2,514,565 relationships.

Data Exploration
We’ll start out simple using Cypher queries and build from there. Let’s see how many bills discuss Whaling.

START n=node:subjects('subject:Whaling')
MATCH n-[r:REFLECTS]-c
RETURN c.Session, count(c.Session)
ORDER BY c.Session DESC

Our output:

"110"	5
"109"	7
"108"	11
"107"	13

To see who is sponsoring or cosponsoring the most bills related to “whaling”, we run the following query:

START n=node:subjects('subject:Whaling')
MATCH n-[r:REFLECTS]-c-[s:COSPONSORS|SPONSORS]-z
RETURN z.firstname, z.lastname, count(z) As popCoSponsors
ORDER BY popCoSponsors DESC
LIMIT 15

Our output is:

z.firstname  z.lastname  popCoSponsors
"Joseph"     "Lieberman"  9
"John"	     "Kerry"	  8
"Susan"	     "Collins"	  8
"Edward"     "Kennedy"	  8
"Barbara"    "Boxer"	  8
"Ron"        "Wyden"	  7
"Daniel"     "Akaka"	  7
"Christopher" "Dodd"	  6
"Olympia"     "Snowe"	  6
"John"	      "Reed"	  6
"Carl"	      "Levin"	  6
"Russell"     "Feingold"  6
"Wayne"	      "Gilchrest" 6
"John"	      "McCain"	  6
"Joseph"      "Biden"	  5

If I want to see who usually co-sponsors bills sponsored by Olympia Snowe, I would run this query:

START n=node:subjects('subject:Whaling'), z=node:congress('lastName:Snowe')
MATCH n-[r:REFLECTS]-c-[s:COSPONSORS]-x
WHERE c-[:SPONSORS]-z
RETURN x.firstname, x.lastname, count(x) As popCoSponsors
ORDER BY popCoSponsors DESC
LIMIT 10

The output for this is:

x.firstname
x.lastname
popCoSponsors
"Ron"	        "Wyden"	        4
"Christopher"	"Dodd"	        4
"John"	        "Reed"	        4
"Daniel"	"Akaka"	        4
"Susan"	        "Collins"	4
"Joseph"	"Biden"	        4
"Edward"	"Kennedy"	4
"Barbara"	"Boxer"	        4
"John"	        "McCain"	4
"Joseph"	"Lieberman"	4

Paths between Congressmen
To find the shortest path between two congressmen, we can run this query:

START ryanvp = node(6233), obama = node(6323)
MATCH p=shortestPath(ryanvp-[*..10]-obama) return p;

This returns the following path:

Obama -> Bill SR 97 110 Session -> Topic of Congressional Tributes -> HR716 112 Session -> Paul Ryan

Filed Under: Uncategorized

Graphing Congressional Sponsorship of Bills

October 17, 2012 by dave fauth Leave a Comment

Download PDF

Introduction:

In preparation for my talk at GraphConnect, I’ve added onto the talk by graphing out congressional sponsorship of bills using Neo4J. This post will specifically talk about data acquisition, data preparation, data loading into Neo4J and simple cypher queries against the data. Follow-on posts will look at processing this same data in Hadoop, processing this data in data-flows with Pig and Python to expose new and interesting properties, and visually graphing the data using d3.js.

The Data:

In this project we will download a spreadsheet of all members of congress as well as all bills/resolutions from the 112th session of congress. Govtrack.us has made an xml file of everyone that has ever served in Congress, and U.S. presidents, with their party affiliation, terms in Congress, birthdays, etc. Additionally, they have provided all bills and resolutions for the current Congress and similarly named directories for other Congresses.

GovTrack has provided easy to use instructions on obtaining their data. To download GovTrack’s raw data files in bulk, or if you plan to regularly update the files, you should use the rsync tool. Rsync is good for selecting which directory of data you want and keeping your files up to date by only downloading changes on each update.

rsync -avz --delete --delete-excluded govtrack.us::govtrackdata/us/112/bills .

Complete code for this example will be made available on github.

Data Processing:

Using Excel, you can load in the Congress xml file and it will be transformed into a spreadsheet. By filtering on the active field, you can quickly create a spreadsheet of 535 congressmen. This file is then saved out to a tab-delimited format and then using TextWrangler converted the tabs to pipe delimiters.

Each of the bills is downloaded as an xml file. A sample is shown below:


 REFERRED

 A resolution expressing the sense of the Senate that the primary safeguard for the well-being and protection of children is the family, and that the primary safeguards for the legal rights of children in the United States are the Constitutions of the United States and the several States, and that, because the use of international treaties to govern policy in the United States on families and children is contrary to principles of self-government and federalism, and that, because the United Nations Convention on the Rights of the Child undermines traditional principles of law in the United States regarding parents and children, the President should not transmit the Convention to the Senate for its advice and consent.

 Referred to the Committee on Foreign Relations.
<summary> 3/10/2011--Introduced.
Expresses the sense of the Senate that: (1) the United Nations Convention on the Rights of the Child is incompatible with the Constitution, the laws, and the traditions of the United States; (2) the Convention would undermine presumptions of freedom and independence for U.S. families; (3) the Convention would interfere with U.S. principles of sovereignty, independence, and self-government that preclude the necessity or propriety of adopting international law to govern domestic matters; and (4) the President should not transmit the Convention to the Senate for its advice and consent.
</summary>

Using a quickly written java application, I was able to create several delimited files that I would need to support data loading into Neo4J. Each bill has a sponsor, may have 1 or more co-sponsors, has one or more topic or subject that it describes and has activity as it moves through congress. You can see that modeled below:

Data Loading into Neo4J
Using the BatchInserter framework from Max De Marzi and Michael Hunger, the data can be quickly loaded into Neo4J (less than 5 minutes). There are 50,889 nodes, 121,049 properties, and 266,197 relationships.

Quick Analysis

Using Cypher, you can quickly see who sponsored the most bills:


START n=node:congress('lastName:*')
MATCH n-[r:SPONSORS]->c
RETURN n.lastname, n.firstname, count(*)
ORDER BY count(*) desc LIMIT 5;

Casey, Robert 228
Menéndez, Robert 207
Schumer, Charles 146
Brown, Sherrod 118
Hagan, Kay 116

And those that COSPONSOR:

START n=node:congress('lastName:*')
MATCH n-[r:COSPONSORS]->c
RETURN n.lastname, n.firstname, count(*)
ORDER BY count(*) desc LIMIT 5;
Grijalva, Raul 752
Filner, Bob 704
Rangel, Charles 687
Norton, Eleanor 681
Lee, Barbara 605

Conclusion
We’ve seen how we can quickly transform data in XML files into Neo4J and begin to look at the data. The next post will look at further analyzing of the data.

Filed Under: Uncategorized

Analysis of Cash Pouring into State Campaigns

July 17, 2012 by dave fauth Leave a Comment

Download PDF

In today’s Star Tribune, there was an article discussing the amount of cash that is flowing from outside of the state of Minnesota into various candidate’s campaign war chests. The authors used FEC individual campaign contribution data to perform the analysis described in the article.

Using a similar approach outlined earlier, I decided to see how easy it would be to use Datameer to perform similar analysis. In this case, I filtered out candidates so only Minnesota candidates would show up as shown below.

After joining this set of records to campaign contributions, I then filtered on individuals who listed Minnesota as their address and also filtered on non-Minnesota contributors. After grouping the sums by candidate and joining the sheets, I had a worksheet that I could then use to graph results. That sheet is below:

Using the Infographic capability, I created a quick bar graph from these results.

I have to look at the data set again as the numbers don’t quite match to the Star Tribune article. However, it shows that tools like Datameer 2.0 can be used for multi-purpose data analysis and tasks that we would normally think only a data scientist would be capable of performing.

Filed Under: Uncategorized

« Previous Page
Next Page »

Follow Me

Follow @davefauth