Neo4j has recently announced the 2.2 Milestone 1 release. Among the exciting features is the improved and fully integrated “Superfast Batch Loader”. This utility (unsurprisingly) called neo4j-import, now supports large scale non-transactional initial loads (of 10M to 10B+ elements) with sustained throughputs around 1M records (node or relationship or property) per second. Neo4j-import is available from the command line on both Windows and Unix.
In this post, I’ll walk through how to set up the data files, some command line options and then document the performance for importing a medium size data set.
The data set that we will use is Medicare Provider Utilization and Payment Data.
The Physician and Other Supplier PUF contains information on utilization, payment (allowed amount and Medicare payment), and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service. This PUF is based on information from CMS’s National Claims History Standard Analytic Files. The data in the Physician and Other Supplier PUF covers calendar year 2012 and contains 100% final-action physician/supplier Part B non-institutional line items for the Medicare fee-for-service population.
For the data model, I created doctor nodes, address nodes, procedure nodes and procedure detail nodes.
A (procedure) -[:CONTAINS]-(procedure_details)
The model is shown below: