Hadoop to Neo4J - Intelliwareness

Leading up to Graphconnect NY, I was distracting myself from working on my talk by determining if there was any way to import data directly from Hadoop into a graph database, specifically, Neo4j. Previously, I had written some Pig jobs to output the data into various files and then used the Neo4J batchinserter to load the data. This process works great and others have written about it. For example, this approach also uses the batchinserter while this approach uses some Java UDFs to write the Neo4J files directly.

Both of these approaches work great but I was wondering if I could use a Python UDF and create the Neo4J database directly. To test this out, I decided to resurrect some work I had done on the congressional bill data from Govtrack. You can read about the data and the java code I used to convert the files into single-line JSON files here. It’s also a good time to read up on how to create an Elasticsearch index using Hadoop. Now that you’re back from reading that link, let’s look at the approach to try and go from Hadoop directly into Neo4J. From the previous article, you remember that recently Mortar worked with Pig and CPython to have it committed into the Apache Pig trunk. This now allows to take advantage of Hadoop with real Python. Users get to focus just on the logic you need, and streaming Python takes care of all the plumbing.

Nigel Small had written Py2Neo which is a simple and pragmatic Python library that provides access to the popular graph database Neo4j via its RESTful web service interface. That sounded awesome and something worth trying out. Py2Neo is easy to install using pip or easy_install. Installation instructions are located here.

The model that I was trying to create looks something like this:

The approach taken was to use Pig with a streaming Python UDF to write to the Neo4J database using its RESTful web service. I tested this out with Neo4J 2.0M6. I attempted to use Neo4J2.0RC1 but ran into several errors relating to missing nodes. The example code is below:

-- 'Document' is the delimiter -- 'event, gathering' is the tag list %default OUTPUT_PATH '/Users/davidfauth/MortarBillsData' %default S3_OUTPUT_PATH 's3n://df-bills-project' %default S3_INPUT_PATH 's3n://df-bills-data' %default INPUT_PATH '/Users/davidfauth/MortarNeoTestData' %default BULK_INPUT_PATH '/Users/davidfauth/MortarTestDataBulk' REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/billsProject.py' USING streaming_python AS nltk_udfs; REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/utilities.py' USING streaming_python AS utility_udfs; REGISTER '/Users/davidfauth/mortarProjects/billsProject/udfs/python/neo4JUtility.py' USING streaming_python AS neo4j_udfs; rmf $OUTPUT_PATH; --rmf $S3_OUTPUT_PATH; bills = LOAD '$BULK_INPUT_PATH' USING org.apache.pig.piggybank.storage.JsonLoader( 'bill_id:chararray, congress:chararray, official_title:chararray, updated_at:chararray, subjects_top_term:chararray,summary:map[], sponsor:map[], subjects:map[],cosponsors:map[], bill_type:chararray, number:chararray,introduced_at:chararray,status:chararray,status_at:chararray'); data = LOAD '$BULK_INPUT_PATH' USING org.apache.pig.piggybank.storage.JsonLoader(); billNodes = LOAD '$OUTPUT_PATH/logs/keyNodeList' USING PigStorage('\t') AS (keyValue:chararray, nodeID:int, nodeType:charrary); -- get unique list of bills, subjects, sponsors and cosponsors to create nodes billList = FOREACH bills GENERATE bill_id as keyValue, 'bill' as nodeType; congressList = FOREACH bills GENERATE congress as keyValue, 'congress' as nodeType; congressBillList = FOREACH bills GENERATE congress as congressID, bill_id; sponsors = FOREACH bills GENERATE bill_id, sponsor#'name' AS sponsorName:chararray, sponsor#'state' AS sponsorState:chararray, sponsor#'district' AS sponsorDistrict:chararray, CONCAT(CONCAT(sponsor#'name',' '),sponsor#'state') as keyValue:chararray; --sponsorNameKey = FOREACH sponsors GENERATE CONCAT(CONCAT(sponsorName,' '),sponsorState) as keyValue:chararray, 'MemberOfCongress' as nodeType; listSponsors = FOREACH sponsors GENERATE keyValue; sponsorNameKey = FOREACH sponsors GENERATE keyValue, 'sponsor' as nodeType; cs = FOREACH data GENERATE flatten(object#'bill_id') as billid,flatten(object#'cosponsors') AS cosponsors:map[]; names = FOREACH cs GENERATE billid as bill_id, flatten(cosponsors#'name') as coSponsorName:chararray, flatten(cosponsors#'state') as coSponsorState:chararray, flatten(cosponsors#'district') as coSponsorDistrict:chararray, CONCAT(CONCAT(cosponsors#'name',' '),cosponsors#'state') as keyValue:chararray; cosponsorNameKey = FOREACH names GENERATE CONCAT(CONCAT(coSponsorName,' '),coSponsorState) as keyValue:chararray, 'MemberOfCongress' as nodeType; listCoSponsors = FOREACH names GENERATE keyValue; -- create list of distinct sponsors/cosponsors unionSponsorCoSponsors = UNION listSponsors, listCoSponsors; bUnion = GROUP unionSponsorCoSponsors BY 1; cUsCS = FOREACH bUnion GENERATE flatten(unionSponsorCoSponsors); listdistinctSponsorsCosponsors = DISTINCT cUsCS; uniqueCongressList = DISTINCT congressList; uniquebillList = DISTINCT billList; uniqueSponsors = DISTINCT sponsorNameKey; uniqueCoSponsors = DISTINCT cosponsorNameKey; -- create the subject List -- for some reason, it needs to be written out to file and brought back in subjectList = FOREACH data GENERATE object#'bill_id' as bill_id:chararray, flatten(object#'subjects') AS keyValue:chararray; STORE subjectList INTO '/Users/davidfauth/MortarBillsData/subjects' USING PigStorage('\t'); subjectData = LOAD '/Users/davidfauth/MortarBillsData/subjects' USING PigStorage('\t') as (bill_id:chararray, keyValue:chararray); tmpSubjectList = FOREACH subjectData GENERATE keyValue; uniqueSubjectList = DISTINCT tmpSubjectList; keySubjectList = FOREACH uniqueSubjectList GENERATE keyValue, 'subject' as nodeType; ordereduniqueSubjList = ORDER keySubjectList by keyValue ASC; ordereduniqueBillListValues = ORDER uniquebillList BY keyValue; orderedUniqueSponsors = ORDER uniqueSponsors BY keyValue; orderedUniqueCoSponsors = ORDER uniqueCoSponsors BY keyValue; orderedUniqueSCoS = ORDER listdistinctSponsorsCosponsors By keyValue; -- create the key values (list of nodes) that Neo4J will use unionKeys = UNION uniqueCongressList, ordereduniqueSubjList, ordereduniqueBillListValues, orderedUniqueSponsors, orderedUniqueCoSponsors; --unionKeys = UNION uniqueCongressList, ordereduniqueSubjList, ordereduniqueBillListValues, orderedUniqueSCoS; b = GROUP unionKeys BY 1; c = FOREACH b GENERATE flatten(unionKeys); -- run the counter UDF inside the single reducer --numBillKeyValue = FOREACH ordereduniqueBillListValues GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numSponsorsKeyValue = FOREACH orderedUniqueSponsors GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numCoSponsorsKeyValue = FOREACH orderedUniqueCoSponsors GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; --numSubjectsKeyValue = FOREACH ordereduniqueSubjList GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int; -- run the Counter UDF to create a Node ID keyNodeList = FOREACH c GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int, nodeType; --Create Nodes (can I group and create a tuple/values) nodeValue = FOREACH keyNodeList GENERATE neo4j_udfs.createNode(keyValue, nodeType) as nodeCreated; --nodeSponsorValue = FOREACH numSponsorsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'sponsor') as nodeCreated; --nodeCoSponsorValue = FOREACH numCoSponsorsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'cosponsor') as nodeCreated; --nodeSubjectValue = FOREACH numSubjectsKeyValue GENERATE neo4j_udfs.createNode(keyValue, 'subject') as nodeCreated; -- Update Bill nodes with additional details updatedBillNodes = JOIN keyNodeList by keyValue, bills by bill_id; -- Create bills to subjects relationships billRelBillID = JOIN keyNodeList BY keyValue, subjectData by bill_id; billRel = JOIN billRelBillID by subjectData::keyValue, keyNodeList BY keyValue; --relValue = JOIN billRel by keyValue, subjectData by bill_id; --Create bills to sponsors relationships billRelSponsorID = JOIN keyNodeList BY keyValue, sponsors by bill_id; billSponsorRel = JOIN billRelSponsorID by sponsors::keyValue, keyNodeList BY keyValue; --Create bills to cosponsors relationships billCoRelSponsorID = JOIN keyNodeList BY keyValue, names by bill_id; billCoSponsorRel = JOIN billCoRelSponsorID by names::keyValue, keyNodeList BY keyValue; --Create Congress to Sponsor relationships congressRelSponsorID = JOIN keyNodeList BY keyValue, congressBillList by congressID; congressSponsorRel = JOIN congressRelSponsorID by congressBillList::bill_id, sponsors by bill_id; congressSponsorNodes = JOIN congressSponsorRel by sponsors::keyValue, keyNodeList BY keyValue; --Create Congress to CoSponsor relationships congressRelCoSponsorID = JOIN keyNodeList BY keyValue, congressBillList by congressID; congressCoSponsorRel = JOIN congressRelCoSponsorID by congressBillList::bill_id, names by bill_id; congressCoSponsorNodes = JOIN congressCoSponsorRel by names::keyValue, keyNodeList BY keyValue; --Create Relationships relBillValue = FOREACH billRel GENERATE neo4j_udfs.createRelationship(billRelBillID::keyNodeList::my_id,keyNodeList::my_id,'subject_of'); relBillSponsor = FOREACH billSponsorRel GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,billRelSponsorID::keyNodeList::my_id,'sponsor_of'); relBillCoSponsor = FOREACH billCoSponsorRel GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,billCoRelSponsorID::keyNodeList::my_id,'cosponsor_of'); relCongressSponsor = FOREACH congressSponsorNodes GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,congressSponsorRel::congressRelSponsorID::keyNodeList::my_id,'member_of'); relCongressCoSponsor = FOREACH congressCoSponsorNodes GENERATE neo4j_udfs.createRelationship(keyNodeList::my_id,congressCoSponsorRel::congressRelCoSponsorID::keyNodeList::my_id,'member_of'); nodeBillDetails = FOREACH updatedBillNodes GENERATE neo4j_udfs.updateBillNode(keyNodeList::my_id, bills::official_title,bills::updated_at,bills::bill_type, bills::number, bills::introduced_at, bills::status, bills::status_at); -- Log nodeCreation STORE nodeBillDetails INTO '$OUTPUT_PATH/logs/nodeBillDetails' USING PigStorage('\t'); STORE nodeValue INTO '$OUTPUT_PATH/logs/bills' USING PigStorage('\t'); STORE billRel INTO '$OUTPUT_PATH/logs/billsRel' USING PigStorage('\t'); STORE keyNodeList INTO '$OUTPUT_PATH/logs/keyNodeList' USING PigStorage('\t'); STORE billRelBillID INTO '$OUTPUT_PATH/logs/billRelID' USING PigStorage('\t'); STORE relBillValue INTO '$OUTPUT_PATH/logs/billRelValues' USING PigStorage('\t'); STORE relBillSponsor INTO '$OUTPUT_PATH/logs/billRelSponsorValues' USING PigStorage('\t'); STORE relBillCoSponsor INTO '$OUTPUT_PATH/logs/billRelCoSponsorValues' USING PigStorage('\t'); STORE relCongressSponsor INTO '$OUTPUT_PATH/logs/relCongressSponsor' USING PigStorage('\t'); STORE relCongressCoSponsor INTO '$OUTPUT_PATH/logs/relCongressCoSponsor' USING PigStorage('\t'); STORE congressSponsorNodes INTO '$OUTPUT_PATH/logs/congressSponsorlRel' USING PigStorage('\t'); STORE updatedBillNodes INTO '$OUTPUT_PATH/logs/nodeBillUpdateDetails' USING PigStorage('\t'); STORE c INTO '$OUTPUT_PATH/logs/unionValues' USING PigStorage('\t');

-- run the Counter UDF to create a Node ID
keyNodeList = FOREACH c GENERATE keyValue, utility_udfs.auto_increment_id() AS my_id:int, nodeType;

Since Neo4J uses an incrementing counter for each node, we have to create an id for each keyValue (node name) that we are creating. The keyValues are the congressional session, name of the congresswoman or congressman, billID or subject. Below is a simple Python code that creates that ID.

from pig_util import outputSchema

COUNT = 0

@outputSchema('auto_increment_id:int')
def auto_increment_id():
    global COUNT
    COUNT += 1
    return COUNT

Once we have the id, we can use Py2Neo to create the nodes and relationships.

from pig_util import outputSchema

from py2neo import neo4j
from py2neo import node, rel

@outputSchema('nodeCreated:int')
def createNode(nodeValue, sLabel):
    if nodeValue:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        batch = neo4j.WriteBatch(graph_db)
        alice=batch.create(node(name=nodeValue,label=sLabel))
        results=batch.submit()
        return 1
    else:
        return 0

@outputSchema('nodeCreated:int')
def createRelationship(fromNode, toNode, sRelationship):
    if fromNode:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        ref_node = graph_db.node(fromNode)
        to_node = graph_db.node(toNode)
        aliceRel=graph_db.create(rel(ref_node,sRelationship,to_node))
        return 1
    else:
        return 0   

#myudf.py
@outputSchema('nodeCreated:int')
def createBillNode(nodeValue, sLabel, sTitle, sUpdated, sBillType,sBillNumber,sIntroducedAt,sStatus,sStatusAt):
    if nodeValue:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        foundNode,=graph_db.create(node(name=nodeValue))
        foundNode.add_labels(sLabel)
        foundNode["title"]=sTitle
        foundNode["updateDate"]=sUpdated
        foundNode["billType"]=sBillType
        foundNode["billNumber"]=sBillNumber
        foundNode["introducedAt"]=sIntroducedAt
        foundNode["status"]=sStatus
        foundNode["statusDate"]=sStatusAt
        return 1
    else:
        return 0

#myudf.py
@outputSchema('nodeUpdated:int')
def updateBillNode(nodeID, sTitle, sUpdated, sBillType,sBillNumber,sIntroducedAt,sStatus,sStatusAt):
    if nodeID:
        graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
        foundNode= graph_db.node(nodeID)
        foundNode["title"]=sTitle
        foundNode["updateDate"]=sUpdated
        foundNode["billType"]=sBillType
        foundNode["billNumber"]=sBillNumber
        foundNode["introducedAt"]=sIntroducedAt
        foundNode["status"]=sStatus
        foundNode["statusDate"]=sStatusAt
        return 1
    else:
        return 0

Of note is the ability to create the node and add the label in the createNode function. To create the relationship, we pass in the two node ids and the relationship type. This is passed via the REST API interface and the relationship is created.

Performance – Performance wasn’t what I thought it would be. Py2Neo interacts with Neo4j via its REST API interface and so every interaction requires a separate HTTP request to be sent. This approach, along with logging, made this much slower than I anticipated. Overall, it took about 40 minutes on my MacBook Pro with 16GB ram and SSD to create the Neo4J database.

Py2Neo Batches – Batches allow multiple requests to be grouped and sent together, cutting down on network traffic and latency. Such requests also have the advantage of being executed within a single transaction. The second run was done by adding some Py2Neo batches. This really didn’t make a huge difference as the log files were still being written.

Overall, it still took about 60 minutes on my MacBook Pro with 16GB ram and SSD to create the Neo4J database.

Next Steps
Hmmm….I should have known that the RESTful service performance wasn’t going to be anywhere near as fast as the batchinserter performance due to logging. You could see the log files grow and grow as the data was added. I’m going to go back to the drawing board and see if a Java UDF could work better. The worst case is I just go back to writing out files and writing a custom batchinserter each time.

2 thoughts on “Hadoop to Neo4J”

Pingback: Hadoop to Neo4J - Neo Technology
Roman Bondar December 4, 2013 at 12:01 pm

How many nodes have you tried to upload? Have you tried to place Noe4j db folder(files) at ramfs?

Reply ↓

2 thoughts on “Hadoop to Neo4J”

Leave a Reply Cancel reply