Datameer 2.0 – Update

As of yesterday, June 28, I received the approval to download a 30-day trial for Datameer 2.0. The download and installation for Datameer 2.0 on my MacBook Pro was simple and straightforward. After unpacking the zip file, I simply copied the Datameer file into my applications directory and launched Datameer 2.0. Datameer took about two minutes to start and then asked to generate a 30-day trial key.

I’ll generate that key next week and get started looking at Datameer 2.0 using the Federal Election Commission detailed financial disclosure files located here.

Datameer 2.0

Datameer has announced version 2.0 putting business intelligence into the hands of the average user. Datameer provides a single application that requires no ETL, no static schemas, and puts powerful analytics and data visualizations directly in the hands of any user.

After watching CEO Stefan Groschupf announce the release of Datameer 2.0 at the 2012 Hadoop Summit, I’m excited to get my hands on the personal version to put it through its paces. Datameer personal runs on a single desktop with a data limit of 100GB/yr. Datameer Workgroup runs on a single server with a data limit of 100GB/yr. Datameer Enterprise can scale to thousands of nodes with unlimited data.

Some of the key technologies utilized in version 2.0 are Hadoop, HTML 5, a REST API and an SDK. There are over 25 built-in data connectors (JSON, Amazon S3, Oracle, DB2, MS SQL, MySQL , HBase, XML, native connectors to Twitter and Facebook) with the ability to build additional data connectors through the SDK. As part of the analytical suite, there are over 200 built-in functions including data mining functions. During the announcement, it was mentioned that there was built-in entity and location extraction which are features that I want to investigate further.

Datameer 2.0 provides a Business Infographics Designer allowing a user complete graphics and visualization control. Datameer’s extensive library of widgets includes tables, graphs, charts, diagrams, maps, and tag clouds which enables users to create simple dashboards or stunning business infographics and visualizations. For someone graphically challenged, this should provide an easy-to-use ability to create meaningful representations of the data.

To gain access to Datameer 2.0, apply here. Datameer is slowly rolling out access to the 2.0 product.

About Datameer
Datameer offers the first data analytics solution that helps end users access, analyze and visualize data of any type, size, or source. Founded by Hadoop veterans in 2009, Datameer provides unparalleled access to data with minimal IT resources. Datameer scales from a laptop to thousands of nodes and is available for all major Hadoop distributions including Apache, Cloudera, EMC, Hortonworks, IBM, MapR, Yahoo!, Amazon and Microsoft Azure. Datameer is based in San Mateo, Calif. For more information on Datameer, please visit and follow them on Twitter @Datameer.

Things I learned while skiing

Last week, my son (@dsfauthii) and I went to Copper Mountain (@CopperMtn) for 3 days of skiing. As I was skiing, a few thoughts came to mind related to business and life in general.

1. Have fun. I’m not the best skiier but had a great time skiing most of the mountain. Colorado skiing is such a difference from East Coast skiing. The mountains were much higher, the trails longer and more challenging. Still, we had a great time all three days.

2. Take on new challenges. I had never skiied anything like Copper Mountain. The trails were more challenging and faster than I was used to. It took about half of the first day before I was ready to tackle the blue trails. After some success, I was more confident and knew I could tackle these courses. By the last day, I even took on a couple of black diamonds. Sure I fell a couple of times, but I was confident when I was done with those trails. The end result was worth the risk and challenges.

3. Focus. As I’m not the best skiier, I had to tell myself to focus all of the time. I needed to know who was behind me, who was gaining on me, what was ahead, who I was overtaking and where the trail was going. If I wasn’t focusing, it wasn’t long before I was struggling to stay up.

4. Enjoy the experience. This was a great trip for me and my son because we had a shared experience. I’d rather enjoy the experience with someone than by myself.

Java SSL Certificate

This post is meant to remind me on how to implement SSL certificates within Java. It was definitely a learning experience digging into trust stores and keystores.

Installation of client certificates in a Java client environment

This section describes the steps required to install the provided certificates in a Java client environment. In general you will create a new Java keystore and truststore using the files and password we have provided. Here are the steps to follow:

1. Make sure you have access to a Java 6 installation. You only need this for the keytool utility. The files you create with Java 6 are fully compatible with Java 5 but the keytool utility in Java 5 does not support importing PKCS #12 files.
2. Import the PKCS #12 file provided into a new keystore by issuing the following command: (Use the CLEAR Administrator provided password for all password prompts)
keytool -importkeystore -v -srckeystore clientcert.p12 -srcstoretype PKCS12 –keystore newstore.ks
3. Next create a truststore that includes the CA certificate: (You can select you own password)
keytool -import -v -keystore newtrust.ks -file cacertfile.pem

4. Finally use the Java system properties when running your client to ensure that the proper certificate is selected during SSL negotiation. The properties are: \ \ \

For keytool commands, I referred to this site:

A good site for troubleshooting is:

I ended up using the file on the atlassian site to help troubleshoot the SSL connection. This really helped understand connection issues.

Sample code within Palantir

Within Palantir, I was able to use the following code to successfully connect to the SSL endpoint.

			String string = "";
			StringBuffer sb = new StringBuffer();
			String strGetURL = strURL;
			try {
		        KeyManagerFactory keyManagerFactory = KeyManagerFactory.getInstance(KeyManagerFactory.getDefaultAlgorithm());
		        KeyStore keyStore = KeyStore.getInstance(KeyStore.getDefaultType());
		        InputStream keyInput = this.getClass().getResourceAsStream("/newstore.ks");
		        keyStore.load(keyInput, "certificatepwd".toCharArray());
		        keyManagerFactory.init(keyStore, "certificatepwd".toCharArray());

		        TrustManagerFactory trustManagerFactory = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
		        KeyStore trustStore = KeyStore.getInstance(KeyStore.getDefaultType());
		        InputStream trustInput = this.getClass().getResourceAsStream("/newtrust.ks");
		        trustStore.load(trustInput, "certificatepwd".toCharArray());

		        SSLContext sct = SSLContext.getInstance("SSL");
		        sct.init(keyManagerFactory.getKeyManagers(), trustManagerFactory.getTrustManagers(), new SecureRandom());
		        SSLSocketFactory sslsocketfactory = sct.getSocketFactory();
//		        SSLSocket socket = (SSLSocket)factory.createSocket(host, port);

//		        SSLSocketFactory sslsocketfactory = (SSLSocketFactory) SSLSocketFactory.getDefault();				
				String username="username:password";
				String encoding = new sun.misc.BASE64Encoder().encode (username.getBytes());
				URL url = new URL(strGetURL);

				HttpsURLConnection conn = (HttpsURLConnection)url.openConnection();
				conn.setRequestProperty ("Authorization", "Basic " + encoding);
				conn.setRequestProperty ( "Content-Type", "application/xml" );

				InputStream inputstream = conn.getInputStream();
				InputStreamReader inputstreamreader = new InputStreamReader(inputstream);
				BufferedReader bufferedreader = new BufferedReader(inputstreamreader);

				string = null;
				while ((string = bufferedreader.readLine()) != null) {
//					System.out.println("Received " + string);
			} catch (Exception exception) {
			return sb.toString();

As I mentioned earlier, this is mostly for my usage for future deployments. If someone else finds it useful, I’m glad that you were helped.

FEC Data – Further Analysis

In the previous post we showed how Federal Election Commission data could be loaded into Neo4J and manipulated using Gremlin. In this follow-up posting, we’ll modify the data structure and do some further analysis of the data.

The FEC Data Graph
The FEC data is represented in the following graph. Each committee supports a candidate. Some candidates may be independent from a committee. Individuals contribute 1 or more times to a committee. For this demonstration, we’ve haven’t separated out city/state/zip and created a common location.

A couple of notes on the data. Some of the committees did not have a treasurer so I added in a value of “No Treasurer”. Some of the candidates were referenced to non-existent committees. In this case, I’ve created entries for those committees in order to load the data and create the links. Additionally, the individual contribution file has overpunch characters to different amounts or negative amounts. Those values were adjusted in the database so the data could be loaded as an integer value.

In this design, we see that a contributor (individual making a contribution) can make several contributions over time. These contributions are given to a committee in support of a candidate. Additionally, we’ve added in an additional data set that is a summary of all contributions where detailed donors reporting is not required because they have not given more than $200.

Just to give people an idea of the volume of contributions, in September when I downloaded the data, there were 437,726 contributions. When I downloaded the latest file on November 3, there were now 598,306 contributions. That’s about a 36 percent increase.

Let’s look at a candidate
We’ll use the gremlin language to perform some graph manipulation to analyze giving to some candidates. For this case, I’ve decided to look at Mr. Herman Cain.

 v = g.idx(T.v)[[candName:'CAIN, HERMAN']] >> 1
==>candName=CAIN, HERMAN

Using the Gremlin pipes (for a great description, see this post). To check on the committee supporting Herman Cain, we run this command:

gremlin>  v.inE('supports')[0..20]
gremlin> v=g.idx(T.v)[[name:'FRIENDS OF HERMAN CAIN INC']] >> 1
==>treasurer=MARK J BLOCK

In this instance, there is only a single committee, FRIENDS OF HERMAN CAIN INC, that is supporting Herman Cain.

Our pipe looks like this:

Campaign Contributions
Let’s take a look and see how many campaign contributions have been made to Herman Cain. The data is as of mid-September. I haven’t downloaded an updated data set.

gremlin> v.outE('receives').count()

We see that the FRIENDS OF HERMAN CAIN INC have received 1201 individual contributions. The average contribution is determined using the following Gremlin command:

gremlin> v.outE('receives').inV.amount.mean()

In the next analysis, we’ll use some filter steps to remove objects from the flow of computation. In this example, we see that there are three contributions above $4500.

gremlin>  v.out('receives').filter{it.amount>4500}.amount

To see who the people are, we’ll use a more complicated pipe that starts with the committee, filters out the contributions that are greater than $4500 and then passes those results as input to find out who made those contributions.

gremlin> v.out('receives').filter{it.amount>4500}.inE('makes').outV.contName[0..10]
==>Fox, Saul
==>Weidner, William
==>Jones, William

Our next analysis will be to see who is contributing multiple times to the Cain campaign. We use the following code to see who has contributed. In order to remove the reflexive path, we add in the filter. Without the filter, we would see double the amount of contributions.

gremlin> m=[:]
gremlin>  v.out('receives').inE('makes').outV.filter{it !=v}.contName.groupCount
(m) >> -1
gremlin>  m.sort{a,b -> b.value <=> a.value}[0..39]
==>Waddle, Julie=6
==>Rogers, Michael=5
==>Anderson, Neil=4
==>Watkins, Walter=4
==>Tribble, James=4
==>Burton, James=4
==>laseau, mary=4
==>Russell, Daniel=4
==>Harris, Dudney L.=4
==>Fox, Saul=3
==>Weidner, William=3
==>Ratliff, Robert=3
==>Ellis, Marty=3
==>Bucciero, Kimberly=3
==>Kincaid, Elizabeth=3
==>Harkins, Gerry=3
==>Adams, Archie=3
==>Frankovitch, Joseph=3
==>Holten, James=3
==>Lindenfeld, Malaise=3
==>Richardson, Scott=3
==>Irvin, David L=3
==>Ward, Thomas=3
==>Buchanan, douglas=3
==>samuels, philip=3
==>Thompson, James=3
==>Koch, Tina=3
==>clements, john=3
==>Fowler, Jan=2
==>Blackwell, Diane=2
==>Koch, Richard=2
==>Gingrich, William=2
==>Robson, Roger=2
==>Anderson Jr, Taz=2
==>Hatfield, Edward=2
==>Ramey, Valerie=2
==>Parham, Charles=2
==>Shaw, Terry=2
==>Eidson, Robert=2
==>Keown, Karie=2

Who is Julie Waddle? Using additional commands, we find out she is a homemaker from Herman Cain’s hometown.

gremlin>   v = g.idx(T.v)[[contName:'Waddle, Julie']] >> 1
==>contName=Waddle, Julie

To show the contributions, we will run this command:

gremlin> v.outE('makes').inV.amount

Let’s do a little more digging and see what the top reported occupations where that contributed to the Cain campaign. We’ll use the following command and get the following results which show retirees, homemakers, physicians and lawyers are the top contributors:

gremlin> m=[:]
gremlin> v.out('receives').inE('makes').outV.filter{it !=v}.contOccupation.groupCount(m) >> -1
gremlin>  m.sort{a,b -> b.value <=> a.value}[0..39]
==>self/Real Estate=7
==>Technical Director Custoemr Enginee=5
==>University of West GA/Professor=4
==>self/Human Resources=4
==>Fox Paine & Co./Chief Exec.=3
==>self/small business owner=3
==>Self/Owner/Concert Merchandise Comp=3
==>Goodman Networks/Project Manger=3
==>James C Kincaid DDS/Secretary=3
==>Hybrid Concrete Structures/Construc=3
==>self employed/Consultant=3
==>Kingsley Associates/Database Admini=3
==>Holten Meat Inc/CEO=3
==>North  Georgia Foods Inc/Business O=3
==>renze display/President=3
==>universal sewing supply/Executive=3
==>GA Solar Lighting/contractor=3
==>Teradata Corp./VP=3
==>TRG Inc./President=3
==>Amsell LLC/Sales=2
==>Smith Gambrell & Russell/Legal Secr=2

Next Steps
The next steps will be to reload the graph with updated data and look at different groupings of data (occupation, location, time series, etc).

Federal Election Commission Campaign Data Analysis

This post is inspired by Marko Rodriguez’ excellent post on a Graph-Based Movie Recommendation engine. I will use many of the same concepts that he describes in his post in order to load the data into Neo4J and then begin to analyze the data. This post will focus on the data loading. Follow-on posts will look at further analysis based on the relationships.

The Federal Election Commission has made campaign contribution data publicly available for download here. The FEC has provided campaign finance maps on its home page. The Sunlight Foundation has created the Influence Explorer to provide similar analysis.

This post and follow-on posts will look at analyzing the Campaign Data using the graph database Neo4j, and the graph traversal language Gremlin. This post will go about showing the data preparation, the data modeling and then loading into Neo4J.

The FEC Data
The FEC data is available for download from the FEC website via FTP. It is composed of three main files which are the Campaign Committees, Campaign Candidates and the Individual Contributors. As of this post, there were approximately 10,875 committees, 3,600 candidates, and 455,000 unique contributions. Each of the data sets has a data description as well as frequency counts. The 2011-2012 data can be found here.

Gremlin and Neo4J
Gremlin 1.3 is available for download at this location. Neo4J 1.5M01 is available for download at this location. For this demonstration, we will be running the community edition of Neo4J in a Windows Virtual Machine.

Data Preparation
The FEC data is in formatted, fixed-length fields. This makes it a little bit harder to prepare for import into Neo4J with my limited skills and abilities. To work around that, I was able to load the data into Oracle using SQL Loader and then I wrote a simple PHP program to query the database and format the data into a delimited file. If interested in those files, feel free to contact me.

The FEC Data Graph
The FEC data is represented in the following graph. Each committee supports a candidate. Some candidates may be independent from a committee. Individuals contribute 1 or more times to a committee. For this demonstration, we’ve haven’t separated out city/state/zip and created a common location.

A couple of notes on the data. Some of the committees did not have a treasurer so I added in a value of “No Treasurer”. Some of the candidates were referenced to non-existent committees. In this case, I’ve created entries for those committees in order to load the data and create the links. Additionally, the individual contribution file has overpunch characters to different amounts or negative amounts. Those values were adjusted in the database so the data could be loaded as an integer value.

Loading Data
The data will be inserted into the graph database Neo4j. The Gremlin/Groovy code below creates a new Neo4j graph, removes an unneeded default edge index, and sets the transaction buffer to 2500 mutations per commit.

g = new Neo4jGraph('/tmp/FEC')

Loading Committee Data
The committee data contains information about the different election committees. In our case, it has seven columns.


The code needed to parse this data is below:

new File('committee.dat').eachLine {def line ->
  def components = line.split('::');
  def committeeVertex = g.addVertex(['type':'Committee','committeeId':components[0], 'name':components[1], 'party':components[2],'city':components[3],'state':components[4],'zip':components[5],'treasurer':components[6]]);

Parsing Candidate Data
The candidate data contains information about the various candidates. In our case, it has nine columns. A sample of the data is below:

H0AL01048::WALTER, DAVID MARSH::FOLEY::AL::36535::CON::P::10::01
H0AL02087::ROBY, MARTHA::MONTGOMERY::AL::36106::REP::C::12::02
H0AL05049::CRAMER, ROBERT E "BUD" JR::HUNTSVILLE::AL::35804::DEM::P::08::05
H0AL05155::PHILLIP, LESTER S::MADISON::AL::35758::REP::P::10::05
H0AL05163::BROOKS, MO::HUNTSVILLE::AL::35802::REP::C::12::05
H0AL05197::RABY, STEPHEN WALKER::TOREY::AL::35773::DEM::P::10::05
H0AL06088::COOKE, STANLEY KYLE::KIMBERLY::AL::35091::REP::P::10::06
H0AL06096::LAMBERT, PAUL ANTHONY::MAYLENE::AL::35114::REP::N::10::06

The code to parse the candidate file is:

new File('candidate.dat').eachLine {def line ->
  def components = line.split('::');
  def candVertex = g.addVertex(['type':'Candidate','candId':components[0], 'candName':components[2], 'candCity':components[3], 'candState':components[4],'candZip':components[5],'candParty':components[6],'candStatus':components[7],'candYear':components[8],'candDistrict':components[9]]);
  def supportedEdge = g.addEdge(g.idx(T.v)[[committeeId:components[1]]].next(), candVertex, 'supports');

Loading the Individual Contributors File
The individual contributors file contains all of the contributions made to different committees.

The sample data is:

C00000422::0009951::Helm, Douglas Alan MD::PERINATAL ASSOCIATES/Physician::Fresno::CA::93701::01::11::11::20::0000500::M2
C00000422::0009952::Karasek, Dennis Edward MD::SELF-EMPLOYED/Physician::San Antonio::TX::78231::01::11::11::20::0002000::M2
C00000422::0009953::Kilgore, Shannon M MD::VA PALO ALTO HCS/Physician::Palo Alto::CA::94304::01::11::11::20::0000500::M2
C00000422::0009954::Matthews, George Philip MD::VISION QUEST/Physician::Arlington::TX::76006::01::11::11::20::0000500::M2
C00000422::0009955::Kimball, Daniel B Jr. MD::N/A/Retired Physician::Reading::PA::19611::01::15::11::20::0001000::M2
C00000422::0009956::Mehling, Brian Macdermott MD::MEHLING ORTHOPAEDIC/Physician::West Islip::NY::11795::01::14::11::20::0000291::M2

Given that there are about a half a million contributors, parsing this data and loading will take a couple of minutes.

new File('indiv.dat').eachLine {def line ->
  def components = line.split('::');
  def indivVertex = g.addVertex(['type':'Individual','indivId':components[1], 'indivName':components[2], 'indivOccupation':components[3],'indivCity':components[4], 'indivState':components[5],'indivZip':components[6],'transDate':components[7] + components[8] +components[9],'amount':components[11],'transactionType':components[12]]);

To commit any data left over in the transaction buffer, successfully stop the current transaction. Now the data is persisted to disk. If you plan on leaving the Gremlin console, be sure to g.shutdown() the graph first.


Validating the Data

gremlin> g.V.count()
gremlin> g.E.count()
gremlin> g.V[[type:'Committee']].count()
gremlin> g.V[[type:'Candidate']].count()
gremlin> g.V[[type:'Individual']].count()

Let’s look at some distributions
What is the distribution of contributions among states?

gremlin> g.V[[type:'Individual']].indivState.groupCount(m) >> -1
gremlin> m.sort{a,b -> b.value<=>a.value}

What about the average contribution?

gremlin> g.V[[type:'Individual']].amount.mean()

Are there any treasurers supporting multiple committees?

gremlin> m=[:]
gremlin> g.V[[type:'Committee']].treasurer.groupCount(m) >> -1
gremlin>  m.sort{a,b -> b.value <=> a.value}[0..19]
==>No Chair=1716

No chair and no treasurer indicate that the treasurer value was empty. However, there are several treasurers supporting multiple committees.

Next Steps
The next steps will be to look at some of the relationships between contributors and committees and see if there are treasurers serving on multiple committees.

Additionally, because each contribution is counted individually, there are several duplicate donors/campaign contributors. In order to address that, I will separate out the donors and their address as a separate table and link them to the contributions.

If you have questions about this post, feel free to email me.

i2 Report File – Palantir Plugin (Update)

Since the initial posting, I’ve made some updates to the Palantir import helper allows a user to select the report file and then import the file.

Once the user clicks on Import, a list of i2 types are presented to the user (both links and entities). The user can map each of the i2 types to Palantir objects or links.

Finally a summary of the number of entities and links processed are presented to the user. The entities and links are added to the chart.

i2 Report File – Palantir Plugin

i2 ANB allows users to export chart information about entities, links, attributes and cards to a report. This is useful if you want to create a report containing the information in all or part of your chart. This report is created as a text file which can then be used in other applications.

i2 ANB allows users to define the items you want to include in your report using a report specification. A report specification is a series of settings which tell ANB what kind of report to create and what you want to include in it. Report specifications enable you to define the items, content and destination of our report.

For our useage, we’ve modified a default report template. For i2 ANB entities, we want access to the entity type, identity, label, description, date and the attributes. For the attributes, we are printing out the attribute name and attribute value. We use tabs between the attribute name and value. We also use ]] as a delimiter between attribute name and value.

For links, we print out the link type, label, link 1 and link 2 as well as the Ends. We’ve added link1 and link2 values as it isn’t always possible to parse the Ends value properly.

The report configuration are shown in the following 2 screen shots.

i2 Report Setup
i2 Report Setup

i2 Report Screen Capture 2

As shown in below, the Palantir import helper allows a user to select the report file and then import the file. A summary of the number of entities and links processed are presented to the user. The entities and links are added to the chart. Users can customize the mappings between i2 entity types and palantir entity types by modifying an XML file that is part of the Palantir helpers.

Here’s the final import into a Palantir graph.

Palantir Graph
Palantir Graph

Questions about the Palantir helper can be sent to

Quick Links on APIs

Some quick links that have popped up over the last few days:

Your API Sucks: Why Developers Hang Up and How to Stop That An article from Apigee that talks how APIs don’t need to suck for developers.

Get free admission to Strata and a chance to showcase to investors Thanks to Pete Warden, here’s a way for big data startups a chance to get to Strata and in front of VCs.

Enterprise 2.0 RESTful APIs made easy with PHP FRAPIFRAPI is a high-level API framework that puts the “rest” back into RESTful. Use it to power your web apps, mobile services, and legacy systems.

Best Practices for API Development Recently tips from the founder of the Lokad API, a sales forecasting service, summarized some of her tips for API design.