Intelliwareness

Blog on Big Data, Data Analytics and Other IT

Blogroll

  • CTOvision
  • Mark Needham
  • Max DeMarzi
  • Neo4j

Archives

Contact Me

  • About

Powered by Genesis

Reading and Writing Parquet files with Mortar

May 21, 2014 by dave fauth Leave a Comment

Download PDF

Using Mortar to read/write Parquet files
You can load Parquet formatted files into Mortar, or tell Mortar to output data in Parquet format using the Hadoop connectors that can be built from here or downloaded from here.

Last year, Cloudera, in collaboration with Twitter and others, released a new Apache Hadoop-friendly, binary, columnar file format called Parquet. Parquet is an open source serialization format that stores data in a binary column-oriented fashion. Instead of how row-oriented data is stored, where every column for a row is stored together and then followed by the next row (again with columns stored next to each other), Parquet turns things on its head. Instead, Parquet will take a group of records and store the values of the first column together for the entire row group, followed by the values of the second column, and so on. Parquet has optimizations for scanning individual columns, so it doesn’t have to read the entire row group if you are only interested in a subset of columns.

Installation:

In order to use parquet pig hadoop, the jar needs to be in Pig’s classpath. There are various ways of making that happen though typically the REGISTER command is used:

REGISTER /path/parquet-pig-bundle-1.5.0.jar;

the command expects a proper URI that can be found either on the local file-system or remotely. Typically it’s best to use a distributed file-system (like HDFS or Amazon S3) and use that since the script might be executed on various machines.

Loading Data from a Parquet file

As you would expect, loading the data is straight forward:

data = LOAD 's3n://my-s3-bucket/path/to/parquet'
	USING parquet.pig.ParquetLoader 
	AS (a:chararray, b:chararray, c:chararray)

Writing data to a Parquet file

You can store the output of a Pigscript in Parquet format using the jar file as well.

Store result INTO 's3n://my-s3-bucket/path/to/output' USING parquet.pig.ParquetStorer.

Compression Options
You can select the compression to use when writing data with the parquet.compression property. For example

set parquet.compression=SNAPPY

The valid options for compression are:
UNCOMPRESSED
GZIP
SNAPPY

The default is SNAPPY

Simple examples of Mortar to Parquet and back
One of the advantages is that Parquet can compress data files. In this example, I downloaded the CMS Medicare Procedure set. This file is a CSV file that is 1.79GB in size. It contains 9,153,274 rows. I wrote a simple Pig script that loads the data, orders the data by the provider’s state and then write it back out as a pipe delimited text file and as a parquet file.

orderDocs = ORDER docs BY nppes_provider_state ASC PARALLEL 1;

STORE orderDocs INTO '$OUTPUT_PATH' USING PigStorage('|');
STORE orderDocs INTO '$PARQUET_OUTPUT_PATH' USING parquet.pig.ParquetStorer;

With the regular pipe delimited file, the file size is still 1.79GB in size. With parquet, the file is 838.6MB (a 53% reduction in size).

Another advantage is that you can take the resulting output of the file and copy it into an impala directory (/user/hive/warehouse/mortarP) for an existing Parquet formatted table. Once you refresh the table (refresh), the data is immediately available.
Screen Shot 2014-05-21 at 3.05.36 PM

Screen Shot 2014-05-21 at 3.04.27 PM

Finally, you can copy the parquet file from an Impala table to your local drive or to an S3 bucket, connect to it with Mortar and work directly with the data.

parquetData = LOAD '/Users/davidfauth/vabusinesses/part-m-00000.gz.parquet' USING parquet.pig.ParquetLoader AS 
   (corp_status_date:chararray,
    corp_ra_city:chararray,
    corp_name:chararray);

If we illustrate this job, we see that we see the resulting data sets.

Screen Shot 2014-05-22 at 9.25.58 AM

Using Mortar and Parquet provides additional flexibility when dealing with large data sets.

Filed Under: Hadoop, Impala, Mortar, Parquet, Uncategorized

Extracting Insight from FBO Data Using Mortar and Elasticsearch – Part 4

February 5, 2014 by dave fauth Leave a Comment

Download PDF

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from FBO.gov. In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

In the first part of this series, I looked at the data and munged the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. In part two of this series, I created some bubble charts looking at awards by Agency and also the most popular Awardees. In the third part of the series, I looked at awards by date and then displaying that information in a calendar view. Then we will look at the types of awards as well as some bi-grams in the descriptions of the awards.

Those efforts were time consuming and took a lot of manual work to create the visualizations. Maybe there is a simpler and easier way to look at the data. For this post, I wanted to see if Elasticsearch and their updated dashboard (Kibana) could help out.

MortarData to Elasticsearch
Around October of last year, Elasticsearch announced integration with Hadoop. “Using Elasticsearch in Hadoop has never been easier. Thanks to the deep API integration, interacting with Elasticsearch is similar to that of HDFS resources. And since Hadoop is more then just vanilla Map/Reduce, in elasticsearch-hadoop one will find support for Apache Hive, Apache Pig and Cascading in addition to plain Map/Reduce.”

Elasticsearch published the first milestone (1.3.0.M1) based on the new code-base that has been in the works for the last few months.

I decided to use MortarData to output the fields that I want to search, visualize and dynamically drill down. Back in October, Mortar was able to update their platform to allow Mortar to write out to Elasticsearch at scale. Using the Mortar platform, I wrote a pig script that was able to read in the FBO data and manipulate it as needed. I did neet to modify the output of the Posted Date and Contract Award Date to ensure I had a date/time format that looked like ‘2014-02-01T12:30:00-05:00’. I then wrote out the data directly to the Elasticsearch index. A sample of the code is shown below:

register '/Users/davidfauth/Downloads/elasticsearch-hadoop-1.3.0.M1.jar';
define ESStorage org.elasticsearch.hadoop.pig.ESStorage('es.resource=fbo/awards');
 
-- More code here
 
B = FOREACH joinedActiveDetails GENERATE 
CONCAT(SUBSTRING(postedDate,0,10),'T12:30:00-05:00') as searchPostedDate,
classCode as searchClassCode,
naicsCode as searchNaicsCode,
agencyName as searchAgencyName,
pocName as searchPOCName,
solicitationNumber as searchSolicatationNumber,
pocEmail as searchPOCEmail,
setAside as searchSetAside,
popCity as searchPopCity,
popZip as searchPopZip,
popCountry as searchPopCountry,
noticeType as searchNoticeType,
contractAwardNumber as searchContractAwardNumber,
contractAwardAmount as searchContractAwardAmount,
CONCAT(contractAwardDate,'T12:30:00-05:00') as searchContractAwardDate,
awardee as searchAwardee,
contractorAwardedDuns as searchContractorAwardedDuns,
noticeID as searchNoticeID,
nltk_udfs.stripHTML(description) AS fboDescr;
 
-- store into ES
STORE B INTO 'fbo/awards' USING ESStorage();

For GovTrack Bills data, it was a similar approach. I ensured the bill’s ‘Introduction Date’ was in the proper format and then wrote the data out to an Elasticsearch index. To ensure the output was correct, a quick illustration showed the proper dateformat.
Illustrating Pig Script

After illustrating to verify the Pig script ran, I ran it on my laptop where it took about five minutes to process the FBO data. It took 54 seconds to process the GovTrack data.

Marvel
Elasticsearch just released Marvel. From the blog post,

“Marvel is a plugin for Elasticsearch that hooks into the heart of Elasticsearch clusters and immediately starts shipping statistics and change events. By default, these events are stored in the very same Elasticsearch cluster. However, you can send them to any other Elasticsearch cluster of your choice.

Once data is extracted and stored, the second aspect of Marvel kicks in – a set of dedicated dashboards built specifically to give both a clear overview of cluster status and to supply the tools needed to deep dive into the darkest corners of Elasticsearch.”

I had Marvel running while I loaded the GovTrack data. Let’s look at some screen captures to show the index being created, documents added, and then search request rate.

Before adding an index
This is a look at the Elasticsearch cluster before adding a new index. As you can see, we have two indexes created.

Screen Shot 2014-02-01 at 11.13.36 AM

As the Pig job is running in Mortar, we see a third index (“govtrack”) created and you see the document count edge up and the indexing rate shoot up.

Screen Shot 2014-02-01 at 11.16.08 AM

The pig job has finished and we see the uptick in documents indexed. We can also see the indexing rate as well.

Screen Shot 2014-02-01 at 11.17.26 AM

This last screen shot shows some later work. I had to drop and recreate an index thus the small dip in documents and the indexing rates. You also see some searches that I ran using Kibana.

Screen Shot 2014-02-01 at 11.54.36 AM

In summary, Marvel is a great tool to see what your cluster is doing through the dashboards.

Kibana
Elasticsearch works seamlessly with Kibana to let you see and interact with your data. Specifically, Kibana allows you to create ticker-like comparisons of queries across a time range and compare across days or a rolling view of average change. It also helps you make sense of your data by easily creating bar, line and scatter plots, or pie charts and maps.

Kibana sounded great for visualizing and interactively drilling down into the FBO data set. The installation and configuration is simple. It is a download, unzip, modify a single config.js file and open the URL (as long as you unzipped it so your webserver can load the URL).

Some of the advantages of Kibana are:

  • You can get answers in real time. In my case, this isn’t as important as if you are doing log file analysis.
  • You can visualize trends through bar, line and scatter plots. Kibana also provides pie charts and maps.
  • You can easily create dashboards.
  • The search syntax is simple.
  • Finally, it runs in the browser so set-up is simple.

Kibana Dashboard for GovTrack Data
Using Kibana, I’m going create a sample dashboard to analyze the GovTrack bills data. You can read more about the dataset here. In a nutshell, I wanted to see if Kibana can let me drill down on the data and easily look at this data set.

In my dashboard, I’ve set up multiple search terms. I’ve chosen some topics and sponsors of bills. We have Health, Immigration, Guns, Obama, Clinton and Daschle. I’ve added in some filters to limit the search index to the bills and set up a date range. The date range is from 2005 through 2012 even though I only have a couple of years worth of data indexed. We are shown a dataset of 9,428 bills.
Screen Shot 2014-02-01 at 11.48.01 AM

Let’s look at an easy way to see when the term “Affordable Care Act” showed up in various bills. This is easily done by adding this as a filter.
Screen Shot 2014-02-01 at 12.30.18 PM
In order to see this over time, we need to add a row and a Histogram panel. In setting up the panel, we set the timefield to the search field “introduceddate”, adjusted the chart settings to show line, legend, x and y axis legends and then choose an interval. I choose an 8 week interval. Once this is added, the histogram will show the bills mentioning the term “Affordable Health Care” in relation to the other search terms. In our example, we see the first mention begin in early 2010 and various bills introduced over the next two years. We also see that the term “immigration” shows up in 8 bills and none of the other search terms appear at all.

Down below, we add a table panel to allow us to see details from the raw records. We can look at the sponsor, bill text, and other values. Kibana allows us to expand each record, look at the data in raw, json or table format and allows you to select which columns to view. We can also run a quick histogram on any of the fields as shown below. Here I clicked on the bill sponsor to see who is the most common sponsor.

Screen Shot 2014-02-01 at 11.54.07 AM

We’ll add one other panel. This is the Popular Terms panel. This shows us by count the most popular terms in the filtered result set. Kibana allows you to exclude terms and set the display as either bar chart, pie chart or a table.

Screen Shot 2014-02-01 at 12.39.23 PM

I created another quick dashboard that queries for the term “Patient Protection and Affordable Care Act”. I added a row to the dashboard and added a map panel. The map panel allows you to choose between a world map, US map or European map. Linking the US map to the ‘sponsorstate’ field, I am quickly able to see where the most bills were sponsored that discussed “Patient Protection and Affordable Care Act”. I can also see that Oregon, North Dakota and Indiana had no sponsors. That dashboard is below:

Map panel for bill sponsorship

Kibana Dashboard for FBO data
Kibana allows you to create different dashboards for different data sets. I’ll create a dashboard for the FBO data and do some similar queries. The first step is to create the queries for “sole source”, “small business”, “research and development”, “medical care” and “medicare”. I created a time filter on the contract award date and then set the agency name to the “Bureau of Prisons”. Finally, I added in a histogram to show when these terms showed up in the contract description. “Small business” is by far the most popular of those terms.
Screen Shot 2014-02-01 at 12.19.01 PM

One of the neat things is that you can easily modify the histogram date range. In this case, I’m using an 8 week window but could easily drill in or out. And you can draw a box within the histogram to drill into a specific date range. Again, so much easier and interactive. No need to re-run a job based on new criteria or a new date range.

Kibana allows you to page through the results. You can easily modify the number of results and the number of results per page. In this case, I’ve set up 10 results per page with a maximum of 20 pages. You can open each result and see each field’s data in a table, json or raw format.

Screen Shot 2014-02-05 at 12.35.07 PM

The so-what
Using Elasticsearch and Kibana with both the FBO and the GovTrack data was great since both had data with a timestamp. While the Elasticsearch stack is normally thought of in the terms of ELK (Elasticsearch, Logstash, and Kibana), using non-logstash data worked out great. Kibana is straight-forward to use and provides the ability to drill down into data easily. You don’t have to configure a data stream or set up a javascript visualization library. All of the heavy lifting is done for you. It is definitely worth your time to check it out and experiment with it.

Filed Under: Elasticsearch, Hadoop, Mortar, opendata, Visualization Tagged With: elasticsearch, hadoop, kibana, mortar, mortardata

Extracting Insights from FBO.Gov data – Part 3

January 21, 2014 by dave fauth Leave a Comment

Download PDF

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from FBO.gov. In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

In the first part of this series, I looked at the data and munged the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. In part two of this series, I created some bubble charts looking at awards by Agency and also the most popular Awardees.

In the third part of the series, I am going to look at awards by date and then displaying that information in a calendar view. Then we will look at the types of awards.

For the date analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field, and then calculate the counts by date for the awards. The goal is to see when awards are being made.

The Data Munging
The CSV files contain things like HTML codes and returns within cells. During initial runs, the Piggybank function CSVExcelStorage wasn’t able to correctly parse the CSV files. Other values would end up in the date field and provide inaccurate results. To combat that problem, I ended up finding the SuperCSV java classes. Using their sample code, I was able to write a custom cell processor that would read in the CSV file, strip out carriage returns, and write the information back out in a tab-delimited format.

My friends at Mortar are taking a look at the data and the CSVExcelStorage function to see if my issue can be addressed.

Pig Script
The pig script is really simple. I read in all of the metadata files, filter them where there is an award and it isn’t null. All six files are then joined and a new set of data is generated of just the date of award. These are then grouped and counted. No real rocket science there.

Here is a sample of the results:

Date Amt
2012-09-28 1190
2009-09-30 1187
2010-09-30 1052
2010-09-29 1027
2009-06-04 1002

Visualization
For visualization of this data, I used an adaptation of Mike Bostock’s D3.js calendar example. I used the same colors but made a few tweaks to the scale. A snapshot calendar is below:

Screen Shot 2014-01-16 at 1.19.10 PM

In this visualization, larger award counts are green while lower award counts are in red. Anyone familiar with government contracting will notice that “end of the fiscal year” awards are always made in September. Contracting officers need to spend fiscal year money before it expires. In the calendar, it is easy to see this for each of the fiscal years.

Data by Categories
Let’s look at the data by categories. There are 103 unique categories of award types in the data sets. You can see that list here. There are over 9600 combinations of agencies awarding contracts in a category. You can see the CSV file here.

One of the curious categories is the “Live Animals” category. The agency, category and number of awards are listed below:

Agency Category Number of Awards
Customs and Border Protection 88 — Live animals 46
National Institutes of Health 88 — Live animals 24
National Park Service 88 — Live animals 9
Centers for Disease Control and Prevention 88 — Live animals 7
Army Contracting Command, MICC 88 — Live animals 6
Office of the Chief Procurement Officer 88 — Live animals 4
United States Marshals Service 88 — Live animals 4
Agricultural Research Service 88 — Live animals 3
Animal and Plant Health Inspection Service 88 — Live animals 3
Food and Drug Administration 88 — Live animals 3
Air Force Materiel Command 88 — Live animals 2
Bureau of Alcohol, Tobacco and Firearms (ATF) 88 — Live animals 2
Forest Service 88 — Live animals 2
Public Buildings Service (PBS) 88 — Live animals 2
TRICARE Management Activity 88 — Live animals 2
U.S. Special Operations Command 88 — Live animals 2
United States Marine Corps 88 — Live animals 2
United States Secret Service (USSS) 88 — Live animals 2
Army Contracting Command 88 — Live animals 1
Bureau of Indian Affairs 88 — Live animals 1
Bureau of Medicine and Surgery 88 — Live animals 1
Direct Reporting Unit – Air Force District of Washington 88 — Live animals 1
Direct Reporting Units 88 — Live animals 1
Farm Service Agency 88 — Live animals 1
FedBid 88 — Live animals 1
Fish and Wildlife Service 88 — Live animals 1
Fresno VAMC 88 — Live animals 1
Office of Acquisitions 88 — Live animals 1
Pacific Air Forces 88 — Live animals 1
San Francisco VAMC 88 — Live animals 1
U.S. Army Corps of Engineers 88 — Live animals 1
VA Connecticut Health Care System 88 — Live animals 1
Washington Headquarters Services 88 — Live animals 1

Bi-Grams in the Description
I decided to take a look at descriptions associated with the awards and see what bi-grams appeared when running this through NLTK. I created a simple Pig script to filter the awards, join them to the descriptions and then run the descriptions through Python/NLTK.

The Python/NLTK piece is below:

@outputSchema("top_five:bag{t:(bigram:chararray)}")
def top5_bigrams(textDescription):
    sentences = nltk.tokenize.sent_tokenize(textDescription)
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
	
    bgm    = nltk.collocations.BigramAssocMeasures()
    finder = nltk.collocations.BigramCollocationFinder.from_documents(tokens)
    top_5  = finder.nbest(bgm.likelihood_ratio, 5)

    return [ ("%s %s" % (s[0], s[1]),) for s in top_5 ]

Results
A majority (109 out of 192) consisted of “No description provided”. Thus the bi-gram results were “{(description provided),(no description)}”. Of the remaining awards, there were some interesting results. For example:

{(, boarding),(10 paso),(5 potential),(and general),(boarding and)} was from a US Marshals Service Contract for “serives to include the transportation, boarding and general care of 10 paso fino horses”.

{(country of),(of japan),(japan .),(the country),(& removal)} was from a Pacific Air Forces contract for “this is feral pig control & removal service at tama service annex. ”

and then there was this one:
{(for the),(farm hands),(on april),(a contract),(ms. crundwell)}
{(for the),(the horses),(on april),(care for),(ms. crundwell)}

contract was awarded under far 6.302-2, unusual and compelling urgency. on april 17, 2012, rita crundwell, comptroller for the city of dixon, illinois, was arrested by the federal bureau of investigation (fbi) for wire fraud. ms. crundwell was accused of embezzling $53 million from the city of dixon. on april 18, 2012, the u.s. marshals service was notified by the fbi that ms. crundwell’s bank accounts had been frozen. the defendant was using embezzled money for the care and maintenance of over 200 horses located in dixon, illinois and beliot, wisconsin and several additional locations. since her accounts were frozen, the farm/ranch did not have the means to care for the horses. farm hands continued to work after this date even though they were unsure if they would receive payment. if the farm hands were to walk off the job, there would be no means to care for the horses. in addition, weekly deliveries of hay and grain would cease. the usms issued purchase orders for hay and grain deliveries to continue until a contract award could be made for the management and care of the horses. the government recognized that there was an immediate need to care for the animals. if a contract was not awarded immediately, the lives of the horses would be at risk.

You can read about Ms. Crundwell here. There were two contracts awarded for $625,840 and for $302,850 to ensure the horses’ safety.

In summary, I looked at the award dates to look for patterns and then I looked at the text descriptions to look for interesting data combinations.

Filed Under: Hadoop, Mortar, opendata Tagged With: analysis, d3js, FBO, hadoop, mortar, visualization

Extracting Insights from FBO.Gov data – Part 2

January 5, 2014 by dave fauth Leave a Comment

Download PDF

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from FBO.gov. In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

In the first part of this series, we looked at the data and munging the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. Those didn’t really work out all that well. Actually they sucked. They didn’t represent the data well at all. The data was too sparsely related so I was forced to show more data than really made sense. In part two of this post, we will create some bubble charts looking at awards by Agency and also the most popular Awardees.

The Data
For this analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field and then calculate the sum and count values by the awarding agency and by the awardee.

Pig Script
Again, I created a Mortar project to process the data. The pig script loads the six tab delimited files, filters the files by ‘(noticeType == ‘Award Notice’ AND contractAwardAmount IS NOT NULL)’, does a union and then calculates the count and sum values. I used a small piece of Python code to clean up the award amount field.

The results are written out to a delimited file. Looking at the results, there were 465 agencies that awarded a total of 428,937 contracts. Again, this is for all of the available data (2000-present). For awardees, there were over 192,000 unique names receiving awards.

Visualization
For this visualization, I wanted to try a bubble chart. The bubble chart will allow me to visualize the number of awards, the size of awards and a relative size based on the number of awards.

For the bubble chart, I am using HighCharts, a javascript library for creating charts. Since this is a personal, non-commercial website, there is no charge to use the product.

Here is a sample of the results for agency awards:

Agency Number of Awards Sum of Award Amount
DLA Acquisition Locations 119231 34758490026.0626
Naval Supply Systems Command 42013 1330565313323460
Army Contracting Command 40464 162850429349.624
Air Force Materiel Command 37267 16473214961095.8
U.S. Army Corps of Engineers 17032 211563631159.529

And here’s a sample of the results for awardees:

Agency Number of Awards Sum of Award Amount
KAMPI COMPONENTS CO., INC.-7Z016 KAMPI COMPONENTS CO., INC. 88 CANAL RD FAIRLESS HILLS PA 19030-4302 US 1312 89038206.8847656
OSHKOSH CORPORATION-45152 OSHKOSH CORPORATION 2307 OREGON ST OSHKOSH WI 54902-7062 US 1017 159693165.185546
PIONEER INDUSTRIES, INC.-66200 PIONEER INDUSTRIES, INC. 155 MARINE ST FARMINGDALE NY 11735-5609 US 975 58782020.125
BELL BOEING JOINT PROJECT OFFICE, AMARILLO, TX 79120-3160 940 465747753

Let’s look at a simple bar chart of the Agency Awards:

Now, let’s look at the bubble chart of the same data:

Here is a simple bar chart of award recipients:

And a bubble chart showing awards by recipients:

Analysis

There really isn’t a lot of surprise in which agencies are awarding contracts. 8 of the top 10 are in the defense industry. Similarily the top recipient of contracts was Kampi Components, which supplies factory replacement spare parts to the United States military. Number two was Oshkosh Corporation, who are manufacturers of severe heavy duty all wheel drive defense or military trucks, aircraft or emergency rescue and firefighting. The one outlier is NIH which didn’t award a large number of contracts but awarded large contracts.

Next steps will be to look at the types of awards and see how to display those in a meaningful fashion.

Filed Under: Hadoop, Mortar, opendata, Uncategorized, Visualization Tagged With: FBO, hadoop, mortar, OpenData, visualization

Extracting Insights from FBO.Gov data – Part 1

December 30, 2013 by dave fauth Leave a Comment

Download PDF

Extracting Insights from FBO.Gov data – Part 1

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from FBO.gov. In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

From the Sunlight page linked above:
“The notices are broken out into two parts. One file has records with the unique id for each notice, and almost all of the related data fields. The other file with the _desc suffix has the unique id for each notice and a longer prose field that includes the description of the notice. Additionally, the numbers at the end of each file name indicate the year range.” So what we have are two files for each year. One file has the metadata about the solicitation and award and the other related file has the verbage about the notice.

The metadata consists of the following rows:

  • Posted Date
  • Class Code
  • Office Address Text
  • Agency Name
  • POC Name
  • POC Text
  • Solicitation Number
  • Response Deadline
  • Archive Date
  • Additional Info Link
  • POC Email
  • Set Aside
  • Notice Type
  • Contract Award Number
  • Contract Award Amount
  • Set Aside
  • Contract Line Item Number
  • Contract Award Date
  • Awardee
  • Contractor Awarded DUNS
  • JIA Statutory Authority
  • JA Mod Number
  • Fair Opportunity JA
  • Delivery Order Number
  • Notice ID

The metadata file is a .CSV approximately 250-300MB in size. It is a relatively large file that isn’t easily manipulated in Excel. Additionally, it has embedded quotes, commas, html code and multiple lines with returns in the field. For file manipulation, I decided to use Vertascale. Vertascale is designed to provide real-time insights for data professionals on data stored in S3, Hadoop or on your desktop. It can easily handle text, json, zip, and a few other formats.

The decision to use Vertascale was driven by the size of the file, the ability for it to read the file correctly (i.e. parse the file correctly) and the ability to transform the data out into a format that I could use with other tools. Let’s run through how Vertascale can do this quickly and easily:

A) Opening & converting raw data into a more useful format

Step 1: Open the Vertascale application and navigate to the file that is on my local machine. In this case, we will use:
Screen Shot 2013-12-30 at 12.24.20 PM

Step 2: Next, use Vertascale’s built in parsing tools to convert the file into a more user friendly view:
Screen Shot 2013-12-29 at 9.38.05 PM

Step 3: Export the columns into a file that we can use later.
Screen Shot 2013-12-29 at 9.38.31 PM

B) Getting a feel for the data

Now that the data is loaded in Vertascale (which can be an overwhelming task at times), we can explore and analyze the data. We will look at the Browse and the Count feature of Vertascale to understand the data.

1) Browse
I used Vertascale’s built-in data browser to browse through the dataset to see what data was available. With the slider and next 50 options, I could quickly go to any spot in the file and see what the data was available.

Screen Shot 2013-12-30 at 12.25.30 PM

2) Count
I wanted to look at the “Set aside” column to see what the distribution of set aside values where in the file. Using the “Count Distinct” function, I was able to see how those values were distributed across the entire file. In this case, we can see

Screen Shot 2013-12-29 at 9.37.41 PM

Next Steps
At this point, we have retrieved the FBO datafiles, used Vertascale to get a feel for what is in the files, extracted columns out to do some initial analysis. In the next post, we’ll dive in a little more to look at some ways of analyzing and visualizing this data.

Filed Under: Hadoop, Mortar, Vertascale, Visualization Tagged With: analysis, hadoop, mortar, mortardata, visualization

Follow Me

Follow @davefauth