Don't fear the elephant

Photo credit: https://www.flickr.com/photos/rcrhee/10434611754/


Cambridge, MA based Burstworks is an Internet advertising technology company that focuses on enabling web applications to monetize their users with a better, less jarring native advertising experience. A key differentiator for Burstworks has been their ability to analyze user data to drive product and strategy decisions.


Burstworks approached us with a time-sensitive problem: the volume of data they were generating had outgrown their MySQL reporting systems and they needed a more scalable data analytics infrastructure ASAP.

At a high level, Burstworks was collecting "click stream" data about their users, ingesting it into a MySQL database, and then running nightly batched reports. As the volume of the data increased, though, this solution broke down. Even on large servers MySQL was unable to complete their queries in a reasonable amount of time. Additionally, this setup’s heavy reliance on MySQL indexes made ad-hoc querying practically impossible.

Burstworks knew they needed to move off MySQL and needed some help in getting it done fast.


  • Effectively research and evaluate various solutions - taking into account cost, features, and future expansion
  • Support business goals by allowing both batched and ad-hoc queries
  • Potentially face issues deploying a closed source or proprietary solution
  • Mitigate engineering risk associated with developing new code inside Burstworks' existing codebase


After evaluating several solutions including Vertica, MongoDB, and Google BigQuery we decided that Apache Hadoop was the best fit for Burstworks' use case. After we'd selected Hadoop, we surveyed the available distributions and settled on Cloudera because it offered the best mix of user facing tools, a strong community, and extensive documentation.

Once we'd settled on Cloudera, we worked closely with the Burstworks team to get the cluster operational as fast as possible. Practically, that meant setting up and configuring the cluster, ingesting the data, and then running batch queries against the data. Because of the time pressure, we moved quickly and were able to get to an operational cluster in under a month and had ad-hoc queries spinning within two.

Moving to Hadoop removed a limiting constraint by making their data processing pipeline horizontally scalable. Have more data? Just add more machines.

"Working with Setfive to setup Hadoop freed up our engineers to work on our other client-facing initiatives, and as a startup that's critical. Having a local team where in person meetings are easy was incredibly helpful to stay aligned throughout the project."