With a PhD in distributed computing and over 15 years of experience building software for enterprises with recent work on re-engineering machine learning algorithms to work on big-data and in real-time, Storm, Spark, Hadoop, NoSQL databases such as Cassandra, MongoDB, and distributed SQL stores, Vijay currently heads data services teams at SapientNitro as Director of Technology.
We were dealing with a class of big data analytics from the campaign management media domain that required non-trivial distributed computations. This article summarizes our experience of implementing these computations over common analytical distributed stores and benchmarking of the same.
The nature of computations can be understood by the analogy of a distributed merge tree. Users are identified by unique identifiers called pel ids and are associated with domains such as facebook.com. Domains are grouped into sub-categories which are themselves grouped to form categories. Example of category is social, while a sub-category instance is blog or chat. Similarly, entertainment could be a category with radio or news being sub-categories. The computation involves a merge at each level of the tree above the leaf – it should be noted that merge may involve string sort and removal of duplicates. The overall computation involves a merge all the way up to the root of the tree and a count of records at each node divided by the overall count giving reachability of a node in domain parlance.
We have implemented the same computations on top of state-of-art analytical stores such as Apache HAWQ and Druid. Druid is emerging as an important analytical store that helps in performing ad-hoc queries on large data sets, while HAWQ is a distributed SQL engine over the Hadoop Distributed File System (HDFS).
Apache HAWQ and Druid Custom Achitectures
It must be noted that while HAWQ does not mandate minimum number of nodes for a cluster, Druid recommends at least 4 (one for Broker, Coordinator, two data/historical nodes and one node for Zookeeper). Our HAWQ architecture is very specialized and is given in figure 1.
We formed three HAWQ clusters, each cluster having a partition of the data and each cluster partitioning the data within itself using a multi-dimensional partitioning. While it could be questioned as to why we did not have 3 Druid clusters, it must be kept in mind that Druid requires different types of nodes within a cluster such as the Broker, Coordinator, Zookeeper and historical nodes.
The table one captures one set of performance benchmarks comparing HAWQ and Druid. A few definitions to make the table clearer are given below: pel_ids are unique ids generated for each user in the audience. All the demographic attributes, actions & behaviors performed by the users are then linked to this id. Action_ids represent the set of high value actions performed by the users. This includes activities such as watching a program, visiting a website etc. Program_id refers to the unique id assigned to a program/series. The above mentioned computations took about 20 seconds on a HAWQ AWS 3-node cluster, while it took about a minute on a Druid AWS 4-node cluster.