February 23, 2014

Low Latency Query Frameworks for Hadoop

Strata Conference is about many more things beyond Hadoop. However, a common thread among all the themes this year was that the HDFS and the YARN (Yet Another Resource Negotiator) have been accepted as the standard base frameworks for building a big data platform. Hadoop community is now focusing upon building an strong ecosystem of tools for Hadoop to accelerate the enterprise adoption of Hadoop.

Hadoop ecosystem needs mature tools for interactive fast querying to match the performance of MySql or Oracle. Tools for streaming, user friendly tools for data integration and workflow scheduling and machine learning are the fresh areas of push for the community involved in maturing Hadoop ecosystem. Very innovation that drives open source community, a significant contributor to Hadoop ecosystem, also makes large enterprises nervous to adopt new open source tools because they do not know what is here to stay and what is just a passing fad.

Four initiatives - Impala, Apache Tez, Shark and Apache Tajo that promise to bring SQL friendly and interactive Hadoop to the enterprises and Hadoop community, are in the various stages of maturity.

Impala is backed up by Cloudera but it is not open source. Having a big vendor such as Clodera behind it, ensures that it will not be abandoned. However, open source community is quite vocal in setting the direction of Hadoop ecosystem and therefore I doubt if they will embrace Impala with the open arms. Impala is already available.

Apache Tez comes from Hortonworks initiative and now it is in incubation stage at Apache Software Foundation. Apache Tez does not deviate from core Hadoop philosophy and offers Hive a plugin to run faster on Hadoop on the top of YARN. Hortonworks reports 40 times SQL speed up with Hive 0.12.

Shark is a project from Berkeley. It runs on Apache Spark which is a project under incubation at Apache Software Foundation (ASF). A relatively large following among the open source community for Apache Spark seems very promising. Their vision of creating a uniform set of tools for machine learning, querying, streaming and graph database on the top of HDFS and YARN seems very appealing. Shark sits as a layer between Hive and Spark to speed up the queries and looks similar in approach with Apache Tez.

Apache Tajo is another project to provide low latency query access to Hadoop to support data warehousing. The company affiliation of Tajo team members is as varied as LinkedIn, Hortonworks and Korea University.

When it comes to selecting a query tool for Hadoop, it is better to stay with Hive which seems to enjoy good support from the community. I would suggest evaluating Apache Tez and Shark. Tez has strong backing from Hortonworks though it is open source software, likely to graduate soon in ASF to higher level. Shark benefits from the strong backing of Apache Spark in the open source source community and therefore it is likely to see more features and faster acceptance by the community.

Despite of ever growing list of xxxDB databases for big data, Hadoop continues to be the central building block for any big data platform. The focus has now shifted to building a powerful ecosystem of technologies to make Hadoop acceptable in the enterprises, which are less tech savvy than the companies such as Facebook and Yahoo where Hadoop was deployed successfully to handle massive datasets. Vendors and community is also going to address the migration of legacy databases to Hadoop, which is a problem unheard of in tech startups where Hadoop found its early adopters.

Kudos

Low Latency Query Frameworks for Hadoop

Now read this

Strata 2014 at Santa Clara - First Day Impressions