Big Data Analytics: BigQuery, Impala, and Drill

Last updated on Nov 07,2024 3.7K Views

Big Data Analytics: BigQuery, Impala, and Drill

edureka.co

In previous post, we discussed Apache Hive, which first brought SQL to Hadoop. There are actually several SQL on Hadoop solutions competing with Hive head-to-head. Today, we will look into Google BigQuery, Cloudera Impala and Apache Drill, which all have a root to Google Dremel that was designed for interactive analysis of web-scale datasets. In a nutshell, they are native massively parallel processing query engine on read-only data.

Google BigQuery is the public implementation of Dremel. BigQuery provides the core set of features available in Dremel to third party developers via a REST API. Impala is Cloudera’s open source SQL query engine that runs on Hadoop. It is modeled after Dremel and is Apache-licensed. Impala became generally available in May 2013. Drill is another open source project inspired by Dremel and is still incubating at Apache. Both Impala and Drill can query Hive tables directly. Impala actually uses Hive’s metastore.

Hive is basically a front end to parse SQL statements, generate and optimize logical plans, translate them into physical plans that are finally executed by a backend such as MapReduce or Tez. Dremel and its derivatives are different as they execute queries natively without translating them into MapReduce jobs. For example, the core Impala component is a daemon process that runs on each node of the cluster as the query planner, coordinator, and execution engine. Each node can accept queries. The planner turns a request into collections of parallel plan fragments. The coordinator initiates execution on remote nodes in the cluster. The execution engine reads and writes to data files, and transmits intermediate query results back to the coordinator node.

The two core technologies of Dremel are columnar storage for nested data and the tree architecture for query execution:

Columnar Storage

Data is stored in a columnar storage fashion to achieve very high compression ratio and scan throughput.

Tree Architecture

The architecture forms a massively parallel distributed multi-level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.

These are good ideas and have been adopted by other systems. For example, Hive 0.13 has the ORC file for columnar storage and can use Tez as the execution engine that structures the computation as a directed acyclic graph. Both (and other innovations) help a lot to improve the performance of Hive. However, the benchmark from Cloudera (the vendor of Impala) and the benchmark by AMPLab show that Impala still has the performance lead over Hive. It is well known that benchmarks are often biased due to the hardware setting, software tweaks, queries in testing, etc. But it is still meaningful to find out what possible design choice and implementation details cause this performance difference. And it may help both communities improve the offerings in the future. What follows is a list of possible reasons:

As you see, some of these reasons are actually about the MapReduce or Tez. With the continuous improvements of MapReduce and Tez, Hive may avoid these problems in the future. Besides, the last two are the features of Dremel and it is not clear if Impala implements them.

In summary, Dremel and its derivatives provide us an inexpensive way to do interactive big data analytics. The Hadoop ecosystem is now a real threat to the traditional relational MPP data warehouse systems. The benchmark by AMPLab shows that Amazon Redshift (based on ParAccel by Actian) still has the performance lead over Impala but the gap is small. With continuous improvements (e.g. both Hive and Impala are working on cost based plan optimizer), we can expect SQL on Hadoop/HDFS at higher level in near feature.

Also, Edureka has a specially curated Data Analyst Course that will make you proficient in tools and systems used by Data Analytics Professionals. It includes in-depth training on Statistics, Data Analytics with R, SAS, and Tableau. The curriculum has been determined by extensive research on 5000+ job descriptions across the globe.

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

10 Hottest Tech Skills to Master in 2016

Upcoming Batches For Data Analyst Certification Course
Course NameDateDetails
Data Analyst Certification Course

Class Starts on 28th December,2024

28th December

SAT&SUN (Weekend Batch)
View Details
BROWSE COURSES
REGISTER FOR FREE WEBINAR Data Science Applications for Real-World Analysis