Big Data Analytics: BigQuery, Impala, and Drill

Become a Certified Professional

In previous post, we discussed Apache Hive, which first brought SQL to Hadoop. There are actually several SQL on Hadoop solutions competing with Hive head-to-head. Today, we will look into Google BigQuery, Cloudera Impala and Apache Drill, which all have a root to Google Dremel that was designed for interactive analysis of web-scale datasets. In a nutshell, they are native massively parallel processing query engine on read-only data.

Google BigQuery is the public implementation of Dremel. BigQuery provides the core set of features available in Dremel to third party developers via a REST API. Impala is Cloudera’s open source SQL query engine that runs on Hadoop. It is modeled after Dremel and is Apache-licensed. Impala became generally available in May 2013. Drill is another open source project inspired by Dremel and is still incubating at Apache. Both Impala and Drill can query Hive tables directly. Impala actually uses Hive’s metastore.

Hive is basically a front end to parse SQL statements, generate and optimize logical plans, translate them into physical plans that are finally executed by a backend such as MapReduce or Tez. Dremel and its derivatives are different as they execute queries natively without translating them into MapReduce jobs. For example, the core Impala component is a daemon process that runs on each node of the cluster as the query planner, coordinator, and execution engine. Each node can accept queries. The planner turns a request into collections of parallel plan fragments. The coordinator initiates execution on remote nodes in the cluster. The execution engine reads and writes to data files, and transmits intermediate query results back to the coordinator node.

The two core technologies of Dremel are columnar storage for nested data and the tree architecture for query execution:

Columnar Storage

Data is stored in a columnar storage fashion to achieve very high compression ratio and scan throughput.

Tree Architecture

The architecture forms a massively parallel distributed multi-level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.

These are good ideas and have been adopted by other systems. For example, Hive 0.13 has the ORC file for columnar storage and can use Tez as the execution engine that structures the computation as a directed acyclic graph. Both (and other innovations) help a lot to improve the performance of Hive. However, the benchmark from Cloudera (the vendor of Impala) and the benchmark by AMPLab show that Impala still has the performance lead over Hive. It is well known that benchmarks are often biased due to the hardware setting, software tweaks, queries in testing, etc. But it is still meaningful to find out what possible design choice and implementation details cause this performance difference. And it may help both communities improve the offerings in the future. What follows is a list of possible reasons:

As a native query engine, Impala avoids the startup overhead of MapReduce/Tez jobs. It is well known that MapReduce programs take some time before all nodes are running at full capacity. In Hive, every query suffers this “cold start” problem. In contrast, Impala daemon processes are started at boot time, and thus are always ready to execute a query.
Hadoop reuses JVM instances to reduce the startup overhead partially. However, it also introduces another problem. The nodes in the Cloudera benchmark have 384 GB memory. Such a big heap is actually a big challenge to the garbage collection system of the reused JVM instances. The stop-of-the-world GC pauses may add high latency to queries. On the other hand, Impala prefers such large memory.
Impala process are multithreaded. Importantly, the scanning portion of plan fragments are multithreaded on SSD as well as making use of SSE4.2 instructions. The I/O and network systems are also highly multithreaded. Therefore, each single Impala node runs more efficiently by a high level local parallelism.
Impala’s query execution is pipelined as much as possible. In case of aggregation, the coordinator starts the final aggregation as soon as the pre-aggregation fragments has started to return results. In contrast, sort and reduce can only start once all the mappers are done in MapReduce. Tez currently doesn’t support pipelined execution yet.
MapReduce materializes all intermediate results. This feature enables better scalability and fault tolerance. However, it also significantly slows down the data processing. In contrast, Impala streams intermediate results between executors (of course, in tradeoff of the scalability). Tez allows different types of Input/Output including file, TCP, etc. But it seems that Hive doesn’t use this feature yet to avoid unnecessary disk writes.
The reducer of MapReduce employs a pull model to get Map output partitions. For sorted output, Tez makes use of the MapReduce ShuffleHandler, which requires downstream Inputs to pull data over HTTP. With multiple reducers (or downstream Inputs) running simultaneously, it is highly likely that some of them will attempt to read from the same map node at the same time, inducing a large number of disk seeks and slowing the effective disk transfer rate.
Hive’s query expressions are generated at compile time while Impala does run-time code generation for “big loops” using llvm that can achieve more optimized code.
Tez allows complete control over the processing, e.g. stopping processing when limits are met. It is very useful for top-k calculation and straggler handling. Unfortunately, this feature is not used by Hive currently. BTW, Dremel calculates approximate results for top-k and count-distinct using one-pass algorithms. It is not clear if Impala does the same.
During query execution, Dremel computes a histogram of tablet processing time. If a tablet takes a disproportionately long time to process, it is rescheduled to another server. If trading speed against accuracy is acceptable, Dremel can return the results before scanning all the data, which may reduce the response time significantly as a small fraction of the tables often take a lot longer. It is not clear if Impala implements a similar mechanism although straggler handling was stated on the roadmap.

As you see, some of these reasons are actually about the MapReduce or Tez. With the continuous improvements of MapReduce and Tez, Hive may avoid these problems in the future. Besides, the last two are the features of Dremel and it is not clear if Impala implements them.

In summary, Dremel and its derivatives provide us an inexpensive way to do interactive big data analytics. The Hadoop ecosystem is now a real threat to the traditional relational MPP data warehouse systems. The benchmark by AMPLab shows that Amazon Redshift (based on ParAccel by Actian) still has the performance lead over Impala but the gap is small. With continuous improvements (e.g. both Hive and Impala are working on cost based plan optimizer), we can expect SQL on Hadoop/HDFS at higher level in near feature.

Also, Edureka has a specially curated Data Analyst Course that will make you proficient in tools and systems used by Data Analytics Professionals. It includes in-depth training on Statistics, Data Analytics with R, SAS, and Tableau. The curriculum has been determined by extensive research on 5000+ job descriptions across the globe.

Got a question for us? Please mention it in the comments section and we will get back to you.

Big Data Analytics: BigQuery, Impala, and Drill

Columnar Storage

Tree Architecture

Recommended videos for you

Android Development : Using Android 5.0 Lollipop

Sentiment Analysis In Retail Domain

Machine Learning with Python

Python List, Tuple, String, Set And Dictonary – Python Sequences

Introduction to Business Analytics with R

Python Numpy Tutorial – Arrays In Python

Python Programming – Learn Python Programming From Scratch

Know The Science Behind Product Recommendation With R Programming

Python Tutorial – All You Need To Know In Python Programming

Python for Big Data Analytics

Diversity Of Python Programming

Python Classes – Python Programming Tutorial

3 Scenarios Where Predictive Analytics is a Must

Business Analytics with R

Linear Regression With R

Business Analytics Decision Tree in R

The Whys and Hows of Predictive Modeling-II

Application of Clustering in Data Science Using Real-Time Examples

Mastering Python : An Excellent tool for Web Scraping and Data Analysis

Python Loops – While, For and Nested Loops in Python Programming

Recommended blogs for you

What are Stack Data Structures in Python?

World Cup 2018: 5 Game Changing Technologies in Football

What is Python language? Is it easy to learn?

How to Find Length of List in Python?

Top Python Libraries You Must Know In 2025

The Best Python Libraries For Data Science And Machine Learning

PHP Error Handling: All You Need To Know

A Comprehensive Guide To Random Forest In R

What is the Main Function in Python and how to use it?

Learn How To Use Split Function In Python

A Complete Guide On Decision Tree Algorithm

Machine Learning Career and Future Scope

3 Compelling Reasons to choose Python

All You Need To Know About Statistics And Probability

FIFA World Cup 2018 Best XI: Analyzing Fifa Dataset Using Python

How To Sort A Dictionary In Python : Sort By Keys , Sort By Values

How To Make A Chatbot In Python?

Python Career Opportunities: Your Career Guide To Python Programming

Python Requests Tutorial: GET and POST Requests in Python

How To Run Python In Ubuntu (Linux)?

Join the discussionCancel reply

Trending Courses in Data Science

Data Science and Machine Learning Internship ...

Python Programming Certification Course

Data Science with Python Certification Course

Statistics Essentials for Analytics

SAS Training and Certification

Data Analytics with R Programming Certificati ...

Data Science with R Programming Certification ...

Advanced Python for Data Analytics by PwC Aca ...

Analytics for Retail Banks

Decision Tree Modeling Using R Certification ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Big Data Analytics: BigQuery, Impala, and Drill