Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarise Big Data and makes querying and analyzing easy.
A little history about Apache Hive will help you understand why it came into existence. When Facebook started gathering data and ingesting it into Hadoop, the data was coming in at the rate of tens of GBs per day back in 2006. Then, in 2007, it grew to 1TB/day and within a few years increased to around 15TBs/day. Initially, Python scripts were written to ingest the data in Oracle databases, but with the increasing data rate and also the diversity in the sources/types of incoming data, this was becoming difficult. The Oracle instances were getting filled pretty fast and it was time to develop a new kind of system that handled large amounts of data. It was Facebook that first built Hive, so that most people who had SQL skills could use the new system with minimal changes, compared to what was required with other RDBMs.
The main features of Hive are:
- It stores schema in a database and processes data into HDFS which is why its named as data warehouse tool.
- It is designed for OLAP.
- It provides an SQL-type language for querying, called HiveQL or HQL.
- It is familiar, fast, scalable and extensible.