While Apache Hive and Spark SQL perform the same action, retrieving data, each does the task in a different way. However, Hive is planned as an interface or convenience for querying data stored in HDFS. Though, MySQL is planned for online operations requiring many reads and writes.
Apache Hive:
Apache Hive is built on top of Hadoop. Moreover, It is an open source data warehouse system. Also, helps for analyzing and querying large datasets stored in Hadoop files. First, we have to write complex Map-Reduce jobs. But, using Hive, we just need to submit merely SQL queries. Users who are comfortable with SQL, Hive is mainly targeted towards them.
Spark SQL:
In Spark, we use Spark SQL for structured data processing. Moreover, we get more information on the structure of data by using SQL. Also, gives information on computations performed. One can achieve extra optimization in Apache Spark, with this extra information. Although, Interaction with Spark SQL is possible in several ways. Such as DataFrame and the Dataset API.
Usage
Apache Hive:
- Schema flexibility and evolution.
- Also, can portion and bucket, tables in Apache Hive.
- As JDBC/ODBC drivers are available in Hive, we can use it.
Spark SQL:
- Basically, it performs SQL queries.
- Through Spark SQL, it is possible to read data from existing Hive installation.
- We get the result as Dataset/DataFrame if we run Spark SQL with another programming language.
Limitations
Apache Hive:
- It does not offer real-time queries and row level updates.
- Also provides acceptable latency for interactive data browsing.
- Hive does not support online transaction processing.
- In Apache Hive, the latency for queries is generally very high.
Spark SQL:
- It does not support union type
- Although, no provision of error for oversize of varchar type
- It does not support transactional table
- However, no support for Char type
- It does not support time-stamp in Avro table.
Conclusion
Hence, we can not say SparkSQL is not a replacement for Hive neither is the other way. As a result, we have seen that SparkSQL is more spark API and developer friendly. Also, SQL makes programming in spark easier. While, Hive’s ability to switch execution engines, is efficient to query huge data sets. Although, we can just say it’s usage is totally depends on our goals. Apart from it, we have discussed we have discussed Usage as well as limitations above.