Microsoft Certified Azure Data Engineer Assoc ...
- 14k Enrolled Learners
- Weekend
- Live Class
Looking out for Apache Hive Interview Questions that are frequently asked by employers? Here is the blog on Apache Hive interview questions in Hadoop Interview Questions series. I hope you must not have missed the earlier blogs of our Hadoop Interview Question series.
After going through this Apache Hive interview questions blog, you will get an in-depth knowledge of questions that are frequently asked by employers in Hadoop interviews related to Apache Hive. To learn each and every nuance of Hive & Hadoop Framework you can take a look at our Hadoop online course.
In case you have attended any Hadoop interview previously, we encourage you to add the Apache Hive questions which you came across here in the comments tab. We will be happy to answer them, and spread the word to the community of fellow job seekers.
Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data. It provides a mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language) that are similar to SQL statements. Internally, these queries or HQL gets converted to map reduce jobs by the Hive compiler.
Today, many companies consider Apache Hive as a de facto to perform analytics on large data sets. Also, since it supports SQL like query statements, it is very much popular among people who are from a non – programming background and wish to take advantage of Hadoop MapReduce framework.
Now, let us have a look at the rising Apache Hive job trends over the past few years:
Source: indeed.com
The above image clearly shows the vast demand for Apache Hive professionals in the industry. Therefore, it is high time to prepare yourself and seize this very opportunity.
I would suggest you to go through a dedicated blog on Apache Hive Tutorial to revise your concepts before proceeding in this Apache Hive Interview Questions blog.
Here is the comprehensive list of the most frequently asked Apache Hive Interview Questions that have been framed after deep research and discussion with the industry experts.
HBase | Hive |
1. HBase is built on the top of HDFS | 1. It is a data warehousing infrastructure |
2. HBase operations run in a real-time on its database rather | 2. Hive queries are executed as MapReduce jobs internally |
3. Provides low latency to single rows from huge datasets | 3. Provides high latency for huge datasets |
4. Provides random access to data | 4. Provides random access to data |
Hive supports all those client applications that are written in:
by exposing its Thrift server.
By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can change it by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.
Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.
Local Metastore:
In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
Remote Metastore:
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
By default, Hive provides an embedded Derby database instance backed by the local disk for the metastore. This is called the embedded metastore configuration.
Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?
The default metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.
Following are the steps to configure MySQL database as the local metastore in Apache Hive:
Here is the key difference between an external table and managed table:
Note: I would suggest you to go through the blog on Hive Tutorial to learn more about Managed Table and External Table in Hive.
Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause – LOCATION ‘<hdfs_path>’.
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory.
Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning only relevant partitioned data instead of the whole data set.
For example, we can partition a transaction log of an e – commerce website based on month like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have to scan the Jan partition (sub – directory) only instead of the whole table data.
In dynamic partitioning values for partition columns are known in the runtime, i.e. It is known during loading of the data into a Hive table.
One may use dynamic partition in following two cases:
Suppose, I create a table that contains details of all the transactions done by the customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for each month. But, Hive is taking too much time in processing this query. How will you solve this problem and list the steps that I will be taking in order to do so?
We can solve this problem of query latency by partitioning the table according to each month. So, for each month we will be scanning only the partitioned data instead of whole data sets.
As we know, we can’t partition an existing non-partitioned table directly. So, we will be taking following steps to solve the very problem:
CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
2. Enable dynamic partitioning in Hive:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
3. Transfer the data from the non – partitioned table into the newly created partitioned table:
INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;
Now, we can perform the query using each partition and therefore, decrease the query time.
For adding a new partition in the above table partitioned_transaction, we will issue the command give below:
ALTER TABLE partitioned_transaction ADD PARTITION (month=’Dec’) LOCATION ‘/partitioned_transaction’;
Note: I suggest you to go through the dedicated blog on Hive Commands where all the commands present in Apache Hive have been explained with an example.
The best way to become a Data Engineer is by getting the Data Engineering Certification in Delhi.
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode = <value>
Note: You can set the total number of dynamic partitions that can be created by one statement by using: SET hive.exec.max.dynamic.partitions = <value>
I am inserting data into a table based on partitions dynamically. But, I received an error – FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode requires at least one static partition column. How will you remove this error?
To remove this error one has to execute following commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Things to Remember:
There are two main reasons for performing bucketing to a partition:
Hive determines the bucket number for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets). Here, hash_function depends on the column data type. For integer data type, the hash_function will be:
hash_function (int_type_column)= value of int_type_column
The command: ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one may find the number of files that will be generated in the table directory to be not equal to the number of buckets. As an alternative, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.
One of the Hive query optimization methods is Hive index. Hive index is used to speed up the access of a column or set of columns in a Hive database because with the use of index the database system does not need to read all rows in the table to find the data that one has selected.
Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following entries:
id first_name last_name email gender ip_address
1 Hugh Jackman hughjackman@cam.ac.uk Male 136.90.241.52
2 David Lawrence dlawrence1@gmail.com Male 101.177.15.130
3 Andy Hall andyhall2@yahoo.com Female 114.123.153.64
4 Samuel Jackson samjackson231@sun.com Male 89.60.227.31
5 Emily Rose rose.emily4@surveymonkey.com Female 119.92.21.19
How will you consume this CSV file into the Hive warehouse using built SerDe?
SerDe stands for serializer/deserializer. A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive. SerDes are implemented using Java. Hive comes with several built-in SerDes and many other third-party SerDes are also available.
Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the sample.csv by issuing following commands:
CREATE EXTERNAL TABLE sample
(id int, first_name string,
last_name string, email string,
gender string, ip_address string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE LOCATION ‘/temp’;
Now, we can perform any query on the table ‘sample’:
SELECT first_name FROM sample WHERE gender = ‘male’;
Suppose, I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive table corresponding to these files. The data in these files are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table for lots of small files without degrading the performance of the system?
You can get a better understanding with the Azure Data Engineer Certification.
One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;
LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;
Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.
I hope you find this blog on Apache Hive Interview Questions to be informative and helpful. You are welcome to check out our other interview question blogs as well that covers all the components present in Hadoop framework. Kindly, refer to the links given below and enjoy the reading:
Got a question for us? Please mention it in the comments section of this Apache Hive Interview Questions and we will get back to you.
edureka.co
How can we make fume high available ?
How will you do the sentinment analysis by using Hive instead MapReducer
Hi Team,
I am posting below question which I faced in interview. Can you please provide answer to the same.
Question: Why Hive store metadata information in RDBMS? Can Hbase be used to store Hive metadata information? Please explain answer with valid reasons.
Hive stores metadata information in RDBMS because it is based on tabular abstraction of objects in HDFS which means all file names and directory paths are contained in a table.
Hi Team,
Recently i attended one interview .i posted the question here.please provide me the answers.
1.How to recover the hive table if we deleted by mistake.?
2.how to pass argument to hive from shell? and from hive to shell?
1) In case of internal/ managed tables you can recover the data from .TRASH derectory(Same as recycle bin in Windows), metadata needs to created. In case of External table the data is not deleted and you can again point to same data from that external location, Metadata need to be created again.
2 question answer
—
hive -e “select * from table name” //pass argument to hive from shell (use hive -e ,then any sql query )
! Mkdir //from hive to shell (use exclamation mark and then any commands )
why did we create a temp table before creating a table to store the data in seqFile format? why not directly create a table to store in seqFile format rather than overwriting?
Thanks in advance
If we directly insert data from the csv files into sequence files then number of inserts suppose x will be equal to number of csv files y. For Ex: 10 csv files we will need to insert 10 times sequentially into the Final table and the number of sequence file will be created will also be 10 (That’s of no use). So to avoid this repeating inserts we first collect all the csv data into a temp table and then finally copy the data into sample_seqfile table, stored as sequence file format.
Thanks