What is the difference between partitioning and bucketing a table in Hive

0 votes

Hi Team,

I am new to Hive. I am a bit confused between the partitioning and bucketing a table in the hive. Can anyone tell me the difference between these two terms?

Dec 20, 2020 in Big Data Hadoop by akhtar
• 38,260 points
1,708 views

No answer to this question. Be the first to respond.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

Partitioning data are often used for distributing load horizontally, this has performance benefits, and helps in logically organizing data. For example, we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department. For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING). Partitioning tables change how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like

.../employees/country=ABC/DEPT=XYZ.

If query limits for employees from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. The partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Another drawback of having too many partitions is a large number of Hadoop files and directories created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.

Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using a date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating a table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets. Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

answered Aug 11, 2021 by Kirtesh Thakre

edited Mar 5

Related Questions In Big Data Hadoop

0 votes
1 answer

what is the difference between CREATE TABLE and CREATE EXTERNAL TABLE in Hive?

Hey, Although, we can create two types of ...READ MORE

answered Jun 26, 2019 in Big Data Hadoop by Gitika
• 65,770 points
1,764 views
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

What is the difference between a Big Data Warehouse and a traditional Data Warehouse?

Hadoop is similar in architecture to MPP data ...READ MORE

answered Aug 10, 2018 in Big Data Hadoop by Frankie
• 9,830 points
1,682 views
0 votes
1 answer

What is the difference between a Big Data Warehouse and a traditional Data Warehouse

Hadoop is similar in architecture to MPP data ...READ MORE

answered Aug 10, 2018 in Big Data Hadoop by Frankie
• 9,830 points
774 views
0 votes
1 answer

What is the difference between Hadoop MapReduce and built-in MapReduce?

Differences are as follows: Hadoop's MR can be ...READ MORE

answered Sep 11, 2018 in Big Data Hadoop by Frankie
• 9,830 points
1,793 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,318 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,780 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
110,554 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,770 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP