Hive Data Models: Exploring the Power of Data Manipulation

Partitions:

Partition means dividing a table into a coarse grained parts based on the value of a partition column such as ‘data’. This makes it faster to do queries on slices of data

So, what is the function of Partition? The Partition keys determine how data is stored. Here, each unique value of the Partition key defines a Partition of the table. The Partitions are named after dates for convenience. It is similar to ‘Block Splitting’ in HDFS.

Buckets:

Buckets give extra structure to the data that may be used for efficient queries. A join of two tables that are bucketed on the same columns, including the join column can be implemented as a Map-Side Join. Bucketing by used ID means we can quickly evaluate a user-based query by running it on a randomized sample of the total set of users.

Become a master of data architecture and shape the future with our comprehensive Big Data Architect Course.

Got a question for us? Please mention them in the comments section and we will get back to you.

Related Posts:

Hey Hareesh, thanks for checking out our blog. Here’s the explanation to your query:
1.How to create Partition table schema for this data?. and also I want partition columns as country_code and product_code.
Create a non partioned table ;
create table Foo1(country_code STRING, product_code STRING, rpt_period INT)
row format delimited
fields terminated by ‘,’
stored as textfile;
load data local inpath ‘/home/cloudera/filename’ into table Foo1;
Create a partioned table ;
create table txnrecsByCatstate(country_code STRING, product_code STRING, rpt_period INT)
partitioned by (country_code STRING,rpt_period INT)
row format delimited
fields terminated by ‘,’
stored as textfile;
2. For instance, i want to load (from test.csv:2000 year records only ) to table Foo? how to load?
Now insert only the part of data from the Foo1 table to txnrecsByCatstate by.
INSERT OVERWRITE TABLE txnrecsByCatstate SELECT * FROM Foo1 where rpt_period == “2000” ;
3. How to load append.csv (only 2001 records) to table Foo.
you cannot append the data to Hive table . because update command was not supported in hive.
Only you can override the existing data using insert command.
Hope this helps. Cheers!

Comments

2 Comments

Hareesh@Disqus says:
Jan 15, 2017 at 3:01 pm GMT
consider 2000 year data.
test.csv
country_code,product_code,rpt_period
us,crd,2000
us,pcl,2000
us,mtg,2000
in,crd,2000
in,pcl,2000
in,mtg,2000
now i am appending newly generated 2001 records to test.csv. after appending new data to test.csv my data looks like below.
append.csv
country_code,product_code,rpt_period
us,crd,2000
us,pcl,2000
us,mtg,2000
in,crd,2000
in,pcl,2000
in,mtg,2000
us,crd,2001
us,pcl,2001
us,mtg,2001
in,crd,2001
in,pcl,2001
in,mtg,2001
Below scenarios are possible in the hive? If yes, please answer questions.
1. How to create Partition table schema for this data?. and also I want partition columns as country_code and product_code.
2. For instance, i want to load (from test.csv:2000 year records only ) to table Foo? how to load?
3. How to load append.csv (only 2001 records) to table Foo.
Thanks.
- EdurekaSupport says:
  Jan 20, 2017 at 12:43 pm GMT
  Hey Hareesh, thanks for checking out our blog. Here’s the explanation to your query:
  1.How to create Partition table schema for this data?. and also I want partition columns as country_code and product_code.
  Create a non partioned table ;
  create table Foo1(country_code STRING, product_code STRING, rpt_period INT)
  row format delimited
  fields terminated by ‘,’
  stored as textfile;
  load data local inpath ‘/home/cloudera/filename’ into table Foo1;
  Create a partioned table ;
  create table txnrecsByCatstate(country_code STRING, product_code STRING, rpt_period INT)
  partitioned by (country_code STRING,rpt_period INT)
  row format delimited
  fields terminated by ‘,’
  stored as textfile;
  2. For instance, i want to load (from test.csv:2000 year records only ) to table Foo? how to load?
  Now insert only the part of data from the Foo1 table to txnrecsByCatstate by.
  INSERT OVERWRITE TABLE txnrecsByCatstate SELECT * FROM Foo1 where rpt_period == “2000” ;
  3. How to load append.csv (only 2001 records) to table Foo.
  you cannot append the data to Hive table . because update command was not supported in hive.
  Only you can override the existing data using insert command.
  Hope this helps. Cheers!

Hive Data Models: Designing Efficient Data Structures

Hive Data Models:

Partitions:

Buckets:

Recommended videos for you

Big Data Processing With Apache Spark

Big Data Processing with Spark and Scala

What is Big Data and Why Learn Hadoop!!!

Is It The Right Time For Me To Learn Hadoop ? Find out.

Apache Spark For Faster Batch Processing

MapReduce Design Patterns – Application of Join Pattern

5 Things One Must Know About Spark

Apache Spark Redefining Big Data Processing

Power of Python With BigData

HBase Tutorial – A Complete Guide On Apache HBase

When not to use Hadoop

Webinar: Introduction to Big Data & Hadoop

Advanced Security In Hadoop Cluster

Ways to Succeed with Hadoop in 2015

New-Age Search through Apache Solr

Secure Your Hadoop Cluster With Kerberos

Reduce Side Joins With MapReduce

Python for Big Data Analytics

MapReduce Tutorial – All You Need To Know About MapReduce

Logistic Regression In Data Science

Recommended blogs for you

Hadoop 2.0 – Frequently Asked Questions

How To Install MongoDB on Mac Operating System?

Apache Flume Tutorial : Twitter Data Streaming

How to become a Hadoop Developer? Job Trends and Salary

Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Spark Accumulators Explained: Apache Spark

Big Data Applications-Sears Case Study

Spark Tutorial: Real Time Cluster Computing Framework

What is Azure Data Factory – Here’s Everything You Need to Know

Azure Data Engineer Roadmap in 2025

Oracle to HDFS using Sqoop

Rio Olympics 2016: Big Data powers the biggest sporting spectacle of the year!

Apache Hive Installation on Ubuntu

Big Data Engineer Resume Examples and Tips for 2025

Top Big Data Technologies that you Need to know

MapReduce Example: Reduce Side Join in Hadoop MapReduce

RDDs in PySpark – Building Blocks Of PySpark

Why Should a Data Warehouse Professional Move to Big Data Hadoop?

Dataframes in Spark: All you need to know about Structured Data Processing

Apache Hadoop : Create your First HIVE Script

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

PySpark Certification Training Course

Microsoft Fabric Data Engineer Associate Trai ...

Apache Kafka Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Spark and Scala Certification Training ...

Splunk Certification Training: Power User and ...

ELK Stack Training & Certification

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Hive Data Models: Designing Efficient Data Structures