Output Splitting problem in Hadoop

0 votes

I ran the following script with two files as input, the output was split into two file part-m-00000 and part-m-00001. I couldn't understand why, please assist me? Note: The size in only 8.2 MB for each file.

REGISTER PIG/PigUDF.jar;

A = LOAD "PIG/HealthCare/Input/healthcare_Sample_dataset1.csv" USING PigStorage(",") AS (patientID:int, name:chararray, date:chararray, phoneNumber:chararray, eMail:chararray, SSN:chararray, gender:chararray, disease:chararray, age:chararray);

B = LOAD "PIG/HealthCare/Input/healthcare_Sample_dataset2.csv" USING PigStorage(",") AS (patientID:int, name:chararray, date:chararray, phoneNumber:chararray, eMail:chararray, SSN:chararray, gender:chararray, disease:chararray, age:chararray);

C = UNION A, B;

D = FOREACH C GENERATE patientID, com.kamran.pig.udf.encryptField(name,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(date,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(phoneNumber,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(eMail,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(SSN,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(gender,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(disease,"12345678abcdefgh"), age;

STORE D INTO "PIG/HealthCare/Output/HealthCareOutput.csv";
Jul 16, 2019 in Big Data Hadoop by Rasheed
1,132 views

1 answer to this question.

0 votes

When you are loading two different files, it is not mandatory that the files are getting loaded into the same data block. It might get loaded into different data blocks and for each block, separate mappers might be running on them. Since the data might be present in different nodes, it can easily create different part files.

You can check by loading a small file to pig and try processing it, this is going to create a single part file in the output.

Refer below:

A = load 'weatherPIG.txt' using TextLoader as (date:chararray);

AF = foreach A generate TRIM(SUBSTRING(data, 6, 14)), TRIM(SUBSTRING(data, 46, 53)), TRIM(SUBSTRING(data, 38, 45));

store AF into 'pigudf32' using PigStorage(',');

You can check pigudf32, this folder is supposed to consist of a single part file.

answered Jul 16, 2019 by Sayni

Related Questions In Big Data Hadoop

0 votes
1 answer

Getting error in Hadoop: Output file already exist

When you executed your code earlier, you ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by Shubham
• 13,490 points
8,507 views
0 votes
1 answer

How to format the output being written by MapReduce in Hadoop?

Here is a simple code demonstrate the ...READ MORE

answered Sep 5, 2018 in Big Data Hadoop by Frankie
• 9,830 points
2,587 views
0 votes
1 answer

In Hadoop MapReduce, how can i set an Object as the Value for Map output?

Try this and see if it works: public ...READ MORE

answered Nov 21, 2018 in Big Data Hadoop by Omkar
• 69,220 points
971 views
0 votes
1 answer

Hadoop: How to get the column name along with the output in Hive?

You can get the column names by ...READ MORE

answered Nov 21, 2018 in Big Data Hadoop by Omkar
• 69,220 points
4,879 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,015 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,528 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
108,739 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,605 views
0 votes
7 answers

How to run a jar file in hadoop?

I used this command to run my ...READ MORE

answered Dec 10, 2018 in Big Data Hadoop by Dasinto
26,554 views
+1 vote
2 answers

How to authenticate username & password while using Connector for Cloudera Hadoop in Tableau?

Hadoop server installed was kerberos enabled server. ...READ MORE

answered Aug 21, 2018 in Big Data Hadoop by Priyaj
• 58,020 points
1,671 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP