Pig Programming: Apache Pig Script with UDF in HDFS Mode

Last updated on May 22,2019 17.3K Views

Pig Programming: Apache Pig Script with UDF in HDFS Mode

edureka.co

In the previous blog posts we saw how to start with Pig Programming and Scripting.  We have seen the steps to write a Pig Script in HDFS Mode and Pig Script Local Mode without UDF. In the third part of this series we will review the steps to write a Pig script with UDF in HDFS Mode.

We have explained how to implement Pig UDF by creating built-in functions to explain the functionality of Pig built-in function. For better explanation, we have taken two built-in functions. We have done this with the help of a pig script.

Here, we have taken one example and we have used both the UDF (user defined functions) i.e. making a string in upper case and taking a value & raising its power.

The dataset is depicted below which we are going to use in this example:

Our aim is to make 1st column letter in upper case and raising the power of the 2nd column with the value of 3rd column.

Let’s start with writing the java code for each UDF. Also we have to configure 4 JARs in our java project to avoid the compilation errors.
First, we will create java programs, both are given below:

Upper.java

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;

@SuppressWarnings("deprecation")
public class Upper extends EvalFunc<String> {

public String exec(Tuple input) throws IOException {

if (input == null || input.size() == 0)

return null;

try {

String str = (String) input.get(0);

str=str.toUpperCase();

return str;

}
catch (Exception e) {

throw WrappedIOException.wrap("Caught exception processing input row ", e);

}
}
}

Power.java

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigWarning;
import org.apache.pig.data.Tuple;

public class Pow extends EvalFunc<Long> {

public Long exec(Tuple input) throws IOException {
try {

int base = (Integer)input.get(0);
int exponent = (Integer)input.get(1);
long result = 1;

/* Probably not the most efficient method...*/
for (int i = 0; i < exponent; i++) {
long preresult = result;
result *= base;
if (preresult > result) {
// We overflowed. Give a warning, but do not throw an
// exception.
warn("Overflow!", PigWarning.TOO_LARGE_FOR_INT);
// Returning null will indicate to Pig that we failed but
// we want to continue execution.
return null;
}
}
return result;
} catch (Exception e) {
// Throwing an exception will cause the task to fail.
throw new IOException("Something bad happened!", e);
}
}
}

To remove compilation errors, we have to configure 4 JARs in our java project.


Click on the Download button to download the JARs

[buttonleads form_title=”Download Code” redirect_url=https://edureka.wistia.com/medias/wtboe1hmkr/download?media_file_id=76900193 course_id=166 button_text=”Download JARs”]

Now, we export JAR files for both the java codes. Please check the below steps for JAR creation.

Here, we have shown for one program, proceed in the same way in the next program as well.

After creating the JARs and text files, we have moved all the data to HDFS cluster, which is depicted by the following images:

In our dataset, fields are comma (,) separated.

After moving the file, we have created script with .pig extension and put all the commands in that script file.

Now in terminal, type PIG followed by the name of the script file which is shown in the following image:

Here, this is the output for running the pig script.

Got a question for us? Please mention them in the comments section and we will get back to you. 

Related Posts:

Steps to create UDF in Apache Pig

Introduction to Apache Hive

Get Started with Big Data and Hadoop

BROWSE COURSES