Microsoft Certified Azure Data Engineer Assoc ...
- 14k Enrolled Learners
- Weekend
- Live Class
Looking out for Apache Pig Interview Questions that are frequently asked by employers? Here is the fifth blog of Hadoop Interview Questions series, which covers Apache PIG interview questions. The list of questions has been carefully put together after much research and under the strict guidance of certified Big Data Hadoop experts who have been working actively in the industry for several years now. I hope you must not have missed the earlier blogs of our Hadoop Interview Question series.
After going through the Pig interview questions, you will get an in-depth knowledge of questions that are frequently asked by employers in Hadoop interviews.
In case you have attended Pig interviews previously, we encourage you to add your questions in the comments tab. We will be happy to answer them, and spread the word to the community of fellow job seekers.
♦ Apache Pig is a platform, used to analyze large data sets representing them as data flows. It is designed to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce task using Java programming. We can perform data manipulation operations very easily in Hadoop using Apache Pig. From this Big Data Course, you will learn more about Pig,Hive,Flume,etc.
♦ Apache Pig has two main components – the Pig Latin language and the Pig Run-time Environment, in which Pig Latin programs are executed.
♦Apache Pig follows ETL (Extract Transform Load) process. It can handle inconsistent schema (in case of unstructured data).
♦ Apache Pig automatically optimizes the tasks before execution, i.e. automatic optimization. Apache Pig handles all kinds of data.
♦ Pig allows programmers to write custom functions which is unavailable in Pig. User Defined Functions (UDF) can be written in different language like Java, Python, Ruby, etc. and embed them in Pig script.
♦ Pig Latin provides various built-in operators like join, sort, filter, etc. to read, write, and process large data sets.
♣ Tip: Before going through this Apache Pig interview questions, I would suggest you to go through Apache Pig Tutorial to revise your Pig concepts.
Now moving on, let us look at the Apache Pig interview questions.
♣ Tip: In this question, you should explain what were the problems with MapReduce which led to the development of Apache Pig by Yahoo.
MapReduce | Apache Pig |
1. It is a low-level data processing paradigm | 1. It is a high-level data flow platform |
2. Complex Java implementations | 2. No complex Java implementations |
3. Do not provide nested data types | 3. Provides nested data types like tuples, bags, and maps |
4. Performing data operations is a humongous task | 4. Provides many built-in operators to support data operations |
Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache Pig is used for:
♣ Tip: Approach this question by explaining when does the logical and physical plans are created.
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs by the compiler. Logical and Physical plans are created during the execution of a pig script.
After performing the basic parsing and semantic checking, the parser produces a logical plan and no data processing takes place during the creation of a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. For each line in the Pig script, syntax check is performed for operators and a logical plan is created. If an error is encountered, an exception is thrown and the program execution ends.
A logical plan contains a collection of operators in the script, but does not contain the edges between the operators.
After the logical plan is generated, the script execution moves to the physical plan where there is a description about the physical operators, Apache Pig will use, to execute the Pig script. A physical plan is like a series of MapReduce jobs, but the physical plan does not have any reference on how it will be executed in MapReduce.
Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. A program written in Pig Latin is a data flow language, which need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler converts the program into MapReduce jobs.
The components of Apache Pig Execution Environment are:
There are three ways to execute the Pig script:
Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.
Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[]. These are also called the primitive data types.
The complex data types supported by Pig Latin are:
♣ Tip: Complex Data Types of Pig Latin are very important to understand, so you can go through Apache Pig Tutorial blog and understand them in-depth.
A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections of tuples while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.
♣ Tip:You can also explain the two types of bag in Pig Latin i.e. outer bag and inner bag, which may impress your employers.
Outer bag or relation is nothing but a bag of tuples. Here relations are similar as relations in relational databases. For example:
{(Linkin Park, California), (Metallica, Los Angeles), (Mega Death, Los Angeles)}
An inner bag contains a bag inside a tuple. For Example:
(Los Angeles, {(Metallica, Los Angeles), (Mega Death, Los Angeles)})
(California, {(Linkin Park, California)})
♣ Tip: Apache Pig deals with both schema and schema-less data. Thus, this is an important question to focus on.
The Apache Pig handles both, schema as well as schema-less data.
Using Grunt i.e. Apache Pig’s interactive shell, users can interact with HDFS or the local file system.
To start Grunt, users should use pig –x local command . This command will prompt Grunt shell. To exit from grunt shell, press CTRL+D or just type exit.
If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring that functionality using other languages like Java, Python, Ruby, etc. and embed it in the Pig Latin Script file.
♣ Tip: To understand how to create and work with UDF, go through this blog – creating UDF in Apache Pig.
♣ Tip: Important points about UDF to focus on:
Pig supports a number of diagnostic operators that you can use to debug Pig scripts.
♣ Tip: Go through this blog on diagnostic operators, to understand them and see their implementations.
No, illustrate will not pull any MapReduce, it will pull the internal data. On the console, illustrate will not do any job. It just shows the output of each stage and not the final output.
ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script. This command alone might be a good reason for choosing Pig over something else.
Syntax: illustrate relation_name;
Executing Pig scripts on large data sets, usually takes a long time. To tackle this, developers run Pig scripts on sample data, but there is possibility that the sample data selected, might not execute your Pig script properly. For instance, if the script has a join operator there should be at least a few records in the sample data that have the same key, otherwise the join operation will not return any results.
To tackle these kind of issues, illustrate is used. Illustrate takes a sample of the data and whenever it comes across operators like join or filter that remove data, it ensures that only some records pass through and some do not, by making modifications to the records such that they meet the condition. Illustrate just shows the output of each stage but does not run any MapReduce task.
All Pig Latin statements operate on relations (and operators are called relational operators). Different relational operators in Pig Latin are:
♣ Tip: Go through this blog on relational operators, to understand them and see their implementations.
Yes, the keyword ‘DEFINE’ is like a function name.
DEFINE statement is used to assign a name (alias) to a UDF function or to a streaming command.
18. What is the function of co-group in Pig?
COGROUP takes members of different relations, binds them by similar fields, and creates a bag that contains a single instance of both relations where those relations have common fields. Co-group operation joins the data set by grouping one particular data set only.
It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the first data set record with the common data set and the second bag consists of the second data set records with the common data set.
Co-group is a group of data sets. More than one data set, co-group will group all the data sets and join them based on the common field. Hence, we can say that co-group is a group of more than one data set and join of that data set as well.
Group and Cogroup operators are identical. For readability, GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. Group operator collects all records with the same key. Cogroup is a combination of group and join, it is a generalization of a group instead of collecting records of one input depends on a key, it collects records of n inputs based on a key. At a time, we can Cogroup up to 127 relations.
For getting only 5 records from 100 records we use limit operator.
First load the data in Pig:
personal_data = LOAD “/personal_data.txt” USING PigStorage(‘,’) as (parameter1, Parameter2, …);
Then Limit the data to 5 records:
limit_data = LIMIT personal_data 5;
MapFile is a class which serves file-based map from keys to values.
A map is a directory containing two files, the data file, containing all keys and values in the map, and a smaller index file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval().
The index file is read entirely into memory. Thus, key implementations should try to keep themselves small. Map files are created by adding entries in-order.
The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.
The execution modes in Apache Pig are:
♣ Tip: Explain the both aspects of Apache Pig i.e. case-sensitive as well as case-insensitive aspect.
Pig script is both case sensitive and case insensitive.
User defined functions, the field name, and relations are case sensitive i.e. EMPLOYEE is not same as employee or M=LOAD ‘data’ is not same as M=LOAD ‘Data’.
Whereas Pig script keywords are case insensitive i.e. LOAD is same as load.
It is difficult to say whether Apache Pig is case sensitive or case insensitive. For instance, user defined functions, relations and field names in Pig are case sensitive. On the other hand, keywords in Apache Pig are case insensitive.
Sometimes there is data in a tuple or a bag and if we want to remove the level of nesting from that data, then Flatten modifier in Pig can be used. Flatten un-nests bags and tuples. For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple, whereas un-nesting bags is a little complex because it requires creating new tuples.
Pig Statistics is a framework for collecting and storing script-level statistics for Pig Latin. Characteristics of Pig Latin scripts and the resulting MapReduce jobs are collected while the script is executed. These statistics are then available for Pig users and tools using Pig (such as Oozie) to retrieve after the job is completed.
The stats classes are in the package org.apache.pig.tools.pigstats:
Limitations of the Apache Pig are:
I hope these Apache Pig Interview Questions were helpful for you. I would suggest you to go through the whole series, to get in-depth knowledge on Hadoop Interview Questions. Learn Hadoop from industry experts while working with real-life use cases.
Kindly, refer to the links given below and enjoy the reading:
Got a question for us? Mention them in the comments section and we will get back to you.
edureka.co
can we visualized data in graphs, chart, plot digrams by using apcahe pig
Hey Uday, thanks for checking out our blog.
No. The data in PIG can be transformed into some graphical representation (for example charts) only by using some visualisation tools. You have to use some visualisation tools for representation of your data.There are data visualisation tools available in the market, for example some of them are Tableau, Worlform Alpha,Excel,Many eyes,Talend.
Hope this helps. Cheers!
thanks for Edureka…..giving a solutions for my query
You’re welcome, Uday. :) Do follow our blog to stay posted. Cheers!
sure:-
There is a mistake in answers to one of the questions –
What co-group does in Pig?
Co-group joins the data set by grouping one particular data set only.
The answer has to be one or more than one dataset
You might have missed to read the whole para of Co-group, It has mentioned in the end that “co-group is a group of more than one data set and join of that data set as well.”