Big Data and Hadoop (170 Blogs) Become a Certified Professional
AWS Global Infrastructure

Big Data

Topics Covered
  • Big Data and Hadoop (146 Blogs)
  • Hadoop Administration (8 Blogs)
  • Apache Storm (4 Blogs)
  • Apache Spark and Scala (29 Blogs)
SEE MORE

Brief Introduction to Oozie

Last updated on Feb 09,2021 15.6K Views


Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs such as Java MapReduce, Streaming MapReduce, Pig, Hive and Sqoop. Oozie is a scalable, reliable and extensible system. This technology is used in production at Yahoo!, running more than 200,000 jobs every day.

Features:

  • Execute and monitor workflows in Hadoop
  • Periodic scheduling of workflows
  • Trigger execution of data availability
  • HTTP and command line interface and web console

Workflow – Directed Acyclic Graph of Jobs:

Workflow Example:

Oozie Workflow-Oozie-Edureka<workflow-app nome='wordcount –wf’>
 <start to= ‘wordcount’/>
<action name=’Wordcount'>
 <map-reduce>
<job-tracker>foo.com:9001</job-tracker>
<name-node>hdfs://bar.com:9000</name-node>
 <configuration>
<property>
 <name>mapred.input.dir</name>
 <value>${inputDir}</value,>
 </property>
<property>
 <name>mapred.output.dir</name>
 <value> ${outputDir}</value>
 </property>
 </configuration>
 </map-reduce>
<ok to='end’/>
 <error to='kill'/>
 </action>
<kill name='kill'/>
<end name='end'/>
 </Workflow-app>

Workflow Definition:

A workflow definition is a DAG with control flow nodes or action nodes, where the nodes are connected by transitions arrows.

Control Flow Nodes:     

The control flow provides a way to control the Workflow execution path. Flow control operations within the workflow applications can be done through the following nodes:

  • Start/end/kill
  • Decision
  • Fork/join

Action Nodes:

  • Map-reduce
  • Pig
  • HDFS
  • Sub-workflow
  • Java – Run custom Java code

Workflow Application:

Workflow application is a ZIP file that includes the workflow definition and the necessary files to run all the actions. It contains the following files:

  • Configuration file – config-default.xml
  • App files – lib/ directory with JAR and SO files
  • Pig scripts

Application Deployment:

$ hadoop fs-put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount

Workflow Job Parameters:

$ cat job.properites
Oozie.wf.application.path=hdfs://bar.com:9000/usr/abc/wordcount
Input=/usr/abc/input-data
Output=/usr/abc/output-data

Job Execution:

$ oozie job –run –config job.properties
Job:1-20090525161321-oozie-xyz-W

Got a question for us? Mention them in the comments section and we will get back to you. 

edureka

Upcoming Batches For Big Data Hadoop Certification Training Course
Course NameDateDetails
Big Data Hadoop Certification Training Course

Class Starts on 30th March,2024

30th March

SAT&SUN (Weekend Batch)
View Details
Comments
5 Comments
  • Rajiv says:

    sir how to schedule job using crontab

    • EdurekaSupport says:

      Hey Rajiv, thanks for checking out our blog. Please refer to the steps given below to step up cron job:
      1. Prepare SQL to be run on using CRON
      2. See below for example of code which needs to be added to SQL code for a cron job
      .logon server/user_id, Teradata password
      For example :
      .logon Mozart/akatarni,Welcome1
      ADD THE SQL CODE HERE
      .logoff
      .quit
      .exit
      3. WinSCP – this is the file transfer application that is used to transfer the .SQL code file to the server.
      a. Open “WinSCP”, Server name = phximdsas02.phx.ebay.com
      b. Give login id and SAS password
      c. Copy the code from your system to server window, in the attached snap shot we have copied “ask_lstg.sql” from genpact(personal system) to server window.
      i. Left window shows your personal computer and right one is server
      4. Open “Putty”. Use server phximdsas02.phx.ebay.com.
      https://uploads.disquscdn.com/images/78d54b229f0ce485a72b7984a886306720904f6e58179446069905356f639f94.png
      5. At the prompt, enter SAS credentials. After entering the password , you will see the attached window.
      https://uploads.disquscdn.com/images/d89991fa7351bb7a80085c13e9ec8028f333f79c3ab3e6da6c406491ceb53dde.png
      6. To open the editor :
      a. Type export EDITOR=vi <hit enter>
      b. Type crontab -e <hit enter>
      i. This command edits your crontab file, or create one if it doesn’t already exist.
      c. Press “i” to start typing
      d. Press <ESC> to get out of insert mode
      7. Then make the cron job entry:
      A crontab entry has five fields for specifying day, date and time followed by the command to be run at that interval.
      00 06 * * * /usr/bin/bteq <fake_lstg.sql> fake_lstg.LOG 2>&1
      The above will run the code at 06:00 hours every day
      In the above example, “fake_lstg.sql” is SQL file, “fake_lstg.LOG” is the log file where results will appear
      15 20 * * 0 /usr/bin/bteq <fake_lstg.sql> fake_lstg.LOG 2>&1
      The above will run the code at 20:15 hours every Sunday
      https://uploads.disquscdn.com/images/4b910f3e7b76f35e76b9e9a338c5f547932c7c2e897d8b8a109c15c224fa0e01.png
      8. Keep adding lines to the crontab file to schedule more job.
      a. The easiest way to add a line is to be at the first character in the file, then in ESC mode,
      click <shift> + O (case sensitive). This adds a new line above the current one.
      9. To move around the file, in ESC mode
      “l” – move right
      “h” – move left
      “j” – move down
      “k” – move up
      10. To save the crontab file and exit, press <ESC>, then :wq
      a. To exit the file WITHOUT saving, press <ESC>, the :q!
      11. Type Exit at the Unix prompt to exit Putty.
      12. The cron job should run at the specified time
      13. Check the *.LOG file to make sure code ran successfully.
      Hope this helps. Cheers!

      • Rajiv says:

        sir thanks for giving answer to my question..its helpful form me…good and fine description..thanks to u sir

  • Sankalp Tomar says:

    Hi,

    Suppose we want to use the output of Hive Job as an input to Mapreduce Job. How can we achieve this??

    • EdurekaSupport says:

      Hey Sankalp, thanks for checking out our blog. With regard to your query, first we can store the output of hive in hdfs and then we can execute it as an input file for mapreduce code.
      Storing the output of hive.
      INSERT OVERWRITE DIRECTORY ‘/path/to/output/dir’
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ‘,’
      select books from table;
      Hope this helps. Cheers!

Join the discussion

Browse Categories

webinar REGISTER FOR FREE WEBINAR
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Subscribe to our Newsletter, and get personalized recommendations.

image not found!
image not found!

Brief Introduction to Oozie

edureka.co