Big Data For ETL and Data Warehousing (10 Blogs)

Talend Tutorial – Future Of Data Integration

Last updated on Nov 29,2021 13.9K Views

Swatee Chand
Sr Research Analyst at Edureka. A techno freak who likes to explore... Sr Research Analyst at Edureka. A techno freak who likes to explore different technologies. Likes to follow the technology trends in market and write...
2 / 2 Blog from Talend Data Integration

In today’s data-driven world a huge amount of data is generated from various organizations, machines, and gadgets, irrespective of their sizes. For example, your mobile, each time you browse the web, some amount of data is generated. Do you know a commercial plane can generate up to 500GB of data per hour? I hope now you can imagine how large this data is! This is the reason it is known as Big Data. But all of this data is pretty much useless unless you perform ETL operations on it! Believe me, it’s certainly not an easy task. Moreover, today’s real-time and fast-paced nature of the business, adds to the need of having such a tool which can quickly and easily integrate the systems. Well, this is where Talend comes to the rescue. Through this blog on Talend Tutorial, I will explain how Talend helps to build, test, deploy, schedule and monitor this data.

But before I proceed, let me list down the topics I will be discussing today:

You may also go through this recording of Talend Data Integration Tutorial where our experts have explained the topics in a detailed manner with examples.

Talend Data Integration Tutorial | Talend Online Training | Edureka

This Edureka video on Talend Data Integration Tutorial will help you in understanding the basic concepts of Talend and getting familiar with the Talend Open Studio which is an open-source software provided by Talend to develop the ETL Jobs.

What Is Talend? – Talend Tutorial

Talend is an open source software integration platform/vendor which offers data integration and data management solutions. This company provides various integration software and services for big data, cloud storage, data integration, data management, master data management, data quality, data preparation, and enterprise applications. Its headquarters are located in Redwood City, California.

Following are the some of the major features of Talend:

Talend features - Talend Tutorial - Edureka

It is considered to be the next-generation leader in cloud and big data integration software. It provides the software that helps companies become data driven by making data more accessible, improving its quality and quickly moving it where it’s needed for real-time decision making. You can think Talend as a critical infrastructure for this data-driven world. It’s an open source approach which breaks off the traditional proprietary model by providing the powerful software solutions. It enables the flexibility to meet the needs of all the organizations. Being open source, it is backed by a huge community of the developers. Talend publishes its core module’s codes under the GNU Public License or the Apache License. From here, the developers within the community can make changes and enhance the products which in turn will benefit other Talend users.

Various products offered by Talend are:

talend products - Talend Tutorial - Edureka

Among all the above-shown products, Talend Open Studio (TOS) is the main and majorly used. In this Talend tutorial blog, I will be explaining how you can use Talend Open Studio for Data Integration.

Introduction To Talend Open Studio (TOS) – Talend Tutorial

Talend Open Studio is an open source project that is based on Eclipse RCP. It supports ETL oriented implementations and is generally provided for the on-premises deployment.  It is extensively used for integration between operational systems, ETL processes and data migration. Talend Open Studio for Data Integration is designed in such a way that it can easily combine, convert and update data present at various locations across an organization. This acts as a code generator which produces data transformation scripts and underlying programs in Java. It provides an interactive and user-friendly GUI which lets you access metadata repository containing the definition and configurations for each process performed in Talend. Below is the basic architecture of Talend Open Studio.Talend Open Studio architecture - Talend Tutorial - Edureka

Lets now try to download and install Talend Open Studio on CentOS.

TOS Installation – Talend Tutorial

STEP 1: Go to: https://www.talend.com/download.

Installation Step 1 - Talend Tutorial - EdurekaSTEP 2: Click on ‘Download Free Tool’.Installation Step 3 - Talend Tutorial - Edureka

STEP 3: Again click on ‘Download Free Tool’ to get the zip file.

STEP 4: Now extract the zip file.Installation Step 4 - Talend Tutorial - Edureka

STEP 5: Now go into the extracted folder and double click on TOS_DI-linux-gtk-x86_64 file.Installation Step 5 - Talend Tutorial - Edureka

STEP 6: Let the installation finish.

Installation Step 6 - Talend Tutorial - EdurekaSTEP 7: Click on ‘Create a new project’ and specify a meaningful name for your project.

Installation Step 7 - Talend Tutorial - EdurekaSTEP 8: Click on ‘Finish’ to go to the Open Studio GUI.

STEP 9: Right-click on the Welcome tab and select ‘Close’.Installation Step 9 - Talend Tutorial - Edureka

STEP 10: Now you should be able to see the TOS main page.TOS GUI - Talend Tutorial - Edureka

TOS GUI – Talend Tutorial

Now that you have downloaded and installed Talend Open Studio, let me give you a walkthrough of its GUI. Talend Open Studio consists of four major parts, as shown below.

GUI introduction - Talend Tutorial - Edureka

  1. Repository

    The Repository collects all the technical items which can be used either to describe business models or design Jobs within Talend and displays them in a tree structure. From the Repository, you can access various Business Models, Job Designs, reusable routines, documentation as well as database connections. In other words, the Repository acts as a central store for all the elements which are necessary for any Job design or business modelling within a project. 

  2. Design Window

    This window further consists of the following parts:design window - Talend Tutorial - Edureka

    1. Workspace: Here you can lay down the designs of your Jobs as well as the business models.
    2. Designer Tab:  This tab opens by default when you create a Job which displays the Job in a graphical mode.
    3. Code Tab: This tab helps you in visualizing the code and highlight the possible language errors.
  3. Palette

    Component Palette is docked at the top of the design workspace to help you draw the model corresponding to your workflow needs. Depending on your Job or the business model, you can drag and drop various technical components or shapes into your design workspace. There are more than 800 components available for you to choose from.

  4. Configuration Tab

    The configuration tabs are present in the lower half of the design window. There are various configurational tabs available in TOS. Each of these tabs opens a view which displays the properties of the current element in the workspace. Most frequently used configurational tabs are:configurational tab - Talend Tutorial - Edureka

    1. Job Tab:

      The Job tab provides various information about the current Job in the designer window including name, version, creation date and time etc.

    2. Context Tab

      The Context tab is used to set context variables and different contexts on which they will be used.

    3. Component Tab

      The Component tab displays all the parameters that are required to configure a component.  Basically, it collects all the information that is relative to the graphical element selected in the design workspace.

    4. Run Tab

      The Run tab displays the progress of the execution of a Job. The logs shown here includes any start, end and error messages.

Here you might ask ‘what is a Job’, as I have already used this term quite a few times till now. So, before diving any deeper let me first give you a brief about a Talend Job.

Talend Job – Talend Tutorial

A ‘Job’ in Talend is basically a customer requirement converted into a technical process. Technically, it is a basic executable unit of any process that is built using Talend. As you already know, TOS converts everything into Java codes at the backend. In case of Jobs, each Job is converted into a single Java class. Let me show you how you can create a Job in Talend. 

Steps:

  1. Right-click on the ‘Job Designs’ in the Repository and select ‘Create job’.job creation - Talend Tutorial - Edureka
  2. Specify a meaningful name for your Job along with the purpose and description of it and click on ‘Finish’.job details - Talend Tutorial - Edureka
  3. Once you finish creating a Job, you will get access to the components present in the palette. Now you can drag any component you need from the palette and drop it in the workspace.adding components - Talend Tutorial - Edureka

But in order to add a component to a Job, first, you need to know what exactly are components, how you can use multiple components together and connect them. So in the next part of this Talend tutorial, I will introduce you to various components and connectors available in Talend.

Talend Components And Connectors – Talend Tutorial

Let’s start with Components.

A component is a functional piece which is used to perform a single operation in Talend. On the palette, whatever you can see all are the graphical representation of the components. You can use them with a simple drag and drop. At the backend, a component is a snippet of Java code that is generated as a part of a Job (which is basically Java class). These Java codes are automatically compiled when the Job is saved. A Talend Job may include one or more components depending on the requirement. One thing you need to know here is Talend provides more than 800 components from which you can choose from. For the ease of access, all these components are generalized to few groups or families. In this Talend tutorial blog, I will introduce you to some of the most important and frequently used components of each family. 

  • Databases

    This family provides Talend components which cover various needs like opening connections, reading and writing tables, committing transactions, performing rollback for error handling etc. More than 40 RDBMS are supported by Talend some of which are MySQL, MS SQL Server, Hive, Amazon, Azure etc. Following are some of the majorly used MySQL components:

    • tMysqlConnection: This component opens a new connection to the database for a current transaction.
    • tMysqlInput: This component reads a database and extracts fields based on the query.
    • tMysqlOutput: This component writes, updates, makes changes or suppresses entries in a database.
    • tMysqlClose: This component closes the transaction committed in the connected database. 
  • File

    This family groups together various components which read and write data in all types of files like Delimited, Positional, XML, Excel etc. Moreover, it also provides a number of components which help in performing various tasks like unarchiving, deleting, copying, comparing etc. This family is further divided into subfamilies like Input, Output, and Management. Few majorly used components of this family are:

    • tFileInputDelimited: This component reads a given file row by row with fields separated using some specified character.
    • tFileInputExcel: This component reads an Excel file (.xls or .xlsx) and extracts data line by line.
    • tFileOutputXML: This component outputs the data to a XML type of file.
    • tFileList: This component retrieves a set of files or folders based on a filemask pattern and iterates them.
    • tFileArchive: This component zips one or more files according to the parameters defined and places the archive created in the selected directory.
  • Internet

    This family includes all of the components that help in accessing information from the Internet, through various means like Web services, RSS flows, SCP, MOM, Emails, FTP etc. Few of the majorly used components of this family are:

    • tFTPGet: This component helps in retrieving the specified files via an FTP connection.
    • tFTPPut: This component copies the selected files via an FTP connection.
    • tHttpRequest: This component sends an HTTP request to the server end and receives the corresponding response from the server end.
    • tSendMail: This component is used to send emails and attachments to the defined recipients.
  • Logs & Errors

    This family, groups together all the components which are dedicated to catch log information and handle Job errors. Following are the majorly used components of this family:

    • tLogRow: This component allows you to write row data into the Job log file, or to the console window.
    • tLogRowCatcher:  This component collects the log data and encapsulates it to pass it on to the defined output.
    • tWarn: This component triggers a warning often caught by the tLogCatcher component for the exhaustive log.
    • tDie: This component sends a message to a tLogCatcher and allows the Job to terminate a Job, with a specified Exit Code
  • Misc

    This family gathers different miscellaneous components covering various needs like the creation of sets of dummy data rows, buffering data, loading context variables etc. Few important components of this family are:

    • tMsgBox: This component opens a dialogue box with a clickable OK button.
    • tRowGenerator: This component is used to generate as many rows and fields as are required using random values which are taken from a list.
  • Orchestration

    This family includes various components which help to sequence or orchestrate tasks and processing Jobs or SubJobs etc. Majorly used components from this family are:

    • tLoop: This component helps in executing a task or a Job automatically, based on a loop with the specified number of iterations.
    • tPrejob: This component helps in triggering a task required for the execution of a Job.
    • tPostjob: This component helps in triggering a task required after the execution of a Job.
    • tSleep: This component helps in implementing a time off within a Job execution.

Now that you know the components, let’s quickly take a look at the connectors or the links which help in connecting these components together in a Job.

Talend provides various types of connections to enable the communication between the components:

  1. Row

    The Row connection deals with the actual data flow. Following are the types of Row connections supported by Talend:

    • Main
    • Lookup
    • Filter
    • Rejects
    • ErrorRejects
    • Output
    • Uniques/Duplicates
    • Multiple Input/Output
  2. Iterate

    The Iterate connection is used to perform a loop on files contained in a directory, on rows contained in a file or on the database entries. Unlike other types of connections, the name of this Iterate link is read-only.

  3. Trigger

    The Trigger connection is used to create a dependency between Jobs or SubJobs which are triggered one after the other according to the trigger’s nature. Trigger connections are generalized in two categories:

    1. Subjob Triggers

      • OnSubjobOK
      • OnSubjobError
      • Run if
    2. Component Triggers

      • OnComponentOK
      • OnComponentError
      • Run if
  4. Link

    The Link connection can be used only with the ELT components. It is used to transfer the table schema information to the ELT mapper component in order to be used in specific DB query statements.

Metadata – Talend Tutorial

metadata - Talend Tutorial - Edureka

Metadata in Talend is the definitional data which basically provides information
about other data that all are managed within Talend Studio. You can find the Metadata in the Repository area of the TOS. In the Repository Metadata, you can store metadata about the various data sources that you may use. This comes in handy while developing any project as you can use these data sources later in your Jobs, just by dragging an object from the repository and dropping it in the workspace.

In the Repository, you can store metadata for various data sources like delimited files, positional file, XML files, database, FTP, Azure, Salesforce etc. 

 

Context Variables – Talend Tutorial

Context variables are the user-defined parameters used by Talend which are passed into a Job at the runtime. These variables may change their values as the Job promotes from Development to Test and Production environment. So, once these variables are set correctly for each environment, you can execute a Job easily in any of these environments. Another use of context variables is to define the values which are commonly used within a project. You can create the context variables in three ways:

  1. Embedded Context Variables

    These context variables are embedded in the Job and are configured much like any other component parameters in the Context Tab below the Job Designer.

  2. Repository Context Variables

    These are created when context variables are used or needed in more than one Job. They are centrally maintained in the repository allowing them generally accessible.

  3. External Context Variables

    External context variables are those context variables which are held in an external file and loaded into the Studio job at the run-time.

Now, I think you are ready to design your First job in Talend. 

In the next section of this Talend tutorial blog, I will show you a step by step demonstration of a simple Talend Job which you can easily execute.

First Job In Talend – Talend Tutorial

Following is a demo in which first you will be establishing a connection with the database, read data from two different external excel files, merge them and then insert it into the database table. Then in a new excel file write the new table contents. Finally, close the connection once the transfer is complete.

Let’s see how to execute it, step by step:

STEP 1: In this demo, I am using external context file for database details. In order to do so, first, you need to create a context file with all the necessary database details.

external context file - Talend Tutorial - Edureka

STEP 2: Create a new Job. Got to its ‘Contexts’ tab and add the following details:

context - Talend Tutorial - Edureka

 STEP 3: Now, add a ‘PreJob’ and a ‘tMysqlConnection’ components in the workspace and link them together as shown below. This will establish the connection with the database before the actual Job is executed. Then go to the ‘Component’ tab of ‘tMysqlConnection’ component and add the necessary details:

Database connection - Talend Tutorial - Edureka

STEP 4: Add two ‘tFileInputExcel’ files and a ‘tMap’ component in the workspace and link them as shown.

Adding ExcelFile components - Talend Tutorial - Edureka
STEP 5: Now go to the ‘Repository’ and expand ‘Metadata’ section. Right click on ‘File Excel’ and select ‘Create File Excel’ and then provide the necessary details as shown below. Once done click on ‘Next’.
Excel metadata - Talend Tutorial - Edureka
STEP 6: Provide the source file path and click on ‘Next’.
excel metadata - Talend Tutorial - Edureka
STEP 7: Check on ‘Header’ to skip the header row (if applicable). Click on ‘Next’.
excel metadata - Talend Tutorial - Edureka
STEP 8: Finally provide a name for the ‘schema’ and click on ‘Finish’.

STEP 9: Go to the ‘Component’ tab of the ‘tFileInputExcel’ component. Select the ‘Property Type’ as ‘Repository’ and select the metadata, you just created.
Excel input details - Talend Tutorial - Edureka
STEP 10: Repeat the same for the other input file.
STEP 11: Double-click on the ‘tMap’ component and map the input and output tables as shown:
tmap config - Talend Tutorial - Edureka
STEP 12: Add ‘tMysqlOutput’ and ‘tFileOutputExcel’ components and link them as shown:
mySql output and excel output - Talend Tutorial - Edureka
STEP 13: Go to the component tab of ‘tMysqlOutput’ and enter the details as shown:
sqloutput - Talend Tutorial - Edureka
STEP 14: Go to the component tab of ‘tFileOutputExcel’ and provide the details as shown:
Excel out component - Talend Tutorial - Edureka
STEP 15: Finally to finish the job, add a ‘Postjob’ and a ‘tMysqlClose’ component as shown.
postJob - Talend Tutorial - Edureka
STEP 16: Go to the ‘Component’ tab of the ‘tMysqlClose’ component and select the connection you need to close.
MysqlClose component - Talend Tutorial - Edureka
STEP 17: Now go to the ‘Run’ tab and execute the job.
final job - Talend Tutorial - Edureka
So, this brings us to the end of the blog on Talend Tutorial. I tried my best to keep the concepts short and clear. Hope it helped you in understanding Talend and its various features. Regarding the demo, if you need the datasets for the practice, all you need to do is drop a comment.
If you found this Talend tutorial blog, relevant, check out the Talend Certification Training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Talend for DI and Big Data Certification Training course helps you to master Talend and Big Data Integration Platform and easily integrate all your data with your Data Warehouse and Applications, or synchronize data between systems.
Got a question for us? Please mention it in the comments section and we will get back to you.
Comments
0 Comments

Join the discussion

Browse Categories

webinar REGISTER FOR FREE WEBINAR
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Subscribe to our Newsletter, and get personalized recommendations.

image not found!
image not found!

Talend Tutorial – Future Of Data Integration

edureka.co