Apache Flume Tutorial for Beginners | Twitter Data Streaming

In this Apache Flume tutorial blog, we will understand how Flume helps in streaming data from various sources. But before that let us understand the importance of data ingestion. Data ingestion is the initial & important step in order to process & analyse data, and then derive business values out of it. There are be multiple sources from which data is gathered in an organization.

Lets talk about another important reason why Flume became so popular. I hope you may be familiar with Apache Hadoop, which is being used tremendously in the industry as it can store all kinds of data. Flume can easily integrate with Hadoop and dump unstructured as well as semi-structured data on HDFS, complimenting the power of Hadoop. This is why Apache Flume is an important part of Hadoop Ecosystem.

We will be beginning this Flume tutorial by discussing about what is Apache Flume. Then moving ahead, we will understand the advantages of using Flume.

Apache Flume Tutorial: Introduction to Apache Flume

Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates and transports large amount of streaming data such as log files, events from various sources like network traffic, social media, email messages etc. to HDFS. Flume is a highly reliable & distributed.

The main idea behind the Flume’s design is to capture streaming data from various web servers to HDFS. It has simple and flexible architecture based on streaming data flows. It is fault-tolerant and provides reliability mechanism for Fault tolerance & failure recovery.

After understanding what is Flume, now let us advance in this Flume Tutorial blog and understand the benefits of Apache Flume. Then moving ahead, we will look at the architecture of Flume and try to understand how it works fundamentally.

Apache Flume Tutorial: Advantages of Apache Flume

There are several advantages of Apache Flume which makes it a better choice over others. The advantages are:

The architecture is one which is empowering Apache Flume with these benefits. Now, as we know the advantages of Apache Flume, lets move ahead and understand Apache Flume architecture.

Apache Flume Tutorial: Flume Architecture

Apache Flume Architecture - Flume Tutorial - Edureka

There is a Flume agent which ingests the streaming data from various data sources to HDFS. From the diagram, you can easily understand that the web server indicates the data source. Twitter is among one of the famous sources for streaming data.

Now as we know how Apache Flume works, let us take a look at a practical where we will sink the Twitter data and store it in the HDFS.

Apache Flume Tutorial: Streaming Twitter Data

In this practical, we will stream data from Twitterusing Flume and then store the data in HDFS as shown in the below image.

The first step is to create a Twitter application. For this, you first have to go to this url: https://apps.twitter.com/ and sign in to your Twitter account. Go to create application tab as shown in the below image.

After creating this application, you will find Key & Access token. Copy the key and the access token. We will pass these tokens in our Flume configuration file to connect to this application.

Now create a flume.conf file in the flume’s root directory as shown in the below image. As we discussed, in the Flume’s Architecture, we will configure our Source, Sink and Channel. Our Source is Twitter, from where we are streaming the data and our Sink is HDFS, where we are writing the data.

In source configuration we are passing the Twitter source type as org.apache.flume.source.twitter.TwitterSource. Then, we are passing all the four tokens which we received from Twitter. At last in source configuration we are passing the keywords on which we are going to fetch the tweets.

In the Sink configuration we are going to configure HDFS properties. We will set HDFS path, write format, file type, batch size etc. At last we are going to set memory channel as shown in the below image.

Now we are all set for execution. Let us go ahead and execute this command:

$FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f $FLUME_HOME/flume.conf

After executing this command for a while, and then you can exit the terminal using CTRL+C. Then you can go ahead in your Hadoop directory and check the mentioned path, whether the file is created or not.

Download the file and open it. You will get something as shown in the below image.

I hope this blog is informative and added value to you. If you are interested to learn more, you can go through this Hadoop Tutorial Series which tells you about Big Data and how Hadoop is solving challenges related to Big Data.

Redefine your data analytics workflow and unleash the true potential of big data with Pyspark Training.

Now that you have understood Apache Flume, check out the Hadoop training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.

Got a question for us? Please mention it in the comments section and we will get back to you.

Comments

2 Comments

diva says:
Jan 23, 2018 at 10:35 am GMT
Hi,i have done my configuration for twitter data in same way but I am not getting the tweet data based on keywords passed.Even in your screenshot also it is not relevant data to keywords that you have passed in configuration file.Can you please explain how we can get data based on some keywords?Also please tell me if we can pass keywords dynamically instead of hard writing in configuration file?
- kavya says:
  Mar 29, 2018 at 5:21 am GMT
  hi diva, you need to design agent.config file, i got this information from this website apache flume, u can also refer.

Introduction to Big Data

Introduction to Hadoop

Hadoop Distributed File System

Hadoop Installation

YARN & MapReduce

Data Loading Tools

Apache Pig

Apache Hive

DynamoDB vs MongoDB: Which One Meets Your Business Needs Better?

How To Install MongoDB On Windows Operating System?

How To Install MongoDB On Ubuntu Operating System?

How To Install MongoDB on Mac Operating System?

How To Create User In MongoDB?

Apache HBase

Apache Oozie

Hadoop Interview Questions

Career Guidance

Big Data

Apache Flume Tutorial : Twitter Data Streaming

Apache Flume Tutorial: Introduction to Apache Flume

Apache Flume Tutorial: Advantages of Apache Flume

Apache Flume Tutorial: Flume Architecture

Apache Flume Tutorial: Streaming Twitter Data

Recommended videos for you

Logistic Regression In Data Science

Tailored Big Data Solutions Using MapReduce Design Patterns

Administer Hadoop Cluster

Is It The Right Time For Me To Learn Hadoop ? Find out.

Apache Spark For Faster Batch Processing

Secure Your Hadoop Cluster With Kerberos

Bulk Loading Into HBase With MapReduce

Ways to Succeed with Hadoop in 2015

Advanced Security In Hadoop Cluster

Streaming With Apache Spark and Scala

Real-Time Analytics with Apache Storm

Webinar: Introduction to Big Data & Hadoop

Introduction to Big Data TDD and Pig Unit

Reduce Side Joins With MapReduce

Python for Big Data Analytics

What is Big Data and Why Learn Hadoop!!!

When not to use Hadoop

Apache Spark Redefining Big Data Processing

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Boost Your Data Career with Predictive Analytics! Learn How ?

Recommended blogs for you

Hadoop Components that you Need to know about

Top Skills Required for Big Data Engineer

Copy Activity in Azure Data Factory and Azure Synapse Analytics

Spark vs Hadoop: Which is the Best Big Data Framework?

Introduction to Spark with Python – PySpark for Beginners

10 Reasons Why Big Data Analytics is the Best Career Move

Real Time Storm Project

How to Create a Pipeline in Azure Data Factory Step-by-Step

What is Big Data? – A Beginner’s Guide to the World of Big Data

Stateful Transformations in Apache Spark Streaming

All You Need To Know About Splunk

A Beginner’s Guide to Understanding Big Data & Hadoop

Apache Pig UDF: Part 1 – Eval, Aggregate & Filter Functions

Splunk Use Case: Domino’s Success Story

ELK Stack Tutorial – Discover, Analyze And Visualize Your Data Efficiently

Explaining Hadoop Configuration

Operators in Apache Pig: Part 1- Relational Operators

Distributed Caching With Broadcast Variables: Apache Spark

Do You Need Java To Learn Hadoop?

PySpark Tutorial – Learn Apache Spark Using Python

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric Data Engineer Associate Trai ...

PySpark Certification Training Course

Apache Kafka Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Splunk Certification Training: Power User and ...

ELK Stack Training & Certification

Apache Spark and Scala Certification Training ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Apache Flume Tutorial : Twitter Data Streaming