HBase Architecture: HBase Data Model & HBase Read/Write Mechanism

Last updated on Nov 18,2022 72.1K Views
Shubham Sinha is a Big Data and Hadoop expert working as a... Shubham Sinha is a Big Data and Hadoop expert working as a Research Analyst at Edureka. He is keen to work with Big Data...

HBase Architecture: HBase Data Model & HBase Read/Write Mechanism

edureka.co

HBase Architecture

In my previous blog on HBase Tutorial, I explained what is HBase and its features. I also mentioned Facebook messenger’s case study to help you to connect better. Now moving ahead, I will explain the data model of HBase and HBase Architecture. Before you move on, you should also know that HBase is an important concept that makes up an integral portion of the course curriculum for Big Data Hadoop Certification.

The important topics that I will be taking you through in this HBase architecture blog are:

Let us first understand the data model of HBase. It helps HBase in faster read/write and searches.

HBase Architecture: HBase Data Model

As we know, HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, but it is not a relational database. Relational databases are row oriented while HBase is column-oriented. So, let us first understand the difference between Column-oriented and Row-oriented databases:

Row-oriented vs column-oriented Databases:

To better understand it, let us take an example and consider the table below. 

If this table is stored in a row-oriented database. It will store the records as shown below:

1, Paul Walker, US, 231, Gallardo

2, Vin Diesel, Brazil, 520, Mustang

In row-oriented databases data is stored on the basis of rows or tuples as you  can see above.

While the column-oriented databases store this data as:

1,2Paul Walker, Vin DieselUS, Brazil231, 520Gallardo, Mustang

In a column-oriented databases, all the column values are stored together like first column values will be stored together, then the second column values will be stored together and data in other columns are stored in a similar manner.

HBase tables has following components, shown in the image below:

In a more simple and understanding way, we can say HBase consists of:

Now that you know about HBase Data Model, let us see how this data model falls in line with HBase Architecture and makes it suitable for large storage and faster processing.

HBase Architecture: Components of HBase Architecture

HBase has three major components i.e., HMaster Server, HBase Region Server, Regions and Zookeeper.

The below figure explains the hierarchy of the HBase Architecture. We will talk about each one of them individually.


Now before going to the HMaster, we will understand Regions as all these Servers (HMaster, Region Server, Zookeeper) are placed to coordinate and manage Regions and perform various operations inside the Regions. So you would be curious to know what are regions and why are they so important?

 The best way to become a Data Engineer is by getting the Azure Data Engineering Course in India.

HBase Architecture: Region

A region contains all the rows between the start key and the end key assigned to that region. HBase tables can be divided into a number of regions in such a way that all the columns of a column family is stored in one region. Each region contains the rows in a sorted order.

Many regions are assigned to a Region Server, which is responsible for handling, managing, executing reads and writes operations on that set of regions.

So, concluding in a simpler way:

Now starting from the top of the hierarchy, I would first like to explain you about HMaster Server which acts similarly as a NameNode in HDFS. Then, moving down in the hierarchy, I will take you through ZooKeeper and Region Server.

HBase Architecture: HMaster

As in the below image, you can see the HMaster handles a collection of Region Server which resides on DataNode. Let us understand how HMaster does that.

HBase has a distributed and huge environment where HMaster alone is not sufficient to manage everything. So, you would be wondering what helps HMaster to manage this huge environment? That’s where ZooKeeper comes into the picture. After we understood how HMaster manages HBase environment, we will understand how Zookeeper helps HMaster in managing the environment. 

HBase Architecture: ZooKeeper – The Coordinator

This below image explains the ZooKeeper’s coordination mechanism.

As I talked about .META Server, let me first explain to you what is .META server? So, you can easily relate the work of ZooKeeper and .META Server together. Later, when I will explain you the HBase search mechanism in this blog, I will explain how these two work in collaboration. You can get a better understanding with the Azure Data Engineering Certification.

HBase Architecture: Meta Table

As I already discussed, Region Server and its functions while I was explaining you Regions hence, now we are moving down the hierarchy and I will focus on the Region Server’s component and their functions. Later I will discuss the mechanism of searching, reading, writing and understand how all these components work together.

HBase Architecture: Components of Region Server

This below image shows the components of a Region Server. Now, I will discuss them separately. 

A Region Server maintains various regions running on the top of HDFS. Components of a Region Server are:

Now that we know major and minor components of HBase Architecture, I will explain the mechanism and their collaborative effort in this. Whether it’s reading or writing, first we need to search from where to read or where to write a file. So, let’s understand this search process, as this is one of the mechanisms which makes HBase very popular. 

HBase Architecture: How Search Initializes in HBase?

As you know, Zookeeper stores the META table location. Whenever a client approaches with a read or writes requests to HBase following operation occurs:

  1. The client retrieves the location of the META table from the ZooKeeper.
  2. The client then requests for the location of the Region Server of corresponding row key from the META table to access it. The client caches this information with the location of the META Table.
  3. Then it will get the row location by requesting from the corresponding Region Server.

For future references, the client uses its cache to retrieve the location of META table and previously read row key’s Region Server. Then the client will not refer to the META table, until and unless there is a miss because the region is shifted or moved. Then it will again request to the META server and update the cache.

As every time, clients does not waste time in retrieving the location of Region Server from META Server, thus, this saves time and makes the search process faster. Now, let me tell you how writing takes place in HBase. What are the components involved in it and how are they involved?

HBase Architecture: HBase Write Mechanism

This below image explains the write mechanism in HBase.

The write mechanism goes through the following process sequentially (refer to the above image): 

Step 1: Whenever the client has a write request, the client writes the data to the WAL (Write Ahead Log). 

Step 2: Once data is written to the WAL, then it is copied to the MemStore.

Step 3: Once the data is placed in MemStore, then the client receives the acknowledgment.

Step 4: When the MemStore reaches the threshold, it dumps or commits the data into a HFile.

Now let us take a deep dive and understand how MemStore contributes in the writing process and what are its functions?

HBase Write Mechanism- MemStore

As I discussed several times, that HFile is the main persistent storage in an HBase architecture. At last, all the data is committed to HFile which is the permanent storage of HBase. Hence, let us look at the properties of HFile which makes it faster for search while reading and writing.

HBase Architecture: HBase Write Mechanism- HFile

After knowing the write mechanism and the role of various components in making write and search faster. I will be explaining to you how the reading mechanism works inside an HBase architecture? Then we will move to the mechanisms which increases HBase performance like compaction, region split and recovery.

HBase Architecture: Read Mechanism

As discussed in our search mechanism, first the client retrieves the location of the Region Server from .META Server if the client does not have it in its cache memory. Then it goes through the sequential steps as follows: 

So far, I have discussed search, read and write mechanism of HBase. Now we will look at the HBase mechanism which makes search, read and write quick in HBase. First, we will understand Compaction, which is one of those mechanisms.

HBase Architecture: Compaction

HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for a read. This process is called compaction. Compaction chooses some HFiles from a region and combines them. There are two types of compaction as you can see in the above image.

  1. Minor Compaction: HBase automatically picks smaller HFiles and recommits them to bigger HFiles as shown in the above image. This is called Minor Compaction. It performs merge sort for committing smaller HFiles to bigger HFiles. This helps in storage space optimization. 
  2. Major Compaction: As illustrated in the above image, in Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile. In this process, the same column families are placed together in the new HFile. It drops deleted and expired cell in this process. It increases read performance.

But during this process, input-output disks and network traffic might get congested. This is known as write amplification. So, it is generally scheduled during low peak load timings.

Now another performance optimization process which I will discuss is Region Split. This is very important for load balancing.

HBase Architecture: Region Split

The below figure illustrates the Region Split mechanism.

Whenever a region becomes large, it is divided into two child regions, as shown in the above figure. Each region represents exactly a half of the parent region. Then this split is reported to the HMaster. This is handled by the same Region Server until the HMaster allocates them to a new Region Server for load balancing.

Moving down the line, last but the not least, I will explain you how does HBase recover data after a failure. As we know that Failure Recovery is a very important feature of HBase, thus let us know how HBase recovers data after a failure.

HBase Architecture: HBase Crash and Data Recovery

I hope this blog would have helped you in understating the HBase Data Model & HBase Architecture. Hope you enjoyed it. Now you can relate to the features of HBase (which I explained in my previous HBase Tutorial blog) with HBase Architecture and understand how it works internally. Now that you know the theoretical part of HBase, you should move to the practical part. Keeping this in mind, our next blog of Hadoop Tutorial Series will be explaining a sample HBase POC.

Now that you have understood the HBase Architecture, check out the Hadoop training in Bangalore by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.

Got a question for us? Please mention it in the comments section and we will get back to you.

BROWSE COURSES