Apache Hadoop project includes four key modules
-
Hadoop Common: The common utilities that support the other Hadoop modules.
-
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
-
Hadoop YARN: A framework for job scheduling and cluster resource management.
-
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
HBase is A scalable, distributed database that supports structured data storage for large tables. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
When to use HBase:
-
If your application has a variable schema where each row is slightly different
-
If you find that your data is stored in collections, that is all keyed on the same value
-
If you need random, real-time read/write access to your Big Data.
-
If you need key-based access to data when storing or retrieving.
-
If you have a huge amount of data with existing Hadoop cluster
But HBase has some limitations
-
It can't be used for classic transactional applications or even relational analytics.
-
It is also not a complete substitute for HDFS when doing large batch MapReduce.
-
It doesn’t talk SQL, have an optimizer, support cross record transactions or joins.
-
It can't be used with complicated access patterns (such as joins)