Apache HBase is an open source, key-value distributed database built on top of HDFS and modeled around Google Bigtable. It is an ideal choice for applications that need fast random access to very large amounts of data (billions of rows X millions of columns)

  • Linear and modular scalability
  • Strongly consistent reads and writes, thanks to ZooKeeper
  • Automatic and configurable sharding of tables
  • Automatic failover support between RegionServers

Refer to HFile to see how HBase supports random read/write.

HBase Data Model Terminologies

  • Table - consists of multiple rows
  • Row - consists of a row key and one or more columns (column family) with values associated with them. Rows are stored alphabetically by the row key
  • Column - consists of a column family and a column qualifier, which are delimited by a : (colon) character
  • Column Family - groups a set of columns and their values. They are defined at the creation time. Each column family has its own storage properties like data compression, row key encoding, caching, etc
  • Column Qualifier - is used to represent the data within a column family
  • Cell - is a unique combination of row, column family, and column qualifier. It contains a value and a timestamp.
  • Timestamp - Values written in a cell are versioned with the timestamp. A timestamp is written alongside each value, and is the identifier for a given version of a value. By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell.

HBase Storage Internals

  • Regions - Table will be split into Regions based on rows' lexicographical order such that all keys within regions start and end keys are stored in the same region
  • RegionServers - Regions are serrved by Region Server. Regions are distributed across the cluster, requests from client can be processed by RegionServer process independently
  • Store - A Store corresponds to a column family for a table for a given region. A Store hosts a MemStore and 0 or more StoreFiles (HFiles). The MemStore holds in-memory modifications to the Store. When the MemStore reaches a given size (hbase.hregion.memstore.flush.size), it flushes its contents to a StoreFile. StoreFiles are where your data lives

hbase-arch-1

Think about

  • Row key structure
  • Think about the read & write access patterns because tables are sorted based upon row keys
  • How many column families are required? and what we will store in each column family?
  • Decide what column qualifiers you will need? You don't need qualifiers at table creation time but it's good to think about them beforehand
  • How many columns are required?
  • What will you store in each cell?
  • Do you need versioning? If yes, roughly how many versions should you stored?