HBase優化指南

在hbase2x 增刪改查 scala版中，有介紹hbase1.2.x增刪改查的api文件，但僅僅了解還是不夠，在不同的讀寫業務場景中，必須做出適當優化，才能滿足業務需求。本文首先講解hbase快取機制，並針對服務端(server)和客戶端(client)進行調優說明。

hbase由master和regionserver組成，master用來管理regionserver並進行負載均衡，regionserver用來管理當前節點的region並響應客戶端讀寫請求。

regionserver包括n個memstore和乙個blockcache，其中memstore用來寫快取，blockcache用來度快取。regionserver給每個region都分配乙個memstore，資料寫入會先預先log(wal機制開啟時)，然後寫入memstore，當達到memstore設定的閾值（hbase.hregion.memstore.flush.size），會觸發flush操作溢寫到storefile；或達到全域性性溢寫觸發閾值（heapsize*hbase.regionserver.global.memstore.upperlimit），會強行啟動flush程序，從最大memstore開始flush至storefile。當乙個region內storefile超過設定閾值（hbase.hstore.compaction.min），則啟動compact程序，把小的storefile合併為大的storefile。當region越來越大，達到閾值（hbase.hregion.max.filesize），則自動split。

讀請求先到memstore中查資料，查不到就到blockcache中查，再查不到就會到磁碟上讀，並把讀的結果放入blockcache。由於blockcache是乙個lru,因此blockcache達到上限(heapsize*hfile.block.cache.size)後，會啟動淘汰機制，淘汰掉最老的一批資料。

hbase.hstore.blockingstorefileshstore的storefile的檔案數大於配置值，則在flush memstore前先進行split或者compact，除非超過hbase.hstore.blockingwaittime配置的時間，預設為7，可調大。

hbase.regionserver.handler.count處理rpc的執行緒數量，預設值：10，根據併發可以設定成80-100-120

hfile.block.cache.size預設值0.25，regionserver的block cache的記憶體大小限制，在偏向讀的業務中，可以適當調大該值，需要注意的是hbase.regionserver.global.memstore.upperlimit的值和hfile.block.cache.size的值之和必須小於0.8。

2.1.1 預分割槽設定

當乙個region的資料達到一定量時，hbase會自動開始split，split的過程中會短暫下線region，可能導致請求超時。我們可以預先建立好一定數量的region 來避免hbase自動split情況的發生。

預先建立好多個分割槽後，應當將rowkey離散化，讓資料盡可能均衡的分部在各個region上。可使用hbase內建的演算法建立，在hbases shell 下執行：

create 'test_lee',, ,owner=>'lee'

2.1.2 rowkey長度

建議是越短越好，不要超過16個位元組。在hbase中，乙個具體的值由儲存該值的行鍵、對應的列(列族：列)以及該值的時間戳決定。hbase中索引是為了加速隨即訪問的速度，索引的建立是基於「行鍵+列族：列+時間戳+值」的，如果行鍵和列族的大小過大，甚至超過值本身的大小，納悶將會增加索引的大小。並且在hbase中資料記錄往往非常之多，重複的行鍵、列將不但使索引的大小過大，也將加重系統的負擔。

2.1.3 rowkey離散化

rowkeyhash= org.apache.hadoop.hbase.util.md5hash.ge***5ashex(bytes.tobytes("rowkey"));
或者rowkeyhash1= rowkeyhash.substring(0,8)+rowkey;

2.2.1 設定最大版本數

不推薦設定比較大的最大版本數，會導致storefile大，占用比較多的資源。

create 'test_lee',, ,owner=>'lee'

2.2.2 設定生命週期

通過設定生命週期，過期資料將被自動清除。

, ,owner=>'lee'

2.2.3 設定壓縮格式

如果使用hbase api，則如下

def createtable(tablename: string, family: array[string],livetime:int) 
if (!admin.tableexists(name)) 
}

預設是開啟的，可以關閉自動flush，批量提交。

hbase 1.x以前版本 api：

htable.setautoflushto(false)

hbase 1.x以後版本 api：

// 設定客戶端寫buffer大小，達到閾值則flush
val params = new bufferedmutatorparams(name).writebuffersize(4*1024*1024)

當資料被寫入時會預設先寫入write-ahead log(wal)。wal中包含了所有已經寫入memstore但還未flush到hfile的更改(edits)。在memstore中資料還沒有持久化，當regionsever宕掉的時候，可以使用wal恢復資料。

可以在服務端修改hbase-site.xml配置hbase.regionserver.hlog.enabled設為false即可關閉（全域性表），但不建議。一般在客戶端修改設定。

hbase 1.x以前版本 api：

htable.setautoflushto(false)

hbase 1.x以後版本 api：

put.setdurability(durability.skip_wal)

客戶端和服務端一樣可以修改，hbase 1.x和hbase 2.x沒有區別，如下。

通過批量讀寫，hbase 1.x以前版本 api如下：

htable.put(list)
htable.get(list)

hbase 1.x以後版本 api如下：

mutator.mutate(gets)
mutator.mutate(puts)

scan.setcacheblocks(true)
scan.setcaching(500)

參考資料

HBase優化指南

hbase資料讀取優化 HBase效能優化總結篇

hadoop 權威指南 HBase

HBase查詢優化

HBase優化指南

hbase資料讀取優化 HBase效能優化 總結篇

hadoop 權威指南 HBase

HBase查詢優化

相關推薦

hbase資料讀取優化 HBase效能優化總結篇