hbase rowkey的設計和預分割槽

在專案中結合使用了hive和hbase，需要把hive中的表插入到hbase，hive表都經過了打標籤處理，共包含9個字段，根據業務需求，此時需將hive表中的routermac欄位(string型別)設計成hbase表中的rowkey，hive中是每天一張表，hbase中是每月一張表。首先是進行預分割槽，由於集群共有20個regionserver，則設定40個分割槽

(一)統計hive表中每個月去重的routermac

去重後的routermac每個月大約有400萬到700多萬

首先將每天的hive表中的routermac進行去重後插入一張表並再次去重生成mac_09表，得到9月份去重的所有routermac，然後進行排序生成新錶，並新增乙個排序序號欄位rank，如下所示：

create table hbase.temp as select row_number() over (order by routermac) rank,* from mac_09;

count一下發現temp表一共400萬行，那就比較簡單了，由於routermac已經經過了排序，則可以每隔10萬個routermac設定乙個分割槽，共40個分割槽，此時當把9月份的hive表都插入以mac為rowkey的hbase表時，大致可以保持region的負載均衡。

提取預分割槽所用rowkey如下：

hive -e "select routermac from hbase.temp where rank=100000 or .... or rank=3900000;" >> ~/var/lib/hadoop-hdfs/hbase/split_09.txt

(二)新建hbase表進行並根據split_09.t

xt檔案進行預分割槽

此時則可以在hbase master節點上看到tags:router09表共有40個分割槽

(三)在hive的hbase資料庫中建立外部表和hbase中的tags:router09表進行關聯

(四)將hive中的資料插入tags_router09表

將資料插入tags_router09表時，即將資料插入了hbase中的tags:router09表，如下：

(五)如果速度比較慢的話可以對hive表建索引

其實hbase的插入速度還是非常快的，但是如果插入語句裡邊增加了where routernc<>'null'，速度就會非常慢，一種情況是對routermc欄位建立索引，另一種情況是對phitagorc20170923表進行清洗，去掉routermac='null'的字段後，把新錶插入到hbase中。

建立索引語句如下：

create index mac_index on table phitagorc20170923(routermac) as

'org.apache.hadoop.hive.ql.index.compact.compactindexhandler'

with deferred rebuild in table index_phitagorc20170923;

載入索引資料：

alter index mac_index on phitagorc20170923 rebuild;

hbase rowkey的設計和預分割槽

HBase RowKey設計和預分割槽

HBase RowKey設計原則

HBase Rowkey 設計指南

hbase rowkey的設計和預分割槽

HBase RowKey設計和預分割槽

HBase RowKey設計原則

HBase Rowkey 設計指南

相關推薦