Hive 十九分桶

建表語句

create
table user_info_bucketed(user_id bigint, firstname string, lastname string)
comment 'a bucketed copy of user_info'
partitioned by(ds string)
clustered by(user_id) into
256 buckets;

以上分桶表以字段user_id進行分桶，以下說明如何在分桶的表中匯入分桶資料

插入資料

from user_id
insert overwrite table user_info_bucketed
partition (ds='2009-02-25')
select userid, firstname, lastname where ds='2009-02-25';

hive.enforce.bucketing = true會在insert的過程中根據bucket的大小(256)，bucket依賴的字段(user_id)，自動設定reduce的個數(256)，否則，你需要自己手工寫如下語句進行bucket表資料的匯入

from user_id
insert overwrite table user_info_bucketed
partition (ds='2009-02-25')
select userid,firstname,lastname where ds='2009-02-25' cluster by user_id

分桶+sorted表

create table page_view(viewtime int, userid bigint, page_url string, referrer_url string, ip string comment 'ip address of the user') comment 'this is the page view table' partitioned by(dt string, country string) clustered by(userid) sorted by(viewtime) into 32 buckets row format delimited fields terminated by '\001' collection items terminated by '\002' map keys terminated by '\003'

stored as sequencefile;

針對分桶表進行抽樣

table_sample: tablesample (bucket x out
of y [on colname])

注意點：

tablesample語句可以跟在select…from…之後

桶的編號x從1開始計數

colname代表從哪一列對錶進行抽樣，可以是原始表中的任一列，或者是rand()，rand()代表對整行抽樣，而不是某一列。

注：tablesample是抽樣語句，語法：tablesample(bucket x out of y)

y必須是table總bucket數的倍數或者因子。hive根據y的大小，決定抽樣的比例。例如，table總共分了64份，當y=32時，抽取(64/32=)2個bucket的資料，當y=128時，抽取(64/128=)1/2個bucket的資料。x表示從哪個bucket開始抽取。例如，table總bucket數為32，tablesample(bucket 3 out of 16)，表示總共抽取（32/16=）2個bucket的資料，分別為第3個bucket和第（3+16=）19個bucket的資料。

建表語句

查詢語句

效果說明

create table () clustered by (user_id) into 32 buckets

select * from .. tablesample(bucket 3 out of 16 on user_id)

select * from .. tablesample(bucket 3 out of 64 on user_id)

select * from .. tablesample(bucket 3 out of 32 on user_id) 同上

select * from .. tablesample(bucket 3 out of 32 on user_name)

select * from .. tablesample(bucket 3 out of 32 on rand())

查詢時sample欄位與create的cluster by欄位一致，需進行全表掃瞄，然後確認抽樣資料，其中，rand()，是根據整條記錄進行抽樣

create table ()

select * from .. tablesample(bucket 3 out of 32 on id)

建表時沒有分桶，查詢時需要全表掃瞄

Hive 十九分桶

hive 修改分桶數分桶表 Hive中的分桶

hive分桶 hive學習筆記之五分桶

Hive分桶筆記

Hive 十九 分桶

hive 修改分桶數 分桶表 Hive中的分桶

hive分桶 hive學習筆記之五 分桶

Hive分桶筆記

相關推薦

Hive 十九分桶

hive 修改分桶數分桶表 Hive中的分桶

hive分桶 hive學習筆記之五分桶