hive資料抽樣的方法

塊抽樣（block sampling）

hive 本身提供了抽樣函式，使用 tablesample 抽取指定的行數/比例/大小，舉例：

create table iteblog as select * from iteblog1 tablesample(1000 rows);
create table iteblog as select * from iteblog1 tablesample (20 percent); 
create table iteblog as select * from iteblog1 tablesample(1m);

block_sample: tablesample (n percent)

該語句允許至少抽取 n% 大小的資料（注意：不是行數，而是資料大小）做為輸入，僅支援 combinehiveinputformat ，不能夠處理一些特殊的壓縮格式。如果抽樣失敗，mapreduce 作業的輸入將是整個表或者是分割槽的資料。由於在 hdfs 塊級別進行抽樣，所以抽樣粒度為塊大小。例如如果塊大小為256mb，即使 n% 的輸入僅為100mb，那也會得到 256mb 的資料。

select *from source tablesample(0.1 percent) s;

如果希望在不同的塊中抽取相同大小的資料，可以改變下面的引數：

set hive.sample.seednumber=;

或者可以指定要讀取的總長度，但與 percent 抽樣具有相同的限制。（從hive 0.10.0開始 -

block_sample: tablesample (bytelengthliteral)

bytelengthliteral : (digit)+ (『b』 | 『b』 | 『k』 | 『k』 | 『m』 | 『m』 | 『g』 | 『g』)

在下面例子中中 100m 或更多的輸入資料用於查詢：

select *from source tablesample(100m) s;

hive 還支援按行數對輸入進行限制，但它與上述兩種行為不同。首先，它不需要 combinehiveinputformat，這意味著這可以在 non-native 表上使用。其次，使用者給定的行數應用到每個 inputsplit 上。因此總行數還取決於輸入 inputsplit 的個數（不同 inputsplit 個數得到的總行數也會不一樣）。（從hive 0.10.0開始 -

block_sample: tablesample (n rows)

例如，以下查詢將從每個輸入 inputsplit 中取前10行：

select * from source tablesample(10 rows);

因此如果有20個 inputsplit 就會輸出200條記錄。

缺點：不隨機。該方法實際上是按照檔案中的順序返回資料，對分割槽表，從頭開始抽取，可能造成只有前面幾個分割槽的資料。

優點：速度快。

分桶表抽樣（smapling bucketized table）

利用分桶表，隨機分到多個桶裡，然後抽取指定的乙個桶。

語法：

table_sample: tablesample (bucket x out of y [on colname])

tablesample 子句允許使用者編寫對抽樣資料的查詢，而不是對整個**進行查詢。tablesample 子句可以新增到任意表中的 from 子句中。桶從1開始編號。colname 表明在哪一列上對錶的每一行進行抽樣。colname 可以是表中的非分割槽列，也可以使用 rand() 表明在整行上抽樣而不是在單個列上。表中的行在 colname 上進行分桶，並隨機分桶到編號為1到y的桶上。返回屬於第x個桶的行。下面的例子中，返回32個桶中的第3個桶中的行，s 是表的別名：

select * from source tablesample(bucket 3 out of 32 on rand()) s;

通常情況下，tablesample 將掃瞄整個表並抽取樣本。但是，這並不是一種有效率的方式。相反，可以使用 clustered by 子句建立該錶，表示在該錶的一組列上進行雜湊分割槽/分簇。如果 tablesample子句中指定的列與 clustered by 子句中的列匹配，則 tablesample 僅掃瞄表中所需的雜湊分割槽。

所以在上面的例子中，如果使用 clustered by id into 32 buckets 建立表 source（根據id將資料分到32個桶中）：

tablesample(bucket 3 out of 16 on id)

會返回第3個和第19個簇，因為每個桶由（32/16）= 2個簇組成（建立表時指定了32個桶，會對應32個簇）。為什麼選擇3和19呢，因為要返回的是第3個桶，而每個桶由原來的2個簇組成，3%16=3 19%16=3，第3個桶就由原來的第3個和19個簇組成。

另乙個例子:

tablesample(bucket 3 out of 64 on id)

會返回第三個簇的一半，因為每個桶將由（32/64）= 1/2個簇組成。

再乙個例子：隨機分到10個桶，抽取第乙個桶。

create table iteblog as select * from iteblog1 tablesample (bucket 1 out of 10 on rand());

優點：隨機，測試發現，速度比方法3的rand()快。

隨機抽樣

原理：利用 rand() 函式進行抽取，rand() 返回乙個0到1之間的 double 值。

使用方法一

create table iteblog as

select * from iteblog1

order by rand()

limit 10000

此時，可以提供真正的隨機抽樣，但是，需要在單個 reducer 中進行總排序，速度慢。

使用方法二

create table iteblog as

select * from iteblog1

sort by rand()

limit 10000

hive 提供了 sort by，sort by 提供了單個 reducer 內的排序功能，但不保證整體有序，上面的語句是不保證隨機性的。

使用方法三

create table iteblog as

select * from iteblog1

where rand()<0.002

distribute by rand()

sort by rand()

limit 10000;

where 條件首先進行一次 map 端的優化，減少 reducer 需要處理的資料量，提高速度。distribute by 將資料隨機分布，然後在每個 reducer 內進行隨機排序，最終取10000條資料（如果資料量不足，可以提高 where 條件的 rand 過濾值）。

缺點：速度慢

使用方法四

create table iteblog as

select * from iteblog1

where rand()<0.002

cluster by rand()

limit 10000;

cluster by 的功能是 distribute by 和 sort by 的功能相結合，distribute by rand() sort by rand() 進行了兩次隨機，cluster by rand() 僅一次隨機，所以速度上會比上一種方法快。

隨機結果裡面新增分割槽

上面的幾種方法會丟失掉分割槽資訊，我們可以結合動態分割槽將分割槽資訊加到結果集中，具體如下：

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table iteblog partition(thedate) 
select * from iteblog1 tablesample (bucket 1 out of 10 on rand());

hive資料抽樣的方法

HIve實現資料抽樣

Hive 7 資料抽樣

Hive實現資料抽樣的常用三種方法

hive資料抽樣的方法

HIve實現資料抽樣

Hive 7 資料抽樣

Hive實現資料抽樣的常用三種方法

相關推薦