Hive實現隨機抽樣（附詳解）

select  
* from 
tab 
order by rand()
limit 1000

select 
*from 
( select 
e.*, cast(rand() * 100000 as int) as idx 
from e 
) t 
order by t.idx 
limit 1000

表e為乙個存有資料普通表，我們要從表e中隨機抽出1000條資料作為資料樣本。

rand() 函式產生乙個0到1的隨機數字，cast(rand() * 100000 as int) as idx為乙個0到100000之間的乙個隨機整數。

根據輸入的inputsize，取樣n%

比如：輸入大小為1g，tablesample (50 percent)將會取樣約512m的資料；

使用下面的sql，從表table1中取樣50%的資料，建立乙個table_new新錶：

create table table_new as
select * from table1 tablesample (50 percent);

指定取樣資料的大小，單位為m

使用下面的sql,將會從表table1中取樣30m的資料：

create table table_new as
select * from table1 tablesample (30m);

可以根據行數來取樣，注意：這裡指定的行數，是在每個inputsplit中取樣的行數，也就是每個map中都取樣n rows

select count(1) from (select * from table1 tablesample (200 rows)) t;

若有5個maptask(inputsplit),每個map取樣200行，一共取樣1000行

hive中的分桶表（bucket table），根據某乙個欄位hash取模，放入指定資料的桶中，比如將表table1按照id分成100個桶，其演算法是hash(id) % 100，這樣，hash(id) % 100 = 0的資料被放到第乙個桶中，hash(id) % 100 = 1的記錄被放到第二個桶中。

分桶表取樣的語法:

table_sample: tablesample (bucket x out of y [on colname])

其中x是要抽樣的桶編號，桶編號從1開始，colname表示抽樣的列，y表示桶的數量。

select count(1)
from table1 tablesample (bucket 1 out of 10 on rand());

該sql語句表示將表table1隨機分成10個桶，抽樣第乙個桶的資料，出來的結果基本上是原表的十分之一，

注意：這個結果每次執行是不一樣的，是按照隨機數進行分桶取樣的

如果基於乙個已經分桶表進行取樣，將會更有效率。

執行下面的sql語句，建立乙個分桶表（分桶表在建立時候使用cluster by語句建立），並插入資料：

create table table_bucketed (id string)
clustered by(id) into 10 buckets;
insert overwrite table table_bucketed
select id from table1;

表table_bucketed按照id欄位分成10個桶，下面的語句表示從10個桶中抽樣第乙個桶的資料:

select count(1) from table_bucketed tablesample(bucket 1 out of 10 on id);

結果差不多是源表記錄的1/10.

如果從源表中直接分桶抽樣，也能達到一樣的效果，比如：

select count(1) from table1 tablesample(bucket 1 out of 10 on id);

區別在於基於已經分桶的表抽樣，查詢只會掃瞄相應桶中的資料，而未分桶表的抽樣，查詢時候需要掃瞄整表資料，先分桶，再抽樣。

hive 隨機抽樣

1.random sampling syntax select from distribute by rand sort by rand limit 2.bucket table sampling 該方式是最佳化取樣bucket表。rand 函式也可以用來取樣整行。如果取樣列同時使用了cluster...

使用Hive隨機抽樣

test1 簡單隨機抽樣 select t.varx,t.a from select varx,rand a from tablename t where t.a between 0 and 0.2這樣就抽取了五分之一的資料。或者像這樣隨機抽取100條資料，與limit結合使用 select dis...

mR 隨機抽樣

1.問題由來 google曾經有一道非常經典的面試題給你乙個長度為n的鍊錶。n很大，但你不知道n有多大。你的任務是從這n個元素中隨機取出k個元素。你只能遍歷這個鍊錶一次。你的演算法必須保證取出的元素恰好有k個，且它們是完全隨機的出現概率均等這道題的解法非常多，網上討論也非常熱烈。本文要討論的是...

Hive實現隨機抽樣（附詳解）

hive 隨機抽樣

使用Hive隨機抽樣

mR 隨機抽樣

相關推薦