Hive Group By和Count與笛卡爾積

1）開啟map端聚合引數設定

（1）是否在map端進行聚合，預設為true set hive.map.aggr = true ;（2）在map端進行聚合操作的條目數目 set hive.groupby.mapaggr.checkinterval = 100000; （3）有資料傾斜的時候進行負載均衡（預設是false） set hive.groupby.skewindata = true

;

當選項設定為 true，生成的查詢計畫會有兩個mr job。第乙個mr job中，map的輸出結果會隨機分布到reduce中，每個reduce做部分聚合操作，並輸出結果，這樣處理的結果是相同的group by key有可能被分發到不同的reduce中，從而達到負載均衡的目的；第二個mr job再根據預處理的資料結果按照group by key分布到reduce中（這個過程可以保證相同的group by key被分布到同乙個reduce中），最後完成最終的聚合操作。

資料量小的時候無所謂，資料量大的情況下，由於count distinct操作需要用乙個reduce task來完成，這乙個reduce需要處理的資料量太大，就會導致整個job很難完成，一般count distinct使用先group by再count的方式替換：

環境準備：

hive>
>create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t'
;hive>
>load data local inpath '/home/admin/softwares/data/100萬條大表資料（id除以10取整）/bigtable' into table bigtable;
hive>
>set hive.exec.reducers.bytes.per.reducer=32123456;
hive>
>select count(distinct id) from bigtable;
結果：c0
10000
time taken: 35.49 seconds, fetched: 1 row(s)

可以轉換成：

hive>
>set hive.exec.reducers.bytes.per.reducer=32123456;
hive>
>select count(id) from (select id from bigtable group by id) a;
結果：stage-stage-1: map: 1 reduce: 4 cumulative cpu: 13.07 sec hdfs read: 120749896 hdfs write: 464 success
stage-stage-2: map: 3 reduce: 1 cumulative cpu: 5.14 sec hdfs read: 8987 hdfs write: 7 success
_c010000
time taken: 51.202 seconds, fetched: 1 row(s)

雖然會多用乙個job來完成，但在資料量大的情況下，這個絕對是值得的。

盡量避免笛卡爾積，即避免join的時候不加on條件，或者無效的on條件，hive只能使用1個reducer來完成笛卡爾積。

Hive Group By和Count與笛卡爾積

Hive Group By 常見錯誤

Hive Group By 常見錯誤

hive group by聚合函式增強

Hive Group By和Count與笛卡爾積

Hive Group By 常見錯誤

Hive Group By 常見錯誤

hive group by聚合函式增強

相關推薦