hive join 遇到問題

在表連線時遇到乙個問題：

insert overwrite table bf_evt_crd_crt_trad2
select bf_evt_crd_crt_trad.*, jjkdjk.cust_no,bf_agt_crd_crt.out_crd_instn_cd
from bf_agt_crd_crt join jjkdjk on (bf_agt_crd_crt.cust_no=jjkdjk.pcust_no) join bf_evt_crd_crt_trad on (bf_evt_crd_crt_trad.crd_no= bf_agt_crd_crt.crd_no);

該語句中如果大表有30億行記錄，而小表只有100行記錄，而且那麼大表中資料傾斜特別嚴重，有乙個key上有15億行記錄，在執行過程中特別的慢，而且在reduece的過程中遇有記憶體不夠而報錯。

考慮map join 的原理：

mapjion會把小表全部讀入記憶體中，在map階段直接拿另外乙個表的資料和記憶體中表資料做匹配，由於在map是進行了join操作，省去了reduce執行的效率也會高很多

解決思路：

bf_agt_crd_crt　　count(*)　　4031974
jjkdjk　　count(*)　　3912676
bf_evt_crd_crt_trad　　count(*)　　251512826
採用hint方式啟動資料驅動，如：

select f.a,f.b from a t join b f  on ( f.a=t.a and f.ftime=20110802)  
改為select /*+ mapjoin(a)*/ f.a,f.b from a t join b f on ( f.a=t.a and f.ftime=20110802)

insert overwrite table bf_evt_crd_crt_trad2
select /*+ mapjoin(bf_agt_crd_crt)*/bf_evt_crd_crt_trad.*, jjkdjk.cust_no,bf_agt_crd_crt.out_crd_instn_cd
from bf_agt_crd_crt join jjkdjk on (bf_agt_crd_crt.cust_no=jjkdjk.pcust_no) join bf_evt_crd_crt_trad on (bf_evt_crd_crt_trad.crd_no= bf_agt_crd_crt.crd_no);

但還是報錯。

total mapreduce jobs = 4 2014-10-22 05:45:06 starting to launch local task to process map join; maximum memory = 1065484288 2014-10-22 05:45:42 processing rows: 200000 hashtable size: 199999 memory usage: 82761296 percentage: 0.078 2014-10-22 05:45:45 processing rows: 300000 hashtable size: 299999 memory usage: 114515648 percentage: 0.107 2014-10-22 05:45:47 processing rows: 400000 hashtable size: 399999 memory usage: 148324312 percentage: 0.139 ....... 2014-10-22 05:46:37 processing rows: 2400000 hashtable size: 2399999 memory usage: 851355056 percentage: 0.799 2014-10-22 05:46:46 processing rows: 2500000 hashtable size: 2499999 memory usage: 888876848 percentage: 0.834 2014-10-22 05:46:47 processing rows: 2600000 hashtable size: 2599999 memory usage: 934695048 percentage: 0.877 2014-10-22 05:46:48 processing rows: 2700000 hashtable size: 2699999 memory usage: 973416544 percentage: 0.914 execution failed with exit status: 3 obtaining error information task failed! task id: stage-12 logs: /tmp/root/hive.log

failed: execution error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.mapredlocaltask

分析原因是：

任務自動把join裝換mapjoin時記憶體溢位，解決法子：關閉自動裝換，11前的版本預設值為false，後面的為true;

所以hive預設配置引數為set hive.auto.convert.join = true;

首先把小的表加入記憶體，hive自動根據sql，選擇使用common join或者map join，導致只針對小表來確定mapreduce個數和執行空間，而大表根本就處理不了。

而hive.mapjoin.smalltable.filesize 預設值是25mb

set mapreduce.map.memory.mb=2049;
set mapreduce.reduce.memory.mb=20495;
set hive.auto.convert.join=false;
insert overwrite table bf_evt_crd_crt_trad2
select bf_evt_crd_crt_trad.*, jjkdjk.cust_no,bf_agt_crd_crt.out_crd_instn_cd
from bf_agt_crd_crt join jjkdjk on (bf_agt_crd_crt.cust_no=jjkdjk.pcust_no) join bf_evt_crd_crt_trad on (bf_evt_crd_crt_trad.crd_no= bf_agt_crd_crt.crd_no);

hive join 遇到問題

遇到問題後。。。

hive on tez遇到問題

linux遇到問題

hive join 遇到問題

遇到問題後。。。

hive on tez遇到問題

linux遇到問題

相關推薦