hive 幾種hive優化方法

1.通過explain或者explain extended來檢視執行計畫。

explain select * from u3; //執行結果 ------------------------------------------ stage dependencies: stage-0 is a root stage stage plans: stage: stage-0 fetch operator limit: -1 processor tree: tablescan alias: u3 statistics: num rows: 1 data size: 43 basic stats: complete column stats: none select operator expressions: id (type: bigint), name (type: string), *** (type: tinyint) outputcolumnnames: _col0, _col1, _col2 statistics: num rows: 1 data size: 43 basic stats: complete column stats: none listsink

time taken: 0.457 seconds, fetched: 17 row(s)

加上extended

explain extended select * from u3; --------------------------- abstract syntax tree: tok_query tok_from tok_tabref tok_tabname u3tok_insert tok_destination tok_dir tok_tmp_file tok_select tok_selexpr tok_allcolref stage dependencies: stage-0 is a root stage stage plans: stage: stage-0 fetch operator limit: -1 processor tree: tablescan alias: u3 statistics: num rows: 1 data size: 43 basic stats: complete column stats: none gatherstats: false select operator expressions: id (type: bigint), name (type: string), *** (type: tinyint) outputcolumnnames: _col0, _col1, _col2 statistics: num rows: 1 data size: 43 basic stats: complete column stats: none listsink

time taken: 0.263 seconds, fetched: 34 row(s)

以上兩種方法都是檢視執行計畫，只不過extended會列印語句的抽象語義樹。

stage:

（1）乙個stage相當於乙個mapreduce任務（可以是乙個子查詢，可以是乙個抽樣，可以是乙個

合併、可以是乙個limit）

（2）hive預設每次只執行乙個stage，但是沒有依賴關係的可以並行執行。

（3）乙個hive的hql語句包含乙個或者多個stage，多個之間依賴越複雜，表示任務越複雜，執行效率較低。

2.limit的優化

//優化是否開啟

hive.limit.optimize.enable=false;

//控制最大的抽樣數量

hive.limit.row.max.size=10000;

//抽樣的最大檔案數量

hive.limit.optimize.limit.file=10;

//fechquery獲取最大的行數

hive.limit.optimize.fetch.max=50000;

3.join設定

永遠是小表驅動大表

大表標識（/+streamtable(br)/）

開啟map端的join

join的on只支援等值連線，on後的比較的兩個欄位的資料型別盡量相同

4.local本地模式

hive查詢資料依然還是依靠hadoop。

//是否開啟本地模式

hive.exec.mode.local.auto=false;

hive.exec.mode.local.auto.inputbytes.max=134217728;

hive.exec.mode.local.auto.input.files.max=4;

5.parallel並行設定

hive沒有相互依賴的任務可以並行執行。

//是否設定並行執行

hive.exec.parallel=false;

//並行執行執行緒數

hive.exec.parallel.thread.number=8;

6.jvm的使用

//jvm rask數量

mapreduce.job.jvm.numtasks=1;

//允許重用的task

set mapred.job.reuse.jvm.num.tasks=1;

7.資料傾斜

由於key的分布不均勻造成的資料向乙個方向偏的現象。

資料傾斜原因：

資料本身傾斜

hql語句：

join、group by 容易造成

解決資料傾斜：

找到造成資料傾斜的key，

可以單獨將這個key提取出來計算，然後再通過union合併進來；

可以將key拼接隨機數，然後將其分掃到不同的節點執行；

設定屬性：

可去hive配置文件去檢視相關解釋

//建議開啟

hive.map.aggr=true;

hive.optimize.skewjoin=false;

hive.groupby.skewindata=false;

8.job數量

一般是乙個查詢，子查詢，limit等產生乙個job(不是所有的這些語句都會產生)。

可以通過語句來控制job。

hive 幾種hive優化方法

Hive優化的幾種方法

Hive 常用優化方法

Hive幾種引數配置方法

hive 幾種hive優化方法

Hive優化的幾種方法

Hive 常用優化方法

Hive幾種引數配置方法

相關推薦