Hive小技巧和調優

作為hadoop生態圈中的重要元件，hive在資料分析、處理方面扮演著異常重要的角色。另外，hive作為大資料元件，處理的資料量往往很大，合適的優化技巧在執行效率方面往往可以起到非常好的效果。

1、篩選重覆記錄

這是在業務中經常遇到的乙個問題，主要場景往往是，同一條記錄被多次插入，或者同乙個id對應多條記錄，但是只需要其中一條就足矣。

（1）對於重覆記錄，如果是資料去重，自然是可以使用distinct關鍵字處理，如果記錄不同，而對於同一id任意一條記錄欄位都是有效的話，可以使用group by + max/min這種組合方式處理：

select id,max(c1) as c1,max(c2) as c2 from test_table group by id

（2）另外還有一種就是可以取任意一條記錄，但是必須是同一條記錄的所有字段，這時候可以借助row_number+join方式實現：

create table test_table2 as 
select *, row_number() over(order by id) as c1 from test_table1;
select* from
(select id,max(c1) as c1 from test_table2 group by id) t1 inner join test_table1 t2
on t1.id=t2.id and t1.c1=t2.c1;

（3）還有一種比較特殊的處理方式，就是利用hive的行轉列集合函式處理。當然這種方法用的比較少，也不建議這麼用。

select id,collect_list(c1)[0] as c1,collect_list(c2)[0] as c2 from test_table group by id

2、多表合併

多表合併也是乙個非常常見的場景，例如，有三張學生三門課成績表，分別是學號和成績，現在合併為一張表，四列分別是id和三門課成績。對於上面的場景，通常的做法就是乙個三表join即可，但是join對於資源的消耗和執行效率也很可觀，其實在不利用join的情況下，利用union all + group by也可以完成這個任務：

select id,sum(score_1) as score_1,sum(score_2) as score_2,sum(score_3) as score_3 from
(select id,score_1,0 as score2,0 as score3 from test_table1
union all
select id,0 as score_1,score2,0 as score3 from test_table2
union all
select id,0 as score_1,0 as score2,score3 from test_table3) t1
group by id;

當然這裡介紹的方法並不是為上述業務場景服務的，只是提供一種處理思路，另外在採用這種方法的時候需要注意對於null的處理。

3、join

資料庫join是效能調優乙個永恆的話題，常見的注意事項如下：

4、row_number排序

row_number常用來新增排序字段，對於分割槽排序建議使用 row_number() over(distribute by c1 sort by c2 desc)，而不是row_number() over(partition by cl1 order by c2)。

order by：適用全域性排序，缺陷是只能使用乙個reduce任務。

5、壓縮儲存

建立表時設定orc格式

create table table_name( id bigint, ...) …row format serde 'org.apache.hadoop.hive.ql.io.orc.orcserde' stored as inputformat 'org.apache.hadoop.hive.ql.io.orc.orcinputformat' outputformat

'org.apache.hadoop.hive.ql.io.orc.orcoutputformat';

6、並行執行

hive預設情況下是不開啟並行執行模式的，例如，如果兩個不相關的子查詢a join b或者a union all b，既然a、b不相關，所以在資源允許情況下，最好是a、b兩個子查詢並行執行效率最高，但是預設情況下hive是先執行完乙個，再執行另乙個，造成效率低下，因此可以開啟並行模式：

set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=12;

7、count去重統計

count去重統計常見方法就是 count(distinct column_a)，但是這種方法計算效率低，建議使用子查詢group by加外層count來提公升計算效率。另外，在統計資料的時候推薦count(1)而不是使用count(*)進行統計。

8、with

在hive查詢中，經常會遇到乙個子查詢多次使用的情況，這時候第一可以使用子查詢巢狀加別名，第二可以寫兩遍子查詢。這兩種方法不管哪一種相對with而言都比較麻煩。with可以預定義一段執行語句（相當於變數）在下面呼叫。這樣既可以簡化sql，而且因為with語句只執行一遍，還可以優化效能。

with t1 as (
select *
from carinfo
), t2 as (
select *
from car_blacklist
)select * from t1 inner join t2 on t1.id=t2.id;

參考資料

Hive小技巧和調優

hive效能調優

Hive效能調優

Hive引數調優

Hive小技巧和調優

hive效能調優

Hive效能調優

Hive引數調優

相關推薦