hive sql常用技巧

介紹一些常見的資料分析場景中hive sql的一些寫法，涉及區間分析，資料按條件轉換，資料列轉行，計算連續天數，分組排序取top n等場景。

多行合併常用於做區間統計，通過定義一定的金額區級，將上億的記錄降維為不同區間內總數。概括來說就是多對映到一。典型場景：基於使用者交易天流水，計算每天不同金額段的金額筆數。

如使用者的天交易流水表結構如上，需要計算出交易額在0-100，100-200，200-300，大於300幾個區級的筆數,

create
view t_deal_tmp_view_1 as
select
case
when rcv_amount <= 100
then
1when rcv_amount <= 200
then
2when rcv_amount <= 300
then
3else
4end
as amount_range,
receiver
from t_transfer_info
select
amount_range,
count(receiver) as cnt
from t_deal_tmp_view_1
group
by amount_range 
drop
view t_deal_tmp_view_1 
複製**

為什麼不使用下面這種寫法

select
case
when rcv_amount <= 100
then
1when rcv_amount <= 200
then
2when rcv_amount <= 300
then
3else
4end
as amount_range,
count(receiver)
from t_transfer_info
group
bycase
when rcv_amount <= 100
then
1when rcv_amount <= 200
then
2when rcv_amount <= 300
then
3else
null
end複製**

這種寫法會報expressio not in group by key 的錯誤，在hive中，使用group by時，非group by的字段必須使用聚合函式，只有group by的字段才能原值取出。

主要原因是上面在group by後面使用case when沒方法命名新字段。因此需要使用臨時view進行處理。

在hive的表中，有些記錄可能是null，這時如果我們直接對這條記錄做運算或邏輯判斷是得不到我們期望的結果的，這裡可以將null轉換為0再做處理。當然null轉0可以使用hive現成的函式nvl，這裡使用case when是想介紹在hive sql裡條件語句的用法。

select t1.uin, t1.income + case when t2.income is null then 0else t2.income end as income, t1.expend + case when t2.expend is null then 0else t2.expend end as expend from ( select uin, income, expend from t_user_trans_inf_day where statis_day=20180812 )t1left join ( select uin, income, expend from t_user_trans_inf_day where statis_day=20180811 )t2on(t1.uin=t2.uin)

複製**

如有乙個表a，如上，記錄了使用者的消費記錄，每類消費一列，現在需要將該錶的列轉化為行，如表b，原來的多列轉化為多行。如下

這裡有兩種方式可以實現，分布是使用union以及po***plode。

方法一使用union

union實現方式就是分布取出單列，然後進行對結果進行合併，sql如下。

select uin, 1 astype, of_amt from t_user_trans union all select uin, 2 astype, lf_amt from t_user_trans union all select uin, 3 astype, on_amt from t_user_trans union all select uin, 4 astype, cr_amt from t_user_trans

複製**

方法二，使用po***plode

explode是內建函式，支援兩種用法分別是：

explode(array) 列表中的每個元素生成一行。

explode(map) map中每個key-value對，生成一行，key為一列，value為一列。

使用explode(array)沒有type列，因此無法將轉換後的行對應到之前的列，這裡可以使用po***plode來代替，po***plode(array)轉換後，可以獲得列名在陣列中的位置，這樣將位置對應一列進行輸出即可。

select 
uin 
t.pos+1
astype, 
t.value as amount
from t_user_tans
lateral view 
po***plode(
array(
of_amt,
lf_amt,
on_amt,
cr_amt
)) t as pos, value
複製**

有一張使用者登陸流水表，需要計算使用者的連續登陸天數，這裡可以使用分組編號，group by uin+時間減分組編號，這樣連續的天數就被聚合在一起了，可以通過聚合函式計算最終結果。

select
uin,
count(uin) as continuity_days
from(
select
uin,
statis_day,
row_number() over(partition
by uin order
by statis_day asc) as rn
from
( select 
uin,
statis_day 
from t_user_login_log 
where statis_day>= 20170101
and statis_day <= 20180809
) )group
by uin, date_sub(statis_day,cast(rn as
int))
複製**

如有t_user_score記錄了學生所有的科目成績，需要取出每個學生分數最高的一門學科。這裡主要用到row_number()函式。

select
uinfrom
( select 
uin, 
course, 
row_number() over(partition
by uin order
by score asc) as rn
from
t_user_score
)where rn = 1
複製**

hive sql常用技巧

hive表合併字段 hive sql常用技巧

hiveSQL常用日期函式

Hive sql 常用的一些方法

hive sql常用技巧

hive表 合併字段 hive sql常用技巧

hiveSQL常用日期函式

Hive sql 常用的一些方法

相關推薦

hive表合併字段 hive sql常用技巧