hive資料處理

剛工作兩周，大量使用到hive，總結一下使用過程中遇到的問題和容易犯的錯誤！

示例一.解析出中的12345678，類似於這樣的需求。

一眼看到這樣的需求，第一反應就是這是乙個正則匹配的問題，用正則解析函式一下就搞定了regex_extract(uri,'(.*)/(\\d+)',2),但是在實際中會發現當資料量特別大了以後效率特別特別慢，並且正則解析是乙個非常耗cpu的操作，於是就想用其它方法，仔細看看資料發現字串是有規律的,即12345678的這個字串在uri欄位中是定長的，並且12345678這個字串的字首也是定長的，所以改用substr(a,b,c),效率成倍的提公升。

總結，對於可能會用到正則的需求，首先應該考慮能不能有其他的替代方案，然後就需要了解你的資料，如果能夠確定你的資料是規則的，那麼完全可以採用字串函式，進行替代。總之就是，正則其實更適合處理不規則的字串。用字串函式處理規則的字串效率會很高。

示例二.

table1-->訂單表主要字段：order_idtotal_price

table2-->訂單商品表主要字段：order_idgoods_idpriceamount

table3-->商品表。主要字段：goods_idpricename

需求：從這三個表中查詢出每個商品銷售額；

一眼看上去這是乙個簡單的關聯查詢，於是：

select sum(total_price) from table1 join table2 on table1.order_id join table3 on table2.goods_id=table3 group by name;

也許我很弱智所以直接就犯了這個錯誤。

錯誤一：沒有搞清楚訂單表和訂單商品表的關係，如果訂單表和訂單商品表是一對多的關係，那麼本來只有一條訂單的記錄，一旦和訂單商品表進行連線的時候訂單的記錄就會出現多條，這時候去做sum明顯是不對的，直接把總價翻倍了;

錯誤二：沒有搞清楚table3中的name是否是唯一，如果name不唯一那麼不同的商品應該是由goods_id來決定而不應該是name;

在搞清楚這些問題了以後寫出正確的sql語句

select name,sum(price*amount) from table1 join on table2 on table.order_id join table3 on table2.goods_id=table3.goods_id group by table2.goods_id;

總結，在寫sql的時候必須要清楚各個表之間的關聯關係。字段含義，只有清楚的知道了這些以後才能寫出正確的sql;

示例三.舉例說明grouping sets() 方法的使用；現在有乙個商品表，商品可能存在於一，二，**類目下。如果乙個商品屬於某一類的**類目，那麼它也同時屬於一級和二級類目；

需求:統計出一級類目，二級類目，**類目中的商品的個數；

select catalog1,count(1) as goods_amount from test group by catalog1

union all

select catalog1,catalog2,count(1) as goods_amount from test group by catalog1,catalog2

union all

select catalog1,catalog2,catalog3,count(1) as goods_amount from test group by catalog1,catalog2,catalog3;

可以看見這個sql很長，但是冗餘的部分很多，所以我們可以用比較優雅的方式來替代它，即：

select grouping__id,catalog1,catalog2,catalog3,count(1) as goods_amount from test group by catalog1,catalog2,catalog3 grouping sets(catalog1,(catalog1,catalog2),(catalog1,catalog2,catalog3));

同時它也等同於select grouping__id,catalog1,catalog2,catalog3,count(1) as goods_amount from test group by catalog1,catalog2,catalog3 with rollup;

grouping sets((...),(....));的意思是說根據你group by 出來的結果可以進一步的對其再分組，grouping__id也是乙個函式，它返回的是分組的編號。

還有一種是select grouping__id,catalog1,catalog2,catalog3,count(1) as goods_amount from test group by catalog1,catalog2,catalog3 with cube;

group by catalog1,catalog2,catalog3 with cube;等價於group by catalog1,catlog2,catalog3 grouping sets((catalog1),(catalog2),(catalog3),(catalog1,catalog2),(catalog2,catalog3),(catalog1,catalog3),(catalog1,catalog2,catalog3));也就是說with cube返回的是序列catalog1,catalog2,catalog3的全序集；

示例四.分組的topn問題；

例如:統計每個部門每年年薪排名前3的職工的姓名

select t.department,t.year,t.year_salary from (select department,year,name,sum(salary) as year_salary,row_number() over(distribute by department,year sort by sum(salary) desc) as linenumber from employee group by department,year,name) t where t.linenumber<=3;

這裡主要用到了row_number() over()兩個視窗函式（統計函式），底層操作以及更多的含義需要以後查明；

總結：還有很多更複雜的情況沒有一一說明，總之就是資料處理不能僅僅關注資料，資料怎麼來的；最後應該以何種方式呈現；業務邏輯等和資料有關的事情更是需要我們去了解，否則出現資料錯誤，或者是資料表達不清晰等不好的事情發生。

從資料的列印，收集，儲存，處理，展現；這個過程是複雜的，任何乙個環節出現錯誤最後導致的結果就是結論錯誤，所以作為一名資料分析獅，要做和關心的事情還很多，加油吧，少年！

hive資料處理

資料處理流資料處理利器

爬蟲資料處理 pandas資料處理

資料處理 pandas資料處理優化方法小結

hive資料處理

資料處理 流資料處理利器

爬蟲 資料處理 pandas資料處理

資料處理 pandas資料處理優化方法小結

相關推薦

資料處理流資料處理利器

爬蟲資料處理 pandas資料處理