使用Hive進行OSS資料處理的乙個最佳實踐

本文主要介紹如何使用hive來處理儲存在oss上的資料來源，並通過e-mapreduce計算，最終的結果儲存在oss上，並能夠每天自動的進行hive的分割槽資料的排程

資料來源：我們假設在oss上我們的資料是按照一定的目錄格式來儲存的，比如時間，按照類似2016/06/01這樣的年/月/日的方式存放。而原始資料內容都是一些非格式化的資料，完全沒有經過處理。

類似如下的乙個格式：

123|service control exceed 100. others content|192.168.0.1|2016-05-31

結果資料：我們需要把每個目錄下的資料經過處理，寫到oss上類似2016/06/01的乙個結果目錄下

create external table logoss (logcontent string) partitioned by (year string, month string, day string) stored as textfile location 'oss:';

通過這一步，我們有了一張hive的分割槽表，hive只是在它的元資料庫中記錄了這個表的資訊，這個時候還沒有資料的處理。而資料也還在我們的oss上躺著。

接著把需要的分割槽都加入到表中，這裡我假設我們有很多個分割槽

alter table logoss add partition (year='2016', month='05', day='31') location 'oss:/2016/05/31' partition (year='2016', month='06', day='01') location 'oss:/2016/06/01' partition (year='2016', month='06', day='02') location 'oss:/2016/06/02' partition (year='2016', month='06', day='03') location 'oss:/2016/06/03';

接下來我們select資料看一下，執行如下

select * from logoss limit 100;

我們就會看到我們的分割槽中的內容了。

我們要把原來oss上的原始資料，經過處理然後寫到乙個hdfs上的表，然後用這個hdfs的表進行後續的一系列處理。這裡把所有的中間步驟都在hdfs上走，這樣速度會快很多。

首先建立乙個基於hdfs的hive表，目前資料也還是空

create table loghdfs (id string, content string, ip string, oridate string) partitioned by (year string, month string, day string) stored as textfile;

然後將oss的資料進行處理並寫入到hdfs的表中,這裡我們使用if not exists，為了防止這個分割槽已經存在被我們覆蓋掉，如果你希望資料直接覆蓋，可以去掉這個條件判斷。

insert overwrite table loghdfs partition (year='2016', month='05', day='31') if not exists select split(logcontent,'\\|')[0] as id, split(logcontent,'\\|')[1] as content, split(logcontent,'\\|')[2] as ip, split(logcontent,'\\|')[3] as oridate from logoss;

好了，到了這一步，我們就已經有了乙個hdfs上的表了，我們可以對這個表進行任意的後續處理，

比如groupby 所有的ip,然後看他們的總數值

create table userip as select ip, count(id) from loghdfs group by ip;

中間可以進行類似的各種操作，由你的業務決定。

當所有的操作都完成以後，如果要把資料寫到oss上，那麼來到最後一步

首先我們會建立乙個對應oss路徑的hive表，與第一步很類似

create external table resultoss (ip string, count int) partitioned by (year string, month string, day string) stored as textfile location 'oss:';

最後把我們的業務資料寫入到對應的分割槽中去

insert overwrite table resultoss partition (year='2016', month='05', day='31') if not exists select ip, count from userip;

這樣我們的結果資料就寫到了oss上對應的目錄下，類似這樣的路徑

/path/year=2016/month=05/day=31/

看了上面的這個過程，會發現這中間這個時間的分割槽需要我們手工寫在裡面，實在是太麻煩了，完全沒有辦法自動跑啊，那麼下面我們就來更加進化一下。

我們首先在e-mapreduce控制台上編輯的時候使用hivevar來指定時間變數，如下

-hivevar year='2016' -hivevar month='05' -hivevar day='31' -f ossref://mypath/job.hql

然後，我們需要把這個裡面的常量變成每天自動變化的時間，我們使用e-mapreduce提供的時間變數

如下

-hivevar year=' $' -hivevar month=' $' -hivevar day=' $' -f ossref://mypath/job.hql

時間配置的說明請參考這裡

現在我們看看修改完成以後的完整的**，中間的分割槽時間都是用變數進行了替換

create external table logoss (logcontent string) partitioned by (year string, month string, day string) stored as textfile location 'oss:/';
alter table logoss add partition (year='$', month='$', day='$') location 'oss:/$/$/$';
create table loghdfs (id string, content string, ip string, oridate string) partitioned by (year string, month string, day string) stored as textfile;
insert overwrite table loghdfs partition (year='$', month='$', day='$') if not exists select split(logcontent,'\\|')[0] as id, split(logcontent,'\\|')[1] as content, split(logcontent,'\\|')[2] as ip, split(logcontent,'\\|')[3] as oridate from logoss;
create table userip as select ip, count(id) as count from loghdfs group by ip;
create external table resultoss (ip string, count int) partitioned by (year string, month string, day string) stored as textfile location 'oss:';
insert overwrite table resultoss partition (year='$', month='$', day='$') if not exists select ip, count from userip;

然後你可以把這個作業加到乙個週期執行的執行計畫中，每天執行一次，就可以完全的自動每天跑資料啦。

hive資料處理

剛工作兩周，大量使用到hive，總結一下使用過程中遇到的問題和容易犯的錯誤！示例一.解析出中的12345678，類似於這樣的需求。一眼看到這樣的需求，第一反應就是這是乙個正則匹配的問題，用正則解析函式一下就搞定了regex extract uri,d 2 但是在實際中會發現當資料量特別大了以後效率特...

Python進行資料處理

coding utf 8 created on thu aug 17 17 15 14 2017 author cq 二元化 from sklearn.preprocessing import binarizer x 1,2,3,4,5 5,4,3,2,1 3,3,3,3,3 1,1,1,1,1 x...

利用ArcGIS進行拓撲資料處理

道路中心線如果有多部分組成，例如是multilinestring的時候，由於資料採集處理的原因經常會造成，multilinestring的各個部分之間並不是完全能夠連線起來，這樣的話會對資料的後期分析處理造成一些影響，所以必須對資料進行處理，將一定範圍內分離的各部分連線起來，arcgis的拓撲分析的...

使用Hive進行OSS資料處理的乙個最佳實踐

hive資料處理

Python進行資料處理

利用ArcGIS進行拓撲資料處理

相關推薦