hive日誌分析實戰（二）

統計某遊戲平台新使用者渠道**

日誌格式如下：

jul 23 0:00:47  [info] gjzq2013072300004785493108s1360wan-2j-reg58.240.209.78

問題的關鍵在於先找出新使用者

新使用者：僅在7月份登陸過平台的使用者為新使用者

依據map/reduce思想，可以按照如下方式找出新使用者：

找出新使用者的**渠道

**渠道：新使用者在201307可能多次登入平台，需要找出最早登陸平台所屬渠道

分兩步來做：

資料準備

1）建表

create table if not exists glogin_daily (year int,month int,day int,hour int,logintime string,qid int,gkey string,skey string,loginip string,registsrc string,loginfrom string) partitioned by (dt string);

依據日誌內容及所關心的資訊建立**，按天組織分割槽

2 ) 資料匯入

因日誌檔案存在於多處，我們先將日誌彙總到一臨時目錄，建立臨時外部表將資料載入進hive，然後通過正則匹配的方式分隔出各欄位。（內部表只能load單檔案，通過這種方式可以load資料夾）

echo "==== load data into tmp table $tmp_table ==="
$/hive/bin/hive -e "create external table $tmp_table (info string) location '$';"
echo "==== m/r ==="
curr_year=`echo $curr_doing|cut -b 1-4`
curr_month=`echo $curr_doing|cut -b 5-6`
curr_day=`echo $curr_doing|cut -b 7-8`
dt="$-$-$"
$/hive/bin/hive -e "add file $/$;set hive.exec.dynamic.partition=true;insert overwrite table glogin_daily partition (dt='$') select transform (t.i) using '$map_script_parser ./$' as (y,m,d,h,t,q,g,s,ip,src,f) from (select info as i from $) t;"

其中filter_login.php:

$fr=fopen("php://stdin","r");
$month_dict = array(
'jan' => 1,
'feb' => 2,
'mar' => 3,
'apr' => 4,
'may' => 5,
'jun' => 6,
'jul' => 7,
'aug' => 8,
'sep' => 9,
'oct' => 10,
'nov' => 11,
'dec' => 12,
);while(!feof($fr))
xxj20130723000000245396389s9iwan-ng-mnsgcl-reg-xxj0if221.5.67.136
if(preg_match("/([^ ]+) +(\d+) (\d+):.*\([^\(\d+)\(\d+)\([^\([^\(([^\([^\)?/",$input,$matches))
}fclose ($fr);

2.找出新使用者

1)使用者登陸平台記錄按月消重彙總

create table distinct_login_monthly_tmp_07 as select qid,year,month from glogin_daily group by qid,year,month;

2)使用者登陸平台月數

create table login_stat_monthly_tmp_07 as select qid,count(1) as c from distinct_login_monthly_tmp_07 where year<2013 or (year=2013 and month<=7) group by qid;

平台級新使用者：

1)找出登陸月數為1的使用者;

2.判斷這些使用者是否在7月份出現，如果有出現，找出登陸所有src

create table new_player_monthly_07 as select distinct a.qid,b.src,b.logintime from (select qid from login_stat_monthly_tmp_07 where c=1) a join (select qid,loginfrom as src,logintime from glogin_daily where month=7 and year=2013) b on a.qid=b.qid;

找出最早登陸的src:

add file /home/game/lvbenwei/load_login/get_player_src.php;
create table new_player_src_07 as select transform (t.qid,t.src,t.logintime) using 'php ./get_player_src.php' as (qid,src,logintime) from (select * from new_player_monthly_07 order by qid,logintime) t;

其中get_player_src.php:

$fr=fopen("php://stdin","r");
$curr_qid = null;
$curr_src = null;
$curr_logintime=null;
while(!feof($fr))
}fclose ($fr);

平台級新使用者數：

select count(*) from new_player_src_07;

平台級各渠道新使用者彙總：

create table new_player_src_stat_07 as select src,count(*) from new_player_monthly_07 group by src;

hive日誌分析實戰（二）

統計某遊戲平台新使用者渠道日誌格式如下 text jul 23 0 00 47 info gjzq2013072300004785493108s1360wan 2j reg58.240.209.78 問題的關鍵在於先找出新使用者新使用者僅在7月份登陸過平台的使用者為新使用者依據map red...

實戰班 Hive高階（二）

一上次回顧二 hive sql的執行流程三 hive中的udf函式四本次課程涉及面試題 1 場景資料量不大，但是使用hive sql執行起來比較慢 hive中常用的sql語句？乙個sql的執行流程大概會分為如下幾個流程 sql on hadoop的乙個引擎，乙個sql語句進來，把作業提交...

基於Hive的日誌資料統計實戰

public boolean next longwritable key,byteswritable value throws ioexception return false 重寫 hiveignorekeytextoutputformat 中 recordwriter 中的 write 方法，示...

hive日誌分析實戰（二）

hive日誌分析實戰（二）

實戰班 Hive高階（二）

基於Hive的日誌資料統計實戰

相關推薦