hive日誌分析實戰（二）

統計某遊戲平台新使用者渠道**

日誌格式如下：

text**

jul 23

0:00:47 [info] gjzq2013072300004785493108s1360wan-2j-reg58.240.209.78

問題的關鍵在於先找出新使用者

新使用者：僅在7月份登陸過平台的使用者為新使用者

依據map/reduce思想，可以按照如下方式找出新使用者：

找出新使用者的**渠道

**渠道：新使用者在201307可能多次登入平台，需要找出最早登陸平台所屬渠道

分兩步來做：

資料準備

1）建表

sql**

create

table if not exists glogin_daily (year

int,month

int,day

int,hour

int,logintime string,qid int,gkey string,skey string,loginip string,registsrc string,loginfrom string) partitioned by (dt string);

依據日誌內容及所關心的資訊建立**，按天組織分割槽

2 ) 資料匯入

因日誌檔案存在於多處，我們先將日誌彙總到一臨時目錄，建立臨時外部表將資料載入進hive，然後通過正則匹配的方式分隔出各欄位。（內部表只能load單檔案，通過這種方式可以load資料夾）

shell**

echo "==== load data into tmp table $tmp_table ==="

$/hive/bin/hive -e "create external table $tmp_table (info string) location '$';"

echo "==== m/r ==="

curr_year=`echo $curr_doing|cut -b 1-4`

curr_month=`echo $curr_doing|cut -b 5-6`

curr_day=`echo $curr_doing|cut -b 7-8`

dt="$-$-$"

$/hive/bin/hive -e "add file $/$;set hive.exec.dynamic.partition=true;insert overwrite table glogin_daily partition (dt='$') select transform (t.i) using '$map_script_parser ./$' as (y,m,d,h,t,q,g,s,ip,src,f) from (select info as i from $) t;"

其中filter_login.php:

php**

$fr=fopen("php://stdin","r");

$month_dict = array(

'jan' => 1,

'feb' => 2,

'mar' => 3,

'apr' => 4,

'may' => 5,

'jun' => 6,

'jul' => 7,

'aug' => 8,

'sep' => 9,

'oct' => 10,

'nov' => 11,

'dec' => 12,

);

while(!feof($fr))

xxj20130723000000245396389s9iwan-ng-mnsgcl-reg-xxj0if221.5.67.136

if(preg_match("/([^ ]+) +(\d+) (\d+):.*$[^\(\d+)\(\d+)\([^\([^\(([^\([^$?/",$input,$matches))

} fclose ($fr);

2.找出新使用者

1)使用者登陸平台記錄按月消重彙總

sql**

create

table distinct_login_monthly_tmp_07 as

select qid,year,month

from glogin_daily group

by qid,year,month;

2)使用者登陸平台月數

sql**

create

table login_stat_monthly_tmp_07 as

select qid,count(1) as c from distinct_login_monthly_tmp_07 where

year

<2013 or (year=2013 and

month

<=7) group

by qid;

平台級新使用者：

1)找出登陸月數為1的使用者;

2.判斷這些使用者是否在7月份出現，如果有出現，找出登陸所有src

sql**

create

table new_player_monthly_07 as

select

distinct a.qid,b.src,b.logintime from (select qid from login_stat_monthly_tmp_07 where c=1) a join (select qid,loginfrom as src,logintime from glogin_daily where

month=7 and

year=2013) b on a.qid=b.qid;

找出最早登陸的src:

sql**

add file /home/game/lvbenwei/load_login/get_player_src.php;

create

table new_player_src_07 as

select transform (t.qid,t.src,t.logintime) using 'php ./get_player_src.php'

as (qid,src,logintime) from (select * from new_player_monthly_07 order

by qid,logintime) t;

其中get_player_src.php:

php**

$fr=fopen("php://stdin","r");

$curr_qid = null;

$curr_src = null;

$curr_logintime=null;

while(!feof($fr))

} fclose ($fr);

平台級新使用者數：

sql**

select

count(*) from new_player_src_07;

平台級各渠道新使用者彙總：

sql**

create

table new_player_src_stat_07 as

select src,count(*) from new_player_monthly_07 group

by src;

hive日誌分析實戰（二）

hive日誌分析實戰（二）

實戰班 Hive高階（二）

基於Hive的日誌資料統計實戰

hive日誌分析實戰（二）

hive日誌分析實戰（二）

實戰班 Hive高階（二）

基於Hive的日誌資料統計實戰

相關推薦