大資料專案3

2021-10-16 17:19:52 字數 3814 閱讀 2552

gmv                      今天提交訂單的金額總和,不管是否支付

全站pv 頁面瀏覽量大砍一次就是乙個pv再重新整理一次又是乙個pv

全站uv 去重的訪客總和

set mapreduce.framework.name=local;本地

set mapreduce.framework.name=yarn; yarn

set hive.vectorized.execution.enabled=true ; //開啟

set hive.vectorized.execution.enabled=false ; //關閉

set mapreduce.framework.name=local;

set hive.vectorized.execution.enabled=false ;

--相當於給這個表起個別名

with temp as

(select

guid,

--使用者新會話 超過30分鐘的哪個

newsessionid as session_id,

--起始時間

`timestamp` as ts,

--新老訪客標記

isnew as isnew,

--事件id

eventid as eventid,

--first_value取每個分割槽中的第乙個資料

--根據guid和和newsessionid開測視窗取事件(是乙個集合)的最小的(第乙個)

,但是null會排在前面所以null就匹配了個最大排最後就取不到

first_value

(properties[

'pageid'])

over

(partition by guid,newsessionid order by if

(eventid=

'pageview'

, `timestamp`,

20000000000000

)) as start_page,

--這也是一樣就desc倒序取第乙個相當於取最大的

first_value

(properties[

'pageid'])

over

(partition by guid,newsessionid order by if

(eventid=

'pageview'

, `timestamp`,

0)desc) as end_page,

--取第乙個省

first_value

(province)

over

(partition by guid,newsessionid order by `timestamp`) as province,

--取第乙個國家

first_value

(country)

over

(partition by guid,newsessionid order by `timestamp`) as country,

--取第乙個市

first_value

(city)

over

(partition by guid,newsessionid order by `timestamp`) as city,

--取第乙個區

first_value

(region)

over

(partition by guid,newsessionid order by `timestamp`) as region,

--取第乙個裝置型別

first_value

(devicetype)

over

(partition by guid,newsessionid order by `timestamp`) as device_type,

--取第乙個系統

first_value

(osname)

over

(partition by guid,newsessionid order by `timestamp`) as os_name,

first_value

(releasechannel)

over

(partition by guid,newsessionid order by `timestamp`) as release_ch

from

where

dt='2021-01-10'

)(dt =

'2021-01-10');

select

guid,

session_id,

min(ts) as start_time,

max(ts) as end_time,

min(start_page) as start_page,

min(end_page) as end_page,

count(if

(eventid=

'pageview',1

,null)

) as pv_cnt,

min(isnew) as isnew,

--把事件戳減三位數可以from_unixtime變成時間 2005-03

-1801:

58:31,再hour變成小時01

hour

(from_unixtime

(min

(cast

(ts/

1000 as bigint)))

) as hour_range,

min(country) as country,

min(province) as province,

min(city) as city,

min(region) as region,

min(device_type) as device_type,

min(os_name) as os_name,

min(release_ch) as release_ch

from

temp

group by

guid,session_id

(guid string ,

session_id string ,

--會話id

start_time bigint ,

end_time bigint ,

start_page string ,

--入口頁

end_page string ,

--跳出頁

pv_cnt int

,--訪問頁數

isnew string ,

--新使用者

hour_range int

,--小時段

country string,

province string ,

city string ,

region string ,

device_type string ,

--手機型號

os_name string ,

--手機品牌

releasechannel string

)partitioned by

(dt string)

stored as parquet

大資料 專案流程

1.資料的預處理階段 2.資料的入庫操作階段 3.資料的分析階段 4.資料儲存到資料庫階段 5.資料的查詢顯示階段 reduce driver create table 表名 videoid string,uploader string,age int row format delimited fi...

離線大資料專案流程

mapreduce 資料清洗 hive textfile格式 create table 表名 a string b string row format delimited fields terminated by 指定分隔符 stored as textfile load data local in...

目前公司大資料專案結構

採集資料 上傳資料 mq 大資料 link 持久層hardoop 負責資料更新 mysql 用於查詢歷史資料 用於展示實時資料 web 一 什麼是mq mq使用經驗總結 mq經驗總結 首先了解什麼是mq?mq的作用是什麼?mq是通訊中介軟體。他的作用是省去開發人員開發通訊工具的時間,節省開發成本,提...