hive 專案實戰 2

2021-08-30 15:29:09 字數 4771 閱讀 5652

建表

建立表這裡總共需要建立4張表,明明只有兩個資料檔案,為什麼要建立4張表呢?因為這裡建立的表要使用orc的壓縮方式,而不使用預設的textfile的方式,orc的壓縮方式要想向表中匯入資料需要使用子查詢的方式匯入,即把從另一張表中查詢到的資料插入orc壓縮格式的表匯中,所以這裡需要四張表,兩張textfile型別的表user和video,兩張orc型別的表user_orc和video_orc

1.先建立textfile型別的表

create table user(

videoid string,

uploader string,

age int,

category array,

length int,

views int,

rate float,

ratings int,

comments int,

relatedid array)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

create table video(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by "\t"

stored as textfile;

向兩張表中匯入資料,從hdfs中匯入

load data inpath '資料檔案在hdfs中的位置' into table user;

2.建立兩張orc型別的表

create table user_orc(

videoid string,

uploader string,

age int,

category array,

length int,

views int,

rate float,

ratings int,

comments int,

relatedid array)

clustered by (uploader) into 8 buckets

row format delimited fields terminated by "\t"

collection items terminated by "&"

stored as orc;

create table video_orc(

uploader string,

videos int,

friends int)

clustered by (uploader) into 24 buckets

row format delimited

fields terminated by "\t"

stored as orc;

向兩張表中匯入資料

insert into table user_orc select *from user;

insert into table video_orc select *from video;

這時候資料就載入到兩張表中了,可以進行簡單的檢視

select *from user_orc limit 10;

select *from video_orc limit 10

create table video(

videoid string,

uploader string,

age int,

category array,

length int,

views int,

rate float,

ratings int,

comments int,

relatedid array)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

create table user(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by "\t"

stored as textfile;

create table video_orc(

videoid string,

uploader string,

age int,

category array,

length int,

views int,

rate float,

ratings int,

comments int,

relatedid array)

clustered by (uploader) into 8 buckets

row format delimited fields terminated by "\t"

collection items terminated by "&"

stored as orc;

create table user_orc(

uploader string,

videos int,

friends int)

clustered by (uploader) into 24 buckets

row format delimited

fields terminated by "\t"

stored as orc;

a. 炸開 類別

select videoid,category_name

from video_orc lateral view explode(category) t_category as category_name; t1

b. 分組 計算count

select category_name,count(1) as cnt from t1 group by category_name order by cnt desc limit 20;

select

category_name as category,

count ( 1 ) as cnt

from

( select videoid, category_name from video_orc lateral view explode ( category ) t_category as category_name ) t1

group by

category_name

order by

cnt desc

limit 20

a.統計**數前20 的類別

select videoid,category from video_orc order by views desc delimited 20; t1

b. 檢視類別

select videoid, category_name from t1 lateral view explode(category) t_category as category_name;

select category_name,count(*) as cnt from (select category_name from (select videoid,category from video_orc order by views desc limit 20)t1 lateral view explode(category) t_category as category_name)t2 group by category_name order by cnt;

每個類別 top10

a.炸開類別

select videoid,category_name,views

from video_orc lateral view explode(category) t_category as category_name; t1

b. 利用row number 函式

select t2.* from (select category_name,views,videoid,row_number() over(partition by category_name order by views desc) as rank from (select videoid,category_name,views from video_orc lateral view explode(category) t_category as category_name)t1) t2 where t2.rank <10;

animals 類別 top 10

select videoid,category_name,views

from video_orc lateral view explode(category) t_category as category_name ; t1

select views from (select videoid,category_name,views from video_orc lateral view explode(category) t_category as category_name)t1 where t1.category_name=animals order by views limit 10;

HIVE專案實戰

字段 備註詳細描述 video id 11位字串 uploader agecategory length views 次數 rate 滿分5分 ratings 流量conments related ids 2 使用者表 表6 14 使用者表 字段備註 字段型別 uploader 上傳者使用者名稱 s...

Hive專案實戰三

這裡總共需要建立4張表,明明只有兩個資料檔案,為什麼要建立4張表呢?因為這裡建立的表要使用orc的壓縮方式,而不使用預設的textfile的方式,orc的壓縮方式要想向表中匯入資料需要使用子查詢的方式匯入,即把從另一張表中查詢到的資料插入orc壓縮格式的表匯中,所以這裡需要四張表,兩張textfil...

Hive專案實戰一

1.需求描述 2.資料來源結構說明 資料來源1 user.txt 資料樣例 資料樣例中的三個字段結構 上傳者使用者名稱 string int朋友數量 int資料來源2 video.txt 資料樣例 fqshwyqgqsw lonelygirl15736 people blogs133 151763 ...