Hive 知識梳理

1、 order by， sort by， distribute by， cluster by

背景表結構

在講解中我們需要貫串乙個例子，所以需要設計乙個情景，對應還要有乙個表結構和填充資料。如下：有 3 個字段，分別為 personid 標識某乙個人， company 標識一家公司名稱，money 標識該公司每年盈利收入（單位：萬元人民幣）

personid company money p1 公司1 100 p2 公司2 200 p1 公司3 150

p3 公司4 300

建表匯入資料：

create table company_info( personid string, company string, money float )row format delimited fields terminated by "\t" load data local inpath 「company_info.txt」 into

table company_info;

例如：按照 money 排序的例子

select * from company_info order by money desc;

2、 sort by

hive 中的 sort by 語句會對每一塊區域性資料進行區域性排序，即，每乙個 reducer 處理的資料都是有序的，但是不能保證全域性有序。

例如：不同的人（personid）分為不同的組，每組按照 money 排序。

select * from company_info distribute by personid sort by personid, money desc;

4、 cluster by

hive 中的 cluster by 在 distribute by 和 sort by 排序字段一致的情況下是等價的。同時， clusterby 指定的列只能是降序，即預設的 descend，而不能是 ascend。

例如：寫乙個等價於 distribute by 與 sort by 的例子

select * from company_info distribute by personid sort by personid;

等價於

select * from compnay_info cluster by personid;

2、行轉列、列轉行（udaf 與 udtf）

1 行轉列

表結構：

name constellation blood_type 孫悟空白羊座 a 大海射手座 a 宋宋白羊座 b 豬八戒白羊座 a

鳳姐射手座 a

建立表及資料匯入：

create table person_info( name string, constellation string, blood_type string) row format delimited fields terminated by

"\t";

load data local inpath '/opt/module/datas/person_info.tsv' into table person_info;

例如：把星座和血型一樣的人歸類到一起

select
t1.base,concat_ws('|', collect_set(t1.name)) name
from
(select
name,concat(constellation, ",", blood_type) base
from
person_info) t1
group by t1.base;

2、列轉行

表結構：

movie category 《疑犯追蹤》懸疑,動作,科幻,劇情《lie to me》懸疑,警匪,動作,心理,劇情

《戰狼 2》戰爭,動作,災難

建立表及匯入資料：

create table movie_info( movie string, category array )row format delimited fields terminated by "\t"

collection items terminated by ",";

load data local inpath '/opt/module/datas/movie_info.tsv' into table movie_info;

例如：將電影分類中的陣列資料展開

select
movie,category_name
from
movie_info lateral view explode(category) table_tmp as category_name;

3、陣列操作

「fields terminated by」：欄位與字段之間的分隔符。

「collection items terminated by」：乙個欄位中各個子元素 item的分隔符。

4、 orc 儲存

orc 即 optimized row columnar (orc) file，在 rcfile 的基礎上演化而來，可以提供一種高效的方法在 hive 中儲存資料，提公升了讀、寫、處理資料的效率。

5、 hive 分桶

hive 可以將表或者表的分割槽進一步組織成桶，以達到：

1、資料取樣效率更高

2、資料處理效率更高

桶通過對指定列進行雜湊來實現，將乙個列名下的資料切分為「一組桶」，每個桶都對應了乙個該列名下的乙個儲存檔案。

1、直接分桶

開始操作之前，需要將 hive.enforce.bucketing 屬性設定為 true，以標識 hive 可以識別桶。

create
table music(
id int,
name string,
size
float)
row format delimited
fields terminated by
"\t"
clustered by (id) into
4 buckets;

該**的意思是將 music 表按照 id 將資料分成了 4 個桶，插入資料時，會對應 4 個 reduce操作，輸出 4 個檔案。

2、在分割槽中分桶

當資料量過大，需要龐大分割槽數量時，可以考慮桶，因為分割槽數量太大的情況可能會導致檔案系統掛掉，而且桶比分區有更高的查詢效率。資料最終落在哪乙個桶裡，取決於 clusteredby 的那個列的值的 hash 數與桶的個數求餘來決定。雖然有一定離散性，但不能保證每個桶中的資料量是一樣的。

create
table music2(
id int,
name string,
size
float)
partitioned by (date string)
clustered by (id) sorted by(size) into
4 bucket
row format delimited
fields terminated by
"\t";

load data local inpath '/opt/module/datas/music.txt' into table music2 partition(date='2017-08-30');

Hive 知識梳理

Hive中常用SQL梳理

知識梳理計畫

music 知識梳理

Hive 知識梳理

Hive中常用SQL梳理

知識梳理計畫

music 知識梳理

相關推薦