Hive入門之基礎知識（二）之資料操作與查詢

hive不會驗證向表中裝載的資料和表的模式是否匹配（需要自己檢查確認），但是會檢查檔案的格式是否和表結構定義的一致（建立表時指定的結構若為sequencefile，則裝載進去的檔案也應該為sequencefile格式）。

從本地檔案系統向表中裝載資料

load data local inpath 'path' into table 'table'

從本地檔案系統向表中裝載資料，使用overwrite覆蓋原表資料

load data local inpath 'path' overwrite into table 'table'

從本地檔案系統向表中裝載資料，使用overwrite覆蓋原表資料並指定時間分割槽

load data local inpath 'path' overwrite into table 'table' partition (dt='2019-11-11')

從hdfs向表中裝載資料

load data inpath 'path' into table 'table'

從hdfs向表中裝載資料，使用overwrite覆蓋原表資料

load data inpath 'path' overwrite into table 'table'

從hdfs向表中裝載資料，使用overwrite覆蓋原表資料並指定時間分割槽

load data local inpath 'path' overwrite into table 'table' partition (dt='2019-11-11')

另外需要注意的是，如果使用了local關鍵字，資料將會被拷貝到目標位置，

如果不使用local關鍵字，資料會被轉移到目標位置。因為hive預設在分布式檔案系統中使用者不需要乙份檔案的多份重複拷貝。

partition關鍵字可以指定要建立的分割槽

insert overwrite table 'table1'
partition(dt='2019-11-11')
select * from 'table2'
where dt='2019-11-11'

動態分割槽插入

當分割槽很多時，乙個乙個指定很麻煩，可以使用動態分割槽插入

需要先開啟動態分割槽

set hive.exec.dynamic.partition = true

set hive.exec.dynamic.partition.mode = nostrict

使用動態分割槽插入：

將表2中11-01號到11-11號的user_id按時間分割槽插入到表1中 insert overwrite table 'table1' partition(dt) select user_id, dtfrom 'table2'

where dt between '2019-11-01' and '2019-11-11'

還可以通過乙個查詢語句直接建立出表，在實際工作中長使用此功能建立臨時表

create table tmp as select user_id, dt,hour

from table1

如果資料恰好是所需要的格式，直接從hdfs上拷貝檔案即可。

如果不是需要的格式，可以參考如下示例，hive會將所有字段序列化成字串寫入到檔案中。

insert overwrite local directory 'yourpath' select user_id, name, dt,hour

from yourtable

array

陣列的索引從0開始，使用array[索引]的語法，引用乙個不存在的元素將返回null

select user_info[0] from user_detail

map

與array相同，使用array[…]的語法，不過使用對應的key值而不是索引

select user_info["location"] from user_detail

struct

使用點 . 符號

select address.city from user_detail

1）使用範圍更廣的資料型別，但會占用更多空間。

2）進行縮放，除以10、100、1000等，還可以取log值進行計算。

floor、round、ceil，輸入的是double型別，返回值為bigint型別。在進行資料型別轉換時，這些函式是首選的處理方式。

1）本地模式，如 select * from table ，不會產生mr，hive會直接讀取儲存目錄下的檔案，輸出格式化後的資料。

2）在where子句中只有分割槽欄位時，也不會產生mr。

大多數情況下，hive會對每對join物件啟動乙個mr任務，但如果對3個或3個以上的表進行join時，on條件使用了相同的連線鍵，只會產生乙個mr任務。

order by：對結果執行全域性排序，所有資料全部放在乙個reducer中執行，當資料量很大時，會執行很長時間。

sort by：只會在每個reducer中進行排序，即區域性排序。可以保證每個reducer輸出的結果是有序的，但是不同reducer輸出的結果可能會有重複的。

distribute by控制map的輸出在reduce中是如何劃分的，可以指定distribute by的值，將相同值得資料分發到乙個reducer中去，類似於group by。在分發後的資料中可以呼叫sort by 進行reducer內部的排序。

按使用者id做distribute，再按客戶端時間做排序

select * from your_table distribute by user_id

sort by client_event_time

當distribute by和sort by中的字段相同時，可以使用cluster by做替代，達成相同的效果，但是使用cluster by會剝奪sort by的並行性，而且cluster by也不能指定asc或者desc，只能按降序排列，但是可以實現資料的全域性有序。

Hive入門之基礎知識（二）之資料操作與查詢

Python入門之基礎知識

Hive知識之Hive基礎

大資料基礎之HIVE（一）基礎知識，初學必看

Hive入門之基礎知識（二）之資料操作與查詢

Python入門之基礎知識

Hive知識之Hive基礎

大資料基礎之HIVE（一） 基礎知識，初學必看

相關推薦

大資料基礎之HIVE（一）基礎知識，初學必看