Hive全面深入理解

概述體系架構簡介

安裝步驟

開發1：概述

建立在hadoop基礎之上的資料倉儲基礎架構，通俗來說，就是資料管理的工具，可以通過編寫類似於常用sql的hql查詢語句，實現大規模資料的查詢和處理，而底層則是用mapreduce來完成提交的語句的處理；也就是說，hive的功能，很多也可以通過自己開發mapreduce程式來處理；資料的儲存，則是依據hdfs來完成的。

hive的核心功能是sql語句解析引擎，而其中的表，實際上就是hdfs的目錄和檔案，按照表名來把資料夾分開。

2：體系架構簡介

對於本圖來說，實際上hive就相當於乙個sql語言引擎，將使用者提交出的sql語句，解析成mapreduce程式，由底層的hadoop來執行。

對於使用者來說，有三種介面方式：

cli：命令列方式，

jdbc/odbc：通過類似於mysql介面的方式進行訪問

web gui：通過http頁面進行訪問。

2：hive支援的資料型別

inttinyint/smallint/bigint

boolean

float

double

string

binary

timestamp

array, map, struct, union

decimal

3：常見的一些hql語句

通重載入本地檔案建立表：load data local inpath './home/test.txt' override into table pokes

通重載入hdfs檔案，並且分割槽建表：load data inpath '/wordcount/output.txt' override into table pokes partition(times = 5)

查詢語句

查詢一行：select * from student a where a.id = 10;

查詢一行中的某一列： select a.first_col from student a where a.id = 10

向hdfs中插入資料：insert override directory '/home/tmp' select a.* > 10 from student a.ds= "shanghai"；這裡是可以指定分割槽的

從乙個表中獲取資料插入到另乙個表中：from invites a insert overwrite table events select a.bar, count(*) where a.foo > 0 group by a.bar;

join操作：

from pokes t1 join invites t2 on (t1.bar = t2.bar) insert overwrite table events select t1.bar, t1.foo, t2.foo;這種操作就比較厲害了，而且使用也比較靈活

建立新錶1：create table student （name string,age int ,*** int） row format delimited fields by ',' stored as testfile.

建立新錶2：create table student (name string , age int , *** int) row format delimited fields by ';' lines terminated by '\n';預設情況下，資料是按照行來分的，所以如無特別要求，無需指定；而每一行的字段劃分，則是必須要指定的。

建立附加注釋的表：create external table student（name string comment 'the name of student', age int 'the age of student', *** int） comment 'this is a student basic information table' row format delimited fields by ',' stored as textfile location 'hdfs://localhost:9000'

hive表的修改

新增新的一列：alter table pokes add columns(address string)

新增一列並同時新增注釋：alter table pokes add columns(address string comment 'address comment')

刪除表：drop table pokes

更改表名：alter table pokes rename to another

hive的分割槽和分桶

建立分割槽表

：create table tb_partition(id string, name string) partitioned by(month string) row format delimited fields terminated by ';'這就建立了一張分割槽表，按照month的不同來進行分割槽；在查詢的過程中，也可以指定分割槽，這樣檢索的範圍縮小，效能提高。

載入資料到分割槽表中：

load data local inpath '/home/files/dealedlog.out' overwrite into table tb_partition partition(month = '201802');這裡，通過local inpath指定了載入的是本地檔案，overwrite代表資料直接覆蓋到tb_partition表中；而month則是指定檔案載入到month='201802'的分割槽中。

insert select方式：insert into table tb_partition partition(month = '201802') select id ,name from name;該語句就是從name表獲取兩列資料，插入到tb_partition表中，並指定資料儲存的分割槽為month=『201802』

建立多級分割槽：create table tb_mul_partition (id int , name string) partitioned by(month string ,code string) row format delimited fields terminated by ';'；這裡，就建立了一張表，存在兩個分割槽，按照month和code進行多級別的分割槽，而在讀取資料的時候，也需要指明分割槽：load data local inpath '/home/files/nameinfo.txt' into table tb_mul_partition partition(month = '201802',code = '1000')；這裡，必須強調一下，如果指定了多個分割槽，在載入資料的時候，必須指定多個分割槽，否則會報錯。

注：分割槽所使用的列，並不是正式資料中的一列，更像是偽列，所以需要指定分列的字段型別，在hdfs底層，就相當於按照分割槽建立了乙個資料夾，將檔案存在該目錄下；這樣，在我們指定查詢資料的時候，可以從某乙個資料夾內查詢，大大提高了速度。

而分桶，則是依靠資料中真實存在的列，按照雜湊取模的方式，將資料分成不同的檔案進行儲存，與分割槽相比，其粒度更細；假如說分為三個桶，那就是乙份資料，會按照三份來進行儲存。

join操作

在傳統資料庫中，join操作就是非常常見的一種操作，同樣，hive作為資料倉儲，也支援join操作，但是其只支援等值操作。

hive只支援等值連線，不支援所有非等值的連線

join操作中，每次mapreduce任務的邏輯如下，reducer會快取除了最後乙個表中的所有表的記錄，再通過最後乙個表將結果儲存到檔案系統；這就提示我們，最好是把小表放在前面，避免記憶體溢位，還能提高效率。

例句比如：select a.* from visit_config a join visit_config_new b on (a.id = b.id)；

從sql到hql

大同小異吧，基本的語法很相似，但是hive支援嵌入式的mapreduce程式，而且不支援insert into，也不支援update，也不支援delete操作；這樣就不需要複雜的鎖機制，提高效率；而且同時也不支援事務機制，原理同上。

Hive全面深入理解

深入理解C語言深入理解指標

mysql 索引深入理解深入理解MySql的索引

深入理解C語言深入理解指標

Hive全面深入理解

深入理解C語言 深入理解指標

mysql 索引深入理解 深入理解MySql的索引

深入理解C語言 深入理解指標

相關推薦

深入理解C語言深入理解指標

mysql 索引深入理解深入理解MySql的索引

深入理解C語言深入理解指標