Impala基本使用和介紹

impala

提供對hdfs,hbase資料的高效能，低延遲的互動式sql查詢功能

基於hive,使用記憶體計算，具有實時，批處理，多併發特點

是處理pb級大資料實時查詢分析引擎

優點：基於記憶體運算，不需要把中間結果寫入磁碟，省掉大量i/o開銷

無需轉換為mr，直接訪問hdfs,hbase的資料進行排程

使用了支援data locality的i/o排程機制，盡可能地將資料和計算分配在同一臺機器上進行，減少了網路開銷。

支援各種檔案格式，如textfile?、sequencefile 、rcfile、parquet。

可以訪問hive的metastore，對hive資料直接做資料分析。

缺點1)對記憶體的依賴大，且完全依賴於hive。

2)實踐中，分割槽超過1萬，效能嚴重下降。

3)只能讀取文字檔案，而不能直接讀取自定義二進位制檔案。

4)每當新的記錄/檔案被新增到hdfs中的資料目錄時，該錶需要被重新整理。

impala的sql語句：

建立外部表：（資料儲存在hdfs上，location指定的是資料檔案的目錄名）

create external table tb1

(id int,name string)

row format delimited fields terminated by 『,』 //每個記錄都是「，」分割的

stored as parquet

location 『hdfs://ip:8020/encdata/impala/storage/originald/order2』 //表的儲存位置

建立分割槽表

create table tb1

(id int,name string)

partitioned by (age int,id int) //分割槽字段唯一

stored as parquet

location 『hdfs://ip:8020/encdata/impala/storage/originald/order2』

插入資料進入分割槽

靜態分割槽：（每插入一條資料產生乙個檔案，容易造成hdfs的塊不夠，導致namenode掛掉）

insert into tablename partition(year = 『2013』,month=『12』) values(『foo』,『foo』) //後面兩值代表插入表中的字段

動態分割槽：

insert into tablename partition(year) select id ,year from tmptable // 查詢的字段要與目標的字段順序一致，分割槽欄位要放在查詢欄位的最後，如果有多個分割槽字段，則按partition()中的順序排好

修改表名

alter table old_name rename to new_name

修改資料檔案的儲存位置

alter table tablename set location 『hdfs_directory』

修改列alter table tablename add columns(column_defs) //增加列

alter table tablename replace columns(column_defs) //替換列

alter table tablename dropcolumns(column_defs) //刪除列

alter table tablename change column_name new name new spec //換列名、

載入資料進入表中

load data inpath 『/usr/encdata11.txt』 into table //載入的資料檔案是在hdfs上的

更新元資料資訊（元資料儲存在hive中）

invalidate metadata

查詢impala的版本：select version()

impala不支援delete和update操作

impala支援的資料型別：可以利用cast進行資料型別的轉換 cast(『111』 as int)

bigint/smallint/int

boolean

double/real

float

string

timestamp

hadoop高可用下的impala建表

C STL基本介紹和使用

include include include include include include include include include include using namespace std void testarray 遍歷 for const auto i arr 獲取索引為2的值 co...

python celery介紹和基本使用

08 python celery介紹和基本使用 celery分布式任務佇列 rpc遠端，當執行一條命令，等待遠端執行結果返回客戶端。在linux上可以在後台執行，不影響其他任務執行。涉及到非同步 1 分布式任務運算celery 參考任務計畫 crontab作業系統本身任務計畫 celery也可以實...

Impala實踐之十五 Impala使用文件

由於前期大家使用impala的時候都比較隨意，再加上對impala的原理不清楚，因此在使用的過程中對impala帶來了很大的壓力。經過前段時間的研究和實驗。我整理了乙份impala使用文件，供組內小夥伴使用。只有通過hdfs增加或刪除分割槽中檔案後，才需要人為更新元資料，其餘情況依賴impala自帶...

Impala基本使用和介紹

C STL基本介紹和使用

python celery介紹和基本使用

Impala實踐之十五 Impala使用文件

相關推薦