SparkSQL中建立外部表及使用

工作中經常會需要與外圍系統打交道，由於外圍系統和本系統不處於同乙個hadoop集群下，且不具有訪問本系統的許可權，所以基本上大資料量的介面都是以檔案的方式進行傳輸。如何快速、便捷的將檔案入spark庫中？

通過sparksql中建立外部表的方式就能夠很好地解決這一需求。

注意到，如果要建立多級目錄時，需要加上-p引數。

create external table ods_user_base ( acc_nbr string comment '使用者號碼', product_type string comment '產品型別', cust_id string comment '客戶id', prd_inst_id string comment '產品例項id', latn_id string comment '所在地市', area_id string comment '所在區縣' )row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile location 'hdfs://streamcluster/hupeng/data/ods_user_base';

為了測試如果資料檔案的部分字段資料缺失，外部表是否會報錯。特意準備了如下內容的資料：

注意到，如果hdfs中外部表目錄下資料檔案存在的情況，需要替換掉該檔案的時候可以加上 -f 引數。

通過上圖可以發現，如果資料檔案中部分欄位的值缺失，在外部表中會以null顯示，並不會報錯。

1）資料檔案

圖中紅框的值是多餘的。

2）資料檔案put到hdfs中

hdfs dfs -put -f ./ods_user_data.txt /hupeng/data/ods_user_base

put的時候沒有報錯。

3）檢視外部表資料

也沒有報錯，只是多餘的值沒有顯示出來。

SparkSQL中建立外部表及使用

Hive建立外部表

HIVE建立外部表

hive內部表外部表的建立及load資料

SparkSQL中建立外部表及使用

Hive建立外部表

HIVE建立外部表

hive內部表外部表的建立及load資料

相關推薦