hive中遞迴 hive中常見問題

1 limit語句優化

eg.select *from table_name > limit 100

在 hive 中, 由於表的資料量往往較大, 以上語句都會被優化 (set hive.fetch.task.conversion = none 會被關閉這項優化, 強制起 mr 作業; 預設配置值為 more); 這些語句的執行，都會通過過濾檔案的處理方式查詢結果並返回, 而不是起 mr 任務提交到 yarn 上執行返回, 這一點在執行日誌中可以看出; 在表或分割槽資料量較小, 而且查詢過濾條件命中的條目較多是會較快返回; 但是當由於表或分割槽資料量巨大, 或者命中條數很少而達不到 limit 的條數時, 過濾檔案的操作就會一直進行到滿足 limit 或者過濾完檔案所有資料才返回, 反而變得更慢。

表或者分割槽資料量多情況下，如果滿足的資料量多，那麼過濾某檔案很快能返回值；如果資料量很少，是不加limit限制，那麼mr查詢所有，也可以很快返回。

如果查詢的 where 子句中的字段過濾條件命中率不高, 建議不要帶 limit 子句 (但強烈建議分割槽表查詢時帶上分割槽!!!)

如果非要帶limit，加上：set hive.fetch.task.conversion = none;

fetch task功能：

乙個簡單的查詢語句，是指乙個沒有函式、排序等功能的語句，當開啟乙個fetch task功能，就執行乙個簡單的查詢語句不會生成maprreduce作業，而是直接使用fetchtask，從hdfs檔案系統中進行查詢輸出資料，從而提高效率。

2 case when語句報錯

when 後面的表示式應該型別保持一致

3 修正ctas的表資料錯行問題

直接select可能沒問題，建立新錶時：

create table table_name_new asselect ,,, from table_namewhere dt='';

這時候查詢table_name_new可能會出現錯位，解決辦法：

create table table_name_new storeas parquetas select ,,, from table_namewhere dt='';

原因：table_name_new資料格式是 textfile 格式, 預設的行分隔符為 \n，列分隔符為 \001; 所以當資料中有換行符時解析時會被換行，為了正確解析帶有特殊字元的資料, 建議將表儲存為 parquet 或者其他 hive 支援的資料格式; 這樣帶特殊字元的字段將不會再錯行或者錯列。

4 查詢表時檔案格式出錯

報錯資訊：

org.apache.hadoop.hive.serde2.columnar.bytesrefarraywritable cannot be cast to org.apache.hadoop.io.binarycomparable

或org.apache.hadoop.hive.ql.io.orc.orcstruct cannot be cast to org.apache.hadoop.io.binarycomparable

這是因為表儲存為 rcfile 或 orcfile, 而表的 serde 設定的序列化反序列化類不適配, 使用以下語句修改, 如果是分割槽表, 分割槽也需要使用 alter 或 msck 重建 (hive 的元資料表 sds 中儲存了表和分割槽的序列化反序列化等屬性)

--只 serde 不同時示例

alter table $.$set serde 'org.apache.hadoop.hive.serde2.columnar.lazybinarycolumnarserde';--field.delim 等也不同時示例

alter table $.$set serde 'org.apache.hadoop.hive.serde2.columnar.lazybinarycolumnarserde'

with serdeproperties ('field.delim' = '\001', 'serialization.format' = '\001');

5 表查詢子目錄遞迴問題

當 (非分割槽) 資料表目錄下有子目錄時, 使用 mr 引擎可能查不出資料或查不到部分資料, 因為預設 mr 是不會讀取表目錄下的子目錄資料的, 例子:

hive> dfs -ls hdfs://xx/detail_tmp1;

found 5items

drwxr-xr-x - hive hadoop 0 2018-08-07 14:46 hdfs://xx/detail_tmp1/1

drwxr-xr-x - hive hadoop 0 2018-08-02 15:05 hdfs://xx/detail_tmp1/2

drwxr-xr-x - hive hadoop 0 2018-08-02 15:06 hdfs://xx/detail_tmp13

drwxr-xr-x - hive hadoop 0 2018-08-02 15:05 hdfs://xx/detail_tmp1/4

drwxr-xr-x - hive hadoop 0 2018-08-02 15:06 hdfs://xx/detail_tmp1/5

表目錄下無檔案但有子目錄, mr 引擎預設不會遞迴到子目錄所以查不出資料;

使用 set hive.execution.engine = tez 將引擎切換到 tez, 查詢即可有數, 這是因為在切換為tez 引擎下時, hive 會將 mapred.input.dir.recursive 和 mapreduce.input.fileinputformat.input.dir.recursive 設定為 true, 查詢時會遞迴目錄;

如果在同乙個會話下, 再將引擎切回至 mr, 則會發現也能查出資料了, 因為之前在 tez 引擎下執行了語句, 已將 mapred.input.dir.recursive 和 mapreduce.input.fileinputformat.input.dir.recursive 設定為 true 了。

部分引擎會預設將 mapred.input.dir.recursive 和 mapreduce.input.fileinputformat.input.dir.recursive 這兩個屬性設定為 true

hive中遞迴 hive中常見問題

hive 語句總結工作中常見的hive語句總結

Hive使用中常見問題總結（四）

hive常見命令

hive中遞迴 hive中常見問題

hive 語句總結 工作中常見的hive語句總結

Hive使用中常見問題總結（四）

hive常見命令

相關推薦

hive 語句總結工作中常見的hive語句總結