Hive SQL 日常工作使用總結

寫寫日常在使用hive sql做分析時經常使用的一些函式或者方法

select uid from dw.today where tunittype like '%wew.%'

select uid from dw.today where tunittype rlike '.*(you|me).*'

點號(.)：表示和任意字串匹配，星號(*)：表示重複「左邊的字串」，（x|y）表示和x或者y匹配

select uid from dw.today where not tunittype like '%wew.%'

工作中，需要查詢某個區間的使用者量，這個時候就需要對時間做處理，以便快速搞定

select
distinct from_unixtime(60*30*cast(unix_timestamp("2017-11-11 13:23:23")/(60*30) as bigint), 'yyyy-mm-dd hh:mm:ss')
from test_table

這裡便將時間轉化為13：00：00，記錄的是13：00：00至13：30：00這段時間的資料量

select
distinct from_unixtime(60*10*cast(unix_timestamp("2017-11-11 13:23:23")/(60*10) as bigint), 'yyyy-mm-dd hh:mm:ss')
from test_table

總結：乙個小時60分鐘，一分鐘60秒，按照時間單位的秒來轉化為相應的區間

語法形式：

row_number() over (partition by 字段 a order by 計算項 b desc ) rank

rank 排序的名稱；partition by：類似 hive 的建表，分割槽的意思；order by ：排序，預設是公升序，加 desc 降序；這裡按欄位 a 分割槽，對計算項 b 進行降序排序

例子：

select from_unixtime(unix_timestamp())

cast() 函式將字串轉換為整數、雙精度浮點數或執行反向轉換

可參考這個部落格：

select
cast(a as
double) from
table

沒錯，在機器學習中給資料打標籤過程最常用到的sql語句，主要用於處理單個列的查詢結果

create table ifnot exists dw.huodong_uid_label as select uid, case when action=0 then 0else 1end

as label from zhangxiang.huodong_action_0_2

select sum（a+b+c） as 總和 from table group

by uid

這裡要求a，b，c三列都是數值型

select uid,
sum(if(hour
in (6,7,8,9,10), cast(num_rate_tgi as
double) ,0) )
from
table
where pt_dt = '2018-07-18'
andgroup
by uid

計算6點到11點前的累計tgi和

嘗試分桶取樣，顯示不支援分割槽表。

方案一：

select * from data.next
where pt_dt='2018-06-04'
and label = 0
order
by rand() limit 88000

網上查詢說此方案效率低，其原因是每次要掃全表一次。

方案二：

select *， row_number() over(order
by rand()) as rn from data.next
where pt_dt='2018-06-04'
and label = 0

有兩個函式：

percentile(col,array(0.01,0.05,0.1))

注意：這裡要求p∈

(0,1

) p∈(

0,1)

a regexp b

等同於rlike

select count(*) from olap_b_dw_hotelorder_f where create_date_wid not regexp '\\d'

等同於

select count(*) from olap_b_dw_hotelorder_f where create_date_wid not rlike '\\d'

語法結構：

regexp_extract(string subject, string pattern, int index)

例子：從[189][0]10001614-30以上-3中取出10001614-30來

select regexp_extract('[189][0]10001614-30以上-3','\\[0](.*?)(-)',1);

方案二

select regexp_extract('[189][0]10001614-30以上-3','\\[[0-9]+]\\[[0-9]+]([0-9]*)-',1);

方案三

select regexp_extract('[189][0]10001614-30以上-3','(\\[.*\\])([0-9])(.*)',2);

工作中，經常將sql和hive結合，然後對資料分析，有時也需要對分析的結果插入hive中，以便穩定的儲存。

import hivecontext.implicits._
data.todf().registertemptable("table1")
hivecontext.sql("insert
into table2 partition(date='2018-07-24') select name,col1,col2 from table1")

先將資料儲存為檔案，如csv格式。此方案對資料量太大的情況不合適，在將資料保持為csv等格式的時候容易導致服務崩潰。

hive -e "insert overwrite directory '/user/local/data_export.csv' row format delimited fields terminated by '\t'

select * from locl.data limit 20;"

可以在xshell中的hive端執行，或者在shell中跑

hive -e
"sql**"
>>
log.txt

格式：在hive端執行sql檔案

hive -f
data
.hql >>
log.txt

#!/bin/bash source /exportfs/home/test/.bash_profile echo "sql** ; ">data.hql hive -f data.hql 2>log.txt

# 這裡可以放定時的**

返回週幾等

未完待續。。。。。。

DBA日常工作總結

dba日常工作總結原文出處 dba日常工作總結 oracle資料庫管理員應按如下方式對oracle資料庫系統做定期監控 1 每天對oracle資料庫的執行狀態,日誌檔案,備份情況,資料庫的空間使用情況,系統資源的使用情況進行檢查,發現並解決問題。2 每週對資料庫物件的空間擴充套件情況,資料的增長...

linux redis日常工作命令總結

檢視redis安裝路徑 whereis redis 檢視redis客戶端安裝路徑 whereis redis cli 檢視redis服務安裝路徑 whereis redis server 在redis安裝目錄下啟動redis客戶端 redis服務如 usr local redis bin 啟動服務...

日常工作總結（二）

之前寫過乙個go日常使用的總結文章，最近剛忙完乙個活動，想再總結一下，發現歸類為go有些片面，不僅僅涉及到go，而且更多的可能是的書寫規範或者追求更好的書寫方式，因此索性變成日常工作總結好了表設計問題主要針對innodb儲存引擎，庫名表名欄位名索引名必須使用小寫字母，並且不能以mysql...

Hive SQL 日常工作使用總結

DBA日常工作總結

linux redis日常工作命令總結

日常工作總結（二）

相關推薦