hive中有row_number() over (partition by)函數,可以一句SQL實現想要的排序,在ClickHouse中有很多種實現方式,本篇就介紹一下幾種方法。
目錄
1.row_number排序
HIVE中寫法:
select number,
row_number() over (partition by number order by time desc) as rank
from table a
GROUP BY number
ClickHouse寫法:
select number,
groupArray(time) AS arr_val,
arrayEnumerate(arr_val) as row_number
from (select distinct orderid as number,
toDate(operatetime) as time
from table
order by time desc
) a
GROUP BY number
2.row_number排序後取出rank=1的結果
hive寫法:
select orderid
from (select orderid,
row_number() over(partition by orderid order by datachange_lasttime desc) as row_num
from ods_htl_OrderDB.ord_orders
where d = '${CurrentDate}'
) a
where row_num = 1;
ClickHouse寫法:
方法1:利用groupArray
select orderid,
groupArray(1)(datachange_lasttime) as dates
from (select orderid,
datachange_lasttime
from olap_htlmaindb.tmp_ord_orders_status_s_pre
ORDER BY orderid, datachange_lasttime desc
) a
group by orderid
方法2:利用max函數實現倒序,如果正序使用min函數即可
select orderid,
max(datachange_lasttime) as datachange_lasttime
from olap_htlmaindb.tmp_ord_orders_status_s_pre
group by orderid
方法3:利用rowNumberInAllBlocks函數
select orderid, status
from (select orderid, status, rowNumberInAllBlocks() as rank
from (select orderid, status, datachange_lasttime
from olap_htlmaindb.tmp_ord_orders_status_s_pre
order by orderid, datachange_lasttime desc
) a
) b LIMIT 1 BY orderid
方法4:利用arrayEnumerate函數
select orderid
from (select orderid,
groupArray(datachange_lasttime) AS arr_val,
arrayEnumerate(arr_val) as row_number
from (select orderid, datachange_lasttime
from olap_htlmaindb.tmp_ord_orders_status_s_pre
order by datachange_lasttime desc
) a
GROUP BY number
) b
where row_number = 1
3.特殊場景
要求:
對於以下場景,需要按照orderid分組,按照日期倒序,取最新一條,若日期一致,則隨機取一條作爲結果即可
hive寫法:
select orderid
from (select orderid,
status,
row_number() over(partition by orderid order by datachange_lasttime desc) as row_num
from from ods_htl_OrderDB.ord_orders
where d = '${CurrentDate}'
) as b
where row_num = 1
ClickHouse寫法:
通過上面的案例,我們很容易想到,把上面的結果作爲一個子表,與原表進行關聯,只是這樣關聯,隨便舉一個關聯的寫法:
select a.orderid as orderid_a, a.status as status
from olap_htlmaindb.tmp_ord_orders_status_s_pre a
inner join (select orderid, groupArray(1)(datachange_lasttime) as dates
from (select orderid, datachange_lasttime
from olap_htlmaindb.tmp_ord_orders_status_s_pre
ORDER BY orderid, datachange_lasttime desc
) a
group by orderid) b
on a.orderid = b.orderid
and cast(a.datachange_lasttime as String) = cast(b.dates [ 1 ] as String)
這裏我們是先把符合要求的orderid和時間取出來,再回去關聯,取出需要的列,因爲這些函數都有一個缺點是只能有partition by的字段和排序字段,不能有其他字段,所以要返回關聯,所以上面四種方法,ininer join原表,都不能解決上面案例的問題。
這裏就想到了LIMIT 1 BY這個方法,這個方法其實是最有效的,如下:
select orderid,
status,
datachange_lasttime
from olap_htlmaindb.tmp_ord_orders_status_s_pre
order by orderid, datachange_lasttime desc
LIMIT 1 BY orderid