Hive三種去重方法，distinct,group by與ROW_Number()窗口函數

原創

2020-02-29 17:23

一、distinct,group by與ROW_Number()窗口函數使用方法

1. Distinct用法：對select 後面所有字段去重，並不能只對一列去重。

（1）當distinct應用到多個字段的時候，distinct必須放在開頭，其應用的範圍是其後面的所有字段，而不只是緊挨着它的一個字段，而且distinct只能放到所有字段的前面

（2）distinct對NULL是不進行過濾的，即返回的結果中是包含NULL值的

（3）聚合函數中的DISTINCT,如 COUNT( ) 會過濾掉爲NULL 的項

2.group by用法：對group by 後面所有字段去重，並不能只對一列去重。

3. ROW_Number() over()窗口函數

注意：ROW_Number() over (partition by id order by time DESC) 給每個id加一列按時間倒敘的rank值，取rank=1

select m.id,m.gender,m.age,m.rank

from (select id,gender,age,ROW_Number() over(partition by id order by id) rank

from temp.control_201804to201806

where id!='NA' and gender!='' or age!=''

) m

where m.rank=1

二、案例：

1.表中有兩列：id ，superid，按照superid倒序排序選出前100條不同的id，如下：

1.方案一：

子查詢中對id,superid同時去重，可能存在一個id對應的superid不同，id這一列有重複的id，但是結果只需要一列不同的id，如果時不限制數量，則可以選擇這種方法

%jdbc(hive)

create table temp.match_relation_3M_active_v5 as

select a.id

from (select distinct id,superid

from temp.match_relation_3M_activ

order by superid desc

limit 100

) a

group by a.id

注意，對id去重時可以用gruop by 或者distinct id，兩者去重後的id排序時一致的，但是加了distinct(group by)後，distinct字段自帶排序功能，會先按照distinct後面的字段進行排序,即已經改變了子查詢的中order by的排序，但是結果與正確結果中的id是一樣的，只是排序不同罷了。

方案二：

因爲要求按照superid倒序排序選出，而一個id對應的superid不同，必有大有小，選出最大的那一個，即可。同理若是按照superid正序排列，可以選出最小的一列

%jdbc(hive)

create table temp.match_relation_3M_active_v7 as

select a.id

from (select id,max(superid) as superid

from temp.match_relation_3M_active

group by id

order by superid desc

limit 100

) a

方案三：

首先利用窗口函數ROW_Number() over()窗口函數對id這一列去重，不能用distinct或者group by對id,superid同時去重

%jdbc(hive)

create table temp.match_relation_3M_active_v11 as

select n.id

from (select m.id,superid

from (select id,superid,ROW_Number() over(partition by id order by id) rank

from temp.match_relation_3M_active

) m

where m.rank=1

order by superid desc

limit 100

注意，以下代碼中，窗口函數ROW_Number() over（）的執行順序晚於 order by superid desc，最終的結果並非 superid的倒敘排列的結果

%jdbc(hive)

create table temp.match_relation_3M_active_v9 as

select m.id

from (select id, superid,ROW_Number() over(partition by id order by id) rank

from temp.match_relation_3M

order by superid desc

) m

where m.rank=1

group by m.id

limit 100

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive三種去重方法，distinct,group by與ROW_Number()窗口函數

一、distinct,group by與ROW_Number()窗口函數使用方法

1. Distinct用法：對select 後面所有字段去重，並不能只對一列去重。

2.group by用法：對group by 後面所有字段去重，並不能只對一列去重。

3. ROW_Number() over()窗口函數

二、案例：

1.表中有兩列：id ，superid，按照superid倒序排序選出前100條不同的id，如下：

1.方案一：

子查詢中對id,superid同時去重，可能存在一個id對應的superid不同，id這一列有重複的id，但是結果只需要一列不同的id，如果時不限制數量，則可以選擇這種方法

方案二：

因爲要求按照superid倒序排序選出，而一個id對應的superid不同，必有大有小，選出最大的那一個，即可。同理若是按照superid正序排列，可以選出最小的一列

方案三：

首先利用窗口函數ROW_Number() over()窗口函數對id這一列去重，不能用distinct或者group by對id,superid同時去重

hadoop HA 實現原理

hadoop生態之---sqoop導入數據導致精度丟失

spark部署方式之client 和cluster的區別

微信小程序判斷分享的是羣還是好友微信小程序判斷分享的是羣還是好友

Hive三種去重方法，distinct,group by與ROW_Number()窗口函數

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Hive三種去重方法，distinct,group by與ROW_Number()窗口函數

一、distinct,group by與ROW_Number()窗口函數使用方法

1. Distinct用法：對select 後面所有字段去重，並不能只對一列去重。

2.group by用法：對group by 後面所有字段去重，並不能只對一列去重。

3. ROW_Number() over()窗口函數

二、案例：

1.表中有兩列：id ，superid，按照superid倒序排序選出前100條不同的id，如下：

1.方案一：

子查詢中對id,superid同時去重，可能存在一個id對應的superid不同，id這一列有重複的id，但 是結果只需要一列不同的id，如果時不限制數量，則可以選擇這種方法

方案二：

因爲要求按照superid倒序排序選出，而一個id對應的superid不同，必有大有小，選出最大的那一個，即可。 同理若是按照superid正序排列，可以選出最小的一列

方案三：

首先利用窗口函數ROW_Number() over()窗口函數對id這一列去重，不能用distinct或者group by對id,superid同時去重

子查詢中對id,superid同時去重，可能存在一個id對應的superid不同，id這一列有重複的id，但是結果只需要一列不同的id，如果時不限制數量，則可以選擇這種方法

因爲要求按照superid倒序排序選出，而一個id對應的superid不同，必有大有小，選出最大的那一個，即可。同理若是按照superid正序排列，可以選出最小的一列