concat函數,concat_ws函數,concat_group函數
hivesql中的concat函數,concat_ws函數,concat_group函數之間的區別
CONCAT()函數
CONCAT()函數用於將多個字符串連接成一個字符串。
使用數據表Info作爲示例,其中
SELECT id,name FROM info LIMIT 1;的返回結果爲
±—±-------+
| id | name |
±—±-------+
| 1 | BioCyc |
±—±-------+
1.1、語法及使用特點:
CONCAT(str1,str2,…)
返回結果爲連接參數產生的字符串。如有任何一個參數爲NULL ,則返回值爲 NULL。可以有一個或多個參數。
使用示例:
SELECT CONCAT(id, ‘,’, name) AS con FROM info LIMIT 1;返回結果爲
+----------+
| con |
+----------+
| 1,BioCyc |
+----------+
SELECT CONCAT(‘My’, NULL, ‘QL’);返回結果爲
+--------------------------+
| CONCAT('My', NULL, 'QL') |
+--------------------------+
| NULL |
+--------------------------+
CONCAT_WS函數
如何指定參數之間的分隔符,使用函數CONCAT_WS()。使用語法爲:CONCAT_WS(separator,str1,str2,…)
CONCAT_WS() 代表 CONCAT With Separator ,是CONCAT()的特殊形式。第一個參數是其它參數的分隔符。分隔符的位置放在要連接的兩個字符串之間。分隔符可以是一個字符串,也可以是其它參數。如果分隔符爲 NULL,則結果爲 NULL。函數會忽略任何分隔符參數後的 NULL 值。但是CONCAT_WS()不會忽略任何空字符串。 (然而會忽略所有的 NULL)。
如SELECT CONCAT_WS('_',id,name) AS con_ws FROM info LIMIT 1;返回結果爲
+----------+
| con_ws |
+----------+
| 1_BioCyc |
+----------+
SELECT CONCAT_WS(',','First name',NULL,'Last Name');返回結果爲
+----------------------------------------------+
| CONCAT_WS(',','First name',NULL,'Last Name') |
+----------------------------------------------+
| First name,Last Name |
+----------------------------------------------+
GROUP_CONCAT()函數
GROUP_CONCAT函數返回一個字符串結果,該結果由分組中的值連接組合而成。
使用表info作爲示例,其中語句SELECT locus,id,journal FROM info WHERE locus IN(‘AB086827’,‘AF040764’);的返回結果爲
±---------±—±-------------------------+
| locus | id | journal |
±---------±—±-------------------------+
| AB086827 | 1 | Unpublished |
| AB086827 | 2 | Submitted (20-JUN-2002) |
| AF040764 | 23 | Unpublished |
| AF040764 | 24 | Submitted (31-DEC-1997) |
±---------±—±-------------------------+
1、使用語法及特點:
GROUP_CONCAT([DISTINCT] expr [,expr …]
[ORDER BY {unsigned_integer | col_name | formula} [ASC | DESC] [,col …]]
[SEPARATOR str_val])
在 MySQL 中,你可以得到表達式結合體的連結值。通過使用 DISTINCT 可以排除重複值。如果希望對結果中的值進行排序,可以使用 ORDER BY 子句。
SEPARATOR 是一個字符串值,它被用於插入到結果值中。缺省爲一個逗號 (","),可以通過指定 SEPARATOR “” 完全地移除這個分隔符。
可以通過變量 group_concat_max_len 設置一個最大的長度。在運行時執行的句法如下: SET [SESSION | GLOBAL] group_concat_max_len = unsigned_integer;
如果最大長度被設置,結果值被剪切到這個最大長度。如果分組的字符過長,可以對系統參數進行設置:SET @@global.group_concat_max_len=40000;
2、使用示例:
語句 SELECT locus,GROUP_CONCAT(id) FROM info WHERE locus IN('AB086827','AF040764') GROUP BY locus; 的返回結果爲
+----------+------------------+
| locus | GROUP_CONCAT(id) |
+----------+------------------+
| AB086827 | 1,2 |
| AF040764 | 23,24 |
+----------+------------------+
語句 SELECT locus,GROUP_CONCAT(distinct id ORDER BY id DESC SEPARATOR '_') FROM info WHERE locus IN('AB086827','AF040764') GROUP BY locus;的返回結果爲
+----------+----------------------------------------------------------+
| locus | GROUP_CONCAT(distinct id ORDER BY id DESC SEPARATOR '_') |
+----------+----------------------------------------------------------+
| AB086827 | 2_1 |
| AF040764 | 24_23 |
+----------+----------------------------------------------------------+
語句SELECT locus,GROUP_CONCAT(concat_ws(', ',id,journal) ORDER BY id DESC SEPARATOR '. ') FROM info WHERE locus IN('AB086827','AF040764') GROUP BY locus;的返回結果爲
+----------+--------------------------------------------------------------------------+
| locus | GROUP_CONCAT(concat_ws(', ',id,journal) ORDER BY id DESC SEPARATOR '. ') |
+----------+--------------------------------------------------------------------------+
| AB086827 | 2, Submitted (20-JUN-2002). 1, Unpublished |
| AF040764 | 24, Submitted (31-DEC-1997) . 23, Unpublished
窗口函數 row_number over()和sum() over()
row_number over()的使用:
假如我們有這樣一組數據,我們需要求出不同性別的年齡top2的人的信息。這個時候怎麼做?
可能我們會首先想到分組,但是分組只能值top1,怎麼樣能求出top2,top3呢?這時候我們想如果分組後能夠按照年齡排序然後標出來序號就好了!
id age name sex
1,18,xiaoli,male
2,19,wang,male
3,22,liu,female
4,16,dawei,male
5,30,erbao,male
6,26,xiao,female
7,18,chengua,male
下面就介紹一個非常有用的函數:row_number() over()他的作用就是分組排序加上序號標記
比如以上求解不同性別的年齡top2,我們可以這樣做:
建表導入數據:
create table rownumber(id string,age int,name string,sex string)
row format delimited
fields terminated by ‘,’;
load data local inpath ‘xxx’ into table rownumber;
select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rownumber
from rownumber;
我們可以清楚的看到 row_number() over(partition by sex order by age desc) as rownumber
就相當於增加了一列序號,over()中partition by sex是按照sex分組,order by age desc按照年齡降序排序,然後row_number()在加上序號。
select id,age,name,sex
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rownumber
from rownumber ) temp
where rownumber<3;
這樣就求出分組topn了,很方便!
sum() over()
有這樣的數據:第一列name,第二列月份mon,第三列金額jine
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
C,2015-01,10
C,2015-01,20
A,2015-02,4
A,2015-02,6
C,2015-02,30
C,2015-02,10
B,2015-02,10
B,2015-02,5
A,2015-03,14
A,2015-03,6
B,2015-03,20
B,2015-03,25
C,2015-03,10
C,2015-03,20
我們需要求出對於每個人的一個月的總額和累計到當前月的總額。
傳統方法非常的麻煩,具體思路是;先求出月總額表(name,mon,amount),然後講月總額表自聯結,在過濾當前月份後面的月份,最終在求和。
使用sum() over()可以輕鬆給解決,sum()首先我們都知道是求和,加上over()就是針對某個窗口求和了,具體哪個窗口呢?
具體實現:
求出每月的總額 放到表中,先將數據加載到表中,在求月總額
create table monsum(name string,mon string,jine string)
row format delimited
fields terminated by ',';
load data local inpath '/root/mytest/sumreport.dat' into table monsum;
--求出月總額
create table monamount
as
select name,mon,sum(jine) as amount
from monsum
group by name,mon;
然後使用窗口函數求出累計當前月總額,
select name,mon,amount,
sum(amount) over(partition by name order by mon rows between unbounded preceding and current row) as account
from monamount;
sum(amount)的求和是針對後面over()窗口的求和,
over中partition by name order by mon 針對name這一組按照月份排序,
rows between unbounded preceding and current 限定了行是按照在當前行不限定的往前處理,通俗就是處理當前以及之前的所有行的sum,即3月時sum(amount)求的時123月的和,2月時sum(amount)求的是12月的和。
unbounded意思無限的 preceding在之前的,current row當前行。
Hive之列轉行,行轉列
列轉行
測試數據
hive> select * from col_lie limit 10;
OK
col_lie.user_id col_lie.order_id
104399 1715131
104399 2105395
104399 1758844
104399 981085
104399 2444143
104399 1458638
104399 968412
104400 1609001
104400 2986088
104400 1795054
把相同user_id的order_id按照逗號轉爲一行
select user_id,
concat_ws(',',collect_list(order_id)) as order_value
from col_lie
group by user_id
limit 10;
//結果(簡寫)
user_id order_value
104399 1715131,2105395,1758844,981085,2444143
總結
使用函數:concat_ws(’,’,collect_set(column))
說明:collect_list 不去重,collect_set 去重。 column的數據類型要求是string
行轉列
測試數據
hive> select * from lie_col;
OK
lie_col.user_id lie_col.order_value
104408 2909888,2662805,2922438,674972,2877863,190237
104407 2982655,814964,1484250,2323912,2689723,2034331,1692373,677498,156562,2862492,338128
104406 1463273,2351480,1958037,2606570,3226561,3239512,990271,1436056,2262338,2858678
104405 153023,2076625,1734614,2796812,1633995,2298856,2833641,3286778,2402946,2944051,181577,464232
104404 1815641,108556,3110738,2536910,1977293,424564
104403 253936,2917434,2345879,235401,2268252,2149562,2910478,375109,932923,1989353
104402 3373196,1908678,291757,1603657,1807247,573497,1050134,3402420
104401 814760,213922,2008045,3305934,2130994,1602245,419609,2502539,3040058,2828163,3063469
104400 1609001,2986088,1795054,429550,1812893
104399 1715131,2105395,1758844,981085,2444143,1458638,968412
Time taken: 0.065 seconds, Fetched: 10 row(s)
將order_value的每條記錄切割爲單元素
select user_id,order_value,order_id
from lie_col
lateral view explode(split(order_value,',')) num as order_id
limit 10;
//結果
user_id order_value order_id
104408 2909888,2662805,2922438,674972,2877863,190237 2909888
104408 2909888,2662805,2922438,674972,2877863,190237 2662805
104408 2909888,2662805,2922438,674972,2877863,190237 2922438
104408 2909888,2662805,2922438,674972,2877863,190237 674972
104408 2909888,2662805,2922438,674972,2877863,190237 2877863
104408 2909888,2662805,2922438,674972,2877863,190237 190237
104407 2982655,814964,1484250,2323912,2689723,2034331,1692373,677498,156562,2862492,338128 2982655
104407 2982655,814964,1484250,2323912,2689723,2034331,1692373,677498,156562,2862492,338128 814964
104407 2982655,814964,1484250,2323912,2689723,2034331,1692373,677498,156562,2862492,338128 1484250
104407 2982655,814964,1484250,2323912,2689723,2034331,1692373,677498,156562,2862492,338128 2323912
Time taken: 0.096 seconds, Fetched: 10 row(s)