函數查詢:
https://spark.apache.org/docs/2.3.0/api/sql/index.html#map
concat_ws: 用指定的字符連接字符串
例如:
連接字符串:
concat_ws("_", field1, field2),輸出結果將會是:“field1_field2”。
數組元素連接:
concat_ws("_", [a,b,c]),輸出結果將會是:"a_b_c"。
collect_set: 把聚合的數據組合成一個數組,一般搭配group by 使用。
例如有下表T_course;
id name course
1 zhang san Chinese
2 zhang san Math
3 zhang san English
spark.sql("select name, collect_set(course) as course_set from T_course group by name");
結果是:
name course_set
zhang san [Chinese,Math,English]
RANK, DENSE_RANK, ROW_NUMBER都是把表中的行按分區內的排序標上序號,但有一點差別:
RANK:可以生成不連續的序號,比如按分數排序,第一第二都是100分,第三名98分,那第一第二就會顯示序號1,第三名顯示序號3。
DENSE_RANK: 生成連續的序號,在上一例子中,第一第二並列顯示序號1,第三名會顯示序號2。
ROW_NUMBER: 顧名思義就是行的數值,在上一例子中,第一第二第三將會顯示序號爲1,2,3。
下面的例子幫助理解,按年級分組,分數降序排列,分別新建列RANK, DENSE_RANK, ROW_NUMBER:
考試成績排名
姓名 年級 分數 RANK DENSE_RANK ROW_NUMBER
張三 一年級 100 1 1 1
李四 一年級 100 1 1 2
王五 一年級 98 3 2 3
小明 二年級 100 1 1 1
小芳 二年級 95 2 2 2
小民 二年級 90 3 3 3
sparkSession.sql("SELECT * , " +
"RANK() OVER (PARTITION BY grade ORDER BY score DESC) AS rank, " +
"DENSE_RANK() OVER (PARTITION BY grade ORDER BY score DESC) AS dense_rank, " +
"ROW_NUMBER() OVER (PARTITION BY grade ORDER BY score DESC) AS row_number " +
"FROM ScoreDetail").show()
cast: select new column as null string in spark,默認null 作爲新一列
select cast(null as string) as newcol from db.table
Hive分析窗口函數(一) SUM,AVG,MIN,MAX 用於實現分組內所有和連續累積的統計,實現累加和累乘
https://blog.csdn.net/abc200941410128/article/details/78408942
數據準備:
CREATE EXTERNAL TABLE lxw1234 (
cookieid string,
createtime string, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/tmp/lxw11/';
DESC lxw1234;
cookieid STRING
createtime STRING
pv INT
hive> select * from lxw1234;
OK
cookie1 2015-04-10 1
cookie1 2015-04-11 5
cookie1 2015-04-12 7
cookie1 2015-04-13 3
cookie1 2015-04-14 2
cookie1 2015-04-15 4
cookie1 2015-04-16 4
SUM — 注意,結果和ORDER BY相關,默認爲升序
SELECT cookieid,
createtime,
pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認爲從起點到當前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當前行,結果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分組內所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當前行+往前3行+往後1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---當前行+往後所有行
FROM lxw1234;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10 1 1 1 26 1 6 26
cookie1 2015-04-11 5 6 6 26 6 13 25
cookie1 2015-04-12 7 13 13 26 13 16 20
cookie1 2015-04-13 3 16 16 26 16 18 13
cookie1 2015-04-14 2 18 18 26 17 21 10
cookie1 2015-04-15 4 22 22 26 16 20 8
cookie1 2015-04-16 4 26 26 26 13 13 4
如果不指定ROWS BETWEEN,默認爲從起點到當前行;
如果不指定ORDER BY,則將分組內所有值累加;
關鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往後
CURRENT ROW:當前行
UNBOUNDED:起點,UNBOUNDED PRECEDING 表示從前面的起點, UNBOUNDED FOLLOWING:表示到後面的終點
–其他AVG,MIN,MAX,和SUM用法一樣。
--AVG
SELECT cookieid,
createtime,
pv,
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認爲從起點到當前行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當前行,結果同pv1
AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分組內所有行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當前行+往前3行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當前行+往前3行+往後1行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---當前行+往後所有行
FROM lxw1234;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10 1 1.0 1.0 3.7142857142857144 1.0 3.0 3.7142857142857144
cookie1 2015-04-11 5 3.0 3.0 3.7142857142857144 3.0 4.333333333333333 4.166666666666667
cookie1 2015-04-12 7 4.333333333333333 4.333333333333333 3.7142857142857144 4.333333333333333 4.0 4.0
cookie1 2015-04-13 3 4.0 4.0 3.7142857142857144 4.0 3.6 3.25
cookie1 2015-04-14 2 3.6 3.6 3.7142857142857144 4.25 4.2 3.3333333333333335
cookie1 2015-04-15 4 3.6666666666666665 3.6666666666666665 3.7142857142857144 4.0 4.0 4.0
cookie1 2015-04-16 4 3.7142857142857144 3.7142857142857144 3.7142857142857144 3.25 3.25 4.0
--MIN
SELECT cookieid,
createtime,
pv,
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認爲從起點到當前行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當前行,結果同pv1
MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分組內所有行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當前行+往前3行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當前行+往前3行+往後1行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---當前行+往後所有行
FROM lxw1234;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10 1 1 1 1 1 1 1
cookie1 2015-04-11 5 1 1 1 1 1 2
cookie1 2015-04-12 7 1 1 1 1 1 2
cookie1 2015-04-13 3 1 1 1 1 1 2
cookie1 2015-04-14 2 1 1 1 2 2 2
cookie1 2015-04-15 4 1 1 1 2 2 4
cookie1 2015-04-16 4 1 1 1 2 2 4
分組相加
select sum(col) from table group by
分組相乘
在乘過程中,由於進行了log轉換,存在較小精度損失,用round()進行處理四捨五入處理;
select round(power(10, sum(log(10, col)) from table group by
NTILE :統計一個cookie,pv數最多的前1/3的天
NTILE(n),用於將分組數據按照順序切分成n片,返回當前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均勻,默認增加第一個切片的分佈
SELECT
cookieid,
createtime,
pv,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn
FROM lxw1234;
--rn = 1 的記錄,就是我們想要的結果
cookieid day pv rn
----------------------------------
cookie1 2015-04-12 7 1
cookie1 2015-04-11 5 1
cookie1 2015-04-15 4 1
cookie1 2015-04-16 4 2
cookie1 2015-04-13 3 2
cookie1 2015-04-14 2 3
cookie1 2015-04-10 1 3
cookie2 2015-04-15 9 1
cookie2 2015-04-16 7 1
cookie2 2015-04-13 6 1
cookie2 2015-04-12 5 2
cookie2 2015-04-14 3 2
cookie2 2015-04-11 3 3
cookie2 2015-04-10 2 3
FIRST_VALUE
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM lxw1234;
cookieid createtime url rn first1
---------------------------------------------------------
cookie1 2015-04-10 10:00:00 url1 1 url1
cookie1 2015-04-10 10:00:02 url2 2 url1
cookie1 2015-04-10 10:03:04 1url3 3 url1
cookie1 2015-04-10 10:10:00 url4 4 url1
cookie1 2015-04-10 10:50:01 url5 5 url1
cookie1 2015-04-10 10:50:05 url6 6 url1
cookie1 2015-04-10 11:00:00 url7 7 url1
cookie2 2015-04-10 10:00:00 url11 1 url11
cookie2 2015-04-10 10:00:02 url22 2 url11
cookie2 2015-04-10 10:03:04 1url33 3 url11
cookie2 2015-04-10 10:10:00 url44 4 url11
cookie2 2015-04-10 10:50:01 url55 5 url11
cookie2 2015-04-10 10:50:05 url66 6 url11
cookie2 2015-04-10 11:00:00 url77 7 url11