sparksql:函數使用-1

函數查詢:

https://spark.apache.org/docs/2.3.0/api/sql/index.html#map

concat_ws: 用指定的字符連接字符串

例如:

連接字符串:

concat_ws("_", field1, field2),輸出結果將會是:“field1_field2”。

數組元素連接:

concat_ws("_", [a,b,c]),輸出結果將會是:"a_b_c"。

 

collect_set: 把聚合的數據組合成一個數組,一般搭配group by 使用。

例如有下表T_course;

id    name    course
1    zhang san    Chinese
2    zhang san    Math
3    zhang san    English
spark.sql("select name, collect_set(course) as course_set from T_course group by name");

結果是:

name    course_set
zhang san    [Chinese,Math,English]

RANK, DENSE_RANK, ROW_NUMBER都是把表中的行按分區內的排序標上序號,但有一點差別

RANK:可以生成不連續的序號,比如按分數排序,第一第二都是100分,第三名98分,那第一第二就會顯示序號1,第三名顯示序號3。

DENSE_RANK: 生成連續的序號,在上一例子中,第一第二並列顯示序號1,第三名會顯示序號2。

ROW_NUMBER: 顧名思義就是行的數值,在上一例子中,第一第二第三將會顯示序號爲1,2,3。

 

下面的例子幫助理解,按年級分組,分數降序排列,分別新建列RANK, DENSE_RANK, ROW_NUMBER:
考試成績排名
姓名    年級    分數    RANK    DENSE_RANK    ROW_NUMBER
張三    一年級    100    1    1    1
李四    一年級    100    1    1    2
王五    一年級    98    3    2    3
小明    二年級    100    1    1    1
小芳    二年級    95    2    2    2
小民    二年級    90    3    3    3
 

sparkSession.sql("SELECT * , " +
      "RANK() OVER (PARTITION BY grade ORDER BY score DESC) AS rank, " +
      "DENSE_RANK() OVER (PARTITION BY grade ORDER BY score DESC) AS dense_rank, " +
      "ROW_NUMBER() OVER (PARTITION BY grade ORDER BY score DESC) AS row_number " +
      "FROM ScoreDetail").show()

 

cast: select new column as null string in spark,默認null 作爲新一列

select cast(null as string) as newcol from db.table

Hive分析窗口函數(一) SUM,AVG,MIN,MAX 用於實現分組內所有和連續累積的統計,實現累加累乘

https://blog.csdn.net/abc200941410128/article/details/78408942

數據準備:

CREATE EXTERNAL TABLE lxw1234 (  
cookieid string,  
createtime string,   --day   
pv INT  
) ROW FORMAT DELIMITED   
FIELDS TERMINATED BY ','   
stored as textfile location '/tmp/lxw11/';  
   
DESC lxw1234;  
cookieid                STRING   
createtime              STRING   
pv INT   
   
hive> select * from lxw1234;  
OK  
cookie1 2015-04-10      1  
cookie1 2015-04-11      5  
cookie1 2015-04-12      7  
cookie1 2015-04-13      3  
cookie1 2015-04-14      2  
cookie1 2015-04-15      4  
cookie1 2015-04-16      4  

SUM — 注意,結果和ORDER BY相關,默認爲升序

SELECT cookieid,
    createtime,
    pv,
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認爲從起點到當前行
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當前行,結果同pv1 
    SUM(pv) OVER(PARTITION BY cookieid) AS pv3,	--分組內所有行
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,  --當前行+往前3行
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,  --當前行+往前3行+往後1行
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6  ---當前行+往後所有行  
    FROM lxw1234;
     
    cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 
    -----------------------------------------------------------------------------
    cookie1  2015-04-10      1       1       1       26      1       6       26
    cookie1  2015-04-11      5       6       6       26      6       13      25
    cookie1  2015-04-12      7       13      13      26      13      16      20
    cookie1  2015-04-13      3       16      16      26      16      18      13
    cookie1  2015-04-14      2       18      18      26      17      21      10
    cookie1  2015-04-15      4       22      22      26      16      20      8
    cookie1  2015-04-16      4       26      26      26      13      13      4

如果不指定ROWS BETWEEN,默認爲從起點到當前行;
如果不指定ORDER BY,則將分組內所有值累加;
關鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句
PRECEDING:往前
FOLLOWING:往後
CURRENT ROW:當前行
UNBOUNDED:起點,UNBOUNDED PRECEDING 表示從前面的起點, UNBOUNDED FOLLOWING:表示到後面的終點

–其他AVG,MIN,MAX,和SUM用法一樣。

--AVG
    SELECT cookieid,
    createtime,
    pv,
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認爲從起點到當前行
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當前行,結果同pv1 
    AVG(pv) OVER(PARTITION BY cookieid) AS pv3,	--分組內所有行
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當前行+往前3行
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當前行+往前3行+往後1行
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6  ---當前行+往後所有行  
    FROM lxw1234; 
    cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 
    -----------------------------------------------------------------------------
    cookie1 2015-04-10      1       1.0     1.0     3.7142857142857144      1.0     3.0     3.7142857142857144
    cookie1 2015-04-11      5       3.0     3.0     3.7142857142857144      3.0     4.333333333333333       4.166666666666667
    cookie1 2015-04-12      7       4.333333333333333       4.333333333333333       3.7142857142857144      4.333333333333333       4.0     4.0
    cookie1 2015-04-13      3       4.0     4.0     3.7142857142857144      4.0     3.6     3.25
    cookie1 2015-04-14      2       3.6     3.6     3.7142857142857144      4.25    4.2     3.3333333333333335
    cookie1 2015-04-15      4       3.6666666666666665      3.6666666666666665      3.7142857142857144      4.0     4.0     4.0
    cookie1 2015-04-16      4       3.7142857142857144      3.7142857142857144      3.7142857142857144      3.25    3.25    4.0
 --MIN
    SELECT cookieid,
    createtime,
    pv,
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認爲從起點到當前行
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當前行,結果同pv1 
    MIN(pv) OVER(PARTITION BY cookieid) AS pv3,	 --分組內所有行
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,  --當前行+往前3行
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,  --當前行+往前3行+往後1行
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6  ---當前行+往後所有行  
    FROM lxw1234;
     
    cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 
    -----------------------------------------------------------------------------
    cookie1 2015-04-10      1       1       1       1       1       1       1
    cookie1 2015-04-11      5       1       1       1       1       1       2
    cookie1 2015-04-12      7       1       1       1       1       1       2
    cookie1 2015-04-13      3       1       1       1       1       1       2
    cookie1 2015-04-14      2       1       1       1       2       2       2
    cookie1 2015-04-15      4       1       1       1       2       2       4
    cookie1 2015-04-16      4       1       1       1       2       2       4

 

分組相加 

select  sum(col) from  table  group by 

分組相乘

過程中,由於進行了log轉換,存在較小精度損失,用round()進行處理四捨五入處理;

select  round(power(10, sum(log(10, col))   from  table  group by 

NTILE  :統計一個cookie,pv數最多的前1/3的天

NTILE(n),用於將分組數據按照順序切分成n片,返回當前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均勻,默認增加第一個切片的分佈

SELECT 
    cookieid,
    createtime,
    pv,
    NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn 
    FROM lxw1234;
     
    --rn = 1 的記錄,就是我們想要的結果
     
    cookieid day           pv       rn
    ----------------------------------
    cookie1 2015-04-12      7       1
    cookie1 2015-04-11      5       1
    cookie1 2015-04-15      4       1
    cookie1 2015-04-16      4       2
    cookie1 2015-04-13      3       2
    cookie1 2015-04-14      2       3
    cookie1 2015-04-10      1       3
    cookie2 2015-04-15      9       1
    cookie2 2015-04-16      7       1
    cookie2 2015-04-13      6       1
    cookie2 2015-04-12      5       2
    cookie2 2015-04-14      3       2
    cookie2 2015-04-11      3       3
    cookie2 2015-04-10      2       3

FIRST_VALUE

SELECT cookieid,  
createtime,  
url,  
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,  
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1   
FROM lxw1234;  
   
cookieid  createtime            url     rn      first1  
---------------------------------------------------------  
cookie1 2015-04-10 10:00:00     url1    1       url1  
cookie1 2015-04-10 10:00:02     url2    2       url1  
cookie1 2015-04-10 10:03:04     1url3   3       url1  
cookie1 2015-04-10 10:10:00     url4    4       url1  
cookie1 2015-04-10 10:50:01     url5    5       url1  
cookie1 2015-04-10 10:50:05     url6    6       url1  
cookie1 2015-04-10 11:00:00     url7    7       url1  
cookie2 2015-04-10 10:00:00     url11   1       url11  
cookie2 2015-04-10 10:00:02     url22   2       url11  
cookie2 2015-04-10 10:03:04     1url33  3       url11  
cookie2 2015-04-10 10:10:00     url44   4       url11  
cookie2 2015-04-10 10:50:01     url55   5       url11  
cookie2 2015-04-10 10:50:05     url66   6       url11  
cookie2 2015-04-10 11:00:00     url77   7       url11 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章