Hive 獲取數組最後一個元素

原創

osc_xxp9voom

2021-01-30 09:52

引言:

通過split分割當前字段獲取數組，並得到最後一個索引的元素，通過hive怎麼實現，下面通過不同方法一一驗證可行性。

字段樣式 shopList : productA,productB,productC

表名 shopTable : shopListTable

一.split + size 獲取 - 失敗

 hive -e "
 select split(shopList,',')[size(split(shopList,','))-1]
 from shopTable;
 "

FAILED: SemanticException 2:25 Non-constant expressions for array indexes not supported. Error encountered near token '1'

遇到這個問題第一反應就是split轉化爲數組，再通過數組size-1獲取最後一位，但是 array 的索引位置不允許非常數的表達式，所以失敗。

二.split + size + cast 獲取 - 失敗

hive -e "
select array[index] as product
from
( select split(shopList,',') array, cast((size(split(shopList,',')) - 1) as int) index from shopTable) tmp
";

FAILED: SemanticException 2:12 Non-constant expressions for array indexes not supported. Error encountered near token 'index'

和上面的問題類似，有點坑，繼續嘗試。

三. regexp_extract - 能跑但效果不對

hive -e "
select regexp_extract(shopList,'(\,[^\,]+)',1) 
from shopTable
";

split 和 regexp_extract 作用類似，但是這裏正則表達式使用起來有問題，還是暫不考慮。

四. reverse + split + reverse - 成功

hive -e "
select reverse(split(reverse(shopList), ',')[1]) 
from shopTable";

看網上大神這麼操作的，實測沒有問題，但是性能和資源消耗都比較嚴重。

五.自定義UDF - 成功 (推薦👍)

截止目前上面四種方法只有 reverse 一種可以實現但是效率還是個問題，所以出大招直接自己寫函數好了。

java :

package com.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

import java.util.List;

/**
 * @title: SplitString
 * @Date: 2021-01-21 14:21
 * @Version 1.0
 */
public final class SplitString extends UDF {
    public Text evaluate(final Text s) {
        if (s == null) { return null; }
        String[] arr = s.toString().split(",");
        if ( arr.length == 0) { return null; }
        return new Text(arr[arr.length - 1]);
    }
}

sh :

hive -e "
add jar ~/your-class-1.0-SNAPSHOT.jar;
create temporary function my_split as 'com.hive.udf.SplitString';
select my_split(shopList) from shopTable group by my_split(shopList);
"

將java代碼打包，並上傳至sh腳本對應位置即可，通過自定義的方法實現 split + index 功能，這裏不侷限於倒數第一位，倒數第幾位都可以，實測性能優於 reverse，值得擁有！

除了UDF（一進一出）之外，還有 UDAF （多進一出）和 UDTF（一進多出）等形式，有興趣也可以自己實現~

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive 獲取數組最後一個元素

引言:

一.split + size 獲取 - 失敗

二.split + size + cast 獲取 - 失敗

三. regexp_extract - 能跑但效果不對

四. reverse + split + reverse - 成功

五.自定義UDF - 成功 (推薦👍)

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

Elasticsearch簡單優化

Visual Studio中__cplusplus 宏爲199711L的解決方法

17張動圖，帶你瞭解不一樣的數學

ES6 Promise源碼解析（從Promise功能的角度看Promise源碼實現）

java.sql.SQLException: Field ‘id‘ doesn‘t have a default value（用eclipse操作數據庫時報了這種奇怪的錯誤）的原因與解決方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結