hive 處理 json數據

原創

Jackie_ZHF

2019-06-14 10:48

兩種方式
1、將json以字符串的方式整個入Hive表，然後通過使用UDF函數解析已經導入到hive中的數據，比如使用LATERAL VIEW json_tuple的方法，獲取所需要的列名。

2、在導入之前將json拆成各個字段，導入Hive表的數據是已經解析過得。這將需要使用第三方的SerDe。

測試數據
測試數據爲新浪微博的評論數據，格式如下：

{
"appCode": "weibo",
"dataType": "comment",
"pageToken": null,
"data": [
{
"rating": null,
"commenterId": "2235850235",
"tags": null,
"commenterScreenName": "-快樂的豬頭-",
"publishDateStr": "2017-05-22T02:27:52",
"publishDate": 1495420072,
"likeCount": null,
"commentCount": null,
"source": "iPhone 6",
"url": null,
"referId": "4110146290137390",
"content": "盲道上都是共享單車了，管一管吧",
"imageUrls": null,
"id": "4110152040668671"
},
{
"rating": null,
"commenterId": "1457994444",
"tags": null,
"commenterScreenName": "彳拓",
"publishDateStr": "2017-05-22T02:06:26",
"publishDate": 1495418786,
"likeCount": null,
"commentCount": null,
"source": "iPhone 6 Plus",
"url": null,
"referId": "4110146290137390",
"content": "如何界定那車是殘疾人的？",
"imageUrls": null,
"id": "4110146646971555"
}
]
}

該數據採用json格式存儲。

第一種：
導入數據

CREATE TABLE IF NOT EXISTS tmp_json_test (
json string
)
STORED AS textfile ;

load data local inpath ‘/opt/datas/weibotest.json’ overwrite into table tmp_json_test;

解析數據：

select get_json_object(t.json,’.id′),getjsonobject(t.json,′ .id'), get_json_object(t.json,'.id′),getjsonobject(t.json,′.total_number’) from tmp_json_test t ;

select t2.* from tmp_json_test t1 lateral view json_tuple(t1.json, ‘id’, ‘total_number’) t2 as c1, c2;

第二種：
第二種方式相比第一種更靈活，更通用。重要的是每行必須是一個完整的JSON，一個JSON不能跨越多行。

下載Jar，使用之前先下載jar：

http://www.congiu.net/hive-json-serde/
如果要想在Hive中使用JsonSerde，需要把jar添加到hive類路徑中：

add jar json-serde-1.3.7-jar-with-dependencies.jar;

建表

create table if not exists temp_db.test_json_weibo(
appCode string
,dateType string
,pageToken string
,data string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"ignore.malformed.json"="true"
)
STORED AS TEXTFILE;

load data local inpath ‘/home/hadoop/test_json_weibo.txt’ into table temp_db.test_json_weibo;

查數據

select * from temp_db.test_json_weibo limit 1;
1

倒入之後就可以隨便使用了

select * from tmp_json_array where array_contains(ids,‘2813165271’) or array_contains(ids,‘1419789200’);

需要注意的是當你的數據中包含有不符合json規範的行時，運行查詢會報異常

測試可以增加配置用以跳過錯誤數據

ALTER TABLE weibo_json SET SERDEPROPERTIES ( “ignore.malformed.json” = “true”);
在運行查詢不會報錯，但是壞數據記錄將變爲NULL。

最後需要提醒的是當你的json數據中包含hive關鍵字時，導入的數據會有問題，此時 SerDe可以使用SerDe屬性將hive列映射到名稱不同的屬性

如果ids是hive關鍵字的話，更改建表語句如下：

複製代碼
CREATE TABLE tmp_json_array (
id string,
ids_alias array,
total_number int)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
WITH SERDEPROPERTIES (“mapping.ids_alias”=“ids”)
STORED AS TEXTFILE;
---------------------
作者：cuteximi_1995
來源：CSDN
原文：https://blog.csdn.net/qq_31975963/article/details/88657709
版權聲明：本文爲博主原創文章，轉載請附上博文鏈接！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hive 處理 json數據

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

網絡爬蟲的祕密：如何高效地抓取JD.com視頻鏈接

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

Cloudera Manager離線部署CDH文檔詳解

Kafka丟失數據問題優化總結

解決安裝MySQL時登錄錯誤--error: 'Access denied for user 'root'@'localhost' (using password: YES)'

Linux下顯示類似-bash-4.1# 不顯示路徑的解決辦法

hive 處理 json數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結