hive和impala操作parquet文件timestamp帶來的困擾

原創

bsf5521

2020-06-23 10:05

前言：準備使用hive作數據倉庫，因歷史遺留問題，原先遺留的數據處理都是impala處理的，數據文件是parquet文件，因本身集羣資源少，而處理的文件很大，準備使用hive離線分析將小文件推送到db或者impala進行展示操作。

準備：搭建cdh5.9，將原有的數據從一個集羣遷移到現有的集羣。對數據按照天進行動態分區，分區數據仍然使用parquet格式。

問題：因分區字段爲timestamp類型，一個偶然的機會發現了一個詭異的問題，hive查詢的時間比impala查詢的時間多了8個小時，和原始數據進行比對發現hive處理的timestamp數據有問題。

Based on this discussion it seems that when support for saving timestamps in Parquet was added to Hive, the primary goal was to be compatible with Impala's implementation, which probably predates the addition of the timestamp_millis type to the Parquet specification.

Impala's timestamp representation maps to the int96 Parquet type (4 bytes for the date, 8 bytes for the time, details in the linked discussion).

So no, storing a Hive timestamp in Parquet does not use the timestamp_millis type, but Impala's int96 timestamp representation instead.

以上是查到的問題的原因，因英文不好，不是很難就不在作翻譯了。

說說的我的解決措施吧，因我準備後期長期使用hive 而不是使用impala 固將數據timestamp 添加 to_utc_timestamp(insert_time, 'GMT+8') 進行轉換，函數不懂可以自己去查詢下哈，然後重新分區使用orcfile（簡單說下orcfile格式，列式存儲，數據文件佔用空間小）格式進行存儲。

悲催的是impala不支持orcfile格式的數據文件，無奈只能選擇妥協方案，大數據文件使用hive離線處理，數據結果推送到impala或者db，保存格式爲impala支持的格式。

僅以此文紀念爲解決此問題死傷的腦細胞！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hive和impala操作parquet文件timestamp帶來的困擾

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

impala的操作

hive和impala操作parquet文件timestamp帶來的困擾

spark在eclipse中遇到的問題及處理措施_標記下

cdh spark history無法查看歷史數據

CDH5.9安裝

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結