AVRO文件结构分析

原創

2020-06-16 05:14

AVRO文件结构分析
[email protected]

研究了AVRO的规范，比较形象的图形表达了文件中内容布局，仅做参考。详细说明在图形下方。

使用AVRO标准系列化生成二进制的文件，该文件总体上由文件头(Header)和数据块(Data Block)及同步标识(Synchronization marker)三部分组成。

文件头为标识为Header的青色大框部分。
数据块为文件头下方紧邻的灰色的Data Block部分。
同步标识为数据块下方紧接着的橘色的Synchronization marker部分。

AVRO通过使用同步标识，将大块数据分割成小块，连续存储在同一个文件中，便于并发处理，即不同线程可以相互无影响的同时操作不同的数据块。因此，在上图最下方的数据块之后，根据情况，会有更多的同步标识和数据块。

AVRO的文件头由三部分组成，如上图所示。

文件头由四个字节'O', 'b', 'j'开始，后面紧接着1，一般称这四个字节为魔术字符(magic)
紧接着文件头的是AVRO的Meta Data
文件头的最后由同步标识结尾

----------------------------------问题分割线------------------------------

what is “sync marker” used for in avro format

I have been struggling with the "sync marker" part in avro. The doc says it's used for splitting files. Not sure what it really means. Some questions:

1 How does it use this part to split files. Does it scan the whole file and file such parts and split? If yes, won't it be more efficient if it just get the size in each data block and jump ahead to next block and do same thing?

2 when the data block is compressed, is the sync-marker compressed?

3 why does it have multiple data blocks rather then put into one single data block, is it because the size of the block is of long type. Which has limit of length it could hold?

4 the data block is logical view? all the data blocks will still be in a single file in filesystem?

Thanks for information for any point above.

link：http://stackoverflow.com/questions/27360727/what-is-sync-marker-used-for-in-avro-format

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

AVRO文件结构分析

what is “sync marker” used for in avro format

如何使用 JS 判断用户是否处于活跃状态

通过HPA+CronHPA组合应对业务复杂弹性伸缩场景

Moving Data from HDFS to Hive Using an External Table

AVRO文件結構分析

tomcat原理以及處理HTTP請求的過程

CronTrigger

讓你提升命令行效率的 Bash 快捷鍵 [完整版]

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結