AVRO文件結構分析

AVRO文件結構分析 
[email protected] 

研究了AVRO的規範,比較形象的圖形表達了文件中內容佈局,僅做參考。詳細說明在圖形下方。 

 

使用AVRO標準系列化生成二進制的文件,該文件總體上由文件頭(Header)和數據塊(Data Block)及同步標識(Synchronization marker)三部分組成。 
  • 文件頭爲標識爲Header的青色大框部分。
  • 數據塊爲文件頭下方緊鄰的灰色的Data Block部分。
  • 同步標識爲數據塊下方緊接着的橘色的Synchronization marker部分。


AVRO通過使用同步標識,將大塊數據分割成小塊,連續存儲在同一個文件中,便於併發處理,即不同線程可以相互無影響的同時操作不同的數據塊。因此,在上圖最下方的數據塊之後,根據情況,會有更多的同步標識和數據塊。 

AVRO的文件頭由三部分組成,如上圖所示。 
  • 文件頭由四個字節'O', 'b', 'j'開始,後面緊接着1,一般稱這四個字節爲魔術字符(magic)
  • 緊接着文件頭的是AVRO的Meta Data
  • 文件頭的最後由同步標識結尾
----------------------------------問題分割線------------------------------

what is “sync marker” used for in avro format

I have been struggling with the "sync marker" part in avro. The doc says it's used for splitting files. Not sure what it really means. Some questions:

1 How does it use this part to split files. Does it scan the whole file and file such parts and split? If yes, won't it be more efficient if it just get the size in each data block and jump ahead to next block and do same thing?

2 when the data block is compressed, is the sync-marker compressed?

3 why does it have multiple data blocks rather then put into one single data block, is it because the size of the block is of long type. Which has limit of length it could hold?

4 the data block is logical view? all the data blocks will still be in a single file in filesystem?

Thanks for information for any point above.

link:http://stackoverflow.com/questions/27360727/what-is-sync-marker-used-for-in-avro-format
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章