Google Protocol Buffer 學習筆記

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Google Protocol Buffer( 簡稱 Protobuf) 是 Google 公司內部的混合語言數據標準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Protocol Buffers 是一種輕便高效的結構化數據存儲格式,可以用於結構化數據串行化,或者說序列化。它很適合做數據存儲或 RPC 數據交換格式。可用於通訊協議、數據存儲等領域的語言無關、平臺無關、可擴展的序列化結構數據格式。目前提供了 C++、Java、Python 三種語言的 API。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Protobuf 性能"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/57/57c38131cdc5caf61004c539a65cb2ba.png","alt":null,"title":"封解包速度對比,來自網絡","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Protobuf 的優點"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Protobuf 有如 XML,不過它更小、更快、也更簡單。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它有一個非常棒的特性,即“向後”兼容性好。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Protobuf 語義更清晰,無需類似 XML 解析器的東西(因爲 Protobuf 編譯器會將 .proto 文件編譯生成對應的數據訪問類以對 Protobuf 數據進行序列化、反序列化操作)。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 Protobuf 無需學習複雜的文檔對象模型,Protobuf 的編程模式比較友好,簡單易學。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Protobuf 的不足"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它功能簡單,無法用來表示複雜的概念。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Protobuf 只是 Google 公司內部使用的工具,在通用性上還差很多。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Protobuf 不適合用來對基於文本的標記文檔(如 HTML)建模。自解釋性差,不可以被人直接讀取編輯,它以二進制的方式存儲,除非你有 .proto 定義,否則你沒法直接讀出 Protobuf 的任何內容"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Protobuf 編碼方式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Protobuf 序列化後所生成的二進制消息非常緊湊,這得益於 Protobuf 採用的非常巧妙的 Encoding 方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考察消息結構之前,讓我首先要介紹一個叫做 "},{"type":"codeinline","content":[{"type":"text","text":"Varint"}]},{"type":"text","text":" 的術語。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Varint 是一種緊湊的表示數字的方法。"},{"type":"codeinline","content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"它用一個或多個字節來表示一個數字,值越小的數字使用越少的字節數。這能減少用來表示數字的字節數"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如對於 int32 類型的數字,一般需要 4 個 byte 來表示。但是採用 Varint,對於很小的 int32 類型的數字,則可以用 1 個 byte 來表示。當然凡事都有好的也有不好的一面,採用 Varint 表示法,大的數字則需要 5 個 byte 來表示。從統計的角度來說,一般不會所有的消息中的數字都是大數,因此大多數情況下,採用 Varint 後,可以用更少的字節數來表示數字信息。下面就詳細介紹一下 "},{"type":"codeinline","content":[{"type":"text","text":"Varint"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Varint"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Varint 中的每個 byte 的最高位 bit 有特殊的含義,"},{"type":"codeinline","content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"如果該位爲 1,表示後續的 byte 也是該數字的一部分,如果該位爲 0,則結束"}]},{"type":"text","text":"。其他的 7 個 bit 都用來表示數字。因此小於 128 的數字都可以用一個 byte 表示。大於 128 的數字,比如 300,會用兩個字節來表示:1010 1100 0000 0010"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖演示了 Google Protocol Buffer 如何解析兩個 bytes。注意到最終計算前將兩個 byte 的位置相互交換過一次,這是因爲 Google Protocol Buffer "},{"type":"codeinline","content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"字節序採用 little-endian"},{"type":"text","text":" 的方式"}]},{"type":"text","text":"。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cd0eae9c45efd414d38c01a9a226539c.jpeg","alt":null,"title":"Varint 編碼","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"消息經過序列化後會成爲一個二進制數據流,該流中的數據爲一系列的 Key-Value 對。如下圖所示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ea/ea5f43669b1c1532afa5b392da8895fc.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"採用這種 Key-Pair 結構無需使用分隔符來分割不同的 Field。對於可選的 Field,如果消息中不存在該 field,那麼在最終的 Message Buffer 中就沒有該 field,這些特性都有助於節約消息本身的大小。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以代碼清單 1 中的消息爲例。假設我們生成如下的一個消息 Test1:"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"代碼清單 1 \n\nTest1.id = 10; \nTest1.str = “hello”;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"則最終的 Message Buffer 中有兩個 Key-Value 對,一個對應消息中的 id;另一個對應 str。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Key 用來標識具體的 field,在解包的時候,Protocol Buffer 根據 Key 就可以知道相應的 Value 應該對應於消息中的哪一個 field。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Key 的定義如下:"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"`(field_number << 3) | wire_type`"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到 Key 由兩部分組成。第一部分是 field"},{"type":"text","marks":[{"type":"italic"}],"text":"number,比如消息 lm.helloworld 中 field id 的 field"},{"type":"text","text":"number 爲 1。第二部分爲 wire_type。表示 Value 的傳輸類型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表 1. Wire Type"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| Type | Meaning | Used For | "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| ---- | ---- | ---- | "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|0 | Varint | int32, int64, uint32, uint64, sint32, sint64, bool, enum | "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 1 | 64-bit | fixed64, sfixed64, double |"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 2 | Length-delimi | string, bytes, embedded messages, packed repeated fields |"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 3 | Start group | Groups (deprecated) |"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 4 | End group\t | Groups (deprecated) |"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 5 | 32-bit | fixed32, sfixed32, float\t | "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們的例子當中,field id 所採用的數據類型爲 int32,因此對應的 wire type 爲 0。細心的讀者或許會看到在 Type 0 所能表示的數據類型中有 "},{"type":"codeinline","content":[{"type":"text","text":"int32"}]},{"type":"text","text":" 和 `sint32 這兩個非常類似的數據類型。Protobuf區別它們的主要意圖也是爲了減少編碼後的字節數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在計算機內,"},{"type":"text","marks":[{"type":"strong"}],"text":"一個負數一般會被表示爲一個很大的整數,因爲計算機定義負數的符號位爲數字的最高位。如果採用 Varint 表示一個負數,那麼一定需要 5 個 byte"},{"type":"text","text":"。爲此 Google Protocol Buffer 定義了 sint32 這種類型,採用 zigzag 編碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Zigzag"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zigzag 編碼用無符號數來表示有符號數字,正數和負數交錯,這就是 zigzag 這個詞的含義了。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f1/f12f8fa5b2cee893fad445b9baee573b.png","alt":null,"title":"Zigzag編碼\n","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 zigzag 編碼,絕對值小的數字,無論正負都可以採用較少的 byte 來表示,"},{"type":"text","marks":[{"type":"strong"}],"text":"充分利用了 Varint 這種技術"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其他的數據類型,比如字符串等則採用類似數據庫中的 varchar 的表示方法,"},{"type":"codeinline","content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"即用一個 varint 表示長度,然後將其餘部分緊跟在這個長度部分之後即可"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過以上對 protobuf Encoding 方法的介紹,想必您也已經發現 protobuf 消息的內容小,適於網絡傳輸。假如您對那些有關技術細節的描述缺乏耐心和興趣,那麼下面這個簡單而直觀的比較應該能給您更加深刻的印象。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/42/4227f7bb5b8a13b2827ea402d4407e9e.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"封解包的速度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們來了解一下 XML 的封解包過程。XML 需要從文件中讀取出字符串,再轉換爲 XML 文檔對象結構模型。之後,再從 XML 文檔對象結構模型中讀取指定節點的字符串,最後再將這個字符串轉換成指定類型的變量。這個過程非常複雜,其中將 XML 文件轉換爲文檔對象結構模型的過程通常需要完成詞法文法分析等大量消耗 CPU 的複雜計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"反觀 Protobuf,它只需要簡單地將一個二進制序列,按照指定的格式讀取到 C++ 對應的結構類型中就可以了。從上一節的描述可以看到消息的 decoding 過程也可以通過幾個位移操作組成的表達式計算即可完成。速度非常快。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了說明這並不是我拍腦袋隨意想出來的說法,下面讓我們簡單分析一下 Protobuf 解包的代碼流程吧。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下面代碼中的 Reader 爲例,該程序首先調用 msg1 的 ParseFromIstream 方法,這個方法解析從文件讀入的二進制數據流,並將解析出來的數據賦予 helloworld 類的相應數據成員。"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"#include \"lm.helloworld.pb.h\" \n…\n void ListMsg(const lm::helloworld & msg) { \n cout << msg.id() << endl; \n cout << msg.str() << endl; \n } \n \n int main(int argc, char* argv[]) { \n \n lm::helloworld msg1; \n \n { \n fstream input(\"./log\", ios::in | ios::binary); \n if (!msg1.ParseFromIstream(&input)) { \n cerr << \"Failed to parse address book.\" << endl; \n return -1; \n } \n } \n \n ListMsg(msg1); \n … \n }"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該過程可以用下圖表示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7f/7f5510e14a37a2a06aa72172730c64d9.jpeg","alt":null,"title":"解包流程圖\n","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個解析過程需要 Protobuf 本身的框架代碼和由 Protobuf 編譯器生成的代碼共同完成。Protobuf 提供了基類 Message 以及 Message_lite 作爲通用的 Framework,,CodedInputStream 類,WireFormatLite 類等提供了對二進制數據的 decode 功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面內容中我們可以瞭解到,Protobuf 的解碼可以通過幾個簡單的數學運算完成,無需複雜的詞法語法分析,因此 ReadTag() 等方法都非常快。 在這個調用路徑上的其他類和方法都非常簡單,感興趣的讀者可以自行閱讀。 相對於 XML 的解析過程,以上的流程圖實在是非常簡單的,這也就是 Protobuf 效率高的第二個原因了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Protobuf 可用於通訊協議、數據存儲等領域的語言無關、平臺無關、可擴展的序列化結構數據格式;支持多種語言;得益於簡單的編碼方式 和 快速的解碼速度,使用速度很快;但也正是因爲簡單、快速,不利於閱讀;且目前來看,通用性較差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"引用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://blog.csdn.net/qq_21383435/article/details/81035852","title":""},"content":[{"type":"text","text":"Mac安裝protobuf 流程"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://juejin.im/post/5d7e40036fb9a06b122f6bbf","title":""},"content":[{"type":"text","text":"Protobuf語言指南"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://colobu.com/2019/10/03/protobuf-ultimate-tutorial-in-go/","title":""},"content":[{"type":"text","text":"Protobuf終極教程"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://www.ibm.com/developerworks/cn/linux/l-cn-gpb/index.html","title":""},"content":[{"type":"text","text":"Google Protocol Buffer 的使用和原理"}]}]},{"type":"paragraph","attrs":{"indent":8,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":8,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 歡迎大家關注我的公衆號,一起探討技術"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4d/4d7f7782f5c2dd7d94316724724fe12c.jpeg","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章