Protocol Buffers

1. 人人都愛Protocol Buffers

1.1 Protocol Buffers（PB）是什麼?

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages – Java, C++, or Python. You can even update your data structure without breaking deployed programs that are compiled against the “old” format.（摘自PB官網）

針對英文不太好的同學，除了強烈建議好好學一下英文外（PB的最新文檔總是英文的），這裏筆者按照自己的理解試着翻譯一下：protocol buffers是google提供的一種將結構化數據進行序列化和反序列化的方法，其優點是語言中立，平臺中立，可擴展性好，目前在google內部大量用於數據存儲，通訊協議等方面。PB在功能上類似XML，但是序列化後的數據更小，解析更快，使用上更簡單。用戶只要按照proto語法在.proto文件中定義好數據的結構，就可以使用PB提供的工具（protoc）自動生成處理數據的代碼，使用這些代碼就能在程序中方便的通過各種數據流讀寫數據。PB目前支持Java, C++和Python3種語言。另外，PB還提供了很好的向後兼容，即舊版本的程序可以正常處理新版本的數據，新版本的程序也能正常處理舊版本的數據。

1.2 如何使用Protocol Buffers？

這裏以官網Tutorial的通訊簿例子來簡單介紹一下PB的常規使用方式，非常規的使用方式在後面幾章逐一介紹
1.在addressbook.proto文件裏定義通訊簿消息的格式，一個通訊簿（AddressBook）由可重複的Person組成，一個person由兩個必需存在的name和id字段，以及一個可選的email字段，和可重複的PhoneNumber構成。PhoneNumber由number和type組成。


message
Person {

  required
string name = 1;

  required
int32 id = 2;

  optional
string email = 3;

  enum

PhoneType {

    MOBILE
= 0;

    HOME
= 1;

    WORK
= 2;

  }

  message
PhoneNumber {

    required
string number = 1;

    optional
PhoneType type = 2 [default

= HOME];

  }

  repeated
PhoneNumber phone = 4;

}

message
AddressBook {

  repeated
Person person = 1;

}

2.使用PB提供的工具 protoc根據.proto文件自動生成處理消息的代碼


protoc
-I=$SRC_DIR --cpp_out=$DST_DIR $SRC_DIR/addressbook.proto


</pre>

在$DST_DIR裏生成了下面兩個文件：

addressbook.pb.h,

addressbook.pb.cc

<pre>

3.程序使用生成的代碼來讀寫（序列化，反序列化）和操作（get，set）消息


//保存address
book

fstream
output(argv[1], ios::out | ios::trunc | ios::binary);

address_book.SerializeToOstream(&amp;output))；

1.3 爲什麼寫這篇文章

目前網上關於PB的文章大部分只涉及到上面講的內容，而實際上PB的能力遠不止如此，本文嘗試使用PB內建的支持實現自描述消息，動態消息以及兩者的結合：動態自描述消息，在此基礎上給出一些性能參考和建議。本文以下部分適合對PB有一定使用經驗的同學閱讀，強烈建議感興趣的同學在閱讀下面章節前再去複習一下Tutorial，因爲我會以Tutorial的AddressBook例子來演示自描述消息的實現。
由於筆者知識有限，本文只涉及C++語言的內容，使用Java和Python的同學可參考下文並閱讀官網API Reference自己摸索，應該問題不大，筆者就是這麼摸索過來的。
爲了下文介紹的方便，先明確生產者，消費者兩個角色。
生產者：產生消息，填充內容，並序列化保存
消費者：讀取數據，反序列化得到消息，使用消息
在我們的例子裏生產者和消費者均爲爲獨立的程序，消息序列化後保存在文件中。網絡通訊的情況類似，請自行推理

2. 自描述消息

2.1 分析

Tutorial介紹的使用方法要求生產者和消費者在編譯時就確定消息格式（.proto文件），生產者和消費者在消息格式上緊耦合。當消息格式發生變化的時候，消費者必須重新編譯才能理解新格式。有沒有可能解除這種耦合，讓消費者能動態的適應消息格式的變換？從原理上進行分析的話發現是可行的。即生產者把定義消息格式的.proto文件和消息作爲一個完整的消息序列化保存，完整保存的消息我稱之爲Wrapper message，原來的消息稱之爲payload message。消費者把wrapper message反序列化，先得到payload message的消息類型，然後根據類型信息得到payload message，最後通過反射機制來使用該消息。通過這種方式消費者只需要瞭解這一種wrapper message的格式就能夠適應各種payload message的格式。這也是PB官網給出的解決方案：Self-describing Messages
wrapper message的定義如下所示，第一個字段保存payload message的類型信息（由於message可以內嵌message，而.proto文件可以import 其他.proto，所以這裏使用FileDescriptorSet），第二個字段是payload message的類型名字符串，第三個字段是payload message序列化後的數據。


message
SelfDescribingMessage {

  //
Set of .proto files which define the type.

  required
FileDescriptorSet proto_files = 1;

  //
Name of the message type.  Must be defined by one of the files in proto_files.

  required
string type_name = 2;

  //
The message data.

  required
bytes message_data = 3;

}

2.2 實現

下面通過改造tutorial例子程序，演示自描述消息的實現方式。

生產者：add_person.cc

1. 使用 protoc生成代碼時加上參數–descriptor_set_out，輸出類型信息(即SelfDescribingMessage的第一個字段內容)到一個文件，這裏假設文件名爲desc.set，
protoc –cpp_out=. –descriptor_set_out=desc.set addressbook.proto
2. payload message使用方式不需要修改
tutorial::AddressBook address_book;
PromptForAddress(address_book.add_person());//這個函數不需要任何修改
3. 在保存時使用文件desc.set內容填充SelfDescribingMessage的第一個字段，使用AddressBook
AddressBook的full name填充SelfDescribingMessage的第二個字段，AddressBook序列化後的數據填充第三個字段。最後序列化SelfDescribingMessage保存到文件中。


tutorial::SelfDescribingMessage
sdmessage;

fstream
desc(argv[2], ios::in | ios::binary);

sdmessage.
mutable_proto_files()->ParseFromIstream(&desc)；

sdmessage.set_type_name((address_book.GetDescriptor())->full_name());

sdmessage.clear_message_data();

address_book.SerializeToString(sdmessage.mutable_message_data());

fstream
output(argv[1], ios::out | ios::trunc | ios::binary);

sdmessage.SerializeToOstream(&output))；

消費者：list_people.cc

List_people.cc編譯時需要知道SelfDescribingMessage，不需要知道AddressBook，運行時可以正常操作AddressBook消息。
1. 首先反序列化SelfDescribingMessage


tutorial::SelfDescribingMessage
sdmessage;

fstream
input(argv[1], ios::in | ios::binary);

sdmessage.ParseFromIstream(&input))；

2. 通過第一個字段得到FileDescriptorSet，通過第二個字段取得消息的類型名，使用DescriptorPool得到payload message的類型信息Descriptor


SimpleDescriptorDatabase
db;

for(int

i=0;i<sdmessage.proto_files().file_size();i++)

{   
db.Add(sdmessage.proto_files().file(i));  }

DescriptorPool
pool(&db);

const

Descriptor *descriptor = pool.FindMessageTypeByName(sdmessage.type_name());

3. 使用DynamicMessage new出這個類型的一個空對象，從第三個字段反序列化得到原來的message對象


DynamicMessageFactory
factory(&pool);

Message
*msg = factory.GetPrototype(descriptor)->New();

msg->ParseFromString(sdmessage.message_data());

4. 通過Message的reflection接口操作message的各個字段

3. 動態消息

3.1 分析

自描述消息解放了消費者，那麼生產者呢？能否在運行時確定消息格式，動態生成消息呢？從原理上分析發現也是可以的。自描述消息的消費者是從文件中讀取消息格式信息，我們只要在運行時構建這樣的內容就可以實現動態消息。下面以代碼說明，本章節的內容由劍豪提供。
最終動態生成的消息格式定義如下所示：


message
pair {

  required
string key = 1;

  required
uint32 value = 2;

  }

3.2 實現

1. 動態定義消息，生成類型信息


FileDescriptorProto
file_proto;

file_proto.set_name("foo.proto");

//
create dynamic message proto names "Pair"

DescriptorProto
*message_proto = file_proto.add_message_type();

message_proto->set_name("Pair");

FieldDescriptorProto
*field_proto = NULL;

field_proto
= message_proto->add_field();

field_proto->set_name("key");

field_proto->set_type(FieldDescriptorProto::TYPE_STRING);

field_proto->set_number(1);

field_proto->set_label(FieldDescriptorProto::LABEL_REQUIRED);

field_proto
= message_proto->add_field();

field_proto->set_name("value");

field_proto->set_type(FieldDescriptorProto::TYPE_UINT32);

field_proto->set_number(2);

field_proto->set_label(FieldDescriptorProto::LABEL_REQUIRED);

DescriptorPool
pool;

const

FileDescriptor *file_descriptor = pool.BuildFile(file_proto);

const

Descriptor *descriptor = file_descriptor->FindMessageTypeByName("Pair");

2. 根據類型信息使用DynamicMessage new出這個類型的一個空對象


//
build a dynamic message by "Pair" proto

DynamicMessageFactory
factory(&pool);

const

Message *message = factory.GetPrototype(descriptor);

//
create a real instance of "Pair"

Message
*pair = message->New();

3. 通過Message的reflection操作message的各個字段


//
write the "Pair" instance by reflection

const

Reflection *reflection = pair->GetReflection();

const

FieldDescriptor *field = NULL;

field
= descriptor->FindFieldByName("key");

reflection->SetString(pair,
field, "my
key");

field
= descriptor->FindFieldByName("value");

reflection->SetUInt32(pair,
field, 1234);

此時動態生成的pair對象內容爲


key:
"my key"

value:
1234

3.3 代碼

完整代碼也不多，直接貼上：


#include
<iostream>

#include
<google/protobuf/descriptor.h>

#include
<google/protobuf/descriptor.pb.h>

#include
<google/protobuf/dynamic_message.h>

using

namespace 
std;

using

namespace 
google::protobuf;

int

main(int

argc, const

char 
*argv[])

{

    FileDescriptorProto
file_proto;

    file_proto.set_name("foo.proto");

    //
create dynamic message proto names "Pair"

    DescriptorProto
*message_proto = file_proto.add_message_type();

    message_proto->set_name("Pair");

    FieldDescriptorProto
*field_proto = NULL;

    field_proto
= message_proto->add_field();

    field_proto->set_name("key");

    field_proto->set_type(FieldDescriptorProto::TYPE_STRING);

    field_proto->set_number(1);

    field_proto->set_label(FieldDescriptorProto::LABEL_REQUIRED);

    field_proto
= message_proto->add_field();

    field_proto->set_name("value");

    field_proto->set_type(FieldDescriptorProto::TYPE_UINT32);

    field_proto->set_number(2);

    field_proto->set_label(FieldDescriptorProto::LABEL_REQUIRED);

    //
add the "Pair" message proto to file proto and build it

    DescriptorPool
pool;

    const

FileDescriptor *file_descriptor = pool.BuildFile(file_proto);

    const

Descriptor *descriptor = file_descriptor->FindMessageTypeByName("Pair");

    cout
<< descriptor->DebugString();

    //
build a dynamic message by "Pair" proto

    DynamicMessageFactory
factory(&pool);

    const

Message *message = factory.GetPrototype(descriptor);

    //
create a real instance of "Pair"

    Message
*pair = message->New();

    //
write the "Pair" instance by reflection

    const

Reflection *reflection = pair->GetReflection();

    const

FieldDescriptor *field = NULL;

    field
= descriptor->FindFieldByName("key");

    reflection->SetString(pair,
field, "my
key");

    field
= descriptor->FindFieldByName("value");

    reflection->SetUInt32(pair,
field, 1234);

    cout
<< pair->DebugString();

    delete

pair;

    return

0;

}

3.4 另一種實現方式：動態編譯

上面是動態消息的一種方式，我們還可以使用PB 提供的 google::protobuf::compiler 包在運行時動態編譯指定的.proto 文件來使用其中的 Message。這樣就可以通過修改.proto文件實現動態消息，有點類似配置文件的用法。完成這個工作主要的類叫做 importer，定義在 importer.h 中。
Foo.proto內容如下：


message
Pair {

    required
string key = 1;

    required
uint32 value = 2;

}

下面的代碼實現同樣的動態消息：


#include
<iostream>

#include
<google/protobuf/descriptor.h>

#include
<google/protobuf/descriptor.pb.h>

#include
<google/protobuf/dynamic_message.h>

#include
<google/protobuf/compiler/importer.h>

using

namespace 
std;

using

namespace 
google::protobuf;

using

namespace 
google::protobuf::compiler;

int

main(int

argc, const

char 
*argv[])

{

    DiskSourceTree
sourceTree;

    //look
up .proto file in current directory

    sourceTree.MapPath("",
"./");

    Importer
importer(&sourceTree, NULL);

    //runtime
compile foo.proto

    importer.Import("foo.proto");

    const

Descriptor *descriptor = importer.pool()->FindMessageTypeByName("Pair");

    cout
<< descriptor->DebugString();

    //
build a dynamic message by "Pair" proto

    DynamicMessageFactory
factory;

    const

Message *message = factory.GetPrototype(descriptor);

    //
create a real instance of "Pair"

    Message
*pair = message->New();

    //
write the "Pair" instance by reflection

    const

Reflection *reflection = pair->GetReflection();

    const

FieldDescriptor *field = NULL;

    field
= descriptor->FindFieldByName("key");

    reflection->SetString(pair,
field, "my
key");

    field
= descriptor->FindFieldByName("value");

    reflection->SetUInt32(pair,
field, 1111);

    cout
<< pair->DebugString();

    delete

pair;

    return

0;

}

4. 動態自描述消息

4.1 分析

好了，到此爲止我們已經可以通過自描述消息解放消費者，通過動態消息解放生產者。最後介紹的大殺器是兩者的結合：動態自描述消息，徹底解放生產者和消費者。
仍以上面的消息爲例說明：


message
pair {

  required
string key = 1;

  required
uint32 value = 2;

  }

這次我們不使用第二章介紹的wrapper message方式，改爲通過文件格式約定實現自描述，網絡通信協議可參考這種方式。
生產者和消費者商定文件格式如下：

4.2 實現

生產者

1. 動態定義消息，生成類型信息;根據類型信息生成一個空的message對象;通過Message的reflection操作message的各個字段。這些和動態消息處理一致，這裏就不贅述了。
2. 使用CodedOutputStream寫文件，依次保存如下信息：
a) MAGCI_NUM, 消費者可以用來驗證文件格式是否一致或者格式是否錯誤。
b) FileDescriptorProto序列化後數據的size
c) 序列化的FileDescriptorProto數據
d) Payload message序列化後數據的size
e) 序列化的Payload message數據
代碼如下：


const

unsigned int

MAGIC_NUM=2988;

int

fd = open("dpb.msg",
O_WRONLY|O_CREAT,0666);

ZeroCopyOutputStream*
raw_output = new

FileOutputStream(fd);

CodedOutputStream*
coded_output = new

CodedOutputStream(raw_output);

coded_output->WriteLittleEndian32(MAGIC_NUM);

string
data;

file_proto.SerializeToString(&data);

coded_output->WriteVarint32(data.size());

coded_output->WriteString(data);

data.clear();

pair->SerializeToString(&data);

coded_output->WriteVarint32(data.size());

coded_output->WriteString(data);;

delete

coded_output;

delete

raw_output;

close(fd);

消費者

1. 使用CodedInputStream讀取文件，先通過MAGIC_NUM判斷文件格式是否正確，然後反序列化FileDescriptorProto，得到payload message的類型信息


FileDescriptorProto
file_proto;

int

fd = open("dpb.msg",
O_RDONLY);

ZeroCopyInputStream*
raw_input = new

FileInputStream(fd);

CodedInputStream*
coded_input = new

CodedInputStream(raw_input);

unsigned
int

magic_number;

coded_input->ReadLittleEndian32(&magic_number);

if

(magic_number != MAGIC_NUM) {

        cerr
<< "File
not in expected format." 
<< endl;

        return

1;

}

uint32
size;

coded_input->ReadVarint32(&size);

char*
text = new

char[size
+ 1];

coded_input->ReadRaw(text,
size);

text[size]
= '\0';

file_proto.ParseFromString(text);

DescriptorPool
pool;

const

FileDescriptor *file_descriptor = pool.BuildFile(file_proto);

const

Descriptor *descriptor = file_descriptor->FindMessageTypeByName("Pair");

2. 使用DynamicMessage new出這個類型的一個空對象，從文件中讀取messagedata反序列化得到原來的message
DynamicMessageFactory factory(&pool);


const

Message *message = factory.GetPrototype(descriptor);

 //
create a real instance of "Pair"

 Message
*pair = message->New();

 coded_input->ReadVarint32(&size);

 text
= new

char[size
+ 1];

 coded_input->ReadRaw(text,
size);

 text[size]
= '\0';

 pair->ParseFromString(text);

3. 通過Message的reflection即可操作message的各個字段

5. 天下沒有免費的午餐

自描述和動態生成得到的靈活性不是免費的午餐，那麼下面我們就以文中的例子來分析一下動態自描述消息相對靜態消息在空間和時間上的變化。
1. 空間：由於PB主要用於數據存儲和通訊協議，下面分別分析：

以Tutorial中的AddressBook爲例分析數據存儲的使用場景，添加如下兩條記錄：


Person
ID: 1

  Name:
Peter

  E-mail
address: [email protected]

  Home
phone #: 13777777777

  Work
phone #: 13788888888

  Mobile
phone #: 13799999999

Person
ID: 2

  Name:
Tom

  E-mail
address: [email protected]

   Mobile
phone #: 13888888888

使用方式	內容	字節數
靜態消息	AddressBook	120
第二章自描述消息	FileDescriptorSet（3+302） type_name（2+20） message_data(2+120)	449

這裏需要注意的是表面上看數據量增加了274%，實際上增加的是固定的329字節，即當文件越來越大的時候這部分開銷是不會增加的。

以第四章動態自描述消息爲例分析在通訊協議中使用PB的應用場景


pair消息內容爲：

key:
"jianhao"

value:
8888

使用方式	內容	字節數
靜態消息	Pair	12
動態自描述消息	MAGIC_NUM FileDescriptorProto length FileDescriptorProto Message length Pair	64

注意：在網絡通訊中由於一次通訊需要傳輸一次完整的類型信息，所以消息越大越划算。
2. 時間：通過測試對比靜態消息和動態自描述消息在日常的使用場景下的效率。
測試中的消息類型如下：


message
Pair {

    required
string key = 1;

    required
uint32 value = 2;

}

生產者：
靜態消息使用方式：


pair.set_key("my
key");

pair.set_value(i);

pair.SerializeToArray(buffer,100);

動態消息使用方式：


const

Reflection *reflection = pair->GetReflection();

const

FieldDescriptor *field = NULL;

field
= descriptor->FindFieldByName("key");

reflection->SetString(pair,
field, "my
key");

field
= descriptor->FindFieldByName("value");

reflection->SetUInt32(pair,
field, i);

pair->SerializeToArray(buffer,100);

消息使用方式	循環1M時間消耗	循環10M消耗時間
靜態消息	0.37s	3.64s
動態消息	1.65s	16.51s

由於絕對時間和機器環境有關，所以相對值更有意義。從上面的測試可知動態消息的賦值和序列化時間是靜態消息的賦值和序列化的4倍。

消費者：
靜態消息使用方式：


pair.ParseFromArray(buffer,100);

key=pair.key();

value=pair.value()+i;

動態自描述消息有兩種使用方式：
1.僅反序列化&操作payload message，常用於數據存儲


pair->ParseFromArray(buffer,100);

const

Reflection *reflection = pair->GetReflection();

const

FieldDescriptor *field = NULL;

field
= descriptor->FindFieldByName("key");

key=reflection->GetString(*pair,
field);

field
= descriptor->FindFieldByName("value");

value=reflection->GetUInt32(*pair,
field)+i;

2.先反序列化payload message的類型信息，然後動態生成一個空的該類型對象，然後反序列化並操作該對象，常用於通訊協議


FileDescriptorProto
file_proto;

file_proto.ParseFromArray(descbuffer,300);

DescriptorPool
pool;

const

FileDescriptor *file_descriptor = pool.BuildFile(file_proto);

const

Descriptor *descriptor = file_descriptor->FindMessageTypeByName("Pair");

//
build a dynamic message by "Pair" proto

DynamicMessageFactory
factory;

const

Message *message = factory.GetPrototype(descriptor);

Message
*pair = message->New();

pair->ParseFromArray(buffer,100);

const

Reflection *reflection = pair->GetReflection();

const

FieldDescriptor *field = NULL;

field
= descriptor->FindFieldByName("key");

key=reflection->GetString(*pair,
field);

field
= descriptor->FindFieldByName("value");

value=reflection->GetUInt32(*pair,
field)+i;

消息使用方式	循環1M時間消耗	循環10M消耗時間
靜態消息	0.48s	4.85s
動態自描述消息（存儲方式）	2.01s	17.28s
動態自描述消息（通訊方式）	28.24s	283.98s

從上面的測試可知動態自描述消息的反序列化和操作時間是靜態消息的反序列化和操作的4倍左右。但是如果加上對類型信息的反序列化得化則性能急劇下降到靜態消息的接近60倍。

6. 參考資料

https://developers.google.com/protocol-buffers/?hl=zh-CN

http://www.ibm.com/developerworks/cn/linux/l-cn-gpb/?ca=drs-tp4608

玩轉Protocol Buffers