lightLDA dump_binary格式分析

原始注释:
/*
* Output file format:
* 1, the first 4 byte indicates the number of docs in this block
* 2, the 4 * (doc_num + 1) bytes indicate the offset of reach doc
* an example
* 3 // there are 3 docs in this block
* 0 // the offset of the 1-st doc
* 10 // the offset of the 2-nd doc, with this we know the length of the 1-st doc is 5 = 10/2
* 16 // the offset of the 3-rd doc, with this we know the length of the 2-nd doc is 3 = (16-10)/2
* 24 // with this, we know the length of the 3-rd doc is 4 = (24 - 16)/2
* w11 t11 w12 t12 w13 t13 w14 t14 w15 t15 // the token-topic list of the 1-st doc
* w21 t21 w22 t22 w23 t23 // the token-topic list of the 2-nd doc
* w31 t31 w32 t32 w33 t33 w34 t34 // the token-topic list of the 3-rd doc

 * the class block_stream helps generate such binary format file, usage:
 * int doc_num = 3;
 * int64_t* offset_buf = new int64_t[doc_num + 1];
 *
 * block_stream bs;
 * bs.open("block");
 * bs.write_empty_header(offset_buf, doc_num);
 * ...
 * // update offset_buf and doc_num...

 * bs.write_doc(doc_buf, doc_idx);
 * ...
 * bs.write_real_header(offset_buf, doc_num);
 * bs.close();
 */

分析:
两篇文章的情况下,格式如下
2, 0, 0, 0, 59, 0, 130, 0, 0, 2270, 0, 2865, 0, 6357, 0, 7962, 0, 8110, 0, 8627, 0, 8760, 0, 8934, 0, 9104, 0, 9723, 0, 11089, 0, 11766, 0, 12608, 0, 12750, 0, 14119, 0, 17061, 0, 27641, 0, 45843, 0, 54110, 0, 66203, 0, 145784, 0, 187091, 0, 187631, 0, 187631, 0, 189015, 0, 189015, 0, 189015, 0, 1566513, 0, 3683883, 0, 0, 2270, 0, 2865, 0, 6357, 0, 6357, 0, 7962, 0, 8110, 0, 8627, 0, 8760, 0, 8934, 0, 9104, 0, 9723, 0, 11089, 0, 11766, 0, 12608, 0, 12750, 0, 14119, 0, 17061, 0, 27641, 0, 45843, 0, 54110, 0, 66203, 0, 145784, 0, 187091, 0, 187631, 0, 187631, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 1566513, 0, 3683883, 0

其中每个数字后都会跟一个0

2, 0, -------------2篇文章
0, 0, 59, 0, 130, 0, -----------第一篇文章的起止为0/59,第二篇文章的起止地址为59/130
0, -------------每篇文章开始处为一个0
2270, 0, 2865, 0, 6357, 0, 7962, 0, 8110, 0, 8627, 0, 8760, 0, 8934, 0, 9104, 0, 9723, 0, 11089, 0, 11766, 0, 12608, 0, 12750, 0, 14119, 0, 17061, 0, 27641, 0, 45843, 0, 54110, 0, 66203, 0, 145784, 0, 187091, 0, 187631, 0, 187631, 0, 189015, 0, 189015, 0, 189015, 0, 1566513, 0, 3683883, 0, ---------文章每个词id后面跟一个0
0,
2270, 0, 2865, 0, 6357, 0, 6357, 0, 7962, 0, 8110, 0, 8627, 0, 8760, 0, 8934, 0, 9104, 0, 9723, 0, 11089, 0, 11766, 0, 12608, 0, 12750, 0, 14119, 0, 17061, 0, 27641, 0, 45843, 0, 54110, 0, 66203, 0, 145784, 0, 187091, 0, 187631, 0, 187631, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 189015, 0, 1566513, 0, 3683883, 0

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章