boost tokenizer 坑

原創

2018-09-01 14:00

今天发现 boost tokenizer容易用错的地方，记录一下。

  1 #include <unordered_map>
  2 #include <iostream>       
  3 #include <map>
  4 #include <string>         
  5 #include <vector>         
  6 #include <boost/tokenizer.hpp>
  7 
  8 
  9 using namespace std;
 10 using namespace boost;
 11 
 12 int main() {
 13 
 30 //    vector<string> str_vec;
 31 //    str_vec.push_back("hello");
 32 //    str_vec.push_back("world");
 33 //    auto itr = str_vec.begin();
 34 //    const string& str = *itr;
 35 //    cout << str << endl;
 36 //    ++itr;
 37 //    cout << str << endl;
 38 
 39     string src("hello world");
 41     tokenizer<> outer_tokens(src); 
 42     tokenizer<>::iterator outer_tok_itr = outer_tokens.begin();
 43     const string& str_1 = *outer_tok_itr;
 44     cout << str_1 << endl;
 46     ++outer_tok_itr;
 47     cout << str_1 << endl;
 49 }

输出如下：

hello
world

就是在用引用的时候

const string& str_1 = *outer_tok_itr;

++outer_tok_itr 的操作导致两次输出str_1结果不一样。

因为差点吓哭，所以试了一下vector的迭代器压压惊，就是注释掉的部分，结果两次输出的都是hello，事实证明我原来的三观还算是正的。。

</pre><pre name="code" class="cpp"> 40     string src("hello world");
 41     tokenizer<> outer_tokens(src);
 42     tokenizer<>::iterator outer_tok_itr = outer_tokens.begin();
 43     const string& str_1 = *outer_tok_itr; 
 44     cout << str_1 << endl;
 45 //    ++outer_tok_itr;
 46     tokenizer<>::iterator new_itr = outer_tok_itr;
 47     std::advance(outer_tok_itr, 1); 
 48     cout << str_1 << endl;
 49     
 50     cout << *new_itr << endl;
 51     cout << *outer_tok_itr << endl;

输出：

hello
world
hello
world

new_itr确实能输出正确结果。。

先记下这个坑，有时间再查原因：（

更新：

今天又发现另外一个问题，记录一下。

在线上环境中发现 tokenizer iterator 明显已经越界，但是还在对这个迭代器解引用，这应该是一个很明显的bug，但是线上却是正常的！怎么会这样！

我写了简单的测试代码，发现对越界的迭代器解引用是会报错的。

我确认了几遍测试代码和线上代码高度一致，然后也在同一台开发机上编译，也在同一台机器上运行。都是一样的，为啥表现不一致！

后来想到问题出在编译参数上，我编写测试程序是用 g++ 直接默认参数编译的，但是我们工程是用的 blade 构建工具，

排查了一遍参数设置，发现问题在于blade 默认定义了NODEBUG，从而关闭了 assertion。

恩，找到线上为啥没有报错的原因，但是，对越界的迭代器解引用怎么样也不应该输出正确结果吧。

一般也会 segment fault 吧（试了一下对 vector 的越界 iterator 解引用确实会 segment fault）。

其实找到问题在哪里也不用管这种异常情况下程序是怎么表现的了，未定义行为怎么表现都是可以的。

但是还是想尽可能了解一下。。。

这就是 boost tokenizer 的特异之处了。。

typedef boost::tokenizer<boost::char_separator<char> > Tokenizer;
boost::char_separator<char> outer_sep(" ");
string src("hello world hahah");
Tokenizer outer_tokens(src, outer_sep);
Tokenizer::iterator outer_tok_itr = outer_tokens.begin();
while(outer_tok_itr != outer_tokens.end()) {
  cout << "not end yet" << endl;
  ++outer_tok_itr;
}
cout << *outer_tok_itr << endl;                         // output hahah
cout << *outer_tokens.end() << endl;                    // output ""(nothing)

++outer_tok_itr;
cout << *outer_tok_itr << endl;                         // output hahah

（以上代码编译时定义 NODEBUG ，否则是肯定 assert 的）

while 循环出来 itr 就已经和 end() 相等了，但是解引用可是不一样的。

输出见注释，另，即使自增 itr 之后，输出的仍然是 hahah。

其实这个还算好理解，看了下源代码，在自增执行TokenizerFunc时（此处为char_separator）时，是会检查当前的迭代器是 end() 的关系的。

所以会导致迭代器本身并不自增。

要真正理解还是要搞清楚这个步骤是怎么做的。。。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

boost tokenizer 坑

工作中用到的脚本合集

微服务实践Aspire项目发布到远程k8s集群

通过f-string编写简洁高效的Python格式化输出代码

[转帖]20个常用的Linux工具命令

[转帖]PostgreSQL从小白到高手教程 - 第46讲：poc-tpch测试

24-5-18 X

數據結構算法面試100題之二叉樹轉換成雙向鏈表

歸併排序 c語言實現

MOS管基礎

hash表 c語言實現

數據結構算法面試100題之逐層遍歷二叉樹元素

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結