Boost庫基礎-字符串與文本處理(tokenizer)

原創

2020-06-16 06:07

tokenizer

tokenizer庫是一個專門用於分詞的字符串處理庫，它與string_algo庫的分割算法很類似，但有更多的變化，需要包含以下頭文件。

#include <boost/tokenizer.hpp>
using namespace boost;

類tokenizer是tokenizer庫的核心，類摘要如下：

tokenizer接受三個模板類型參數：

TokenizerFunc：tokenizer庫專用的分詞函數對象，默認是使用空格和標點分詞；
Iterator：字符序列的迭代器類型；
Type：保存分詞結果的類型；

三個模板都提供了默認值，但通常只有前兩個可以變化，第三個類型通常只能選擇std::string或者std::wstring。

tokenizer的構造函數接受要進行分詞的字符串，可以以迭代器的區間形式給出，也可以是一個有begin()和end()成員函數的容器。

assign()函數可以重新指定要分詞的字符串，用於再利用tokenizer。

用法：

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace boost;
using namespace std;

int main()
{
	string str("Link raise the master-sword.");

	//使用缺省模板參數創建分詞對象
	tokenizer<> tok(str);

	//遍歷
	for (auto& x : tok)
	{
		cout << "[" << x << "]";
	}

	getchar();
	return 0;
}

運行結果：

tokenizer默認把所有的空格和標點符號作爲分隔符。

分詞函數對象

tokenizer第一模板類型參數TokenizerFunc是一個函數對象，它決定如果進行分詞處理，有以下幾種類型：

char_separator

構造函數聲明：

char_separator(const Char* dropped_delims,
                   const Char* kept_delims = 0,
                   empty_token_policy empty_tokens = drop_empty_tokens)

用法：

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace boost;
using namespace std;

template<typename T>
void print(T &tok)
{
	for (auto& x : tok)
	{
		cout << "[" << x << "]";
	}
	cout << endl;
}

int main()
{
	char str[] = "Link  ; ; <master-sword> zelda";

	char_separator<char> sep;

	//使用缺省模板參數創建分詞對象
	tokenizer<char_separator<char>, char *> tok(str, str + strlen(str));
	print(tok);

	//重新分詞
	tok.assign(str, str + strlen(str), char_separator<char>(" ;-", "<>"));
	print(tok);

	//重新分詞
	tok.assign(str, str + strlen(str), char_separator<char>(" ;-<>", "",keep_empty_tokens));
	print(tok);

	getchar();
	return 0;
}

運行結果：

escaped_list_separator

專門處理CSV格式(逗號分隔值)的分詞對象，以下爲構造函數聲明：

escaped_list_separator(Char e = '\\',Char c = ',',Char q = '\"')

用法：

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace boost;
using namespace std;

template<typename T>
void print(T &tok)
{
	for (auto& x : tok)
	{
		cout << "[" << x << "]";
	}
	cout << endl;
}

int main()
{
	//CSV格式
	string str = "id,100,name,\"mario\"";

	//分詞對象
	escaped_list_separator<char> sep;

	tokenizer<escaped_list_separator<char>> tok(str, sep);
	print(tok);

	getchar();
	return 0;
}

運行結果：

offset_separator

它是使用偏移量的概念，在處理某些不使用分隔符而使用固定字段寬度的文本時很有用，以下爲構造函數聲明：

offset_separator(Iter begin, Iter end, bool wrap_offsets = true,
                     bool return_partial_last = true)

構造函數接收兩個迭代器參數（也可以是數組指針）begin和end，指定分詞用的整數偏移量序列，整數序列的每個元素是分詞字段的寬度。

wrap_offsets：決定是否在偏移量用完後繼續分詞。

return_partial_last：決定在偏移量序列最後是否返回分詞不足的部分。

用法：

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace boost;
using namespace std;

template<typename T>
void print(T &tok)
{
	for (auto& x : tok)
	{
		cout << "[" << x << "]";
	}
	cout << endl;
}

int main()
{
	string str = "2233344445";

	//分爲 2個一組，3個一組，4個一組
	int offsets[] = { 2,3,4 };

	offset_separator sep(offsets, offsets + 3, true, false);
	tokenizer<offset_separator> tok(str, sep);
	print(tok);

	tok.assign(str, offset_separator(offsets, offsets + 3, false));
	print(tok);

	str += "56667";
	tok.assign(str, offset_separator(offsets, offsets + 3, true, false));
	print(tok);

	getchar();
	return 0;
}

運行結果：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Boost庫基礎-字符串與文本處理(tokenizer)

dotnet C# 創建 X11 應用時設置窗口背景顏色

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

Navicat安裝與激活教程

go語言 defer延遲機制

華爲交換機配置實驗項目筆記

Qt設置QLineEdit佔位符文本字體和顏色

Qt d指針和p指針

搭建海思3559A-Qt4.8.7+Openssl開發環境

搭建海思3559A-Qt5.8.0+Openssl開發環境

Boost庫基礎-智能指針(intrusive_ptr)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結