xsplit PHP擴展

xsplit是一個PHP擴展，提供基於MMSEG算法的分詞功能。目前只在linux下測試並部署過，希望有朋友可以幫忙編譯提供windows下的dll。

xsplit只處理UTF8編碼格式，如果是其他編碼格式，請在使用前自行轉換

xsplit主要有以下幾個函數：

bool xs_build ( array $words, string $dict_file )

resource xs_open (string $dict_file [, bool $persistent])

array xs_split ( string $text [, int $split_method = 1 ［, resource $dictionary_identifier ］ ] )

mixed xs_search ( string $text ［, int $search_method [, resource $dictionary_identifer ] ］ )

string xs_simhash( string $text [, bool $isBinary] )

int xs_hdist( string $simhash1, string $simhash2 )

安裝過程與一般的PHP擴展安裝一樣

$phpize

$./configure --with-php-config=/path/to/php-config

$make

$make install

php.ini中可以設置以下參數：

xsplit.allow_persisten = On

xsplit.max_dicts = 5

xsplit.max_persistent = 3

xsplit.default_dict_file = /home/xdict

xsplit.allow_persistent 是否允許加載持久詞典

xsplit.max_dicts 允許同時打開的最大詞典數目

xsplit.max_persistent 允許同時打開的最大持久詞典數目

xsplit.default_dict_file 默認的詞典，沒有指定詞典時會調用此詞典

源碼中有一個utils目錄，包含

make_dict.php 提供命令行方式創建詞典

xsplit.php 一個簡單的示例文件

xdict_example.txt 一個文本詞庫的格式示例

make_dict.php的使用例子如下：

$php make_dict.php ./xdict_example.txt ./xdict.db

文本詞庫的格式請參考xdict_example.txt

bool xs_build (array $words, string $dict_file)

從$words數組建立名稱爲$dict_file的詞典，若成功則返回true。$words數組的格式請參考示例，key爲詞語，value爲詞頻。

例子如下：

<?php

$dict_file='dict.db';

$dwords['美麗']=100;

$dwords['蝴蝶']=100;

$dwords['永遠']=100;

$dwords['心中']=100;

$dwords['翩翩']=100;

$dwords['飛舞']=100;

$dwords['翩翩飛舞']=10;

if(!xs_build($dwords, $dict_file)) {

die('建立詞典失敗！');

}

resource xs_open (string $dict_file [, bool $persistent])

打開一個詞典文件，並返回一個resource類型的identifier。$persistent可以指定是否是持久化詞典，持久化詞典在這裏可以理解爲詞典資源生命週期的不同，一般情況下$persistent=true或者默認缺省即可。在進行分詞的時候，可以指定不同的詞典。

$dict_file_1 = 'xdcit.db';

$dict_file_2 = 'mydict.db';

$dict1 = xs_open($dict_file);

xs_open($dict_file);

array xs_split ( string $text [, int $split_method = 1 ［, resource $dictionary_identifier ］ ] )

對文本進行分詞，可以指定分詞方法和詞典。分詞方法目前有兩種，一個是MMSEG算法（默認），一個是正向最大匹配，分別用常量XS_SPLIT_MMSEG和XS_SPLIT_MMFWD表示。返回值是一個數組，包含所有切分好的詞語。如果不指定詞典，最後一次打開的詞典將被使用。

<?php

$text="那隻美麗的蝴蝶永遠在我心中翩翩飛舞着。";

$dict_file = 'xdict.db';

$dict_res = xs_open($dict_file);

$words = xs_split($text); /* 此處沒有指定詞典資源，默認使用最後一次打開的詞典 */

$words1 = xs_split($text, XS_SPLIT_MMSEG, $dict_res);

mixed xs_search ( string $text ［, int $search_method [, $dictionary_identifer ] ］ ) 基於雙數組trie樹提供的一些功能，$search_method有四個常量表示：

XS_SEARCH_CP : darts的commonPrefixSearch封裝，如果沒有找到，返回false。

XS_SEARCH_EM : darts的exactMatchSearch封裝，如果沒有找到，返回false。

XS_SEARCH_ALL_SIMPLE : 按照詞典返回所有詞語詞頻總和，一個INT型數值。

XS_SEARCH_ALL_DETAIL : 按照詞典返回所有詞典的詞頻，並以數組形式返回每一個詞語的詳細統計。

如果不指定詞典，最後一次打開的詞典將被使用。

<?php

xs_open($dict_file);

$text="那隻美麗的蝴蝶永遠在我心中翩翩飛舞着。";

$word='翩翩飛舞';

$result=xs_search($word, XS_SEARCH_CP); /* common prefix search */

var_dump($result);

$result=xs_search($word, XS_SEARCH_EM); /* exact match search */

var_dump($result);

$result=xs_search($text, XS_SEARCH_ALL_SIMPLE);

var_dump($result);

$result=xs_search($text, XS_SEARCH_ALL_DETAIL);

var_dump($result);

string xs_simhash( array $tokens [, bool $rawoutput] )

計算simhash。這裏所有token權重都是1，$tokens的例子如array('在', '這個', '世界')。$rawoput默認爲0，即返回simhash的hex string形式，如md5， sha1函數一樣；如過$rawoput爲真，返回一個8字節的字符串，這個字符串實際上是一個64 bits的整型數，uint64_t，在一些特殊情況下可以用到。

int xs_hdist（ string $simhash1, $string $simhash2)

計算漢明距離。

<?php

xs_open('xdict');

$text1="那隻美麗的蝴蝶永遠在我心中翩翩飛舞着。";

$text2="那隻美麗的蝴蝶永遠在我心中翩翩飛舞。";

$tokens1=xs_search($text1, XS_SEARCH_ALL_INDICT); /* 去掉標點等特殊符號，經過實驗，計算simhash時，一些標點、換行、特殊符號等對效果影響較大 */

$tokens2=xs_search($text2, XS_SEARCH_ALL_INDICT);

$simhash1=xs_simhash($tokens1);

$simhash2=xs_simhash($tokens2);

echo "simhash1 is {$simhash1}\n";

echo "simhash2 is {$simhash2}\n";

$hamming_dist=xs_hdist($simhash1, $simhash2);

echo "bit-wise format:\n";

echo decbin(hexdec($simhash1)), "\n";

echo decbin(hexdec($simhash2)), "\n";

echo "hamming distance is {$hamming_dist}\n";

TDengine docker安裝方法

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

Navicat安裝與激活教程

nginx核心模塊內置變量

svn忽略文件或目錄

利用CURL實現抓取外域內容

不允許某些賬戶遠程登錄

CentOS6 啓動流程圖文解剖 + 引導文件損壞處理方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結