逆向最大區配算法

原創

iceshirley

2020-02-21 07:46

一、定義

逆向mm算法：假設詞典裏面中最長的詞條所包含的字數爲L，則從待分析的字符串中取出L個詞，比較詞典，如果不存在，則去掉最後一個字，在與詞典比較，如此反覆循環。直到滿足條件爲止。

二、實現過程

構造一個MMChineseAnalyzer類，繼承org.apache.lucene.analysis.Analyzer，需要實現public TokenStream tokenStream(String field, Reader reader)方法。在構建一個MMChineseTokenizer，繼承org.apache.lucene.analysis.Tokenizer。在MMChineseAnalyzer的構造方法：

public MMChineseAnalyzer(){
  dic=new HashSet<String>();
  loadStopword();
  loadDictionary();
}用於哈希表插入，刪除，查找的時間複雜度爲常數級的，故採用此數據結構。dic用來存放詞典文件。loadStopword();方法載入stopword，loadDictionary()用來存放詞典文件。

在MMChineseTokenizer中，需要實現public Token next()方法。構造函數爲 public MMChineseTokenizer(Reader in,HashSet dic){
input=in;
this.dic=dic;
}dic爲詞典文件，類型爲hashset，input爲要分析的字符串

在next（）方法中，我們將input存放到ioBuffer中，然後取出一個字符c，判斷c是什麼字符類型，如果c屬於漢字

if (cUnicodeBlock == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS) {

tokenType = "chinese";

int start = bufferIndex;

char[] temp = new char[7];//詞典中最大的詞條長度爲7

temp[0] = c;

for (int i = 1; i < 7; i++, start++) {

char temp1 = ioBuffer[start];

if (start > ioBuffer.length) {

break;

}

if (cUnicodeBlock.toString().equalsIgnoreCase(

Character.UnicodeBlock.of(temp1).toString())) {

temp[i] = temp1;

} else {

break;

}

String temp2 = new String(temp);

temp2 = temp2.trim();

int length = temp2.length();

//算法的關鍵

while (true) {

if (dic.contains(temp2)) {

word.append(temp2);

offset = start;

bufferIndex = start;

break LABLE;

} else {

if (length == 1) {

word.append(temp2);

offset = start ;

bufferIndex = start;

break LABLE;

}

temp2 = temp2.substring(0, --length);

start--;

}

temp存放的是每次取出的詞條，while循環是判斷如果詞條在詞典中，則結束循環，word爲StringBuffer類型。如果詞典裏面沒有，則去掉最後一個字temp2 = temp2.substring(0, --length);反覆循環，直至length等於一的時候，這個時候只有一個詞條只有一個詞，便添加到word中，循環結束

如果c是拉丁字符 isSameUnicodeBlock是判斷當前字符與下一個字符是不是屬於一個字符集。

else if (cUnicodeBlock == Character.UnicodeBlock.BASIC_LATIN) {

tokenType = "english";

if (Character.isWhitespace(c)) {

if (word.length() != 0)

break;

} else {

word.append(c);

nextChar = ioBuffer[bufferIndex];

nextCharUnicodeBlock = Character.UnicodeBlock.of(nextChar);

boolean isSameUnicodeBlock = cUnicodeBlock.toString().equalsIgnoreCase(nextCharUnicodeBlock.toString());

if (word.length() != 0 && (!isSameUnicodeBlock)) {

break;

}

這樣就寫好了分詞器，

測試代碼爲“今天是個難忘的日子中華人民共和國成立於1949年10月1號 ”

結果爲：

(今天是,0,3,type=chinese)
(個,3,4,type=chinese)
(難忘,4,6,type=chinese)
(日子,7,9,type=chinese)
(中華人民共和國,9,16,type=chinese)
(成立,16,18,type=chinese)
(1949,19,23,type=latin)
(年,23,24,type=chinese)
(10,24,26,type=latin)
(月,26,27,type=chinese)
(1,27,28,type=latin)
(號,28,29,type=chinese)

iceshirley

發佈了34 篇原創文章 · 獲贊 3 · 訪問量 11萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

逆向最大區配算法

Lucene2.0中使用基於詞典的中文分詞器建立索引

加入網頁評估算法

learn to crawl:比較分類模式

中科院分詞包ICTCLAS

weka中Saving and loading Trained models

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結