mmseg4j支持單個字母、數字及組合搜索

原文地址:http://blog.csdn.net/july_2/article/details/24481935

如題，看到這個題目也許覺得功能有些多餘，字母、數字連在一塊的話，是不會單獨分出來的，分詞時候是連在一塊的，也算正常搜素需求。如輸入：

String txt = "IBM12二次修改123"; 分詞效果：

i bm |123 | 二 | 次 | 修 | 改

現在，有一個需求：需要對字母、數字都分詞，分詞效果要達到：

i | b | m | 1 | 2 | 3 | 二 | 次 | 修 | 改

類似在數據庫中使用like加百分號雙向查詢效果，使用最初版本的mmseg4j無法滿足需求，經過閱讀mmseg4j部分源代碼，稍微修改了一點點，即可滿足需求（暫不考慮效率）。

未修改前通過單詞，可以查詢，通過字母查詢不到結果如下圖：

單詞完全匹配搜素：

字母模糊搜索：

修改mmseg4j源代碼MMSeg.java中的next部分代碼，其實就是屏蔽了部分代碼，很簡單：

[plain]view
plaincopy

public Word next() throws IOException {  

        //先從緩存中取  

        Word word = bufWord.poll();  

        if(word == null) {  

            bufSentence.setLength(0);  

            int data = -1;  

            boolean read = true;  

//          while(read && (data=readNext()) != -1) {  

            while((data=readNext()) != -1) {  

                read = false;   //默認一次可以讀出同一類字符,就可以分詞內容  

                int type = Character.getType(data);  

                String wordType = Word.TYPE_WORD;  

                switch(type) {  

                case Character.UPPERCASE_LETTER:  

                case Character.LOWERCASE_LETTER:  

                case Character.TITLECASE_LETTER:  

                case Character.MODIFIER_LETTER:  

                    /*  

                     * 1. 0x410-0x44f -> А-я //俄文  

                     * 2. 0x391-0x3a9 -> Α-Ω //希臘大寫  

                     * 3. 0x3b1-0x3c9 -> α-ω //希臘小寫  

                     */  

                    data = toAscii(data);  

                    NationLetter nl = getNation(data);  

                    if(nl == NationLetter.UNKNOW) {  

                        read = true;  

                        break;  

                    }  

                    wordType = Word.TYPE_LETTER;  

                    bufSentence.appendCodePoint(data);  

                    switch(nl) {  

                    case EN:  

                        //字母后面的數字,如: VH049PA  

//                      ReadCharByAsciiOrDigit rcad = new ReadCharByAsciiOrDigit();  

//                      readChars(bufSentence, rcad);  

//                      if(rcad.hasDigit()) {  

//                          wordType = Word.TYPE_LETTER_OR_DIGIT;  

//                      }  

                        //only english  

                        //readChars(bufSentence, new ReadCharByAscii());  

                        break;  

                    case RA:  

                        readChars(bufSentence, new ReadCharByRussia());  

                        break;  

                    case GE:  

                        readChars(bufSentence, new ReadCharByGreece());  

                        break;  

                    }  

                    bufWord.add(createWord(bufSentence, wordType));  

                    bufSentence.setLength(0);  

                    break;  

                case Character.OTHER_LETTER:  

                    /*  

                     * 1. 0x3041-0x30f6 -> ぁ-ヶ   //日文(平|片)假名  

                     * 2. 0x3105-0x3129 -> ㄅ-ㄩ   //注意符號  

                     */  

                    bufSentence.appendCodePoint(data);  

                    readChars(bufSentence, new ReadCharByType(Character.OTHER_LETTER));  

                    currentSentence = createSentence(bufSentence);  

                    bufSentence.setLength(0);  

                    break;  

                case Character.DECIMAL_DIGIT_NUMBER:  

                    bufSentence.appendCodePoint(toAscii(data));  

//                  readChars(bufSentence, new ReadCharDigit());    //讀後面的數字, AsciiLetterOr  

                    wordType = Word.TYPE_DIGIT;  

                    int d = readNext();  

                    if(d > -1) {  

                        if(seg.isUnit(d)) { //單位,如時間  

                            bufWord.add(createWord(bufSentence, startIdx(bufSentence)-1, Word.TYPE_DIGIT)); //先把數字添加(獨立)  

                            bufSentence.setLength(0);  

                            bufSentence.appendCodePoint(d);  

                            wordType = Word.TYPE_WORD;  //單位是 word  

                        } else {    //後面可能是字母和數字  

                            pushBack(d);  

//                          if(readChars(bufSentence, new ReadCharByAsciiOrDigit()) > 0) {   //如果有字母或數字都會連在一起.  

//                              wordType = Word.TYPE_DIGIT_OR_LETTER;  

//                          }  

                        }  

                    }  

                    bufWord.add(createWord(bufSentence, wordType));  

                    bufSentence.setLength(0);   //緩存的字符清除  

                    break;  

                case Character.LETTER_NUMBER:  

                    // ⅠⅡⅢ 單分  

                    bufSentence.appendCodePoint(data);  

                    readChars(bufSentence, new ReadCharByType(Character.LETTER_NUMBER));  

                    int startIdx = startIdx(bufSentence);  

                    for(int i=0; i<bufSentence.length(); i++) {  

                        bufWord.add(new Word(new char[] {bufSentence.charAt(i)}, startIdx++, Word.TYPE_LETTER_NUMBER));  

                    }  

                    bufSentence.setLength(0);   //緩存的字符清除  

                    break;  

                case Character.OTHER_NUMBER:  

                    //①⑩㈠㈩⒈⒑⒒⒛⑴⑽⑾⒇ 連着用  

                    bufSentence.appendCodePoint(data);  

                    readChars(bufSentence, new ReadCharByType(Character.OTHER_NUMBER));  

                    bufWord.add(createWord(bufSentence, Word.TYPE_OTHER_NUMBER));  

                    bufSentence.setLength(0);  

                    break;  

                default :  

                    //其它認爲無效字符  

                    read = true;  

                }//switch  

            }  

            // 中文分詞  

            if(currentSentence != null) {  

                do {  

                    Chunk chunk = seg.seg(currentSentence);  

                    for(int i=0; i<chunk.getCount(); i++) {  

                        bufWord.add(chunk.getWords()[i]);  

                    }  

                } while (!currentSentence.isFinish());  

                currentSentence = null;  

            }  

            word = bufWord.poll();  

        }  

        return word;  

    }

主要是註釋了一些代碼，對字母、數字不要連續處理。

再次搜索字母查詢，效果如下：

綜上，這樣就簡單完成了數據庫中類似like和百分號雙向匹配需求。

mmseg4j支持單個字母、數字及組合搜索

工作中用到的腳本合集

24-5-18 X

Cannot create a server using the selected typ

異常org.aopalliance.intercept.MethodInterceptor

Hibernate JDBC比較及系統調優

Hibernate 與 Spring 多數據源的配置

WORD2010如何添加雙線頁眉

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結