Stanford Segmenter 中文分詞嘗試

Stanford Segmenter是Stanford大學的一個開源分詞工具，目前已支持漢語和阿拉伯語，只是比較耗費內存，但貌似比中科院的分詞工具快（具體沒測）。

Stanford Segmenter是基於CRF（Conditional Random Field，條件隨機場），CRF是一個機器學習算法，其原理是字構成詞，利用此原理把分詞當做字的詞位分類問題，其具體原理也沒有去細探，也沒有時間去研究。先貼一個Stanford Segmenter自帶的小Demo吧，看看分詞效果。（此工具對人名、地名等實體名識別的較爲精準，所有公司想試着搞搞，但是啓動的是忒慢，不知道有沒有別方法改進，本人剛接觸分詞，高手莫噴。）

public class SegDemo {

  //public static String getProperty(String key, String def)
  //Gets the system property indicated by the specified key.
  private static final String basedir = System.getProperty("SegDemo", "data");

  public static void main(String[] args) throws Exception {
    System.setOut(new PrintStream(System.out, true, "utf-8"));

    Properties props = new Properties();
    props.setProperty("sighanCorporaDict", basedir);
    // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
    // props.setProperty("normTableEncoding", "UTF-8");
    // below is needed because CTBSegDocumentIteratorFactory accesses it
    props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");
    if (args.length > 0) {
      props.setProperty("testFile", args[0]);
    }
    props.setProperty("inputEncoding", "UTF-8");
    props.setProperty("sighanPostProcessing", "true");

    CRFClassifier<CoreLabel> segmenter = new CRFClassifier<CoreLabel>(props);
    segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);
    for (String filename : args) {
      segmenter.classifyAndWriteAnswers(filename);
    }

    String sample = "我叫李塗，你叫李塗胡說嗎。";
    List<String> segmented = segmenter.segmentString(sample);
    System.out.println(segmented);
  }

}

運行此Demo的話需Stanford Segmenter項目目錄下的data文件夾拷入項目中，data中存放了一些文件，包括分詞字典，分詞標準（貌似是這麼個意思），這個data文件夾中的文件是需要在程序運行時加載的，加載這些文件需要一些時間。

這個Demo很簡單，很容易運行成功，但是這裏加載的文件具體都是什麼呢？比如

dict-chris6.ser.gz，ctb.gz

等等，一個小Demo帶來了一系列的疑問，既然有了疑問就去找答案吧，但這個過程好枯燥呀，中文的資料沒有找到什麼有價值的東西，還得去官方看原版的資料。下面是我查閱一些資料的理解，記錄下：

官方網站上有這麼一句話

Two models with two different segmentation standards are included:
    Chinese Penn Treebank standard and Peking University standard.

其中Chinese Penn Treebank standard 就是指ctb，是賓夕法尼亞大學的一個漢語樹庫；Peking University standard是北京大學的一個分詞標準，這就是Stanford Segmenter分詞的精髓嗎？（此處來個小插曲，Stanford Segmenter分詞流程是利用一些數據訓練出一個分詞模型，然後用訓練出來的模型進行分詞）

隨後又在這裏發現了這個問題How can I retrain the Chinese Segmenter？這裏進行了詳細的解答，但我還不是很清楚。下面將回答貼出，英文好的可以幫我分析下：

In general you need four things in order to retrain the Chinese Segmenter. You will need a data set with segmented text, a dictionary with words that the segmenter should know about, and various small data files for other feature generators.

The most important thing you need is a data file with text segmented according to the standard you want to use. For example, for the CTB model we distribution, which follows thePenn Chinese Treebanksegmentation standard, we use the Chinese Treebank 7.0 data set.

You will need to convert your data set to text in the following format:

中國 進出口 銀行 與 中國 銀行 加強 合作
新華社 北京 十二月 二十六日 電 （ 記者 周根良 ）
...

Each individual sentence is on its own line, and spaces are used to denote word breaks.

Some data sets will come in the format of Penn trees. There are various ways to convert this to segmented text; one way which uses our tool suite is to use the Treebanks tool:

java edu.stanford.nlp.trees.Treebanks -words ctb7.mrg

The Treebanks tool is not included in the segmenter download, but it is available in the corenlp download.

Another useful tool is a dictionary of known words. This should include named entities such as people, places, companies, etc. which the model might segment as a single word. This is not actually required, but it will help identity named entities which the segmenter has not seen before. For example, our file of named entities includes names such as

吳毅成
吳浩康
吳淑珍
...

To build a dictionary usable by our model, you want to collect lists of words and then use the ChineseDictionary tool to combine them into one serialized dictionary.

java edu.stanford.nlp.wordseg.ChineseDictionary 
                        -inputDicts ,... -output dict.ser.gz

If you want to use our existing dictionary as a starting point, you can include it as one of the filenames. Words have a maximum lexicon length (probably 6, see the ChineseDictionary source) and words longer than that will be truncated. There is also handling of words with a "mid dot" character; this occasionally shows up in machine translation, and if a word with a mid dot shows up in the dictionary, we accept the word either with or without the dot.

You will also need a properties file which tells the classifier which features to use. An example properties file is included in the data directory of the segmenter download.

Finally, some of the features used by the existing models require additional data files, which are included in the data/dict directory of the segmenter download. To figure out which files correspond to which features, please search in the source code for the appropriate filename. You can probably just reuse the existing data files.

Once you have all of these components, you can then train a new model with a command line such as

  java -mx15g edu.stanford.nlp.ie.crf.CRFClassifier 
          -prop ctb.prop -serDictionary dict-chris6.ser.gz -sighanCorporaDict data 
          -trainFile train.txt -serializeTo newmodel.ser.gz > newmodel.log 2> newmodel.err

第一段介紹了訓練模型需要準備的四個東西，一是需要一個數據集（data set），而且這個數據集是經過分詞之後的數據集；需要一個數據字典（dictionary）；一些小的數據文件（small data files）；最重要的就是最後一個，一個數據文件（data file），也可以認爲是分詞標準。

這些英文的意思也不難，這裏我只闡述我的幾點疑問，

1、數據集，分好詞的數據集用來做什麼，是爲了生成數據字典嘛？數據集怎麼選取？

2、上文說到的字典（dictionary）具體指什麼？只是名字等實體字典？怎麼得到的這個字典？利用java edu.stanford.nlp.wordseg.ChineseDictionary這個命令得到的是什麼字典，並且輸入的文件是什麼格式的，分好詞的句呢還是普通的句子？words.txt是不是就是第一步生產的分好詞的數據集？

3、最後訓練模型的命令中train.txt是個什麼樣的文本呢？正常的文本呢還是分詞之後的文本？

Stanford Segmenter 中文分詞嘗試

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

利用BulkLoad導入Hbase表

sizeof(struct)的結果分析及其原因

我的友情鏈接

jvm垃圾回收機制

java中斷小記（二）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結