1. 使用Stanford Word Segmenter進行中文分詞,下載地址http://nlp.stanford.edu/software/segmenter.shtml
2. 版本Version1.6.7
3. 將seg.jar放入ClassPath下,data目錄放在src目錄下
4. 編寫測試程序,根據Demo
import java.util.Properties;
import edu.stanford.nlp.ie.crf.CRFClassifier;
public class SegDemo {
public static String doSegment(String data, CRFClassifier c) {
String[] strs = (String[]) c.segmentString(data).toArray();
StringBuffer buf = new StringBuffer();
for (String s : strs) {
buf.append(s + " ");
}
return buf.toString();
}
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.setProperty("sighanCorporaDict", "data");
props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");
CRFClassifier classifier = new CRFClassifier(props);
classifier.loadClassifierNoExceptions("data/ctb.gz", props);
classifier.flags.setProperties(props);
String sentence = "他和我在學校裏常打桌球。";
String ret = doSegment(sentence, classifier);
System.out.println(ret);
}
}
5. 加入VM運行參數
-mx1g
6. 運行結果
serDictionary=data/dict-chris6.ser.gz
sighanCorporaDict=data
inputEncoding=UTF-8
sighanPostProcessing=true
Loading classifier from data/ctb.gz ... Loading Chinese dictionaries from 1 files:
data/dict-chris6.ser.gz
loading dictionaries from data/dict-chris6.ser.gz...Done. Unique words in ChineseDictionary is: 423200
done [31.8 sec].
serDictionary=data/dict-chris6.ser.gz
sighanCorporaDict=data
inputEncoding=UTF-8
sighanPostProcessing=true
INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from data/dict/character_list and data/dict/in.ctb
Loading character dictionary file from data/dict/character_list
Loading affix dictionary from data/dict/in.ctb
他 和 我 在 學校 裏 常 打 桌球 。
7. 其它要注意的細節http://www.cnblogs.com/XP007/archive/2011/10/27/2227158.html