使用stanford nlp時強制自定義分詞

本文章適用於這樣的情景：
1. 不僅僅使用stanford nlp做分詞，而是用它做句法分析或依存分析等；
2. 對默認的分詞結果不滿意，想要加入強制的自定義詞典；

一、stanford nlp的基本用法

// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(
	PropertiesUtils.asProperties(
		"annotators", "tokenize,ssplit,pos,lemma,parse,natlog",
		"ssplit.isOneSentence", "true",
		"parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz",
		"tokenize.language", "en"));

// read some text in the text variable
String text = ... // Add your text here!
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

可參考官網：https://stanfordnlp.github.io/CoreNLP/api.html

二、自定義詞典的添加

設置屬性：

segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz，yourDictionaryFile

自定義詞典的格式是一行一個詞；
但加入自定義詞典後，程序並不會完全按照它分詞，自定義詞典只作爲分詞時的參考；
stanford nlp沒有提供強制分詞的解決方案；

三、強制自定義分詞

3.1 annotate()方法解析

public void annotate(Annotation annotation)

該方法會完成配置中所定義的所有動作（如tokenize,ssplit,pos,lemma,parse）;

內部的邏輯是逐一調用相應功能的annotater.annotate();

所有結果保存在Annotation對象中，以鍵值對的形式

3.2 手動依次調用annotator

思路是手動調用需要的annotator，並在tokenizerAnnotator完成之後，修改他的結果。

難點在於：

修改完的結果必須合法，不然之後的Annotator不理解；
尋找正確的Annotator；

以下代碼可用來代替annotate():

	Properties properties = ...
    tokenizerAnnotator = new TokenizerAnnotator(properties);
    tokenizerAnnotator.annotate(annotation);
    
    //這裏插入對於annotation的強制分詞操作
    
    properties = ...
    sentencesAnnotator = new WordsToSentencesAnnotator(properties);
	sentencesAnnotator.annotate(annotation);
	 
    String taggerPath = "edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger";
    MaxentTagger tagger = new MaxentTagger(taggerPath);
    taggerAnnotator = new POSTaggerAnnotator(tagger);
    taggerAnnotator.annotate(annotation);

	XXXAnnotator
	......

關於Annotator的官方文檔：https://stanfordnlp.github.io/CoreNLP/annotators.html

3.3 手動修改Annotation中保存的分詞結果

首先需要了解Annotation對象的結構；它是個Map<Class,Object>，具體不展開；
每個annotator的結果就是Annotation中的一個鍵值對；

//獲得分詞結果，即之後的修改對象
	List<CoreLabel> tokens = annotation.get(edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation.class);

然後需要了解CoreLabel類，他也是個Map<Class,Object>；

	//eg. 將i位置替換爲一個新CoreLabel
 	CoreLabel newLabel = CoreLabel.wordFromString("...");
    newLabel.setBeginPosition(startIdx);//新token在text的起始位置
    newLabel.setEndPosition(endIdx);//新token在text的結束位置        					       	
    newLabel.set(edu.stanford.nlp.ling.CoreAnnotations.TokenBeginAnnotation.class, i); //新token是第幾個token
    newLabel.set(edu.stanford.nlp.ling.CoreAnnotations.TokenEndAnnotation.class, i + 1);//新token的下一個是第幾個  
   	newLabel.set(edu.stanford.nlp.ling.CoreAnnotations.IsNewlineAnnotation.class,false)    
   	
    tokens.remove(i);
    tokens.add(replaceLabel);

到tokenizerAnnotator之後，一個CoreLabel對象應該有的屬性是：

token在整個句子中的起始位置
結束位置
在List中的位置，即它是第幾個CoreLabel
他的下一個是第幾個CoreLabel
isNewline

使用stanford nlp時強制自定義分詞

一、stanford nlp的基本用法

二、自定義詞典的添加

三、強制自定義分詞

3.1 annotate()方法解析

3.2 手動依次調用annotator

3.3 手動修改Annotation中保存的分詞結果

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

nodejs學習06——小案例

記錄一場pdd的春招面試

使用stanford nlp時強制自定義分詞

windows下IDEA涉及到IO時出現中文亂碼

應用層及運輸層協議整理

Java運行時數據區

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結