採用Stanford CoreNLP實現英文單詞詞形還原

最近有個小的任務，根據英文單詞的過去分詞或現在分詞或複數形式獲取詞語的原形，本來我的思路是：對於不規則變化的詞語，建立不規則詞表，直接從詞表中查詢；對於規則的詞形變化，自己寫規則進行還原。後來發現有些變化涉及到單詞的發音，如重讀閉音節要雙寫最後一個單詞再變化，這樣逆推的話就不好處理，從網上查詢獲取單詞音標也沒有實現好的結果。於是從網上搜索資料發現了Stanford CoreNLP這個工具。此工具是基於Java開發的開源工具，可以在自己的項目中直接使用。下載地址是：http://nlp.stanford.edu/software/corenlp.shtml。現在後解壓文件，將ejml-0.19-nogui.jar，joda-time.jar，jollyday.jar，stanford-corenlp-3.2.0.jar，stanford-corenlp-3.2.0-models.jar，xom.jar放於自己的項目下。代碼片段爲：

			  Properties props = new Properties();
			  props.put("annotators", "tokenize,ssplit,pos, lemma");
			  StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
			  Annotation document = new Annotation(txtWord);
			  pipeline.annotate(document);
			  List<CoreMap> sentences = document.get(SentencesAnnotation.class);
			  for(CoreMap sentence: sentences) {
				 for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
				      String word = token.get(TextAnnotation.class);
				      String lema = token.get(LemmaAnnotation.class);
				      logger.info(word+","+lema);
				      originWord = lema;
				      originFlag = true;
				    }
				  }

其中 txtWord是待處理的文本，

props.put("annotators", "tokenize,ssplit,pos, lemma");

分別是分詞、分句、詞性標註和次元信息。

String word = token.get(TextAnnotation.class);

獲取單詞信息

String lema = token.get(LemmaAnnotation.class);

獲取對應上面word的詞元信息，即我所需要的詞形還原後的單詞。

小小小小小飛鳥

發佈了46 篇原創文章 · 獲贊 14 · 訪問量 34萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

採用Stanford CoreNLP實現英文單詞詞形還原

基於Gate的ANNIE插件的中文信息抽取

Ubuntu Server 12.04安裝桌面環境以及配置VNC

我的2013年年終總結

基於Gate的中文信息抽取API調用方式--未成功

JGibbLDA使用總結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結