Ubuntu16.04配置使用deepnlp

主要參考的是deep-nlp的readme文件。
DeepNLP包括以下幾個模塊

  • NLP Pipeline Modules:

    • Word Segmentation/Tokenization
    • Part-of-speech (POS)
    • Named-entity-recognition(NER)
    • textsum: automatic summarization Seq2Seq-Attention models
    • textrank: extract the most important sentences
    • textcnn: document classification
    • Web API: Free Tensorflow empowered web API
    • Planed: Parsing, Automatic Summarization
  • Algorithm(Closely following the state-of-Art)

    • Word Segmentation: Linear Chain CRF(conditional-random-field), based on python CRF++ module
    • POS: LSTM/BI-LSTM network, based on Tensorflow
    • NER: LSTM/BI-LSTM/LSTM-CRF network, based on Tensorflow
    • Textsum: Seq2Seq with attention mechanism
    • Texncnn: CNN
  • Pre-trained Model
    • Chinese: Segmentation, POS, NER (1998 china daily corpus)
    • English: POS (brown corpus)
    • For your Specific Language, you can easily use the script to train model with the corpus of your language choice.

安裝

模型需要使用1.0版本的tensorflow。使用如下命令安裝:

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.1-cp35-cp35m-linux_x86_64.whl
sudo pip install --upgrade $TF_BINARY_URL

模型不能使用python3。
使用如下命令安裝

sudo pip install deepnlp

使用教程

下載預訓練模型

使用pip命令安裝的deepnlp並沒有下載模型文件,所以需要另外下載,在python3使用如下命令:

import deepnlp
# Download all the modules
deepnlp.download()
# Download only specific module
deepnlp.download('segment')
deepnlp.download('pos')
deepnlp.download('ner')
deepnlp.download('textsum')

分詞

運行如下python程序

#coding=utf-8
from __future__ import unicode_literals

from deepnlp import segmenter

text = "我剛剛在浙江衛視看了電視劇老九門,覺得陳偉霆很帥"
segList = segmenter.seg(text)
text_seg = " ".join(segList)

print (text.encode('utf-8'))
print (text_seg.encode('utf-8'))

提示出現如下錯誤

Traceback (most recent call last):
  File "test_segment.py", line 4, in <module>
    from deepnlp import segmenter
  File "/usr/local/lib/python2.7/dist-packages/deepnlp/segmenter.py", line 6, in <module>
    import CRFPP
ImportError: No module named CRFPP

分詞功能依賴於CRF++(>=0.54)。從網站下載crf0.58,解壓,運行如下命令:

./configure
make
sudo make install

然後進入python文件夾中,運行如下命令:

python setup.py build
su
python setup.py install

安裝完成後,運行如下python程序

#coding=utf-8
from __future__ import unicode_literals

from deepnlp import segmenter

text = "我剛剛在浙江衛視看了電視劇老九門,覺得陳偉霆很帥"
segList = segmenter.seg(text)
text_seg = " ".join(segList)

print (text.encode('utf-8'))
print (text_seg.encode('utf-8'))

出現如下錯誤:

    import CRFPP
  File "/usr/lib/python2.7/dist-packages/bpython/curtsiesfrontend/repl.py", line 257, in load_module
    module = pkgutil.ImpLoader.load_module(self, name)
  File "/usr/lib/python2.7/pkgutil.py", line 246, in load_module
    mod = imp.load_module(fullname, self.file, self.filename, self.etc)
  File "/usr/local/lib/python2.7/dist-packages/CRFPP.py", line 26, in <module>
    _CRFPP = swig_import_helper()
  File "/usr/local/lib/python2.7/dist-packages/CRFPP.py", line 22, in swig_import_helper
    _mod = imp.load_module('_CRFPP', fp, pathname, description)
ImportError: libcrfpp.so.0: 無法打開共享對象文件: 沒有那個文件或目錄

這是因爲沒有建立正確的鏈接,使用如下命令解決:

sudo ln -s /usr/local/lib/libcrfpp.so.* /usr/lib/ 

詞性標註

運行如下程序:

#coding:utf-8
from __future__ import unicode_literals # compatible with python3 unicode

from deepnlp import segmenter
from deepnlp import pos_tagger
tagger = pos_tagger.load_model(lang = 'zh')

#Segmentation
text = "我愛吃北京烤鴨"         # unicode coding, py2 and py3 compatible
words = segmenter.seg(text)
print(" ".join(words).encode('utf-8'))

#POS Tagging
tagging = tagger.predict(words)
for (w,t) in tagging:
    str = w + "/" + t
    print(str.encode('utf-8'))

#Results
#我/r
#愛/v
#吃/v
#北京/ns
#烤鴨/n
#coding:utf-8
from __future__ import unicode_literals

import deepnlp
deepnlp.download('pos')                     # download the POS pretrained models from github if installed from pip

from deepnlp import pos_tagger
tagger = pos_tagger.load_model(lang = 'en')  # Loading English model, lang code 'en'

#Segmentation
text = "I want to see a funny movie"
words = text.split(" ")
print (" ".join(words).encode('utf-8'))

#POS Tagging
tagging = tagger.predict(words)
for (w,t) in tagging:
    str = w + "/" + t
    print (str.encode('utf-8'))

#Results
#I/nn
#want/vb
#to/to
#see/vb
#a/at
#funny/jj
#movie/nn

命名實體識別

運行如下程序:

#coding:utf-8
from __future__ import unicode_literals # compatible with python3 unicode

import deepnlp
deepnlp.download('ner')  # download the NER pretrained models from github if installed from pip

from deepnlp import segmenter
from deepnlp import ner_tagger
tagger = ner_tagger.load_model(lang = 'zh')

#Segmentation
text = "我愛吃北京烤鴨"
words = segmenter.seg(text)
print (" ".join(words).encode('utf-8'))

#NER tagging
tagging = tagger.predict(words)
for (w,t) in tagging:
    str = w + "/" + t
    print (str.encode('utf-8'))

#Results
#我/nt
#愛/nt
#吃/nt
#北京/p
#烤鴨/nt

Pipline

運行如下程序:

#coding:utf-8
from __future__ import unicode_literals # compatible with python3 unicode

import sys,os
import codecs

import deepnlp
deepnlp.download('segment')   # download all the required pretrained models from github if installed from pip
deepnlp.download('pos')       
deepnlp.download('ner')

from deepnlp import pipeline
p = pipeline.load_model('zh')

# concatenate tuples into one string "w1/t1 w2/t2 ..."
def _concat_tuples(tagging):
  TOKEN_BLANK = " "
  wl = [] # wordlist
  for (x, y) in tagging:
    wl.append(x + "/" + y) # unicode
  concat_str = TOKEN_BLANK.join(wl)
  return concat_str

# input file
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
docs = []
file = codecs.open(os.path.join(BASE_DIR, 'docs_test.txt'), 'r', encoding='utf-8')
for line in file:
    line = line.replace("\n", "").replace("\r", "")
    docs.append(line)

# output file
fileOut = codecs.open(os.path.join(BASE_DIR, 'pipeline_test_results.txt'), 'w', encoding='utf-8')

# analyze function
# @return: list of 3 elements [seg, pos, ner]
text = docs[0]
res = p.analyze(text)
words = p.segment(text)
pos_tagging = p.tag_pos(words)
ner_tagging = p.tag_ner(words)

# print pipeline.analyze() results
fileOut.writelines("pipeline.analyze results:" + "\n")
fileOut.writelines(res[0] + "\n")
fileOut.writelines(res[1] + "\n")
fileOut.writelines(res[2] + "\n")

print (res[0].encode('utf-8'))
print (res[1].encode('utf-8'))
print (res[2].encode('utf-8'))

# print modules results
fileOut.writelines("modules results:" + "\n")
fileOut.writelines(" ".join(words) + "\n")
fileOut.writelines(_concat_tuples(pos_tagging) + "\n")
fileOut.writelines(_concat_tuples(ner_tagging) + "\n")
fileOut.close

自動摘要

參考https://github.com/rockingdingo/deepnlp/tree/master/deepnlp/textsum或者textsum文件夾下的readme。

交互式預測

cd ./ckpt
cat headline_large.ckpt-48000.* > headline_large.ckpt-48000.data-00000-of-00001.tar.gz
tar xzvf headline_large.ckpt-48000.data-00000-of-00001.tar.gz
sudo mkdir /mnt/python/pypi/deepnlp/deepnlp/textsum/ckpt
sudo cp * /mnt/python/pypi/deepnlp/deepnlp/textsum/ckpt
cd ..
python predict.py

然後交互式輸入中文分好詞的新聞正文語料,詞之間空格分割,結果返回自動生成的新聞標題。

預測和評估ROUGE分

python predict.py news/test/content-test.txt news/test/title-test.txt news/test/summary.txt
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章