使用CRF++實現命名實體識別

【定義】
CRF++是著名的條件隨機場的開源工具,也是目前綜合性能最佳的CRF工具,採用C++語言編寫而成。其最重要的功能是採用了特徵模板。這樣就可以自動生成一系列的特徵函數,而不用我們自己生成特徵函數,我們要做的就是尋找特徵,比如詞性等。
【安裝】
在Windows中CRF++不需要安裝,下載解壓CRF++0.58文件即可以使用
【語料】
需要注意字與標籤之間的分隔符爲製表符\t

played VBD O
on IN O
Monday NNP O
( ( O
home NN O
team NN O
in IN O
CAPS NNP O

【特徵模板】
模板是使用CRF++的關鍵,它能幫助我們自動生成一系列的特徵函數,而不用我們自己生成特徵函數,而特徵函數正是CRF算法的核心概念之一。
在這裏插入圖片描述
【訓練】
在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
【預測】
在這裏插入圖片描述
【實例】
該語料庫一共42000行,每三行爲一組,其中,第一行爲英語句子,第二行爲句子中每個單詞的詞性,第三行爲NER系統的標註,共分4個標註類別:PER(人名),LOC(位置),ORG(組織)以及MISC,其中B表示開始,I表示中間,O表示單字詞,不計入NER,sO表示特殊單字詞。首先我們將該語料分爲訓練集和測試集,比例爲9:1。

# -*- coding: utf-8 -*-

# NER預料train.txt所在的路徑
dir = "/Users/Shared/CRF_4_NER/CRF_TEST"

with open("%s/train.txt" % dir, "r") as f:
    sents = [line.strip() for line in f.readlines()]

# 訓練集與測試集的比例爲9:1
RATIO = 0.9
train_num = int((len(sents)//3)*RATIO)

# 將文件分爲訓練集與測試集
with open("%s/NER_train.data" % dir, "w") as g:
    for i in range(train_num):
        words = sents[3*i].split('\t')
        postags = sents[3*i+1].split('\t')
        tags = sents[3*i+2].split('\t')
        for word, postag, tag in zip(words, postags, tags):
            g.write(word+' '+postag+' '+tag+'\n')
        g.write('\n')

with open("%s/NER_test.data" % dir, "w") as h:
    for i in range(train_num+1, len(sents)//3):
        words = sents[3*i].split('\t')
        postags = sents[3*i+1].split('\t')
        tags = sents[3*i+2].split('\t')
        for word, postag, tag in zip(words, postags, tags):
            h.write(word+' '+postag+' '+tag+'\n')
        h.write('\n')

print('OK!')

模板文件template內容如下

# Unigram
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]

U10:%x[-2,1]
U11:%x[-1,1]
U12:%x[0,1]
U13:%x[1,1]
U14:%x[2,1]
U15:%x[-2,1]/%x[-1,1]
U16:%x[-1,1]/%x[0,1]
U17:%x[0,1]/%x[1,1]
U18:%x[1,1]/%x[2,1]

U20:%x[-2,1]/%x[-1,1]/%x[0,1]
U21:%x[-1,1]/%x[0,1]/%x[1,1]
U22:%x[0,1]/%x[1,1]/%x[2,1]

# Bigram
B

訓練該數據

crf_learn -c 3.0 template NER_train.data model -t

在測試集上對該模型的預測表現做評估

crf_test -m model NER_test.data > result.txt

使用Python腳本統計預測的準確率

# -*- coding: utf-8 -*-

dir = "/Users/Shared/CRF_4_NER/CRF_TEST"

with open("%s/result.txt" % dir, "r") as f:
    sents = [line.strip() for line in f.readlines() if line.strip()]

total = len(sents)
print(total)

count = 0
for sent in sents:
    words = sent.split()
    # print(words)
    if words[-1] == words[-2]:
        count += 1

print("Accuracy: %.4f" %(count/total))

看看模型在新數據上的識別效果

# -*- coding: utf-8 -*-

import os
import nltk

dir = "/Users/Shared/CRF_4_NER/CRF_TEST"

sentence = "Venezuelan opposition leader and self-proclaimed interim president Juan Guaidó said Thursday he will return to his country by Monday, and that a dialogue with President Nicolas Maduro won't be possible without discussing elections."
#sentence = "Real Madrid's season on the brink after 3-0 Barcelona defeat"
# sentence = "British artist David Hockney is known as a voracious smoker, but the habit got him into a scrape in Amsterdam on Wednesday."
# sentence = "India is waiting for the release of an pilot who has been in Pakistani custody since he was shot down over Kashmir on Wednesday, a goodwill gesture which could defuse the gravest crisis in the disputed border region in years."
# sentence = "Instead, President Donald Trump's second meeting with North Korean despot Kim Jong Un ended in a most uncharacteristic fashion for a showman commander in chief: fizzle."
# sentence = "And in a press conference at the Civic Leadership Academy in Queens, de Blasio said the program is already working."
#sentence = "The United States is a founding member of the United Nations, World Bank, International Monetary Fund."

default_wt = nltk.word_tokenize # 分詞
words = default_wt(sentence)
print(words)
postags = nltk.pos_tag(words)
print(postags)

with open("%s/NER_predict.data" % dir, 'w', encoding='utf-8') as f:
    for item in postags:
        f.write(item[0]+' '+item[1]+' O\n')

print("write successfully!")

os.chdir(dir)
os.system("crf_test -m model NER_predict.data > predict.txt")
print("get predict file!")

# 讀取預測文件redict.txt
with open("%s/predict.txt" % dir, 'r', encoding='utf-8') as f:
    sents = [line.strip() for line in f.readlines() if line.strip()]

word = []
predict = []

for sent in sents:
    words = sent.split()
    word.append(words[0])
    predict.append(words[-1])

# print(word)
# print(predict)

# 去掉NER標註爲O的元素
ner_reg_list = []
for word, tag in zip(word, predict):
    if tag != 'O':
        ner_reg_list.append((word, tag))

# 輸出模型的NER識別結果
print("NER識別結果:")
if ner_reg_list:
    for i, item in enumerate(ner_reg_list):
        if item[1].startswith('B'):
            end = i+1
            while end <= len(ner_reg_list)-1 and ner_reg_list[end][1].startswith('I'):
                end += 1

            ner_type = item[1].split('-')[1]
            ner_type_dict = {'PER': 'PERSON: ',
                             'LOC': 'LOCATION: ',
                             'ORG': 'ORGANIZATION: ',
                             'MISC': 'MISC: '
                            }
            print(ner_type_dict[ner_type], ' '.join([item[0] for item in ner_reg_list[i:end]]))
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章