序列標註任務中的CRFs和LSTMs

本文先簡要介紹序列標註的經典模型，然後以醫療文本實體識別爲例，來介紹CRF和LSTM的應用。

一、序列標註的經典模型
參考論文 Neural Architectures for Named Entity Recognition

兼顧character-based embedding和word-based embedding，隨機初始化character的embedding，用BiLSTM學習character的embedding，前向LSTM的最後一個輸出反映的是word的suffix，而後向LSTM最後一個輸出反映的是word的prefix。word的embeddind來自於pre-trained，然後將前向LSTM的最後一個hidden，後向LSTM的最後一個hidden，以及word embedding三個向量做concatenation，後接一個dropout層，然後輸入到CRF層。

二、CRF和LSTM的對比
1.CRF在小數據集上會有相對較好的表現，而LSTM這樣的深度學習模型在數據足夠的情況下可以beat CRF；
2.CRF和LSTM經常搭配使用，這種框架已經成爲主流；
3.LSTM經常會產生的一個問題是，在序列標註時，在某個位置輸出一個結束標記，但是前面卻沒有開始標記。如果把CRF加在LSTM之上，則可以解決這個問題。

三、醫療文本實體識別
1.CRF模型
數據特徵如下圖所示

“B”列是字符的“基本標籤“，把字符按類型分爲7種，定義方法如下

import string
import re
# Tags for the basic information

def basicTags(word):
    punStr = string.punctuation + '＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､\u3000、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·！？｡。'
    engReg = r'[A-Za-z]{1}'
    if '%' in word or '%' in word:
        return 'PERC'
    elif re.match(r'[0-9]{1}', word):
        return "NUM"
    elif word in punStr:
        return "PUNC"
    elif word >= '\u4e00' and word <= '\u9fff':
        return "CHN"
    elif re.match(engReg, word):
        return 'ENG'
    #elif word in string.whitespace:
        #return 'SPA'
    elif word == '@':
        return 'SPA'
    else:
        return 'OTHER'

“D“列由一個醫學詞典生成

def matrixPreparing(matrix):
    matrix.sort(key = lambda x:len(x))
    return matrix[::-1]

# Medical Dictionary Tag
def dictTags(word):
    units = 'kBq kbq mg Mg UG Ug ug MG ml ML Ml GM iu IU u U g G l L cm CM mm s S T % % mol mml mmol MMOL HP hp mmHg umol ng'.split(
        ' ')
    chn_units = '毫升 毫克 單位 升 克 第 粒 顆粒 支 件 散 丸 瓶 袋 板 盒 合 包 貼 張 泡 國際單位 萬 特充 個 分 次'.split(' ')
    med_units = 'qd bid tid qid qh q2h q4h q6h qn qod biw hs am pm St DC prn sos ac pc gtt IM IV po iH'.split(' ')
    all_units = units + chn_units + med_units

    site_units = '上 下 左 右 間 片 部 內 外 前 側 後'.split(' ')
    sym_units = '大 小 增 減 多 少 升 降 高 低 寬 厚 粗 兩 雙 延 長 短 疼 痛 終 炎 咳'.split(' ')
    part_units = '腦 心 肝 脾 肺 腎 胸 髒 口 腹 膽 眼 耳 鼻 頸 手 足 腳 指 壁 膜 管 竇 室 管 髖 頭 骨 膝 肘 肢 腰 背 脊 腿 莖 囊 精 脣 咽'.split(' ')
    break_units = '呈 示 見 伴 的 因'.split(' ')
    more_units = '較 稍 約 頻 偶 偏'.split(' ')
    non_units = '無 不 非 未 否'.split(' ')
    tr_units = '服 予 行'.split(' ')

    all_units = matrixPreparing(all_units)
    units = matrixPreparing(units)
    chn_units = matrixPreparing(chn_units)
    med_units = matrixPreparing(med_units)

    if word in units:
        return 'UNIT'
    elif word in chn_units:
        return 'CHN_UNIT'
    elif word in med_units:
        return 'MED_UNIT'
    elif word in site_units:
        return 'SITE_UNIT'
    elif word in sym_units:
        return 'SYM_UNIT'
    elif word in part_units:
        return 'PART_UNIT'
    elif word in break_units:
        return 'BREAK_UNIT'
    elif word in more_units:
        return 'more_UNIT'
    elif word in non_units:
        return 'NON_UNIT'
    elif word in tr_units:
        return 'TR_UNIT'
    else:
        return 'OTHER'

“R“列是漢字的偏旁部首，這列結果由兩部分組成，一個是一張部首表，如下圖

另一個是來自於百度的接口。偏旁部首的計算類如下

class Radical(object):
    def __init__(self,rootPath):
        self.dictionary_filepath = rootPath + 'sources/xinhua.csv'
        self.dictionary = read_csv(self.dictionary_filepath)
        self.baiduhanyu_url = baiduhanyu_url

    def get_radical(self,word):
        if word in self.dictionary.char.values:
            return self.dictionary[self.dictionary.char == word].radical.values[0]
        else:
            return self.get_radical_from_baiduhanyu(word)

    def get_radical_from_baiduhanyu(self,word):
        url = self.baiduhanyu_url % word
        #print(url)
        try:
            r = requests.get(url)
            #print(r.content)
            html = str(r.content).decode("utf-8")
        except Exception as e:
            print('URL Request Error:', e)
            html = None

        if html == None:
            return None
        soup = BeautifulSoup(html, 'html.parser')
        li = soup.find(id="radical")
        radical = li.span.contents[0]

        if radical != None:
            self.dictionary = self.dictionary.append({'char': word, 'radical': radical}, ignore_index= True)
            self.dictionary.to_csv(self.dictionary_filepath, encoding = 'utf-8', index = False)

        return radical

“P“列是詞性標註列。詞性編碼表可以參考詞性編碼表
調用jieba的詞性標註函數posseg進行標註，輸入一個句子，通過下面的函數可以得到該句子的詞性標註

from jieba import posseg as ppseg

def getPOSTagsList(text):
    segs = list(ppseg.cut(text))

    for i in range(len(segs)):
        pair = segs[i]
        start = sum(len(p.word) for p in segs[:i])
        end = sum(len(p.word) for p in segs[:i+1]) -1
        pair.indeces = [start, end]

    POSTagsList = []
    for p in segs:
        word = p.word
        for i in range(len(p.word)):
            if i == 0:
                POSTagsList.append([p.indeces[0] + i, word[i], p.flag+'-B'])
            else:
                POSTagsList.append([p.indeces[0] + i, word[i], p.flag+'-I'])

    return POSTagsList

>>>text = u"美國總統特朗普宣佈對中國的貿易戰正式生效"
>>>for i in getPOSTagsList(text):
>>>    print str(i[0])+"\t"+i[1]+"\t"+i[2]
0   美   ns-B
1   國   ns-I
2   總   n-B
3   統   n-I
4   特   nr-B
5   朗   nr-I
6   普   nr-I
7   宣   v-B
8   布   v-I
9   對   p-B
10  中   ns-B
11  國   ns-I
12  的   uj-B
13  貿   nz-B
14  易   nz-I
15  戰   nz-I
16  正   ad-B
17  式   ad-I
18  生   n-B
19  效   n-I

“R“列是對“E“列簡化的結果，只標記是否是實體，並不區分實體類別。
“E“列是標籤列，按實體類型分爲五種，即”Sy”、”Bo”、”Ch”、”Tr”、”Di”，再加上BIO標識，就得到E列。

然後就可以搭建CRF模型。論文 Named Entity Recognition for Chinese Clinical Texts by CRF Models中採用了兩種CRF架構，一種是單層的，另一種是雙層的，

文中的單層網絡使用了”A”、”B”、”D”、”P”四列數據來預測標籤。

雙層的CRF包含了兩個子model，model1用”A”、”B”、”P”預測”R”列，model2用”A”、”B”、”P”、”R”列來預測”E”列，model1和medel2獨立訓練。在做inference的時候，先用model1預測”R”，然後用model2預測”E”。

2.BiLSTM-CRF
我們要對 $p (y | x) = \frac{p (x | y) \cdot p (y)}{p (x)}$ 建模，我們把 $p (x | y)$ 改成 $p (y | x)$ ，畢竟CRF的feature function還是比較隨意的，想怎麼定義就怎麼定義（只要有意義就行）。
我們用一個BiLSTM網絡擬合 $p (y | x)$ ，視輸出爲頂層的CRF的feature。
BiLSTM-CRF架構如下圖所示

我們假設實體只有兩種，即人和組織。輸入的句子 $x$ 共有5個單詞，即使 $w_{0}, w_{1}, w_{2}, w_{3}, w_{4}$ 。
輸入 $w_{0}, w_{1}, w_{2}, w_{3}, w_{4}$ 包含了character embedding和word embedding。其中character embedding是隨機初始化的，而word embedding來自於一個pre-trained word embedding結果。所有的embedding都是要在訓練過程中進行更新的。
所以BiLSTM-CRF的輸入是embedding，輸出是tag。
看一下底層的BiLSTM，它的輸出是什麼？

上圖顯示，BiLSTM的輸出是每個label的score。
看起來其實BiLSTM就是我們想要的，爲什麼還要加個CRF層呢？
上圖的例子中，BiLSTM的輸出結果就是正確的tagging結果，這只是巧合，很多時候沒那麼幸運，如下圖中的BiLSTM的輸出

這時候BiLSTM的輸出並不是正確答案。

CRF能夠學習到數據中的約束(CRF layer can learn constrains from training data)
CRF可以把一些約束條件加到tag的預測中，這些約束條件包括
1）句子的第一個word的tag應當以”B-“或”O”開頭，而不是”I-“；
2）tag序列應當是“B-label I-label I-label I-…”形式，先出現”B-“，後出現”I-“；
3）”O I-“是無效的；
等。
CRF可是做到這些，所以就可以大大減少錯誤標記的數量。那麼LSTM爲什麼做不到呢？
CRF的優勢在於可以學習到tag之間的dependence，但是LSTM其實也會考慮之前時刻的tag。也許LSTM學習到的這種的tag間的dependence不夠strong？誰知道呢？
那麼CRF是怎麼做到這些的？
我們來研究CRF。CRF包含兩個核心成分，即Emission score和Transition score。
Emission score來自於BiLSTM。
用 $t_{y_{1} y_{2}}$ 表示transition score，我們可以計算tag之間的transition score，形成一個轉移矩陣，如下圖所示

參考資料：
https://github.com/floydluo/cctner

https://createmomo.github.io/2017/09/12/CRF_Layer_on_the_Top_of_BiLSTM_1/#more

序列標註任務中的CRFs和LSTMs

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

從decision tree到bagging、boosting

序列標註任務中的CRFs和LSTMs

特徵工程——分類變量的處理

貝葉斯統計學相關

推薦系統優秀論文、博文彙總

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結