《Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network》——筆記

Abstract

We investigated a novel deep learning method to recognize clinical entities in Chinese clinical documents using the minimal feature engineering approach.

We developed a deep neural network (DNN) to generate word embeddings from a large unlabeled corpus through unsupervised learning and another DNN for the NER task.

兩次DNN方法,一次生成詞嵌入,第二次用作實體識別。

Introduction

介紹electronic health record的應用價值,以及面臨實體識別的問題。

Many existing clinical NLP systems use dictionariesand rule-based methods to identify clinical concepts, such as MedLEE, MetaMap, cTAKES.

More recently, a number of challenges on NER involving shared tasks in clinical text have been organized, including the 2009 i2b2, the 2010 i2b2, the 2013 Share/CLEF challenge and the 2014 Semantic Evaluation challenge.(有空着重瞭解下=_=)

Conventional ML-based methods have been applied to Chinese clinical NER tasks.

In summary, current efforts on NER in Chinese clinical text primarily focus on investigating different machine learning algorithms or optimizing combinations of different types of features via human engineering.

最近越來越多人對基於深度學習的NLP系統感興趣。這種系統能從大規模的未標註的語料通過非監督的方法學習到有用的特徵表達式。深度學習是一個能通過深度神經網絡學習高級特徵表達的機器學習的研究領域。現在在圖像處理,語音自動識別和機器翻譯方面獲得了先進的表現。NLP研究者開發出DNNs從大量的未標註的數據中去學習有用的特徵,不再用花費大量時間去尋找任務特性的特徵。Dr. Ronan Collobert的系統通過單個深度神經網絡在很多NLP任務中獲得了最先進的表現。

本文首個應用DNNs研究中文病歷NER,並對比了傳統的CRF方法。

Methods

Datasets

共兩個數據集,第一個是來自Lei(雷建波)等人先前的研究中標註好的數據集。包含了北京協和醫科大學附屬醫院的EHR數據庫隨機選擇的400份入院記錄。每份入院記錄標註problem, lab tests, procedure, and medication. 具體數量見下表。
雷建波的標註數據集
另外一個數據集,同樣來自協和醫院,包含了36828份未標註的入院記錄用作學得字嵌入。使用單個漢字訓練嵌入矩陣,預處理不用分詞。

Experiments and evaluation

對比了三種NER方法:

  1. 傳統的基於CRF的NER方法,可見Lei的論文。
  2. 基於DNN的NER方法用隨機初始化的字嵌入矩陣。
  3. 另一個使用從未標註語料庫導出的字嵌入矩陣的基於DNN的NER方法。
    All scores were calculated using the Conll 2000 challenge official evaluation script 1. Wilcoxon signed-ranks test was used to test the statistical significance between two classifiers.

CRF vs.DNN Appraoches

CRF-based NER

The CRFs model decodes the sequence labeling problem by undirected Markov chain and Viterbi algorithm with a training criteria of maximizing the likelihood estimation of conditional probability of the output variable y given the observation x.
CRFs was intrinsically designed for sequence labeling problem as it models the relationships between neighboring tokens in sequence.

DNN-based NER

系統架構圖如下:Figure 1 - The sentence approach DNN
In this research, we adopted one of the popular architectures from Dr. Ronan Collobert – the sentence level log-likelihood approach [23], which consists of a convolutional layer, a non-linear layer using the hard version of the hyperbolic tangent (HardTanh), and several linear layers.
接下來講述,每層的公式以及公式含義,詞嵌入方法的優點,以及學習率等一些參數的確定。
(算法的具體細節在這裏不再詳述,畢竟我看不懂,慢慢再說~)

Results

結果如下圖,不再詳述。
結果圖片

Discussion

(待完善)

Conclusion

(待完善)

發佈了36 篇原創文章 · 獲贊 4 · 訪問量 3萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章