好大夫在線在AI落地場景上的探索:醫學報告單結構化

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在好大夫在線每天的線上問診中,包含了大量的各種醫院報告單、化驗單等圖片,如何識別並格式化這些報告單數據,成了我們面臨的一大難題。如果這些報告單僅以圖片形式存儲在服務器中,就難以發揮其在問診過程中的重要價值,無法爲問診醫生提供更準確的參考信息。本文將從實際業務需求、技術挑戰、各種算法的嘗試等方面,逐一探討我們是如何解決這類問題的。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、項目背景及挑戰"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.1 項目需求背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"好大夫網站上,用戶平均每天上傳數萬張圖片,這些圖片中大部分都是醫學檢查報告單(如血常規、肝功能等)。其中蘊含着大量的有價值的信息,但由於這些照片不是結構化的數據,導致無法進行索引和檢索等一系列複雜功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一方面,醫生有將報告單中的關鍵信息謄寫到病歷上的需求,這可以使醫生能更加清楚明白地跟蹤患者病情發展,但是這個過程太過於繁瑣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面,無法提取報告單中的具體信息導致這部分醫療數據無法被利用起來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“醫學報告單結構化”項目便是在這個需求背景下成立了。該項目要求算法能讀取出患者上傳報告單中的文本信息,並將其組合成表格的形式,以便存儲和利用(如圖1.1所示)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/7d\/7dca5743b141406a0913dfb4c7a7ac4a.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.2 所用技術概述"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本項目所用的技術都屬於人工智能的範疇。圖1.2展示了“人工智能”技術領域的三大分支,以及各分支的主要細分支。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/af\/af41171235833c9d69f87aa066ca51b2.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該項目同時涉及圖像處理和自然語言處理兩大人工智能技術分支。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我們需要用到OCR(Optical Character Recognition,光學字符識別)技術將圖像轉化爲文字。OCR技術是圖像處理技術的一個分支,其通常包含“文本區域檢測”和“文字識別”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次,需要用“文本分類”技術項將報告單分類爲血常規、肝功能等一系列的類別,該技術是NLP(Natural Language Processing,自然語言處理)的一個重要分支。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,需要用到NER(Named Entity Recognition,命名實體識別)技術從識別出的文本中準確地提取“檢查項名”、“項數值”、“項範圍”等字段,該技術同樣是NLP的一個重要分支。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.3 項目面臨的挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準確率方面。由圖1.3可以看出,用戶上傳的報告單照片的干擾是非常大的,不僅有噪聲的問題,還有文本傾斜,甚至文本彎曲。這些干擾會嚴重影響文字識別的準確率。爲了評估報告單圖像識別難度,我們整理了“好大夫在線醫學報告單文本行數據集v1.0”(如圖1.4所示),是從報告單圖像中切割出近2000個文本行切片,然後進行人工標記。將該數據集輸入到百度通用OCR接口和騰訊OCR接口中進行識別,結果“字符串”準確率都沒超過85%(實驗詳細結果詳見本文2.2),遠不能滿足我們最終的要求。同時也可以看出該項目的難度之大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/8a\/8ac88e1fedb3e755b7931fa93d806d30.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/d1\/d132fb1061d23e4762d16627e9b0227a.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行速度方面。一天需要處理的報告單數量在6萬左右,要在一天內處理完這些數量的報告單,則要求算法處理每張圖像的平均時間不能超過1.44秒。否則很容易造成積壓,影響用戶體驗。而百度和騰訊通用OCR接口處理每張報告單的速度均在7秒以上,遠不能滿足我們項目的需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據數量方面。雖然服務器中儲存了上億張醫學報告單照片,但都是沒有被標記的數據。人工智能算法的完善是需要大量被標記的數據,數據質量越高,算法表現越好。但標記的過程需要大量人工成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"硬件配置方面。考慮到項目投入的性價比問題,我們捨棄了一些運行負載非常高的算法。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、應用的相關技術及難點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"項目主體使用Python,某些模塊爲了效率使用C++語言編寫成動態鏈接庫供Python調用。在測試環境下,該項目檢測一張報告單圖像的時間平均1.04秒。該項目的整體程序框圖如圖2.1所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/b5\/b50726f6eeebe083f5a75475cbf46bc0.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"共分爲:1)文字區域檢測2)文字識別3)報告單分類4)命名實體識別5)結構化內容提取,這五個部分。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 文字區域檢測模塊"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“文字區域檢測”,是將圖像中的每一行文本切割出來形成一條一條的文字切片,其具體過程如圖2.2所示。目前流行的“文字區域檢測”算法主要分爲基於迴歸的方法和基於分割的方法:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/95\/9511beb15ef87ff4fd90093967cb6e53.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一版是基於一個開源項目修改而來,其採用的主體結構是Yolo[1]文字區域檢測+CRNN[2]文字識別。Yolo文字區域檢測算法屬於基於迴歸的方法。第一版的Yolo文字區域檢測算法在一般的醫學報告單上面的效果是比較令人滿意的,其文字檢測效果不亞於市面上的一些付費OCR公共接口。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於一般的文字區域檢測算法,單獨出現的’+’、’-’等符號通常是被視作噪聲的。我們的Yolo文字區域檢測算法也傾向於將這些符號視作噪聲。這個先驗知識在大部分醫學報告單上是可行的。但仍然有佔總數比例很小的一些類別的報告單(如尿常規、便常規等)上面會出現這些字符,而這些字符在這些類報告單上代表的卻是非常關鍵的意義:如‘+’通常代表陽性,‘-’通常代表陰性(如圖2.3所示)。爲更好地處理這些類別的報告單,我們仍然需要一種效果更好、可以檢測出這些不起眼的小字符的文字區域檢測算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/f3\/f3c24e8e90d95c83f8092d9fdf99416e.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CVPR2019的一篇論文所提出的Craft算法[3]進入了我們的視野。Craft算法是一種基於分割的文字區域檢測算法。在這裏,我們先重點介紹Craft算法與Yolo算法在文字區域檢測任務上的不同。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Yolo文字區域檢測使用固定寬度的小包圍盒來覆蓋檢測到的文字區域以達到文字區域檢測的目的。該算法先使用卷積神經網絡提取圖像高維特徵,然後用迴歸的方法獲取文字區域的候選小包圍盒,最後使用非極大值抑制和循環神經網絡來過濾這些小包圍盒結果。由於使用了循環神經網絡,因此該算法是將文字區域檢測看作“序列處理”任務。這種方式對中長序列的文本區域檢測非常有效。但報告單裏單獨出現的’+’、’-’等符號沒有前後文字符,因此“序列處理”反而會降低算法對這些字符的檢測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Craft算法則不將文字區域檢測看作“序列處理”任務,而是將它看作圖像分割任務。因此該算法對所有的文字序列的圖像(無論長短)都一視同仁。Craft算法會對圖像中的所有像素進行二分類——文字區域像素和非文字區域像素。因此Craft算法更容易檢測到單獨出現的’+’、’-’等符號。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜上,我們選擇了Craft算法替代原有的yolo算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而算法被選定後,一個更大的問題出現了:沒有標記好的數據。Craft算法的訓練數據需要人工框定每一個字符的包圍盒。根據預測,標記一張報告單圖像大概需要25分鐘。標記出一個滿意的數據集需要1000人\/天以上的工作量,代價實在太大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"牛津大學的Gupta學者的一些工作[4]引起了我們的注意。Gupta認爲完全虛擬生成的文本圖像即可訓練出非常好的文本區域檢測算法。參考Gupta論文[4]裏面的方法,我們創造了自己的虛擬報告單生成算法(限於篇幅,我會在下一次分享詳細介紹該算法),能自動生成出一張完整的報告單,幾乎達到以假亂真的程度。圖2.4是一張沒有加背景的虛擬生成的報告單示例。在給報告單加上人工設置的背景(如桌面等)後,我們就可得到最終被訓練的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/a3\/a393e198edd594fb754c7efa6eb00fb8.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最終訓練出的Craft算法在其他字符檢測效果沒有退步的前提下,大大提升了該模塊算法檢測單獨出現的’+’\/’-’等小字符的能力。新舊兩種算法的對比結果如圖2.5所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/d1\/d119f454f47b6eb236316828047de52a.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"獲取文字區域後,由於存在彎曲文本和傾斜文本,所以我們需要對這些文本行圖片進行糾偏。參考算法[5]和算法[6],我們設計了一套文本糾偏算法用於該任務(限於篇幅,我會在後面的分享再詳細敘述該算法的細節和實現)。該算法對傾斜和彎曲文本圖像的糾偏效果如圖2.6、圖2.7、圖2.8所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/3f\/3f9ab1a86dffd790a3e7635d741df460.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/4d\/4d9f140f2d3471948370451517492613.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/6c\/6c0de7e07a1bc0b00b30376b67738374.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 文字識別模塊"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“文字識別”的任務是將切下來的文本行切片轉換爲文字。本項目一直使用的是CRNN算法[2]進行文字識別處理。訓練數據集爲人爲收集的報告單字段隨機自動生成的圖片。我們嘗試過使用注意力機制的DAN算法[7],但公司機器扛不住它的負載。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在“好大夫在線醫學報告單文本行數據集v1.0”上,我們分別試驗了我們的算法、百度OCR和騰訊OCR三種算法(如表2.1表2.2所示)。爲公平起見,當百度雲OCR接口和騰訊雲OCR接口沒有從圖像中檢測出正確的文字區域時,我們則跳過這些圖片,不統計這些圖片的錯誤。表2.3展示了百度雲OCR接口和騰訊雲OCR接口每一類的文字區域檢測錯誤圖片的佔比:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/97\/97017d62584f30ce410bf68a15d739cc.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.3 報告單分類模塊"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“報告單分類”本質上是文本分類任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該模塊最初的版本爲我們和產品經理一起設計的關鍵字分類方法,其思想是提取文本檢測結果中的關鍵字,根據關鍵字加權結果進行分類,我們將其稱爲“加權關鍵字分類算法”。“加權關鍵字分類算法”在一般情況下還是比較準確的。在項目初期,它爲我們提供了較高質量的報告單分類圖像數據。但這個版本的算法由於參數是人爲設定的,因此噪聲對算法的影響非常大。如“血常規”報告單在表頭出現“肝功能”等分類權值非常大的字樣就極易讓算法誤以爲它是“生化、肝腎功”類報告單。因此我們仍然需要用一種效果更好,速度更快的算法去替代我們的“加權關鍵字分類算法”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"報告單分類模塊我們調研了很多算法。捨棄了效果最好的Bert算法[8],主要是因爲其運行速度較慢,佔用內存資源太多。最終我們選擇的算法是FastText算法[9]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FastText算法有幾個優點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 運行速度快,i7十代的CPU下該算法處理一張圖片的文本平均15毫秒;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 由於其只有2層全連接層,所以內存資源佔用非常低;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 準確率較高,在Thuctc數據集[10]上的效果強於常用的TextCNN[11]算法和TextRNN[12]算法,弱於Bert算法[8]。相關實驗結果詳見網站[16]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"得益於之前設計“加權關鍵字分類算法”,我們可以獲取較爲可信的報告單圖像分類數據集。人工過濾該數據集,並標記出5000多張圖像後即可訓練FastText算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實際的實驗中,我們將訓練得到的FastText算法在隨機選取的近500張報告單圖像的文本檢測結果上進行分類測試,準確率爲92.2%,而“加權關鍵字分類算法”的準確率爲75.2%。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.4 命名實體識別模塊"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"命名實體識別(NER)的任務是抽取句子中的某類特殊名詞。我們項目的“命名實體識別”模塊任務如圖2.9所示,需要從報告單文本中提取諸如“檢查項名”、“項數值”、“項範圍”等字段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/69\/69460d7e6afc5eb6e685c4ed3ba97a07.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"命名實體識別(NER)是一個經典的NLP問題。學術界最早傾向於用線性鏈條件隨機場(CRF)[13]去解決“命名實體識別”任務的。線性鏈條件隨機場將自然語言中的每一個字當成線性鏈中的一個一個節點,相鄰的字在線性鏈中爲相鄰的節點。線性鏈條件隨機場假設線性鏈中的任意一個節點的狀態只與其相鄰節點的狀態相關。但在實際情況下,自然語言中字與字間的聯繫通常會跨越數個字,甚至數句話。線性鏈條件隨機場顯然無法很好的處理這些長距離邏輯關係。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近些年,學術界開始研究用深度學習的辦法去解決“命名實體識別”任務。長短記憶單元(LSTM)[14]開始被人用作處理“命名實體識別”任務。長短記憶單元(LSTM)理論上可以將字與字之間的聯繫擴展到無限遠的地方,這進一步提高了其處理NER任務的效果。然而百度研究院的黃志恆博士發現長短記憶單元(LSTM)雖然加強了算法對中長距離字間聯繫的感知,但是卻在超短距離的字間感知上不如線性鏈條件隨機場。於是黃博士將線性鏈條件隨機場架設在雙向長短記憶單元(BiLSTM)後面,創造了效果更好的BiLSTM-CRF算法[15]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2018年,一個里程碑式的算法——Bert[8]算法誕生了,它在11項自然語言處理任務上都取得了令人矚目的成果。無論在前面提到的“文本分類”任務上還是當前小節的“命名實體識別”任務上,Bert算法都是毫無疑問的王者。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"公司有一個醫患交流的命名實體識別數據集,該數據庫有近100萬被標記的字符。在該數據庫上,我們進行過三種算法的測試:1)Bilstm-crf;2)Bert-crf;2)Bert-Bilstm-crf。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據以往學者的建議,我們選擇的算法都含有條件隨機場(CRF)進行超短距離標籤修正。最終三種算法效果最好的是Bert-crf算法,於是該模塊所用的算法最初爲Bert-crf算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提取OCR識別的文字結果後,我標記了200多萬的字符數據,涉及24種標籤,如表2.4所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/ef\/ef1a4f6bb53a1d030e756a778205ab7e.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表中(B)代表首部標籤,(I)代表非首部標籤。其他標記沒有首部與非首部之分。患者性別由於通常爲一個字,故只有首部標籤,沒有非首部標籤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bert也是一種基於注意力機制的算法。基於注意力機制的算法有一個明顯的缺陷,即有最大處理長度,Bert算法的最大處理長度爲512字。然而有很多報告單如血常規、超聲檢查等,字數均在千字以上。血常規等表格式報告單行間聯繫不大,尚可用行分割處理。超聲檢查等兩段式報告單前後聯繫極大,不能簡單的以行爲標準進行分割處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決這個問題,我們設計了Bert-rb(全稱Bert-roll-back)算法。Bert-rb算法是Bert算法的一個衍生,其利用報告單的固有特性,智能對報告單進行分割處理(限於篇幅,我會在下次分享詳細介紹其原理及實現)。將Bert-rb算法後面架設一個crf算法即得到最終我們用到的Bert-rb-crf算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將Bert-crf算法替代爲Bert-rb-crf算法後,這個模塊即擁有了理論上一次性處理無限長報告單文本的能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實驗中,Bert-rb-crf算法處理超長兩段式報告單的效果基本與不分割處理的Bert-crf算法的效果一致,而不分割處理的Bert-crf算法卻只能處理最多512字符長度的文本。這表明了Bert-rb-crf算法的先進性和有效性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決了算法問題,算法也就能在已經標記的200多萬的字符數據上進行訓練了,表2.5是Bert-rb-crf算法在標記的數據集的測試集上的表現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/c2\/c21405a66700e164c1be12232ec3bfe1.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.5 結構化內容提取"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"獲取報告單圖像文本數據中的命名實體信息後,需要將這些信息組合成表格的形式。其思想非常簡單,首先找“檢查項名”,然後在“檢查項名”的當前行和下一行尋找尚未被匹配到的“項數值”、“項範圍”即可。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、存在的問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然該項目各個模塊取得了喜人的成果。但不可否認該項目依舊存在一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如文字區域檢測模塊的Craft算法缺乏“印章檢測”功能,所以當報告單中文字上出現印章時會對“文字區域檢測”的結果產生干擾。Craft算法也喜歡將折線圖誤認爲文字,令人欣慰的是,折線圖上通常沒有重要信息,錯誤識別對最終結果沒有多大影響,還有就是某些紋理(如藍色條紋毛衣等)依舊會對Craft算法造成干擾,幸運的是這些紋理上一般沒有文字。文字識別模塊的CRNN算法在報告單圖像上的效果雖然超過市面上的某些雲OCR接口,但是其依舊有不小的提升空間,這要求我們進一步利用現有的前沿技術,實現產品上的突破。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"報告單分類的分類準確率僅有92.2%,我們對這個結果不太滿意,感覺應該能進一步提升至95%以上,但這需要更多高質量數據集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"命名實體識別模塊不能檢測出“檢查項單位”。最初沒有標記這個命名實體,因爲當時文字識別模塊使用的還是通用文字識別模型,而“檢查項單位”由於其包含很多希臘字母,所以沒有經過特定訓練的文字識別模型對這些字段的識別有許多錯誤,最終導致無法對這些錯誤的字段進行標記。但現在的文字識別模塊對“檢查項單位”的識別準確率已經大大提高,解決這個問題只需要投入更多時間即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考文獻:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]\/\/Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. Shi B , Bai X , Yao C . An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2016, 39(11):2298-2304."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. Baek Y , Lee B , Han D , et al. Character Region Awareness for Text Detection[C]\/\/ 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images[C]\/\/Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2315-2324."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. Ma K , Shu Z , Bai X , et al. DocUNet: Document Image Unwarping via a Stacked U-Net[C]\/\/ 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6. Li Zhang, Chew Lim Tan. Warped Image restoration with Applications to Digital Libraries.[J]. 2005."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7. Wang T , Zhu Y , Jin L , et al. Decoupled Attention Network for Text Recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7):12216-12224."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"8. Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"9. Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification[J]. arXiv preprint arXiv:1607.01759, 2016."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"10. Sun M, Li J, Guo Z, et al. Thuctc: an efficient chinese text classifier[J]. GitHub Repository, 2016."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"11. Kim Y. Convolutional neural networks for sentence classification[J]. arXiv preprint arXiv:1408.5882, 2014."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"12. Liu P, Qiu X, Huang X. Recurrent neural network for text classification with multi-task learning[J]. arXiv preprint arXiv:1605.05101, 2016."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"13. Lafferty J, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[J]. 2001."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"14. Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"15. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv preprint arXiv:1508.01991, 2015."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"16. "},{"type":"link","attrs":{"href":"https:\/\/github.com\/649453932\/Chinese-Text-Classification-Pytorch","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/649453932\/Chinese-Text-Classification-Pytorch"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者簡介:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"張健琦:好大夫在線機器學習工程師,專注於AI圖像處理與自然語言處理方面的工作。發表過關於圖像處理方面的論文:"},{"type":"text","marks":[{"type":"italic"}],"text":"WePBAS: A Weighted Pixel-Based Adaptive Segmenter for Change Detection"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章