Chinese_medical_NLP

醫療NLP領域（主要關注中文）評測數據集與論文等相關資源。

中文評測數據集

1. Yidu-S4K：醫渡雲結構化4K數據集

數據集描述：

Yidu-S4K 數據集源自CCKS 2019 評測任務一，即“面向中文電子病歷的命名實體識別”的數據集，包括兩個子任務：
1）醫療命名實體識別：由於國內沒有公開可獲得的面向中文電子病歷醫療實體識別數據集，本年度保留了醫療命名實體識別任務，對2017年度數據集做了修訂，並隨任務一同發佈。本子任務的數據集包括訓練集和測試集。
2）醫療實體及屬性抽取（跨院遷移）：在醫療實體識別的基礎上，對預定義實體屬性進行抽取。本任務爲遷移學習任務，即在只提供目標場景少量標註數據的情況下，通過其他場景的標註數據及非標註數據進行目標場景的識別任務。本子任務的數據集包括訓練集（非目標場景和目標場景的標註數據、各個場景的非標註數據）和測試集（目標場景的標註數據

數據集地址

度盤下載地址：https://pan.baidu.com/s/1QqYtqDwhc_S51F3SYMChBQ

提取碼：flql

2.瑞金醫院糖尿病數據集

數據集描述：

數據集來自天池大賽。此數據集旨在通過糖尿病相關的教科書、研究論文來做糖尿病文獻挖掘並構建糖尿病知識圖譜。參賽選手需要設計高準確率，高效的算法來挑戰這一科學難題。第一賽季課題爲“基於糖尿病臨牀指南和研究論文的實體標註構建”，第二賽季課題爲“基於糖尿病臨牀指南和研究論文的實體間關係構建”。

官方提供的數據只包含訓練集，真正用於最終排名的測試集沒有給出。

數據集地址

度盤下載地址：https://pan.baidu.com/s/1CWKblBNBqR-vs2h0xiXSdQ

提取碼：0c54

3.Yidu-N7K：醫渡雲標準化7K數據集

數據集描述：

Yidu-N4K 數據集源自CHIP 2019 評測任務一，即“臨牀術語標準化任務”的數據集。
臨牀術語標準化任務是醫學統計中不可或缺的一項任務。臨牀上，關於同一種診斷、手術、藥品、檢查、化驗、症狀等往往會有成百上千種不同的寫法。標準化（歸一）要解決的問題就是爲臨牀上各種不同說法找到對應的標準說法。有了術語標準化的基礎，研究人員纔可對電子病歷進行後續的統計分析。本質上，臨牀術語標準化任務也是語義相似度匹配任務的一種。但是由於原詞表述方式過於多樣，單一的匹配模型很難獲得很好的效果。

數據集地址

4.中文醫學問答數據集

數據集描述：

中文醫藥方面的問答數據集，超過10萬條。

數據說明:

questions.csv：所有的問題及其內容。answers.csv ：所有問題的答案。
train_candidates.txt， dev_candidates.txt， test_candidates.txt ：將上述兩個文件進行了拆分。

數據集地址

數據集github地址

5.平安醫療科技疾病問答遷移學習比賽

數據集描述：

本次比賽是chip2019中的評測任務二，由平安醫療科技主辦。chip2019會議詳情見鏈接：http://cips-chip.org.cn/evaluation
遷移學習是自然語言處理中的重要一環，其主要目的是通過從已學習的相關任務中轉移知識來改進新任務的學習效果，從而提高模型的泛化能力。
本次評測任務的主要目標是針對中文的疾病問答數據，進行病種間的遷移學習。具體而言，給定來自5個不同病種的問句對，要求判定兩個句子語義是否相同或者相近。所有語料來自互聯網上患者真實的問題，並經過了篩選和人工的意圖匹配標註。

數據集地址(需註冊)

6.天池新冠肺炎問句匹配比賽

數據集描述：

本次大賽數據包括：脫敏之後的醫療問題數據對和標註數據。醫療問題涉及“肺炎”、“支原體肺炎”、“支氣管炎”、“上呼吸道感染”、“肺結核”、“哮喘”、“胸膜炎”、“肺氣腫”、“感冒”、“咳血”等10個病種。
數據共包含train.csv、dev.csv、test.csv三個文件，其中給參賽選手的文件包含訓練集train.csv和驗證集dev.csv，測試集test.csv 對參賽選手不可見。
每一條數據由 Category，Query1，Query2，Label構成，分別表示問題類別、問句1、問句2、標籤。Label表示問句之間的語義是否相同，若相同，標爲1，若不相同，標爲0。其中，訓練集Label已知，驗證集和測試集Label未知。
示例
類別：肺炎
問句1：肺部發炎是什麼原因引起的？
問句2：肺部發炎是什麼引起的
標籤:1
類別：肺炎
問句1：肺部發炎是什麼原因引起的？
問句2：肺部炎症有什麼症狀
標籤:0

數據集地址(需註冊)

線上第四名解決方案及代碼

中文醫學知識圖譜

CMeKG

地址

簡介：CMeKG（Chinese Medical Knowledge Graph）是利用自然語言處理與文本挖掘技術，基於大規模醫學文本數據，以人機結合的方式研發的中文醫學知識圖譜。CMeKG的構建參考了ICD、ATC、SNOMED、MeSH等權威的國際醫學標準以及規模龐大、多源異構的臨牀指南、行業標準、診療規範與醫學百科等醫學文本信息。CMeKG 1.0包括：6310種疾病、19853種藥物（西藥、中成藥、中草藥）、1237種診療技術及設備的結構化知識描述，涵蓋疾病的臨牀症狀、發病部位、藥物治療、手術治療、鑑別診斷、影像學檢查、高危因素、傳播途徑、多發羣體、就診科室等以及藥物的成分、適應症、用法用量、有效期、禁忌證等30餘種常見關係類型，CMeKG描述的概念關係實例及屬性三元組達100餘萬。

英文數據集

PubMedQA: A Dataset for Biomedical Research Question Answering

數據集描述：基於Pubmed提取的醫學問答數據集。PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially gen- erated QA instances.

論文地址

相關論文

1.醫療領域預訓練embedding

注：目前沒有收集到中文醫療領域的開源預訓練模型，以下列出英文論文供參考。

Bio-bert

論文題目：BioBERT: a pre-trained biomedical language representation model for biomedical text mining

論文地址

項目地址

論文概要：以通用領域預訓練bert爲初始權重，基於Pubmed上大量醫療領域英文論文訓練。在多個醫療相關下游任務中超越SOTA模型的表現。

論文摘要：

Motivation: Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from bio- medical literature has gained popularity among researchers, and deep learning has boosted the development of ef- fective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text min- ing often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora.
Results: We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.
Availability and implementation: We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

sci-bert

論文題目：SCIBERT: A Pretrained Language Model for Scientific Text

論文地址

項目地址

論文概要：AllenAI 團隊出品.基於Semantic Scholar 上 110萬+ 文章訓練的科學領域bert.

論文摘要：Obtaining large-scale annotated data for NLP tasks in the scientific domain is challeng- ing and expensive. We release SCIBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data. SCIBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve perfor- mance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demon- strate statistically significant improvements over BERT and achieve new state-of-the- art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

clinical-bert

論文題目：Publicly Available Clinical BERT Embeddings

論文地址

項目地址

論文概要：出自NAACL Clinical NLP Workshop 2019.基於MIMIC-III數據庫中的200萬份醫療記錄訓練的臨牀領域bert.

論文摘要：Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset. We find that these domain-specific models are not as performant on 2 clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.

clinical-bert(另一團隊的版本)

論文題目：ClinicalBert: Modeling Clinical Notes and Predicting Hospital Readmission

論文地址

項目地址

論文概要：同樣基於MIMIC-III數據庫,但只隨機選取了10萬份醫療記錄訓練的臨牀領域bert.

論文摘要：Clinical notes contain information about patients that goes beyond structured data like lab values and medications. However, clinical notes have been underused relative to structured data, because notes are high-dimensional and sparse. This work develops and evaluates representations of clinical notes using bidirectional transformers (ClinicalBert). Clini- calBert uncovers high-quality relationships between medical concepts as judged by hu- mans. ClinicalBert outperforms baselines on 30-day hospital readmission prediction using both discharge summaries and the first few days of notes in the intensive care unit. Code and model parameters are available.

BEHRT

論文題目：BEHRT: TRANSFORMER FOR ELECTRONIC HEALTH RECORDS

論文地址

項目地址: 暫未開源

論文概要：這篇論文中embedding是基於醫學實體訓練，而不是基於單詞。

論文摘要：Today, despite decades of developments in medicine and the growing interest in precision healthcare, vast majority of diagnoses happen once patients begin to show noticeable signs of illness. Early indication and detection of diseases, however, can provide patients and carers with the chance of early intervention, better disease management, and efficient allocation of healthcare resources. The latest developments in machine learning (more specifically, deep learning) provides a great opportunity to address this unmet need. In this study, we introduce BEHRT: A deep neural sequence transduction model for EHR (electronic health records), capable of multitask prediction and disease trajectory mapping. When trained and evaluated on the data from nearly 1.6 million individuals, BEHRT shows a striking absolute improvement of 8.0-10.8%, in terms of Average Precision Score, compared to the existing state-of-the-art deep EHR models (in terms of average precision, when predicting for the onset of 301 conditions). In addition to its superior prediction power, BEHRT provides a personalised view of disease trajectories through its attention mechanism; its flexible architecture enables it to incorporate multiple heterogeneous concepts (e.g., diagnosis, medication, measurements, and more) to improve the accuracy of its predictions; and its (pre-)training results in disease and patient representations that can help us get a step closer to interpretable predictions.

2.綜述類文章

nature medicine發表的綜述

論文題目：A guide to deep learning in healthcare

論文地址

論文概要：發表於nature medicine，包含醫學領域下CV,NLP,強化學習等方面的應用綜述。

論文摘要：Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep- learning methods for genomics are reviewed.

3.電子病歷相關文章

Transfer Learning from Medical Literature for Section Prediction in Electronic Health Records

論文地址

論文概要：發表於EMNLP2019。基於少量in-domain數據和大量out-of-domain數據進行EHR相關的遷移學習。

論文摘要：sections such as Assessment and Plan, So- cial History, and Medications. These sec- tions help physicians find information easily and can be used by an information retrieval system to return specific information sought by a user. However, it is common that the exact format of sections in a particular EHR does not adhere to known patterns. There- fore, being able to predict sections and headers in EHRs automatically is beneficial to physi- cians. Prior approaches in EHR section pre- diction have only used text data from EHRs and have required significant manual annota- tion. We propose using sections from med- ical literature (e.g., textbooks, journals, web content) that contain content similar to that found in EHR sections. Our approach uses data from a different kind of source where la- bels are provided without the need of a time- consuming annotation effort. We use this data to train two models: an RNN and a BERT- based model. We apply the learned models along with source data via transfer learning to predict sections in EHRs. Our results show that medical literature can provide helpful su- pervision signal for this classification task.

4.醫學關係抽取

Leveraging Dependency Forest for Neural Medical Relation Extraction

論文地址

論文概要：發表於EMNLP 2019. 基於dependency forest方法，提升對醫學語句中依存關係的召回率，同時引進了一部分噪聲，基於圖循環網絡進行特徵提取，提供了在醫療關係抽取中使用依存關係，同時減少誤差傳遞的一種思路。

論文摘要：Medical relation extraction discovers relations between entity mentions in text, such as research articles. For this task, dependency syntax has been recognized as a crucial source of features. Yet in the medical domain, 1-best parse trees suffer from relatively low accuracies, diminishing their usefulness. We investigate a method to alleviate this problem by utilizing dependency forests. Forests contain more than one possible decisions and therefore have higher recall but more noise compared with 1-best outputs. A graph neural network is used to represent the forests, automatically distinguishing the useful syntactic information from parsing noise. Results on two benchmarks show that our method outperforms the standard tree-based methods, giving the state-of-the-art results in the literature.

5.醫學知識圖譜

Learning a Health Knowledge Graph from Electronic Medical Records

論文地址

論文概要：發表於nature scientificreports（2017）. 基於27萬餘份電子病歷構建的疾病-症狀知識圖譜。

論文摘要：Demand for clinical decision support systems in medicine and self-diagnostic symptom checkers has substantially increased in recent years. Existing platforms rely on knowledge bases manually compiled through a labor-intensive process or automatically derived using simple pairwise statistics. This study explored an automated process to learn high quality knowledge bases linking diseases and symptoms directly from electronic medical records. Medical concepts were extracted from 273,174 de-identified patient records and maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier and a Bayesian network using noisy OR gates. A graph of disease-symptom relationships was elicited from the learned parameters and the constructed knowledge graphs were evaluated and validated, with permission, against Google’s manually-constructed knowledge graph and against expert physician opinions. Our study shows that direct and automated construction of high quality health knowledge graphs from medical records using rudimentary concept extraction is feasible. The noisy OR model produces a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation. Noisy OR significantly outperforms all tested models across evaluation frameworks (p < 0.01).

6.輔助診斷

Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence

論文地址

論文概要：該文章由廣州市婦女兒童醫療中心與依圖醫療等企業和科研機構共同完成，基於機器學習的自然語言處理（NLP）技術實現不輸人類醫生的強大診斷能力，並具備多場景的應用能力。據介紹，這是全球首次在頂級醫學雜誌發表有關自然語言處理（NLP）技術基於電子健康記錄（EHR）做臨牀智能診斷的研究成果，也是利用人工智能技術診斷兒科疾病的重磅科研成果。

論文摘要：Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains challenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physicians and unearth associations that previous statistical methods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing common childhood diseases. Our study provides a proof of concept for implementing an AI-based system as a means to aid physicians in tackling large amounts of data, augmenting diagnostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare providers are in relative shortage, the benefits of such an AI system are likely to be universal.

中文醫療領域語料

醫學教材+培訓考試（共57G）

語料說明：根據此豆瓣鏈接整理。整合到一個文件夾內，便於保存。去掉了其中視頻部分。

度盤下載地址：https://pan.baidu.com/s/1P2WHX7hNTqErZ3j1vhkr_Q

提取碼：xd0c

哈工大《大詞林》開放75萬核心實體詞及相關概念、關係列表（包含中藥/醫院/生物類別）

語料說明:哈工大開源了《大詞林》中的75萬的核心實體詞，以及這些核心實體詞對應的細粒度概念詞（共1.8萬概念詞，300萬實體-概念元組），還有相關的關係三元組（共300萬）。這75萬核心實體列表涵蓋了常見的人名、地名、物品名等術語。概念詞列表則包含了細粒度的實體概念信息。藉助於細粒度的上位概念層次結構和豐富的實體間關係，本次開源的數據能夠爲人機對話、智能推薦、等應用技術提供數據支持。

語料官方下載地址

度盤下載地址：https://pan.baidu.com/s/1NG8xybrEGTVYPepMM12xNw

提取碼：mwmj

開源工具包

分詞工具

PKUSEG

項目地址

項目說明：北京大學推出的多領域中文分詞工具，支持選擇醫學領域。

工業級產品解決方案

靈醫智慧

左手醫生

友情鏈接

awesome_Chinese_medical_NLP

中文NLP數據集搜索

github地址

github項目地址

醫學自然語言處理相關資源整理

Chinese_medical_NLP

中文評測數據集

1. Yidu-S4K：醫渡雲結構化4K數據集

2.瑞金醫院糖尿病數據集

3.Yidu-N7K：醫渡雲標準化7K數據集

4.中文醫學問答數據集

5.平安醫療科技疾病問答遷移學習比賽

6.天池新冠肺炎問句匹配比賽

中文醫學知識圖譜

CMeKG

英文數據集

PubMedQA: A Dataset for Biomedical Research Question Answering

相關論文

1.醫療領域預訓練embedding

Bio-bert

sci-bert

clinical-bert

clinical-bert(另一團隊的版本)

BEHRT

2.綜述類文章

nature medicine發表的綜述

3.電子病歷相關文章

Transfer Learning from Medical Literature for Section Prediction in Electronic Health Records

4.醫學關係抽取

Leveraging Dependency Forest for Neural Medical Relation Extraction

5.醫學知識圖譜

Learning a Health Knowledge Graph from Electronic Medical Records

6.輔助診斷

Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence

中文醫療領域語料

醫學教材+培訓考試 （共57G）

哈工大《大詞林》開放75萬核心實體詞及相關概念、關係列表（包含中藥/醫院/生物 類別）

開源工具包

分詞工具

PKUSEG

工業級產品解決方案

友情鏈接

github地址

醫學教材+培訓考試（共57G）

哈工大《大詞林》開放75萬核心實體詞及相關概念、關係列表（包含中藥/醫院/生物類別）