基於bert的中文實體關係識別（實體關係抽取）項目開源

原創

2020-02-20 18:28

在清華大學開源的OpenNRE項目基礎上實現中文實體關係識別

github項目地址，點我

文章目錄

一、中文關係抽取

一、中文關係抽取

使用哈工大，BERT-wwm，中文bert，在20w中文人物關係數據上的準確率達到0.97

實現過程

實現過程十分簡單，如下：

1）token階段：將文本根據兩個實體位置分割成五個小片；
2）index階段：文本開頭使用[CLS]，結尾使用[SEP]，中間的分割使用[unused1-4]；
3）padding階段：0填充，最大長度80；
4）attention mask，完成embedding；
5）通過bert模型；
6）全連接；
7）softmax。

訓練結果

=== Epoch 0 train ===
100%|██████████████████████████████████████████████████████████████████| 3094/3094 [40:12<00:00, 1.28it/s, acc=0.773, loss=0.687]
=== Epoch 0 val ===
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.42it/s, acc=0.934]
Best ckpt and saved.
=== Epoch 1 train ===
100%|██████████████████████████████████████████████████████████████████| 3094/3094 [38:17<00:00, 1.35it/s, acc=0.923, loss=0.235]
=== Epoch 1 val ===
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:05<00:00, 2.78it/s, acc=0.972]
Best ckpt and saved.
=== Epoch 2 train ===
100%|██████████████████████████████████████████████████████████████████| 3094/3094 [22:43<00:00, 2.27it/s, acc=0.961, loss=0.121]
=== Epoch 2 val ===
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:05<00:00, 2.71it/s, acc=0.986]
Best ckpt and saved.
Best acc on val set: 0.986000
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.64it/s, acc=0.986]
Accuracy on test set: 0.986

測試結果

model.infer({'text': '場照片事後將發給媒體，避免採訪時出現混亂，[3]舉行婚禮侯佩岑黃伯俊婚紗照2011年4月17日下午2點，70名親友見 證下，侯佩', 'h': {'pos': (28, 30)}, 't': {'pos': (31, 33)}})

(‘夫妻’, 0.9995878338813782)

model.infer({'text': '及他們的女兒小蘋果與汪峯感情糾葛2004年，葛薈婕在歐洲盃期間錄製節目時與汪峯相識並相戀，汪峯那首《我如此愛你', 'h': {'pos': (10, 11)}, 't': {'pos': (22, 24)}})

(‘情侶’, 0.9992896318435669)

model.infer({'text': '14日，彭加木的侄女彭丹凝打通了彭加木兒子彭海的電話，“堂哥已經知道了，他說這些年傳得太多，他不相信是真的', 'h': {'pos': (4, 6)}, 't': {'pos': (22, 21)}})

(‘父母’, 0.8954808712005615)

model.infer({'text': '名旦吳菱仙是位列“同治十三絕”的名旦時小福的弟子，算得梅蘭芳的開蒙老師，早年曾搭過梅巧玲的四喜班，舊誼', 'h': {'pos': (2, 4)}, 't': {'pos': (27, 29)}})

(‘師生’, 0.996309220790863)

二、使用前準備

1.bert模型下載：在./pretrain/下面放置chinese_wwm_pytorch模型，下載地址：https://github.com/ymcui/Chinese-BERT-wwm

2.數據下載：在./benchmark/people-relation/下執行gen.py,生產中文人物關係數據，具體腳本中有說明。

3.配置環境變量：vim ~/.bash_profile 添加

# openNRE

export openNRE=項目位置

三、注意事項

如果自己訓練了tensorflow 的bert，可以通過https://github.com/huggingface/transformers 裏面的convert_bert_original_tf_checkpoint_to_pytorch.py 腳本轉換爲pytorch版。

踩坑記錄：

1.安裝tensorflow 2.0，最終用的都是PyTorch模型，但TensorFlow也得安裝

2.構造checkpoint文件

3.報錯：Embedding’ object has no attribute ‘shape’ ，解決：將報錯位置assert那幾行直接刪除

陶瑞同學博客專家

發佈了93 篇原創文章 · 獲贊 225 · 訪問量 23萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

基於bert的中文實體關係識別（實體關係抽取）項目開源

文章目錄

一、中文關係抽取

實現過程

訓練結果

測試結果

二、使用前準備

三、注意事項

關於遊戲付費的一點想法

我通過CKA和CKS啦！

實時數倉和離線數倉

Spark SQL和 presto 訪問數據源的對比分析

使用memory_profiler工具對python工程做內存分析

n-gram語言模型的生成過程及原理

對python代碼進行加速處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結