問答QA（二）基於BERT的知識庫問答實戰—NLPCC2017KBQA

GitHub：

一、問題描述

本篇知識問答實戰來源NLPCC2017的Task5：Open Domain Question Answering；其包含 14,609 個問答對的訓練集和包含 9870 個問答對的測試集。並提供一個知識庫，包含 6,502,738 個實體、 587,875 個屬性以及 43,063,796 個三元組。

知識庫（nlpcc-iccpol-2016.kbqa.kb）

訓練集（nlpcc-iccpol-2016.kbqa.traing-data）

測試集（nlpcc-iccpol-2016.kbqa.testing-data，提交結果進行評測）

NLPCC 2017： http://tcci.ccf.org.cn/conference/2017/taskdata.php

二、解決方案

基於知識庫的自動問答拆分爲2 個主要步驟: 命名實體識別步驟和屬性映射步驟。其中，實體識別步驟的目的是找到問句中詢問的實體名稱，而屬性映射步驟的目的在於找到問句中詢問的相關屬性。

命名實體識別步驟，採用BERT+BiLSTM+CRF方法（另外加上一些規則映射，可以提高覆蓋度）
屬性映射步驟，轉換成文本相似度問題，採用BERT作二分類（對於歧義答案，需要有問答上下文）

三、BERT命名實體識別效果

構造NER的數據集，需要根據三元組-Enitity 反向標註問題，給 Question 打標籤。

代碼：

訓練集：

《機械設計基礎》這本書的作者是誰？     機械設計基礎



標註後：
《 O
機 B-LOC
械 I-LOC
設 I-LOC
計 I-LOC
基 I-LOC
礎 I-LOC
》 O
這 O
本 O
書 O
的 O
作 O
者 O
是 O
誰 O
？ O

訓練代碼：

python bert_lstm_ner.py   \
        --task_name="NER" \
        --do_train=True \
        --do_eval=True  \
        --do_predict=True  \
        --data_dir=NERdata  \    
        --max_seq_length=128 \
        --train_batch_size=32  \
        --learning_rate=2e-5  \
        --num_train_epochs=3.0 \
        --output_dir=./output/result_dir_ner/

預測代碼：

python terminal_predict.py

結果：識別實體還是可以，統計過準確率，還不錯。

四、BERT屬性映射效果評估

構造BERT二分類問題的數據集：

1. 構造測試集的整體屬性集合，提取+去重，獲得 4373 個屬性 RelationList；

2. 一個 sample 由“問題+屬性+Label”構成，原始數據中的屬性值置爲 1；

3. 從 RelationList 中隨機抽取五個屬性作爲 Negative Samples。

訓練代碼：

export BERT_BASE_DIR=/home/bert/chinese_L-12_H-768_A-12

export MY_DATASET=/home/bert/data_sim 

python run_classifier.py \
  --data_dir=$MY_DATASET \
  --task_name=similarity \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --output_dir=./data_sim/output/ \
  --do_train=true \
  --do_eval=true \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=5e-5\
  --num_train_epochs=2.0 




INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.98575
INFO:tensorflow:  eval_loss = 0.06471516
INFO:tensorflow:  global_step = 4727
INFO:tensorflow:  loss = 0.06471516

測試結果：

export BERT_BASE_DIR=/home/mqq/zwshi/bert/chinese_L-12_H-768_A-12
export MY_DATASET=/home/mqq/zwshi/bert/data_kbqa
  
python run_classifier.py \
  --task_name=similarity \
  --do_predict=true \
  --data_dir=$MY_DATASET \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=./data_kbqa/output \
  --max_seq_length=128 \
  --output_dir=./data_kbqa/output/
  


 ----預測準確率： 0.986
import pandas as pd
test_df = pd.read_csv('test.csv',header=None,sep = '\t')
test_label = test_df[3].tolist()

test_predict_df = pd.read_csv('./output/test_results.tsv',header=None,sep = '\t')

test_predict_df['label'] = test_predict_df.apply(lambda x: 0 if x[0] > x[1] else 1, axis=1)   
test_predict_label = test_predict_df['label'].tolist()

result = [1 if x==y else 0 for x,y in zip(test_label,test_predict_label)]

sum(result)/len(result)
0.9863194162950952

【參考文獻】

【1】基於該數據集實現的論文 http://www.doc88.com/p-9095635489643.html

【2】 NLPCC比賽數據集下載頁面

http://tcci.ccf.org.cn/conference/2017/taskdata.php

http://tcci.ccf.org.cn/conference/2016/pages/page05_evadata.html

【3】InsunKBQA_一個基於知識庫的問答系統_周博通_孫承傑_林磊_劉秉權 http://www.doc88.com/p-9095635489643.html

【4】基於知識庫的問答：seq2seq模型實踐 https://zhuanlan.zhihu.com/p/34585912

問答QA（二）基於BERT的知識庫問答實戰—NLPCC2017KBQA

hive任務RMContainerAllocator: REDUCE capability required is more than the supported max container

Python3讀取Hbase包hbase-thrift異常處理

Python連接Kafka問題彙總

在使用pandas 0.23.4對日期進行分組排序時報錯

【轉】推薦系統算法總結（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結