審題
- 比賽目的: 以人工智能輔助糖尿病知識圖譜構建爲題,通過糖尿病相關的教科書、研究論文來進行糖尿病文獻挖掘並構建糖尿病知識圖譜。
- 比賽時間 大賽分爲第一賽季、第二賽季和總決賽三個階段 第一賽季爲實體抽取 第二賽季爲關係抽取 第三賽季爲總決賽
- 比賽數據 數據爲相關教科書或者文獻
訓練數據內容 標註好的數據;
image.png 以及測試數據和需要提交的格式
image.png
解題思路
實體抽取就是自然語言中的命名實體識別,命名實體識別的算法非常多, 比如隱馬爾科夫、條件隨機場、rnn、lstm等等 用標註好的數據訓練模型參數,調優,預測就可以啦
- 模型選擇 選擇用的比較廣泛的LSTM+CRF 試一試用CNN來做一下特徵提取
- 數據處理 數據處理是耗時比較多的,有很多步,把文字轉化爲數字,把標註好的文件進行格式化轉換,把最後得到的數據進行轉換,得到要求的數據格式等等。
解題
- embedding 對每一個字進行數字化處理,可以吧輸入直接轉化爲一個向量 這裏用到的word2vect 代碼如下:
#-*- conding:utf-8 -*- from gensim.models import word2vec import random sentences = word2vec.Text8Corpus("fenzi.txt") model = word2vec.Word2Vec(sentences, size=100) model.save("yixue.model") print("模型訓練完成") # model = word2vec.load("yixue.model") import os import jieba split_num = 0 zidian = [] fp = open('fenzi.txt', 'r', encoding='utf8') fc = open('vec.txt', 'w', encoding='utf8') for line in fp: split_num += 1 for k in line: if k!=" ": # fenzi.write(k+" ") if k not in zidian: zidian.append(k) for i in zidian: try : vect = model[i] vect.tolist() except KeyError: vect = [random.random() for i in range(100)] fc.write(i+" ") m = 0 for k in vect: print(k) m+=1 if m !=100: fc.write(str(k)+" ") else: fc.write(str(k)+"\n") print(zidian)
得到的結果如下:
image.png
image.png
- 對標註進行處理 這裏用BIO的方式對實體進行標註 得到如下的形式:
image.png
得到實體詞典: 代碼入下:
import csv import os import pandas as pd c_root = os.getcwd() + os.sep + "data_source" + os.sep li01 = [] li02 = [] k = 0 for file in os.listdir(c_root): if ".ann" in file: fp = open(c_root + file, 'r', encoding='utf8') k+=1 # print(k) for i in fp: a = i.strip("\n").split("\t")[-1] b = i.strip("\n").split("\t")[-2].split(" ")[0] li01.append(a) li02.append(b) # print(a) data=pd.DataFrame({'a':li01,'b':li02}) # print(data) da=data.drop_duplicates(subset="a",keep='first', inplace=False) # da.to_csv("./DICT_NOW.csv",index=False,header=False,encoding='utf8') print(data["b"].value_counts()) # print(da)
到這裏數據前期處理進行完畢,訓練過程中還需要進行的處理如下:
- 得到所有的類別:
image.png
訓練模型編寫
用TensorFlow編寫訓練程序,核心代碼如下:
def project_layer_bilstm(self, lstm_outputs, name=None): """ hidden layer between lstm layer and logits :param lstm_outputs: [batch_size, num_steps, emb_size] :return: [batch_size, num_steps, num_tags] """ with tf.variable_scope("project" if not name else name): with tf.variable_scope("hidden"): W = tf.get_variable("W", shape=[self.lstm_dim*2, self.lstm_dim], dtype=tf.float32, initializer=self.initializer) b = tf.get_variable("b", shape=[self.lstm_dim], dtype=tf.float32, initializer=tf.zeros_initializer()) output = tf.reshape(lstm_outputs, shape=[-1, self.lstm_dim*2]) hidden = tf.tanh(tf.nn.xw_plus_b(output, W, b)) # project to score of tags with tf.variable_scope("logits"): W = tf.get_variable("W", shape=[self.lstm_dim, self.num_tags], dtype=tf.float32, initializer=self.initializer) b = tf.get_variable("b", shape=[self.num_tags], dtype=tf.float32, initializer=tf.zeros_initializer()) pred = tf.nn.xw_plus_b(hidden, W, b) return tf.reshape(pred, [-1, self.num_steps, self.num_tags]) #IDCNN layer def IDCNN_layer(self, model_inputs, name=None): """ :param idcnn_inputs: [batch_size, num_steps, emb_size] :return: [batch_size, num_steps, cnn_output_width] """ #ft.expand_dims會向tensor中插入一個維度,插入位置就是參數代表的位置(維度從0開始) model_inputs = tf.expand_dims(model_inputs, 1) reuse = False if self.dropout == 1.0: reuse = True with tf.variable_scope("idcnn" if not name else name): shape=[1, self.filter_width, self.embedding_dim, self.num_filter] print(shape) filter_weights = tf.get_variable( "idcnn_filter", shape=[1, self.filter_width, self.embedding_dim, self.num_filter], initializer=self.initializer) """ shape of input = [batch, in_height, in_width, in_channels] shape of filter = [filter_height, filter_width, in_channels, out_channels] height是默認1,width是句子長度,通道是120維 shape of input = [batch, in_height, in_width, in_channels] shape of filter = [filter_height, filter_width, in_channels, out_channels] """ layerInput = tf.nn.conv2d(model_inputs, filter_weights, strides=[1, 1, 1, 1], padding="SAME", name="init_layer",use_cudnn_on_gpu=True) finalOutFromLayers = [] totalWidthForLastDim = 0 #多次卷積,就會將膨脹的時候單次沒有捲到的數據在下次捲到 for j in range(self.repeat_times): for i in range(len(self.layers)): dilation = self.layers[i]['dilation'] isLast = True if i == (len(self.layers) - 1) else False with tf.variable_scope("atrous-conv-layer-%d" % i, reuse=True if (reuse or j > 0) else False): w = tf.get_variable( "filterW", shape=[1, self.filter_width, self.num_filter, self.num_filter], initializer=tf.contrib.layers.xavier_initializer()) b = tf.get_variable("filterB", shape=[self.num_filter]) #膨脹卷積:插入rate-1個0 這裏三層{1,1,2}相當於前兩個沒有膨脹 conv = tf.nn.atrous_conv2d(layerInput, w, rate=dilation, padding="SAME") conv = tf.nn.bias_add(conv, b) conv = tf.nn.relu(conv) if isLast: finalOutFromLayers.append(conv) totalWidthForLastDim += self.num_filter layerInput = conv finalOut = tf.concat(axis=3, values=finalOutFromLayers) keepProb = 1.0 if reuse else 0.5 finalOut = tf.nn.dropout(finalOut, keepProb) #踢掉指定的維度,值不變 finalOut = tf.squeeze(finalOut, [1]) finalOut = tf.reshape(finalOut, [-1, totalWidthForLastDim]) self.cnn_output_width = totalWidthForLastDim return finalOut def project_layer_idcnn(self, idcnn_outputs, name=None): """ :param lstm_outputs: [batch_size, num_steps, emb_size] :return: [batch_size, num_steps, num_tags] """ with tf.variable_scope("project" if not name else name): # project to score of tags with tf.variable_scope("logits"): W = tf.get_variable("W", shape=[self.cnn_output_width, self.num_tags], dtype=tf.float32, initializer=self.initializer) b = tf.get_variable("b", initializer=tf.constant(0.001, shape=[self.num_tags])) pred = tf.nn.xw_plus_b(idcnn_outputs, W, b) return tf.reshape(pred, [-1, self.num_steps, self.num_tags]) def loss_layer(self, project_logits, lengths, name=None): """ calculate crf loss :param project_logits: [1, num_steps, num_tags] :return: scalar loss """ #num_steps是句子長度;project_logits是特徵提取並全連接後的輸出 with tf.variable_scope("crf_loss" if not name else name): small = -1000.0 # pad logits for crf loss #start_logits=[batch_size,1,num_tags+1] start_logits = tf.concat( [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1) #pad_logits=[batch_size,num_steps,1] pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32) #logits=[batch_size,num_steps,num_tags+1] logits = tf.concat([project_logits, pad_logits], axis=-1) #logits=[batch_size,num_steps+1,num_tags+1] logits = tf.concat([start_logits, logits], axis=1) targets = tf.concat( [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1) #targets=[batch_size,1+實際標籤數] self.trans = tf.get_variable( "transitions", shape=[self.num_tags + 1, self.num_tags + 1], initializer=self.initializer) #logits是模型的特徵輸出;targets是label;trans是條件隨機場的輸出 #crf_log_likelihood在一個條件隨機場裏計算標籤序列的log-likelihood #inputs:一個形狀爲[batch_size,max_seq_len,num_tags]的tensor #一般使用BILSTM處理之後輸出轉換爲他要求的形狀作爲CRF層的輸入 #tag_indices:一個形狀爲[batch_size]的向量,表示每個序列的長度 #sequence_lengths:一個形狀爲[batch_size]的向量,表示每個序列的長度 #transition_params:形狀爲[num_tags,num_tags]的轉移矩陣 #log_likelihood:標量,log-likelihood #注意:由於條件隨機場有標記,故真實維度+1 #inputs=[char_inputs,seg_inputs] #高:3 血:22 糖:23 和:24 高:3 血:22 壓:25 char_inputs=[3,22,23,24,3,22,25] #高血糖 和 高血壓 seg_inputs 高血糖=[1,2,3] 和=[0] 高血壓=[1,2,3] seg_inputs=[1,2,3,0,1,2,3] log_likelihood, self.trans = crf_log_likelihood( inputs=logits, tag_indices=targets, transition_params=self.trans, sequence_lengths=lengths+1) return tf.reduce_mean(-log_likelihood)
- 進行訓練 損失值如下:
2018-10-31 14:41:39,399 - log\train.log - INFO - iteration:77 step:4/996, NER loss:10.520254 2018-10-31 14:43:19,310 - log\train.log - INFO - iteration:77 step:104/996, NER loss:12.477299 2018-10-31 14:44:43,748 - log\train.log - INFO - iteration:77 step:204/996, NER loss:12.602566 2018-10-31 14:45:48,943 - log\train.log - INFO - iteration:77 step:304/996, NER loss: 9.900908 2018-10-31 14:47:19,396 - log\train.log - INFO - iteration:77 step:404/996, NER loss:12.695493 2018-10-31 14:49:51,545 - log\train.log - INFO - iteration:77 step:504/996, NER loss:14.701593
- 模型保存
預測結果
def evaluate_line(): config = load_config(FLAGS.config_file) logger = get_logger(FLAGS.log_file) # limit GPU memory tf_config = tf.ConfigProto() tf_config.gpu_options.allow_growth = True with open(FLAGS.map_file, "rb") as f: char_to_id, id_to_char, tag_to_id, id_to_tag = pickle.load(f) with tf.Session(config=tf_config) as sess: model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger) c_root = os.getcwd() + os.sep + "data_test" + os.sep c_root01 = os.getcwd() + os.sep + "data_finall" + os.sep for file in os.listdir(c_root): f = open(c_root + file, 'r', encoding='utf8') k = open(c_root01+file.strip(".txt")+".ann",'w',encoding='utf-8') f = f.readlines() f = [i.strip('\n') for i in f] f = ''.join(f) print(f) line = f result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag) result = result['entities'] id = len(result) for i in range(len(result)):# {'word': '基於胰高血糖', 'start': 0, 'end': 6, 'type': 'Treatment'} k.write("T"+str(i+1)+'\t'+str(result[i]["type"])+' '+str(result[i]['start'])+" "+str(result[i]['end'])+"\t"+str(result[i]['word']+"\n")) def main(_): if 0: if FLAGS.clean: clean(FLAGS) train() else: evaluate_line()
得到的結果如下:
image.png
個人感覺還是挺準的,結果已經提交等待明天出結果吧