基於Tensorflow使用CRF（條件隨機場）解決詞性標註問題

最近一直在看吳茂貴老師的《Python深度學習：基於TensorFlow》，前幾天看到了概率圖模型這一塊，講到了貝葉斯網絡和馬爾可夫網絡等；其中後者主要講到了馬爾可夫隨機場和條件隨機場，於是今天就動手敲了一遍書中給的代碼，講真，代碼有點亂，對於小白來說看着可能會生無可戀，咱們出發吧～～

關於詞性標註稍微說明一下，比如給你一個句子：You are beautiful。標準的“主系表”結構，大家都知道系動詞不能單獨作謂語，其後必須跟表語，其實這就是一種規則，可以稱之爲一個特徵函數，當然這其中還有很多規則，比如動詞後邊不能跟動詞，這也是一個特徵函數，我們就可以定義一個特徵函數的集合，用來評判一個標註的序列是否正確，這一塊的基礎知識大家可以自行查閱一下，我們就不一一介紹了，咱們今天主要還是用代碼實現它~~

進入正題吧：

1.設置參數

num_exam=10
num_words=20
num_feat=100
num_tags=5

咱們設置了10個樣本，每個樣本20個單詞（不是20的後邊統一爲20），100個特徵函數和5組標註序列，當然這個值可以隨意設置，只要合理就ok

2，構建隨機特徵和隨機標註

#構建隨機特徵
x=np.random.rand(num_exam,num_words,num_feat).astype(np.float32)
#構建隨機tag
y=np.random.randint(num_tags,size=[num_exam,num_words]).astype(np.int32)

然後獲取樣本句長，因爲每個句子的長度不一，我們統一爲20，即num_words，進行填充：

length_se=np.full(num_exam,num_words,dtype=np.int32)

3.構建模型

將x、y和length_se轉換爲常量：

x_t=tf.constant(x)
y_t=tf.constant(y)
length_se_t=tf.constant(length_se)

增加一個無偏置線性層：

weights=tf.get_variable("weights",[num_feat,num_tags])
x_t_matr=tf.reshape(x_t,[-1,num_feat])
unary_scores_matr=tf.matmul(x_t_matr,weights)
unary_scores=tf.reshape(unary_scores_matr,[num_exam,num_words,num_tags])

創建一個新的變量weights，並指定其形狀爲特徵函數量×標註序列量；reshape函數中的-1表示新的形狀不確定有多少行，列數爲特徵函數量，也就是行自適應；這四行代碼其實就是最基本的矩陣點乘和轉換的應用，我們可以帶入具體的數值，這樣更便於理解，空間想象一下～～～

然後就可以計算序列的的log-likelihood並獲得轉移概率：

 log_likelihood,tran_params=tf.contrib.crf.crf_log_likelihood(unary_scores,y_t,length_se)

上述tf.contrib.crf.crf_log_likelihood函數用於在一個條件隨機場中計算標籤序列的log-likelihood，其格式爲;

crf_log_likelihood(inputs,tag_indices,sequence_lengths,transition_params=None)

4.解碼

viterbi_sequence,viterbi_score=tf.contrib.crf.crf_decode(unary_scores,tran_params,length_se_t)
loss=tf.reduce_mean(-log_likelihood)
train_op=tf.train.GradientDescentOptimizer(0.01).minimize(loss)
session.run(tf.global_variables_initializer())
mask=(np.expand_dims(np.arange(num_words),axis=0)<np.expand_dims(length_se,axis=1))

第一行出現的函數用於在tensorflow內部解碼，後邊的代碼就是標準的損失函數系列操作，使用隨機梯度下降來求最優，學習率爲0.01，當然這個值也可以更改，大家可以嘗試一下不同的學習率，關於這一點之前有篇博客介紹的比較詳細了：https://blog.csdn.net/beyond9305/article/details/88902616

5.訓練模型

#獲取總標籤數
total_labels=np.sum(length_se)
#開始訓練
for i in range(2001):
    tf_viterbi_sequence,_=session.run([viterbi_sequence,train_op])
    if i%100==0:
         correct_lables=np.sum((y==tf_viterbi_sequence)*mask)
         accuracy=100.0*correct_lables/float(total_labels)
         print("Accuracy-NO.%d:%.2f%%" % (i,accuracy))

我們最後輸出一下精確度，訓練2001次，每100次輸出一下，看下結果:

大家可以自行修改參數，各種嘗試，包括學習率、優化器、迭代次數等，說不定會有意外驚喜。。

ok，本次的實驗大致介紹了基於平臺使用條件隨機場解決詞性標註問題，沒有涉及到基礎的理念，我們主要做的是具體代碼實現，關於其數學基礎原理涉及篇幅較大，有些博主已經整理的很全面了，我就不再造輪子了，大家可以重點理解一下幾個關鍵的概念，如特徵函數、標註序列等，總之，要想全面精確理解一種理念就要從基礎知識出發，一點點深入，最後在語言層面實現它，這樣纔會事半功倍~~

最後附上完整代碼，建議大家自己動手敲一遍，這樣會更有感覺~~~~

#!/usr/bin/env python 
# -*- coding:utf-8 -*-

import numpy as  np
import tensorflow as  tf

#設置參數
num_exam=10
num_words=20
num_feat=100
num_tags=5
#構建隨機特徵
x=np.random.rand(num_exam,num_words,num_feat).astype(np.float32)
#構建隨機tag
y=np.random.randint(num_tags,size=[num_exam,num_words]).astype(np.int32)
#獲取樣本句長
length_se=np.full(num_exam,num_words,dtype=np.int32)
#訓練模型
with tf.Graph().as_default():
    with tf.Session() as session:
        x_t=tf.constant(x)
        y_t=tf.constant(y)
        length_se_t=tf.constant(length_se)
        #增加一個無偏置線性層
        weights=tf.get_variable("weights",[num_feat,num_tags])
        x_t_matr=tf.reshape(x_t,[-1,num_feat])
        unary_scores_matr=tf.matmul(x_t_matr,weights)
        unary_scores=tf.reshape(unary_scores_matr,[num_exam,num_words,num_tags])
        #計算標籤序列的log-likelihood並獲得轉移概率
        log_likelihood,tran_params=tf.contrib.crf.crf_log_likelihood(unary_scores,y_t,length_se)
        #解碼
        viterbi_sequence,viterbi_score=tf.contrib.crf.crf_decode(unary_scores,tran_params,length_se_t)
        loss=tf.reduce_mean(-log_likelihood)
        train_op=tf.train.GradientDescentOptimizer(0.01).minimize(loss)
        session.run(tf.global_variables_initializer())
        mask=(np.expand_dims(np.arange(num_words),axis=0)<np.expand_dims(length_se,axis=1))
        #獲取總標籤數
        total_labels=np.sum(length_se)
        #開始訓練
        for i in range(2001):
            tf_viterbi_sequence,_=session.run([viterbi_sequence,train_op])
            if i%100==0:
                correct_lables=np.sum((y==tf_viterbi_sequence)*mask)
                accuracy=100.0*correct_lables/float(total_labels)
                print("Accuracy-NO.%d:%.2f%%" % (i,accuracy))

基於Tensorflow使用CRF（條件隨機場）解決詞性標註問題

—如果本篇內容對你有一點點幫助，請點個贊或者收藏關注一下，讓我們一起努力—

python gdal 安裝使用（Windows， python 3.6.8）

兩個棧共享一塊存儲空間新解

“暗通道”到底是個什麼東西

Win10自帶Ubuntu系統之有始無終的GUI安裝經歷

手繪知識點——指針運算&變量的內存分配原理

手繪知識點——數組指針

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結