對於機器學習保險行業問答開放數據集DeepQA-1的詳細註解

原創

2020-05-27 08:15

首先感謝https://github.com/chatopera/insuranceqa-corpus-zh作者的辛苦付出，構建了保險行業的中文語料庫，並且提供了一個訓練以及測試例程，解決了很多人的燃眉之急，可以說是雪中送炭了。

稍顯遺憾的是，項目中對於代碼的註釋非常少，https://blog.csdn.net/samurais/article/details/77036461和https://blog.csdn.net/samurais/article/details/77193529這兩篇文章中雖然在宏觀上給予了一定註釋說明，但其程度遠遠不夠，初次接觸者需要花費很多精力去研究探究，筆者就經過了這一“痛苦”的過程。爲了使後來者免於受此“煎熬”，可以快速理解並且上手，在本文中對於項目源碼進行詳細註釋。

閒言少敘，書歸正傳。

https://github.com/chatopera/insuranceqa-corpus-zh/blob/release/deep_qa_1/network.py

network.py中的源碼較多較長，分段來進行解讀。先貼出第一部分代碼。

class NeuralNetwork():

    def __init__(self, hidden_layers = [100, 50], 

                 question_max_length = 20, 

                 utterance_max_length = 99, 

                 lr = 0.001, epoch = 10, 

                 batch_size = 100,

                 eval_every_N_steps = 500):

        '''

        Neural Network to train question and answering model

        '''

        self.input_layer_size = question_max_length + utterance_max_length + 1 # 1 is for <GO>

        self.output_layer_size = 2 # just the same shape as labels

        self.layers = [self.input_layer_size] + hidden_layers + [self.output_layer_size] # [2] is for output layer

        self.layers_num = len(self.layers)

        self.weights = [np.random.randn(y, x) for x,y in zip(self.layers[:-1], self.layers[1:])]

        self.biases = [np.random.randn(x, 1) for x in self.layers[1:]]

        self.epoch = epoch

        self.lr = lr

        self.batch_size = batch_size

        self.eval_every_N_steps = eval_every_N_steps

        self.test_data = corpus.load_test()

（1） input_layer_size：

self.input_layer_size = question_max_length + utterance_max_length + 1 # 1 is for <GO>

根據作者的說明，在預處理時，在詞彙表(vocab)中添加輔助Token: <PAD>, <GO>. 假設x是問題序列，是u回覆序列，輸入序列可以表示爲：

(Q1, Q2, ......, Qquestion_max_length, <GO>, U1, U2, ......, Uutterance_max_length)

其中question_max_length代表模型中問題的最大長度，utterance_max_length代表模型中回覆的最大長度。

因此，input_layer_size表示的是輸入層的長度，即問題最大長度+分隔符+回覆最大長度，默認爲20+1+99=120。

（2）output_layer_size：

self.output_layer_size = 2 # just the same shape as labels

根據作者的說明，回覆可能是正例，也可能是負例，正例標爲[1,0]，負例標爲[0,1]。

因此，output_layer_size表示的是輸出層的長度，值爲2。

（3）layers：

self.layers = [self.input_layer_size] + hidden_layers + [self.output_layer_size] # [2] is for output layer

根據作者的說明，hidden_layers表示隱含層，比如[100, 50]代表兩個隱含層，分別有100，50個神經元。

因此，layers實際上表示的是層的佈局，默認爲[120, 100, 50, 2]，意義爲[輸入層共120個神經元，隱含層1共100個神經元，隱含層2共50個神經元，輸出層共2個神經元]。

（4）layers_num：

self.layers_num = len(self.layers)

layers_num表示層的數量，1個輸入層+2個隱含層+1個輸出層共4個層，因此值爲4。

（5）weights：

self.weights = [np.random.randn(y, x) for x,y in zip(self.layers[:-1], self.layers[1:])]

必須重點講一下weights以及接下來的biases，很多人前邊還能看懂，到這裏就有點懵了。可以看到，它可以拆開爲幾部分，下邊分別針對每一小部分進行講解。

self.layers[:-1]：根據上邊的分析，實際上就是不包括輸出層，默認值爲[120, 100, 50]。

self.layers[1:]：實際上就是不包含輸入層，默認值爲[100, 50, 2]。

zip(self.layers[:-1], self.layers[1:])]：zip函數的功能自行查閱，這裏僅給出結果：[(120, 100), (100, 50), (50, 2)]。

實際上分別表示了輸入層到隱含層1，隱含層1到隱含層2，隱含層2到輸出。

np.random.randn(y, x)：

numpy.random.randn(d0,d1,…,dn)

randn函數返回一個或一組樣本，具有標準正態分佈。
dn表格每個維度
返回值爲指定維度的array

這裏實際上分別返回了100行120列，50行120列，2行50列的array。

到這裏，weights就已經明確了，是[array(100行120列), array(50行120列), array(2行50列)]。

（6）biases：

self.biases = [np.random.randn(x, 1) for x in self.layers[1:]]

self.layers[1:]：實際上就是不包含輸入層，默認值爲[100, 50, 2]。

np.random.randn(x, 1)：

實際上分別返回了100行1列，50行1列，2行1列的array。

到這裏，biases也已經明確了，是[array(100行1列), array(50行1列), array(2行1列)]。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

對於機器學習保險行業問答開放數據集DeepQA-1的詳細註解

Hi3556AV100_MobileCam_SDK中Camera Sensor適配詳細過程（一）

海思/etc/init.d/rcS腳本解析

Hi3519V101設定目標幀率爲25實際輸出30的問題及解決方法

使用gpac封裝mp4參考代碼移植所遇到的問題（三）

使用gpac封裝mp4代碼解析（三）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結