【自然語言處理】LSTM文本生成Python(純Numpy)實現

前言

Github:代碼下載
RNN存在着梯度消失的問題,Hochreiter & Schmidhuber (1997)在1997年提出了LSTM(Long Short-Term Memory)長短期記憶單元來解決這個問題,現在自然語言處理的大部分任務都用到了LSTM,例如機器翻譯,文本生成,,同理還有GRU。

數據集

數據集跟上一篇博客RNN文本生成使用的一樣,都是古詩的語料庫,這裏不再贅述。

算法實現

以下公式參考來源:LSTM公式,不過原文沒有給出最後隱含層到輸出層用的什麼激活函數,也沒有給出損失函數,所以本文進行了以softmax作爲激活函數,以及交叉熵作爲損失函數的公式補充,並對此進行了一些修改,使其完整。

前向傳播

at=tanh(Wcxt+Ucht1)=tanh(a^t)it=σ(Wixt+Uiht1)=σ(i^t)ft=σ(Wfxt+Ufht1)=σ(f^t)ot=σ(Woxt+Uoht1)=σ(o^t)ct=itat+ftct1ct1ctht=ottanh(ct)yt=softmax(Wyht)\begin{array}{c} {a^t} = \tanh ({W_c}{x^t} + {U_c}{h^{t - 1}}) = \tanh ({{\hat a}^t})\\ {i^t} = \sigma ({W_i}{x^t} + {U_i}{h^{t - 1}}) = \sigma ({{\hat i}^t})\\ {f^t} = \sigma ({W_f}{x^t} + {U_f}{h^{t - 1}}) = \sigma ({{\hat f}^t})\\ {o^t} = \sigma ({W_o}{x^t} + {U_o}{h^{t - 1}}) = \sigma ({{\hat o}^t})\\ {c^t} = {i^t} \odot {a^t} + {f^t} \odot {c^{t - 1}}\\ {c^{t - 1}} \to {c^t}\\ {h^t} = {o^t} \odot {\rm{tanh}}({c^t})\\ {y^t} = {\mathop{\rm softmax}\nolimits} \left( {{W_y}{h^t}} \right) \end{array}
損失函數爲:E=t=1Nytlog(atk)E =- \sum\limits_{t = 1}^N {{y_t}\log \left( {a_t^k} \right)}
代碼實現,跟RNN文本生成並無變化,只是權重初始化、前向傳播和後向傳播的寫法在加了LSTM單元后,有些不同。

    def __init__(self):
        self.wordDim = 6000
        self.hiddenDim = 100
        self.Wi, self.Ui = self.initWeights()   #輸入門
        self.Wf, self.Uf = self.initWeights()   #遺忘門
        self.Wo, self.Uo = self.initWeights()   #輸出門
        self.Wa, self.Ua = self.initWeights()   #記憶門
        self.Wy = np.random.uniform(-np.sqrt(1. / self.wordDim), np.sqrt(1. / self.wordDim), (self.wordDim, self.hiddenDim))  #隱含層到輸出層的權重矩陣(100, 6000)

    def initWeights(self):
        W = np.random.uniform(-np.sqrt(1. / self.wordDim), np.sqrt(1. / self.wordDim), (self.hiddenDim, self.wordDim))  #輸入層到隱含層的權重矩陣(100, 6000)
        U = np.random.uniform(-np.sqrt(1. / self.hiddenDim), np.sqrt(1. / self.hiddenDim), (self.hiddenDim, self.hiddenDim))  #隱含層到隱含層的權重矩陣(100, 100)
        return W, U

接着,是前向傳播的代碼。

    def forward(self, data):  #前向傳播,原則上傳入一個數據樣本和標籤
        T = len(data)
        output = np.zeros((T, self.wordDim, 1)) #輸出
        hidden = np.zeros((T+1, self.hiddenDim, 1)) #隱層狀態
        cPre = np.zeros((self.hiddenDim, 1))
        states = list()
        for t in range(T): #時間循環
            state = dict()
            X = np.zeros((self.wordDim, 1)) #構建(6000,1)的向量
            X[data[t]][0] = 1   #將對應的值置爲1,形成詞向量
            a = np.tanh(np.dot(self.Wa, X) + np.dot(self.Ua, hidden[t-1]))
            i = self.sigmoid(np.dot(self.Wi, X) + np.dot(self.Ui, hidden[t-1]))
            f = self.sigmoid(np.dot(self.Wf, X) + np.dot(self.Uf, hidden[t-1]))
            o = self.sigmoid(np.dot(self.Wo, X) + np.dot(self.Uo, hidden[t-1]))
            c = np.multiply(i, a) + np.multiply(f, cPre)
            state['a'] = a
            state['i'] = i
            state['f'] = f
            state['o'] = o
            state['c'] = c
            states.append(state.copy())
            cPre = c
            hidden[t] = np.multiply(o, np.tanh(c))
            y = self.softmax(np.dot(self.Wy, hidden[t]))
            output[t] = y
        state = dict()
        state['c'] = np.zeros((self.hiddenDim, 1))
        states.append(state.copy())
        return hidden, output, states

後向傳播

後向傳播的公式如下:

δyt={i=j,yt1i≠j,ytδht=Eht=(Wy)tδytEoit=Ehithitoit=δhittanh(cit)Ecit=Ehithitcit=δhitoit(1tanh2(cit))δct+=δhtot(1tanh2(ct))Eiit=Ecitcitiit=δcitaitEfit=Ecitcitfit=δcitcit1Eait=Ecitcitait=δcitiitδa^t=δat(1tanh2(a^t))=δcitiit(1tanh2(a^t))δi^t=δitit(1it)=δcitaitit(1it)δf^t=δftft(1ft)=δcitcit1ft(1ft)δo^t=δotot(1ot)=δhittanh(cit)ot(1ot)Wi=Wiηδi^t(xt)TWf=Wfηδf^t(xt)TWo=Woηδo^t(xt)TWa=Woηδa^t(xt)TUi=Uiηδi^t(ht1)TUf=Ufηδf^t(ht1)TUo=Uoηδo^t(ht1)TUa=Uoηδa^t(ht1)T\delta y^t = \left\{ \begin{array}{l} i = j,{y^t} - 1\\ i =\not j,{y^t} \end{array} \right.\\ \delta {h^t} = {{\partial E} \over {\partial {h^t}}} = {\left( {{W_y}} \right)^t}\delta {y^t}{{\partial E} \over {\partial o_i^t}} = {{\partial E} \over {\partial h_i^t}} \cdot {{\partial h_i^t} \over {\partial o_i^t}} = \delta h_i^t \cdot \tanh (c_i^t){{\partial E} \over {\partial c_i^t}}\\ = {{\partial E} \over {\partial h_i^t}} \cdot {{\partial h_i^t} \over {\partial c_i^t}} = \delta h_i^t \cdot o_i^t \cdot (1 - {\tanh ^2}(c_i^t))\\ \delta {c^t} + = \delta {h^t} \odot {o^t} \odot (1 - {\tanh ^2}({c^t}))\\ {{\partial E} \over {\partial i_i^t}} = {{\partial E} \over {\partial c_i^t}} \cdot {{\partial c_i^t} \over {\partial i_i^t}} = \delta c_i^t \cdot a_i^t \\{{\partial E} \over {\partial f_i^t}} = {{\partial E} \over {\partial c_i^t}} \cdot {{\partial c_i^t} \over {\partial f_i^t}} = \delta c_i^t \cdot c_i^{t - 1}\\ {{\partial E} \over {\partial a_i^t}} = {{\partial E} \over {\partial c_i^t}} \cdot {{\partial c_i^t} \over {\partial a_i^t}} = \delta c_i^t \cdot i_i^t\\ \delta {{\hat a}^t} = \delta {a^t} \odot (1 - {\tanh ^2}({{\hat a}^t})) = \delta c_i^t \cdot i_i^t \odot (1 - {\tanh ^2}({{\hat a}^t}))\\ \delta {{\hat i}^t} = \delta {i^t} \odot {i^t} \odot (1 - {i^t}) = \delta c_i^t \cdot a_i^t \odot {i^t} \odot (1 - {i^t})\\ \delta {{\hat f}^t} = \delta {f^t} \odot {f^t} \odot (1 - {f^t}) = \delta c_i^t \cdot c_i^{t - 1} \odot {f^t} \odot (1 - {f^t})\\ \delta {{\hat o}^t} = \delta {o^t} \odot {o^t} \odot (1 - {o^t}) = \delta h_i^t \cdot \tanh (c_i^t) \odot {o^t} \odot (1 - {o^t})\\ {W_i} = {W_i} - \eta \delta {{\hat i}^t}{\left( {{x^t}} \right)^T}\\ {{W_f} = {W_f} - \eta \delta {{\hat f}^t}{{\left( {{x^t}} \right)}^T}}\\ {W_o} = {W_o} - \eta \delta {{\hat o}^t}{\left( {{x^t}} \right)^T} \\ {W_a} = {W_o} - \eta \delta {{\hat a}^t}{\left( {{x^t}} \right)^T} \\ {U_i} = {U_i} - \eta \delta {{\hat i}^t}{\left( {{h^{t - 1}}} \right)^T} \\ {{U_f} = {U_f} - \eta \delta {{\hat f}^t}{{\left( {{h^{t - 1}}} \right)}^T}} \\ {U_o} = {U_o} - \eta \delta {{\hat o}^t}{\left( {{h^{t - 1}}} \right)^T}\\ {U_a} = {U_o} - \eta \delta {{\hat a}^t}{\left( {{h^{t - 1}}} \right)^T}

    def backPropagation(self, data, label, alpha = 0.002):  #反向傳播
        hidden, output, states = self.forward(data)  #(N, 6000)
        T = len(output) #時間長度=詞向量的長度
        deltaCPre = np.zeros((self.hiddenDim, 1))
        WiUpdate = np.zeros_like(self.Wi)
        WfUpdate = np.zeros_like(self.Wf)
        WoUpdate = np.zeros_like(self.Wo)
        WaUpdate = np.zeros_like(self.Wa)
        UiUpdate = np.zeros_like(self.Ui)
        UfUpdate = np.zeros_like(self.Uf)
        UoUpdate = np.zeros_like(self.Uo)
        UaUpdate = np.zeros_like(self.Ua)
        WyUpdate = np.zeros_like(self.Wy)
        for t in range(T-1, -1, -1):
            c = states[t]['c']
            i = states[t]['i']
            f = states[t]['f']
            o = states[t]['o']
            a = states[t]['a']
            cPre = states[t-1]['c']
            X = np.zeros((self.wordDim, 1))  # (6000,1)
            X[data[t]][0] = 1   #構建出詞向量
            output[t][label[t]][0] -= 1 #求導後,輸出結點的誤差跟output只差在i=j時需要把值減去1
            deltaK = output[t].copy()   #輸出結點的誤差
            deltaH = np.dot(self.Wy.T, deltaK)
            deltaO = np.multiply(np.multiply(deltaH, np.tanh(c)), o * (1 - o))
            deltaC = deltaCPre + np.multiply(deltaH, o, 1-(np.tanh(c) ** 2))
            deltaCPre = deltaC
            deltaA = np.multiply(np.multiply(deltaC, i), 1-(a ** 2)) #a = tanh(a)
            deltaI = np.multiply(np.multiply(deltaC, a), i * (1 - i))
            deltaF = np.multiply(np.multiply(deltaC, cPre), f * (1 - f))
            WiUpdate += np.dot(deltaI, X.T)
            WfUpdate += np.dot(deltaF, X.T)
            WaUpdate += np.dot(deltaA, X.T)
            WoUpdate += np.dot(deltaO, X.T)
            UiUpdate += np.dot(deltaI, hidden[t-1].T)
            UfUpdate += np.dot(deltaF, hidden[t-1].T)
            UaUpdate += np.dot(deltaA, hidden[t-1].T)
            UoUpdate += np.dot(deltaO, hidden[t-1].T)
            WyUpdate += np.dot(deltaK, hidden[t].T)
            # deltaCPre = np.multiply(np.multiply(c, a), i * (1 - i))
        self.Wi -= alpha * WiUpdate
        self.Wf -= alpha * WfUpdate
        self.Wa -= alpha * WaUpdate
        self.Wo -= alpha * WoUpdate
        self.Ui -= alpha * UiUpdate
        self.Uf -= alpha * UfUpdate
        self.Ua -= alpha * UaUpdate
        self.Uo -= alpha * UoUpdate
        self.Wy -= alpha * WyUpdate

這個是反向傳播的公式,其它代碼與RNN文本生成的代碼一樣。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章