端到端語音識別（三） Sequence to Sequence and Attention

原創

2018-09-04 12:14

History

encoder-decoder

2014年Kyunghyun Cho[1]提出了RNN Encoder-Decoder的網絡結構，主要用在翻譯上面。
encoder將變長的輸入序列映射到一個固定長度的向量，decoder將該向量進一步映射到另外一個變長的輸出序列，網絡結構如下圖：

encoder:

h ⟨ t ⟩ = f (h ⟨ t - 1 ⟩, x t)

decoder:

h ⟨ t ⟩ = f (h ⟨ t - 1 ⟩, y t - 1, c)

P (y t | y t - 1, . . ., y 1, c) = g (h ⟨ t ⟩, y t - 1, c)

其中

c 是encoder最後時刻的

h ，

f() 是類似於簡化版的LSTM單元，具有reset gate和update gate，以實現捕捉short-term和long-term的依賴性。

sequence to sequence

2014年google的Ilya Sutskever[2]等人提出了sequence to sequence的學習方法來解決英文到法文的翻譯問題，整體框架如下

相比於[1]，主要是網絡使用LSTM，並且將輸入序列進行翻轉，解決了長序列性能下降的問題。

attention

Graves[3]在2013首先在handwriting synthesis中引入attention的機制，和簡單的sequence generation不同，在預測的時候，還通過soft window使用了額外的輸入信息。在動態產生預測的同時，也確定了text和pen locations之間的對齊關係。
Dzmitry Bahdanau[4]將[1]中的decoder的c 替換爲了ci ，即不同時刻i的輸出概率的計算不再使用相同的c 。

其中ci 的計算依賴於輸入的annotations(h1,...,hTx) ，計算公式如下：

c i = \sum j = 1 T x α i j h j

α i j = e x p ( e i j ) \sum T x k = 1 e x p ( e i k )

e i j = a (s i - 1, h j)

其中

a() 使用前向神經網絡來表示，和encoder-decoder一起訓練，也就是在學習translate的同時還需要學習alignment。

αij 表示輸出

yi 對齊到

xj 的概率，相當於引入了attention的機制，這在一定程度上減輕了encoder的壓力，因爲之前encoder需要把所有的輸入信息映射到一個固定的向量

c 。

Speech Application

phone recognition

2014年.Jan Chorowski[5]將encoder-decoder和attention的網絡結構應用到語音中的phone識別上面。[3]中的attention在權重分佈的時候有可能將較大的權重分配到比較遠的輸入上面，從而達到long-distance word reordering的效果。文中對attention的分佈進行了一定的限制，保證輸出附近對應的輸入的權重比較大，而且權重的分佈隨着時間往後移動，即單調性。主要有兩點改進：
1.修改attention的計算方法，引入d() 來學習權重的向後移動過程

2.在loss裏面增加懲罰項，We penalize any alignment that maps to inputs which were already considered for output emission

2015年.Jan Chorowski[6]在[5]的基礎上進行了改進，在使用上一時刻的alignment的時候直接進行Convolution，修改softmax函數突出重點幀的影響，同時不再使用整個序列的h ，只採用特定窗口範圍內的h 。

speech recognition

[5][6]主要將attention和encoder-decoder的網絡用在了phone的識別上面，2016年Dzmitry Bahdanau[7]進一步將其應用到LVCSR，輸出爲character，然後結合語言模型進行解碼。文中提出了pooling的方法爲了解決輸入長度過長帶來的計算複雜的問題。

在不使用外部語言模型的情況下，比ctc方法性能有較大提升，主要得益於encoder-decoder的框架隱式的學習character之間的關係，而CTC當前時刻的輸出跟上一時刻的輸出是獨立的，因此無法刻畫輸出character之間的關係
[8]和[7]類似，也是輸出到character，使用了pooling的思想使用了pyramid BLSTM網絡結構來來解決輸入序列過長訓練困難的問題。

Reference

[1]. Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).
[2].Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014).
[3].Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv:1308.0850 [cs].
[4].D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015.
[5].Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv:1412.1602 [cs, stat], December 2014.
[6].Chorowski, Jan K, Bahdanau, Dzmitry, Serdyuk, Dmitriy, Cho, Kyunghyun, and Bengio, Yoshua. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, pp. 577–585, 2015.
[7].D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949, March 2016. doi: 10.1109/ICASSP.2016.7472618.
[8].William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

端到端語音識別（三） Sequence to Sequence and Attention

History

encoder-decoder

sequence to sequence

attention

Speech Application

phone recognition

speech recognition

Reference

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

kaldi feature extraction

kaldi NFS/GlusterFS

kaldi 1d-CNN源碼

cuda 概況和安裝

kaldi 1d-CNN網絡結構

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結