神經網絡-CNN結構和語音識別應用

原創

2020-02-20 15:53

一、基本結構

入門介紹：https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
參考deep learning. Ian Goodfellow的chapter9
cross-correlation: S(i,j)=(I∗K)(i,j)=∑M∑NI(i+m,j+n)K(m,n)
convolution: S(i,j)=(I∗K)(i,j)=∑M∑NI(i−m,j−n)K(m,n)
兩種操作的區別在於是否做翻轉，使用的時候將這兩種操作都叫做了convolution
三個優勢[1]：
- sparse interactions
- parameter sharing
- equivariant representations

三個階段：
convolution:
nonlinearity:
pooling: 對於input的小的擾動保持invariant

二、kaldi代碼實現

參考kaldi中net2實現，nnet2/nnet-component.h
- CNN的輸入：36*33
假設特徵維度爲36*1
兩階差分：36*3
左右5幀：36*33 （兩維的圖像作爲CNN的輸入）
- 使用的filter：7*33*128
第一維：frequency axis size of the patch
第二維：time axis size of the patch
第三維：number of output feature_map
步長設置：patch_step_=1
form patches which span over several frequency bands and whole time axis
- 輸出的feature_map：30*1*128
第一維：(36-7)/1+1=30
第二維：33 - 33 + 1 = 1
第三維：number of output feature_map
- max-pooling輸出：10*1*128
pool_size=3, no overlaps

三、網絡變形

（一）network in network

Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.

兩點創新：

- global average pooling: 不需要全連接層，減少參數量；全連接層容易過擬合。

- mlpconv

（二）VGGNet

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. Oxford
The runner-up in ILSVRC 2014

網絡層數增加到16-19層，同時使用更小的filter（3*3）

（三）ResNet

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015.
winner of ILSVRC 2015

解決問題：網絡層數增大，可以提升網絡能力，但是當層數增加到一定值以後，性能就會發生飽和，繼續過大，會出現degradation問題，比如56-layer的誤差要比20-layer的高。原因在於have exponentially low convergence rates, which impact the reducing of the training error。
使用方法：引入shortcut，加快收斂，解決degradation問題，同時沒有引入新的參數。
最後結果：從之前的30層增加到152層，性能一直提升，但是增加到1000層以後，training error比較小，但是測試誤差變化，原因在於overfitting。

四、語音應用

（一）deep-cnn

D. Yu, W. Xiong, J. Droppo, A. Stolcke, G. Ye, J. Li, and G. Zweig, “Deep convolutional neural networks with layer-wise context expansion and attention”, in Proc.Interspeech, 2016.

兩點創新：
- context expansion是指將n-1層的context作爲第n層的輸入
- attention機制，對於卷積層的頻譜輸入，不同的時間和頻率對應點的重要性可能不同（當前時刻對應的幀的重要性要比前後幾幀高一些），引入importance weight matrix（權重的初始化值爲1），對每一層做卷積操作之前首先和這個矩陣進行element-wise相乘，相當於根據重要性進行加權。

參數：
- jump blocks: 4
- each block: (20; 31; 128), (10; 16; 256), (5; 8; 512), (3; 4; 1024)
最後的結果相比DNN和LSTM要好。

T. Sercu and V. Goel, “Advances in Very Deep Convolutional Neural Networks for LVCSR,” in INTERSPEECH, 2016.

對比了時間維度上的time pooling和time padding的效果，pool比no pool效果好
但是使用pooling或者padding在進行整句區分性訓練的時候會帶來問題，pooling會導致輸出幀數變少，padding會導致edge處的輸出結果改變
提出了使用更大的context window來解決上面的問題，no pooling，no padding，帶來一些額外的計算量
Batch Normalization

The idea of BN is to standardize the internal representations inside the network (i.e. the layer outputs), which helps the network to converge faster and generalize better, inspired by the way whitening the network input improves performance. BN is implemented by standardizing the output of a layer before applying the nonlinearity, using the local mean and variance computed over the minibatch, then correcting with a learned variance and bias term

B N (x) = γ x - E ( x ) ( V a r ( x ) + ϵ ) 1 / 2 + β

（二）ctc-cnn

Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., Courville, A. (2016) Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. Proc. Interspeech 2016, 410-414.

性能和LSTM差不多，在同樣參數量的情況下加速2.5X
將之前的LSTM網絡結構替換爲CNN，然後跟着全連接層，頂層使用CTC準則進行訓練

W. Song and J. Cai, “End-to-End Deep Neural Network for Automatic Speech Recognition,” Technical Report. 2015 stanford

CNNs are exceptionally good at capturing high level features in spatial domain and have demonstrated unparalleled success in computer vision related tasks. One natural advantage of using CNN is that it’s invariant against translations of the variations in frequencies, which are common observed across speaker with different pitch due to their age or gender.

對數據幀使用時間窗獲得一個單通道的圖像，使用5X3的filter，考慮到頻率維度的長度大於時間維度的長度。
首先使用CNN+softmax訓練一個幀的分類器，然後固定CNN的參數，使用DNN+RNN+CTC替換softmax進行CTC訓練，使用CNN預訓練比直接訓練CTC效果要好一些。

xmucas

發佈了132 篇原創文章 · 獲贊 94 · 訪問量 62萬+

他的留言板關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

神經網絡-CNN結構和語音識別應用

一、基本結構

二、kaldi代碼實現

三、網絡變形

（一）network in network

（二）VGGNet

（三）ResNet

四、語音應用

（一）deep-cnn

（二）ctc-cnn

kaldi feature extraction

kaldi NFS/GlusterFS

kaldi 1d-CNN源碼

cuda 概況和安裝

kaldi 1d-CNN網絡結構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結