本文轉自http://blog.csdn.net/wbgxx333/article/details/45641341

只爲加強自己的學習。

綜述

這個文檔主要來說kaldi中Karel Vesely部分的深度神經網絡代碼。

如果想了解kaldi的全部深度神經網絡代碼，請Deep Neural Networks in Kaldi, 和Dan的版本, 請看Dan’s DNN implementation。

這個文檔的目標就是更加詳細的介紹DNN部分，和簡單介紹神經網絡訓練工具。我們將從Top-level script開始, 解釋the Training script internals到底做了什麼, 展示一些Advanced features, 和對The C++ code做了一些簡單的介紹，和解釋如何來擴展這些。

top-level script

讓我們來看一下腳本egs/wsj/s5/local/nnet/run_dnn.sh。這個腳本是使用單CUDA GPU，和使用CUDA編譯過的kaldi(可以在 src/kaldi.mk中使用’CUDA = true’來檢查)。我們也假設’cuda_cmd’在egs/wsj/s5/cmd.sh裏設置是正確的，或者是使用’queue.pl’的GPU集羣節點，或者是使用’run.pl’的本地機器。最後假設我們由egs/wsj/s5/run.sh得到了一個SAT GMM系統exp/tri4b和對應的fMLLR變換。注意其他數據庫的 run_dnn.sh一般都會在s5/local/nnet/run_dnn.sh.

腳本 egs/wsj/s5/local/nnet/run_dnn.sh分下面這些步驟：

0、存儲在本地的40維fMLLR特徵，使用steps/nnet/make_fmllr_feats.sh，這簡化了訓練腳本，40維的特徵是使用CMN的MFCC-LDA-MLLT-fMLLR

1、RBM預訓練，steps/nnet/pretrain_dbn.sh，是根據Geoff Hinton’s tutorial paper來實現的。訓練方法是使用1步馬爾科夫蒙特卡羅採樣的對比散度算法（CD-1）。第一層的RBM是Gaussian-Bernoulli，和接下來的RBMs是Bernoulli-Bernoulli。這裏的超參數基準是100h Switchboard subset數據及上調參得到的。如果數據集很小的話，迭代次數N就要變爲100h/set_size。訓練是無監督的，所以可以提供足夠多的輸入特徵數據目錄。

當訓練Gaussian-Bernoulli的RBM時，將有很大的風險面臨權重爆炸，尤其是在很大的學習率和成千上萬的隱層神經元上。爲了避免權重爆炸，我們實現時需要在一個minbatch上比較訓練數據的方差和重構數據的方差。如果重構的方差是訓練數據的2倍以上，權重將縮小、學習率將暫時減小。

2、幀交叉熵訓練，steps/nnet/train.sh,這個階段時訓練一個DNN來把幀分到對應的三音素狀態（比如：PDFs）中。這是通過mini-batch隨機梯度下降法來做的。默認的是使用Sigmoid隱層單元，Sotfmax輸出單元和全連接層AffineTransform。學習率是0.008，minbatch的大小是256；我們未使用衝量和正則化（注：最佳的學習率與不同的隱層單元類型有關，sigmoid的值是0.008，tanh是0.00001）。

輸入變換和預訓練DBN（比如：深度信息網絡，RBNs塊）是使用選項“-input-transform”和“-dbn”傳遞給腳本的，這裏僅僅輸出層是隨機初始化的。我們使用提早停止（early stopping）來防止過擬合。爲了這個，我們需要在交叉驗證集（比如:held-out set）上計算代價函數，因此兩對特徵對齊目錄需要做有監督的訓練。

對DNN訓練有一個好的總結文章是http://research.google.com/pubs/archive/38131.pdf

3、4、5、6、sMBR序列區別性訓練，steps/nnet/train_mpe.sh,這個階段對所有的句子聯合優化來訓練神經網絡，比幀層訓練更接近一般的ASR目標。

sMBR的目標是最大話從參考的對齊中得到的狀態標籤的期望正確率，然而一個詞圖框架是來使用表示這種競爭假設。
訓練是使用每句迭代的隨機梯度下降法，我們還使用一個低的固定的學習率1e-5(sigmoids)和跑3-5輪。
當在第一輪跌後重新生成詞圖，我們觀察到快速收斂。我們支持MMI，BMMI，MPE和sMBR訓練。所有的技術在Switchboard 100h集上是相同的，僅僅在sMBR好一點點。
在sMBR優化中，我們在計算近似正確率的時候忽略了靜音幀。具體更加詳細的描述見http://www.danielpovey.com/files/2013_interspeech_dnn.pdf

其他一些有意思的top-level scripts:

除了DNN腳本，這裏也有一些其他的腳本：
* DNN : egs/wsj/s5/local/nnet/run_dnn.sh , (main top-level script)
* CNN : egs/rm/s5/local/nnet/run_cnn.sh , (CNN = Convolutional Neural Network, see paper, we have 1D convolution on frequency axis)
* Autoencoder training : egs/timit/s5/local/nnet/run_autoencoder.sh
* Tandem system : egs/swbd/s5c/local/nnet/run_dnn_tandem_uc.sh , (uc = Universal context network, see paper)
* Multilingual/Multitask : egs/rm/s5/local/nnet/run_multisoftmax.sh, (Network with output trained on RM and WSJ, same C++ design as was used in SLT2012 paper)

Training Script internals

主要的神經網絡訓練腳本steps/nnet/train.sh的調用如下：
steps/nnet/train.sh <data-train> <data-dev> <lang-dir> <ali-train> <ali-dev> <exp-dir>

神經網絡的輸入特徵是從數據目錄<data-train> <data-dev>中獲得，訓練的目標是從目錄 <ali-train> <ali-dev>得到的。

目錄<lang-dir>僅僅在使用LDA特徵變化時才被使用，和從對齊中生成因素幀的統計量，這個對於訓練不是很重要。輸出（比如：訓練得到的網絡和log文件）都存<exp-dir>到。

在內部，腳本需要準備特徵和目標基準，從而產生一個神經網絡的原型和初始化，建立特徵變換和使用調度腳本steps/nnet/train_scheduler.sh，用來跑訓練迭代次數和控制學習率。

當看steps/nnet/train.sh腳本內部時，我們將看到：

1、CUDA是需要的，如果沒有檢測到GPU或者CUDA沒有被編譯，腳本將退出。(你可以堅持使用’–skip-cuda-check true’來使用CPU運行，但是速度將慢10-20倍)

2、對齊基準需要提前準備，訓練工具需要的目標是以後驗概率格式，因此ali-to-post.cc被使用：

labels_tr="ark:ali-to-pdf $alidir/final.mdl \"ark:gunzip -c $alidir/ali.*.gz |\" ark:- | ali-to-post ark:- ark:- |"
labels_cv="ark:ali-to-pdf $alidir/final.mdl \"ark:gunzip -c $alidir_cv/ali.*.gz |\" ark:- | ali-to-post ark:- ark:- |"

3、重組的特徵拷貝到/tmp/???/…，如果使用’–copy-feats false’，這個失效。或者目錄改爲–copy-feats-tmproot <dir>

特徵使用調用列表被重新保存到本地，這些顯著的降低了在訓練過程中磁盤的重要性，它防止了大量磁盤訪問的操作。

4、特徵基準被準備：

# begins with copy-feats:
feats_tr="ark:copy-feats scp:$dir/train.scp ark:- |"
feats_cv="ark:copy-feats scp:$dir/cv.scp ark:- |"
# optionally apply-cmvn is appended: 
feats_tr="$feats_tr apply-cmvn --print-args=false --norm-vars=$norm_vars --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp ark:- ark:- |"
feats_cv="$feats_cv apply-cmvn --print-args=false --norm-vars=$norm_vars --utt2spk=ark:$data_cv/utt2spk scp:$data_cv/cmvn.scp ark:- ark:- |"
# optionally add-deltas is appended:
feats_tr="$feats_tr add-deltas --delta-order=$delta_order ark:- ark:- |"
feats_cv="$feats_cv add-deltas --delta-order=$delta_order ark:- ark:- |"

5、特徵變換被準備

特徵變化在DNN前端處理中是一個固定的函數，是通過GPU來計算的。Usually it performs a type of dimensionality expansion. 這就要使得在磁盤上有低維的特徵和DNN前端處理的高維特徵，即節約了磁盤空間，由節約了讀取吞吐量
大多數的nnet-binaries有選項“-feature-transform”
它的產生依賴於選項“-feat-type”,它的值是（plain|traps|transf|lda）。

6、網絡的原型是由utils/nnet/make_nnet_proto.py產生的:

每個成分在單獨一行上，這裏的維度和初始化超參數是指定的
對於AffineTransform，偏移量的初始化是給定<BiasMean>和<BiasRange>的均勻分佈，而權重的初始化是通過對<ParamStddev>拉伸的正態分佈。
注意：如果你喜歡使用外部準備的神經網絡原始來實驗，可以使用選項“-mlp-proto”

$ cat exp/dnn5b_pretrain-dbn_dnn/nnet.proto
<NnetProto>
<AffineTransform> <InputDim> 2048 <OutputDim> 3370 <BiasMean> 0.000000 <BiasRange> 0.000000 <ParamStddev> 0.067246
<Softmax> <InputDim> 3370 <OutputDim> 3370
</NnetProto>

7、神經網絡是通過nnet-initialize.cc來初始化。下一步中，DBN是通過使用nnet-conca.cc得到的。

8、最終訓練是通過運行調度腳本steps/nnet/train_scheduler.sh來完成的。

注：無論神經網絡還是特徵變換都可以使用nnet-info.cc來觀看，或者使用nnet-copy來顯示。

當具體看steps/nnet/train_scheduler.sh，我們可以看到：

一開始需要在交叉驗證集上運行和主函數需要根據$iter來運行迭代和控制學習率。典型的情況就是，train_scheduler.sh被train.sh調用

默認的學習率調度是根據目標函數的相對性提高來決定的：
如果提高大於“start_halving_impr=0.01”，初始化學習率保持常數。
然後學習率在每次迭代中乘以“halving_factor=0.5”來縮小
最後，如果提高小於“end_halving_impr=0.001”,訓練被終止。

神經網絡被保存在dir/nnet，log文件被保存在dir/log:

1、神經網絡的名字包含迭代的次數，學習率和在訓練和交叉驗證集上的目標函數值

我們可以看到從第五次迭代開始，學習率減半，這是一個普通的情況。

$ ls exp/dnn5b_pretrain-dbn_dnn/nnet
nnet_6.dbn_dnn_iter01_learnrate0.008_tr1.1919_cv1.5895
nnet_6.dbn_dnn_iter02_learnrate0.008_tr0.9566_cv1.5289
nnet_6.dbn_dnn_iter03_learnrate0.008_tr0.8819_cv1.4983
nnet_6.dbn_dnn_iter04_learnrate0.008_tr0.8347_cv1.5097_rejected
nnet_6.dbn_dnn_iter05_learnrate0.004_tr0.8255_cv1.3760
nnet_6.dbn_dnn_iter06_learnrate0.002_tr0.7920_cv1.2981
nnet_6.dbn_dnn_iter07_learnrate0.001_tr0.7803_cv1.2412
...
nnet_6.dbn_dnn_iter19_learnrate2.44141e-07_tr0.7770_cv1.1448
nnet_6.dbn_dnn_iter20_learnrate1.2207e-07_tr0.7769_cv1.1446
nnet_6.dbn_dnn_iter20_learnrate1.2207e-07_tr0.7769_cv1.1446_final_

2、訓練集合交叉驗證集分別存儲了對應的log文件。

每一個log文件命令行：

$ cat exp/dnn5b_pretrain-dbn_dnn/log/iter01.tr.log
nnet-train-frmshuff --learn-rate=0.008 --momentum=0 --l1-penalty=0 --l2-penalty=0 --minibatch-size=256 --randomizer-size=32768 --randomize=true --verbose=1 --binary=true --feature-transform=exp/dnn5b_pretrain-dbn_dnn/final.feature_transform --randomizer-seed=777 'ark:copy-feats scp:exp/dnn5b_pretrain-dbn_dnn/train.scp ark:- |' 'ark:ali-to-pdf exp/tri4b_ali_si284/final.mdl "ark:gunzip -c exp/tri4b_ali_si284/ali.*.gz |" ark:- | ali-to-post ark:- ark:- |' exp/dnn5b_pretrain-dbn_dnn/nnet_6.dbn_dnn.init exp/dnn5b_pretrain-dbn_dnn/nnet/nnet_6.dbn_dnn_iter01

GPU被使用的信息：

LOG (nnet-train-frmshuff:IsComputeExclusive():cu-device.cc:214) CUDA setup operating under Compute Exclusive Process Mode.
LOG (nnet-train-frmshuff:FinalizeActiveGpu():cu-device.cc:174) The active GPU is [1]: GeForce GTX 780 Ti	free:2974M, used:97M, total:3071M, free/total:0.968278 version 3.5

從神經網絡訓練得到的內部統計量是通過函數Nnet::InfoPropagate、Nnet::InfoBackPropagate、和Nnet::InfoGradient來準備的。他們將在迭代的一開始打印和迭代的最後第二次打印。

注意當我們事先新的特徵調試網絡訓練時，每一個成分的統計量就尤其便利，所以我們可以比較參考和期望的值：

VLOG[1] (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:236) ### After 0 frames,
VLOG[1] (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:237) ### Forward propagation buffer content :
[1] output of <Input> ( min -6.1832, max 7.46296, mean 0.00260791, variance 0.964268, skewness -0.0622335, kurtosis 2.18525 ) 
[2] output of <AffineTransform> ( min -18.087, max 11.6435, mean -3.37778, variance 3.2801, skewness -3.40761, kurtosis 11.813 ) 
[3] output of <Sigmoid> ( min 1.39614e-08, max 0.999991, mean 0.085897, variance 0.0249875, skewness 4.65894, kurtosis 20.5913 ) 
[4] output of <AffineTransform> ( min -17.3738, max 14.4763, mean -2.69318, variance 2.08086, skewness -3.53642, kurtosis 13.9192 ) 
[5] output of <Sigmoid> ( min 2.84888e-08, max 0.999999, mean 0.108987, variance 0.0215204, skewness 4.78276, kurtosis 21.6807 ) 
[6] output of <AffineTransform> ( min -16.3061, max 10.9503, mean -3.65226, variance 2.49196, skewness -3.26134, kurtosis 12.1138 ) 
[7] output of <Sigmoid> ( min 8.28647e-08, max 0.999982, mean 0.0657602, variance 0.0212138, skewness 5.18622, kurtosis 26.2368 ) 
[8] output of <AffineTransform> ( min -19.9429, max 12.5567, mean -3.64982, variance 2.49913, skewness -3.2291, kurtosis 12.3174 ) 
[9] output of <Sigmoid> ( min 2.1823e-09, max 0.999996, mean 0.0671024, variance 0.0216422, skewness 5.07312, kurtosis 24.9565 ) 
[10] output of <AffineTransform> ( min -16.79, max 11.2748, mean -4.03986, variance 2.15785, skewness -3.13305, kurtosis 13.9256 ) 
[11] output of <Sigmoid> ( min 5.10745e-08, max 0.999987, mean 0.0492051, variance 0.0194567, skewness 5.73048, kurtosis 32.0733 ) 
[12] output of <AffineTransform> ( min -24.0731, max 13.8856, mean -4.00245, variance 2.16964, skewness -3.14425, kurtosis 16.7714 ) 
[13] output of <Sigmoid> ( min 3.50889e-11, max 0.999999, mean 0.0501351, variance 0.0200421, skewness 5.67209, kurtosis 31.1902 ) 
[14] output of <AffineTransform> ( min -2.53919, max 2.62531, mean -0.00363421, variance 0.209117, skewness -0.0302545, kurtosis 0.63143 ) 
[15] output of <Softmax> ( min 2.01032e-05, max 0.00347782, mean 0.000296736, variance 2.08593e-08, skewness 6.14324, kurtosis 35.6034 ) 

VLOG[1] (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:239) ### Backward propagation buffer content :
[1] diff-output of <AffineTransform> ( min -0.0256142, max 0.0447016, mean 1.60589e-05, variance 7.34959e-07, skewness 1.50607, kurtosis 97.2922 ) 
[2] diff-output of <Sigmoid> ( min -0.10395, max 0.20643, mean -2.03144e-05, variance 5.40825e-05, skewness 0.226897, kurtosis 10.865 ) 
[3] diff-output of <AffineTransform> ( min -0.0246385, max 0.033782, mean 1.49055e-05, variance 7.2849e-07, skewness 0.71967, kurtosis 47.0307 ) 
[4] diff-output of <Sigmoid> ( min -0.137561, max 0.177565, mean -4.91158e-05, variance 4.85621e-05, skewness 0.020871, kurtosis 7.7897 ) 
[5] diff-output of <AffineTransform> ( min -0.0311345, max 0.0366407, mean 1.38255e-05, variance 7.76937e-07, skewness 0.886642, kurtosis 70.409 ) 
[6] diff-output of <Sigmoid> ( min -0.154734, max 0.166145, mean -3.83602e-05, variance 5.84839e-05, skewness 0.127536, kurtosis 8.54924 ) 
[7] diff-output of <AffineTransform> ( min -0.0236995, max 0.0353677, mean 1.29041e-05, variance 9.17979e-07, skewness 0.710979, kurtosis 48.1876 ) 
[8] diff-output of <Sigmoid> ( min -0.103117, max 0.146624, mean -3.74798e-05, variance 6.17777e-05, skewness 0.0458594, kurtosis 8.37983 ) 
[9] diff-output of <AffineTransform> ( min -0.0249271, max 0.0315759, mean 1.0794e-05, variance 1.2015e-06, skewness 0.703888, kurtosis 53.6606 ) 
[10] diff-output of <Sigmoid> ( min -0.147389, max 0.131032, mean -0.00014309, variance 0.000149306, skewness 0.0190403, kurtosis 5.48604 ) 
[11] diff-output of <AffineTransform> ( min -0.057817, max 0.0662253, mean 2.12237e-05, variance 1.21929e-05, skewness 0.332498, kurtosis 35.9619 ) 
[12] diff-output of <Sigmoid> ( min -0.311655, max 0.331862, mean 0.00031612, variance 0.00449583, skewness 0.00369107, kurtosis -0.0220473 ) 
[13] diff-output of <AffineTransform> ( min -0.999905, max 0.00347782, mean -1.33212e-12, variance 0.00029666, skewness -58.0197, kurtosis 3364.53 ) 

VLOG[1] (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:240) ### Gradient stats :
Component 1 : <AffineTransform>, 
  linearity_grad ( min -0.204042, max 0.190719, mean 0.000166458, variance 0.000231224, skewness 0.00769091, kurtosis 5.07687 ) 
  bias_grad ( min -0.101453, max 0.0885828, mean 0.00411107, variance 0.000271452, skewness 0.728702, kurtosis 3.7276 ) 
Component 2 : <Sigmoid>, 
Component 3 : <AffineTransform>, 
  linearity_grad ( min -0.108358, max 0.0843307, mean 0.000361943, variance 8.64557e-06, skewness 1.0407, kurtosis 21.355 ) 
  bias_grad ( min -0.0658942, max 0.0973828, mean 0.0038158, variance 0.000288088, skewness 0.68505, kurtosis 1.74937 ) 
Component 4 : <Sigmoid>, 
Component 5 : <AffineTransform>, 
  linearity_grad ( min -0.186918, max 0.141044, mean 0.000419367, variance 9.76016e-06, skewness 0.718714, kurtosis 40.6093 ) 
  bias_grad ( min -0.167046, max 0.136064, mean 0.00353932, variance 0.000322016, skewness 0.464214, kurtosis 8.90469 ) 
Component 6 : <Sigmoid>, 
Component 7 : <AffineTransform>, 
  linearity_grad ( min -0.134063, max 0.149993, mean 0.000249893, variance 9.18434e-06, skewness 1.61637, kurtosis 60.0989 ) 
  bias_grad ( min -0.165298, max 0.131958, mean 0.00330344, variance 0.000438555, skewness 0.739655, kurtosis 6.9461 ) 
Component 8 : <Sigmoid>, 
Component 9 : <AffineTransform>, 
  linearity_grad ( min -0.264095, max 0.27436, mean 0.000214027, variance 1.25338e-05, skewness 0.961544, kurtosis 184.881 ) 
  bias_grad ( min -0.28208, max 0.273459, mean 0.00276327, variance 0.00060129, skewness 0.149445, kurtosis 21.2175 ) 
Component 10 : <Sigmoid>, 
Component 11 : <AffineTransform>, 
  linearity_grad ( min -0.877651, max 0.811671, mean 0.000313385, variance 0.000122102, skewness -1.06983, kurtosis 395.3 ) 
  bias_grad ( min -1.01687, max 0.640236, mean 0.00543326, variance 0.00977744, skewness -0.473956, kurtosis 14.3907 ) 
Component 12 : <Sigmoid>, 
Component 13 : <AffineTransform>, 
  linearity_grad ( min -22.7678, max 0.0922921, mean -5.66685e-11, variance 0.00451415, skewness -151.169, kurtosis 41592.4 ) 
  bias_grad ( min -22.8996, max 0.170164, mean -8.6555e-10, variance 0.421778, skewness -27.1075, kurtosis 884.01 ) 
Component 14 : <Softmax>,

全部集的目標函數值的總結log文件，它的progress vector是由第一步產生的，和幀正確率：

LOG (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:273) Done 34432 files, 21 with no tgt_mats, 0 with other errors. [TRAINING, RANDOMIZED, 50.8057 min, fps8961.77]
LOG (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:282) AvgLoss: 1.19191 (Xent), [AvgXent: 1.19191, AvgTargetEnt: 0]
progress: [3.09478 1.92798 1.702 1.58763 1.49913 1.45936 1.40532 1.39672 1.355 1.34153 1.32753 1.30449 1.2725 1.2789 1.26154 1.25145 1.21521 1.24302 1.21865 1.2491 1.21729 1.19987 1.18887 1.16436 1.14782 1.16153 1.1881 1.1606 1.16369 1.16015 1.14077 1.11835 1.15213 1.11746 1.10557 1.1493 1.09608 1.10037 1.0974 1.09289 1.11857 1.09143 1.0766 1.08736 1.10586 1.08362 1.0885 1.07366 1.08279 1.03923 1.06073 1.10483 1.0773 1.0621 1.06251 1.07252 1.06945 1.06684 1.08892 1.07159 1.06216 1.05492 1.06508 1.08979 1.05842 1.04331 1.05885 1.05186 1.04255 1.06586 1.02833 1.06131 1.01124 1.03413 0.997029 ]
FRAME_ACCURACY >> 65.6546% <<

log文件的結尾時CUDA的信息，CuMatrix::AddMatMat是矩陣乘法和大多數的話費時間如下：

[cudevice profile]
Destroy	23.0389s
AddVec	24.0874s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)	29.5765s
AddVecToRows	29.7164s
CuVector::SetZero	37.7405s
DiffSigmoid	37.7669s
CuMatrix::Resize	41.8662s
FindRowMaxId	42.1923s
Sigmoid	48.6683s
CuVector::Resize	56.4445s
AddRowSumMat	75.0928s
CuMatrix::SetZero	86.5347s
CuMatrixBase::CopyFromMat(from CPU)	166.27s
AddMat	174.307s
AddMatMat	1922.11s

直接運行steps/nnet/train_scheduler.sh:

*腳本train_scheduler.sh可以被train.sh調用，它允許覆蓋默認的NN-input和NN-target streams, 可以很遍歷的設置。

*然而這個腳本加上所有的設置是正確的，僅僅對高級用戶來說是合適的。

*在直接調用前，我們非常建議去看腳本train_scheduler.sh是如何調用的。

Training tools

與nnet1相關的代碼在目錄src/nnetbin下，重要的工具如下：

nnet-train-frmshuff.cc：

最普遍使用的神經網絡訓練工具，執行一次迭代訓練。

過程如下：

1、on-the-fly featrue expansion by -feature-transform,

2、per-frame shuffling of NN input-target pairs,

3、mini-batch隨機梯度下降（SGD）訓練，

支持每一幀的目標函數（選項-objective-function）:

1、Xent：每一幀的交叉熵 $\mathcal{L}_{Mse}(\mathbf{t},\mathbf{y}) = \frac{1}{2}\sum_D{(t_d - y_d)^2}$ ;

2、Mse：每一幀的最小均方誤差 $\mathcal{L}_{Mse}(\mathbf{t},\mathbf{y}) = \frac{1}{2}\sum_D{(t_d - y_d)^2}$ ；,

這裏的

表示目標向量 $\mathbf{t}$ 的元素，

是DNN的輸出向量 $\mathbf{y}$ 的元素，D是DNN輸出的維度。

nnet-forward.cc：

通過神經網絡計算前向數據，默認使用CPU。

選項：

-apply-log: 產生神經網絡的對數輸出（比如：得到對數後驗概率）

-no-softmax:從模型中去掉soft-max層（decoding with pre-softmax values leads to the same lattices as with log-posteriors）

-class-frame-counts: conts to calculate log-priors, which get subtracted from the acoustic scores(a typical trick in hybrid decoding)

rbm-train-cd1-frmshuff.cc

使用CDI來訓練RBM，當內部調整學習率/衝量時需要訓練數據好幾次。

nnet-train-mmi-sequential.cc

MMI / bMMI DNN training

nnet-train-mpe-sequential.cc

MPE / sMBR DNN training

Other tools

nnet-info.cc

打印關於神經網絡的信息

nnet-copy.cc

使用選項–binary=false把神經網絡轉換爲ASCII格式，可以用來移除某些成分

Showing the network topology with nnet-info

接下來從nnet-info.cc裏的打印信息顯示“feature_transform”與steps/nnet/train.sh裏的‘-feat-type plain’相對應，它包含三個成分：

<Splice> which splices features to contain left/right context by using frames with offsets relative to the central frame[-5 -4 -3 -2 -1 0 1 2 3 4 5]
<Addshift> 把特徵變爲零均值
<Rescale> 把特徵變成單位方差
注意：我麼從磁盤讀取低維特徵，通過選項“feature_transform”擴展到高維特徵，這樣會節省磁盤空間和可讀的吞吐量。

$ nnet-info exp/dnn5b_pretrain-dbn_dnn/final.feature_transform
num-components 3
input-dim 40
output-dim 440
number-of-parameters 0.00088 millions
component 1 : <Splice>, input-dim 40, output-dim 440,
  frame_offsets [ -5 -4 -3 -2 -1 0 1 2 3 4 5 ]
component 2 : <AddShift>, input-dim 440, output-dim 440,
  shift_data ( min -0.265986, max 0.387861, mean -0.00988686, variance 0.00884029, skewness 1.36947, kurtosis 7.2531 )
component 3 : <Rescale>, input-dim 440, output-dim 440,
  scale_data ( min 0.340899, max 1.04779, mean 0.838518, variance 0.0265105, skewness -1.07004, kurtosis 0.697634 )
LOG (nnet-info:main():nnet-info.cc:57) Printed info about exp/dnn5b_pretrain-dbn_dnn/final.feature_transform

下面打印的是6層神經網絡的信息：

每一層是由2個成分構成，通常是一個<AffineTransform>和一個非線性層<Sigmoid>或者<Softmax>
對於每一個<AffineTransform>，處理權重和偏移，還有一些統計量（mix,max,mean,variance,...）

$ nnet-info exp/dnn5b_pretrain-dbn_dnn/final.nnet
num-components 14
input-dim 440
output-dim 3370
number-of-parameters 28.7901 millions
component 1 : <AffineTransform>, input-dim 440, output-dim 2048,
  linearity ( min -8.31865, max 12.6115, mean 6.19398e-05, variance 0.0480065, skewness 0.234115, kurtosis 56.5045 )
  bias ( min -11.9908, max 3.94632, mean -5.23527, variance 1.52956, skewness 1.21429, kurtosis 7.1279 )
component 2 : <Sigmoid>, input-dim 2048, output-dim 2048,
component 3 : <AffineTransform>, input-dim 2048, output-dim 2048,
  linearity ( min -2.85905, max 2.62576, mean -0.00995374, variance 0.0196688, skewness 0.145988, kurtosis 5.13826 )
  bias ( min -18.4214, max 2.76041, mean -2.63403, variance 1.08654, skewness -1.94598, kurtosis 29.1847 )
component 4 : <Sigmoid>, input-dim 2048, output-dim 2048,
component 5 : <AffineTransform>, input-dim 2048, output-dim 2048,
  linearity ( min -2.93331, max 3.39389, mean -0.00912637, variance 0.0164175, skewness 0.115911, kurtosis 5.72574 )
  bias ( min -5.02961, max 2.63683, mean -3.36246, variance 0.861059, skewness 0.933722, kurtosis 2.02732 )
component 6 : <Sigmoid>, input-dim 2048, output-dim 2048,
component 7 : <AffineTransform>, input-dim 2048, output-dim 2048,
  linearity ( min -2.18591, max 2.53624, mean -0.00286483, variance 0.0120785, skewness 0.514589, kurtosis 15.7519 )
  bias ( min -10.0615, max 3.87953, mean -3.52258, variance 1.25346, skewness 0.878727, kurtosis 2.35523 )
component 8 : <Sigmoid>, input-dim 2048, output-dim 2048,
component 9 : <AffineTransform>, input-dim 2048, output-dim 2048,
  linearity ( min -2.3888, max 2.7677, mean -0.00210424, variance 0.0101205, skewness 0.688473, kurtosis 23.6768 )
  bias ( min -5.40521, max 1.78146, mean -3.83588, variance 0.869442, skewness 1.60263, kurtosis 3.52121 )
component 10 : <Sigmoid>, input-dim 2048, output-dim 2048,
component 11 : <AffineTransform>, input-dim 2048, output-dim 2048,
  linearity ( min -2.9244, max 3.0957, mean -0.00475199, variance 0.0112682, skewness 0.372597, kurtosis 25.8144 )
  bias ( min -6.00325, max 1.89201, mean -3.96037, variance 0.847698, skewness 1.79783, kurtosis 3.90105 )
component 12 : <Sigmoid>, input-dim 2048, output-dim 2048,
component 13 : <AffineTransform>, input-dim 2048, output-dim 3370,
  linearity ( min -2.0501, max 5.96146, mean 0.000392621, variance 0.0260072, skewness 0.678868, kurtosis 5.67934 )
  bias ( min -0.563231, max 6.73992, mean 0.000585582, variance 0.095558, skewness 9.46447, kurtosis 177.833 )
component 14 : <Softmax>, input-dim 3370, output-dim 3370,
LOG (nnet-info:main():nnet-info.cc:57) Printed info about exp/dnn5b_pretrain-dbn_dnn/final.nnet

Advanced features

Frame-weighted training

調用帶選項的steps/nnet/train.sh:

--frame-weights <weights-rspecifier>

<weights-rspecifier>一般是表述每一幀權重的浮點型向量的ark文件。

the weights are used to scale gradients computed on single frames, which is useful in confidence-weighted semi-supervised training,
or weights can be used to mask-out frames we don't want to train with by generating vectors composed of weights 0,1

Training with external targets

調用帶選項的steps/nnet/train.sh:

--labels <posterior-rspecifier> --num-tgt <dim-output

while ali-dirs and lang-dir become dummy dirs. The "<posterior-rspecifier>" is typically ark file with Posterior stored, and the "<dim-output>" is the number of neural network outputs. Here the Posterior does not have probabilistic meaning, it is simply a data-type carrier for representing the targets, and the target values can be arbitrary float numbers.

When training with a single label per-frame(i.e. the 1-hot encoding), one can prepare an ark-file with integer vectors having the same length as the input features. The elements of this integer vector encode the indices of the target class, which corresponds to the target value being 1 at the neural network output with that index. The integer vectors get converted to Posterior using ali-to-post.cc, and the integer vector format is simple:

utt1 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 ... 9 9 9
utt2 0 0 0 0 0 3 3 3 3 3 3 2 2 2 2 ... 9 9 9

In the case of multiple non-zero targets, one can prepare the Posterior directly in ascii format

each non-zero target value is encoded by a pair <int32,float>, where int32 is the index of NN output (starting by 0) and float is the target-value
each frame (i.e. datapoint) is represented by values in brackets [ ... ], we see that the <int32,float> pairs get concatenated

utt1 [ 0 0.9991834 64 0.0008166544 ] [ 1 1 ] [ 0 1 ] [ 111 1 ] [ 0 1 ] [ 63 1 ] [ 0 1 ] [ 135 1 ] [ 0 1 ] [ 162 1 ] [ 0 1 ] [ 1 0.9937257 12 0.006274292 ] [ 0 1 ]

The external targets are used in the autoencoder example egs/timit/s5/local/nnet/run_autoencoder.sh

Mean-Square-Error training

call steps/nnet/train.sh with the options

--train-tool "nnet-train-frmshuff --objective-function=mse" 
--proto-opts "--no-softmax --activation-type=<Tanh> --hid-bias-mean=0.0 --hid-bias-range=1.0"

the mean-square error training is used in autoencoder example egs/timit/s5/local/nnet/run_autoencoder.sh

Training with tanh

call steps/nnet/train.sh with option

--proto-opts "--activation-type=<Tanh> --hid-bias-mean=0.0 --hid-bias-range=1.0"

the optimal learning rate is smaller than with sigmoid, usually 0.00001 is good

Conversion of a DNN model between nnet1 -> nnet2

In Kaldi, there are 2 DNN setups Karel's (this page) and Dan's Dan's DNN implementation. The setups use incompatible DNN formats, while there is a converter of Karel's DNN into Dan's format.

The example script is : egs/rm/s5/local/run_dnn_convert_nnet2.sh, model conversion
The model-conversion script is : steps/nnet2/convert_nnet1_to_nnet2.sh, it is calling the model-conversion binary : nnet1-to-raw-nnet.cc
For list of supported components see ConvertComponent.

The C++ code

nnet1的代碼位於src/nnet，工具在src/nnetbin。它是根據 src/cudamatrix。

Neural network representation

神經網絡是由稱爲成分的塊構成的，其中一些簡單的例子就是AffineTransform或者一個非線性Sigmoid,Softmax。一個單獨的DNN層一般是由2個成分構成：AffineTransform和一個非線性。

表示神經網絡的類： Nnet is holding a vector of Component pointers Nnet::components_. Nnet 最重要的一些方法如下：

Nnet::Propagate:從輸入傳播到輸出，while keeping per-component buffers that are needed for gradient computation
Nnet::Backpropage: 通過損失函數來後向傳播，更新權重
Nnet::Feedforward: 傳播，當使用兩個翻動buffer來節省內存
Nnet::SetTrainOptions: 設置訓練的超參數（如：學習率、衝量、L1，L2-cost）

爲了調試，成分和buffers塊通過Nnet::GetComponent, Nnet::PropagateBuffer, Nnet::BackpropagateBuffer可以看到。

Extending the network by a new component

當創建一個新成分，你需要使用下面2個接口中的一個：

Component: a building block, contains no trainable parameters(see example of implementation nnet-activation.h)
UpdatableComponent: child of Component, a building block with trainable parameters(implemented for example in nnet-affine-transform.h)

The important virtual methods to implement are (not a complete list):

Component::PropagateFnc:前向傳播函數
Component::BackpropagateFnc: 後向傳播函數（apply one step of chain rule, multiply the loss-derivative by the derivative of forward-pass function）
UpdatableComponent::Update: 梯度計算和權重更新

使用一個新的成分來擴展神經網絡框架，你需要：

定義一個新的成分入口Component::ComponentType
在表Component::kMarkerMap定義新的一行
添加一個“new Component” 去調用像工廠一樣的函數Component::Read
實現接口Component或者UpdatableComponent的所有虛擬方法

kaldi中的深度神經網絡

綜述