kaldi mmi/bmmi/mpe/smbr訓練及源碼

訓練流程

make_denlats.sh
- 產生1gram語言模型對應的解碼網絡
- nnet-latgen-faster產生denominator lattice，作爲smbr訓練的分母，充當競爭路徑的作用
align.sh
- 根據標註的reference作爲G構建WFST
- 解碼獲得最優路徑，產生每一幀對齊的最好結果numerator alignment，作爲smbr訓練的分子，充當標準路徑的作用
get_egs_discriminative2.sh
重新組織數據格式，將上面兩步產生的numerator alignment、denominator lattice和特徵文件input features組織成以下形式的數據結構（必要時進行split和excise操作）：

struct DiscriminativeNnetExample {
  /// The weight we assign to this example;
  /// this will typically be one, but we include it
  /// for the sake of generality.  
  BaseFloat weight; 

  /// The numerator alignment
  std::vector<int32> num_ali; 

  /// The denominator lattice.  Note: any acoustic
  /// likelihoods in the denominator lattice will be
  /// recomputed at the time we train.
  CompactLattice den_lat; 

  /// The input data-- typically with a number of frames [NumRows()] larger than
  /// labels.size(), because it includes features to the left and right as
  /// needed for the temporal context of the network.  (see also the
  /// left_context variable).
  /// Caution: when we write this to disk, we do so as CompressedMatrix.
  /// Because we do various manipulations on these things in memory, such
  /// as splitting, we don't want it to be a CompressedMatrix in memory
  /// as this would be wasteful in time and also would lead to further loss of
  /// accuracy.
  Matrix<BaseFloat> input_frames;

  /// The number of frames of left context in the features (we can work out the
  /// #frames of right context from input_frames.NumRows(), num_ali.size(), and
  /// this).
  int32 left_context;


  /// spk_info contains any component of the features that varies slowly or not
  /// at all with time (and hence, we would lose little by averaging it over
  /// time and storing the average).  We'll append this to each of the input
  /// features, if used.
  Vector<BaseFloat> spk_info;
}

train_discriminative2.sh
- nnet-combine-egs-discriminative：重新組織egs，512幀組成1個eg，對應於一個batch
- 開始區分度訓練，最外層的循環控制使用的iter個數，也就是把epoch統一轉化爲iter的個數
  num_archive=5000 （比如degs產生的文件個數）
  num_jobs_nnet=4
  num_epochs=4
  num_iters=num_epochs*num_archive/num_jobs_nnet=5000

kaldi源碼

數據部分

數據部分egs_包括
- num_ali
維度：幀數
align後的序列，作爲reference
- den_lat
對應的lattice
- input_frames
維度：(left_context+幀數+right_context)*frame_dim
語音特徵

模型部分包括：
- am_nnet
用來計算P(s|o)，以及訓練更新

訓練流程

Propagate()
計算特徵對應的後驗概率P(s|o)
LatticeComputation()
計算準則對應的loss
- 使用上一步計算出的P(s|o)，除以先驗概率P(s)，得到似然概率P(o|s)，替換lattice邊上對應的P(o|s)
- 根據不同的準則計算對應的post=∂J∂logP(ot|s)
  <1> LatticeForwardBackwardMpeVariants
  計算post=∂J∂logP(ot|s)=rq(c(q)−crarg)
  rq : 表示邊q的似然概率，對應於αqβq∑rαrβr
  c(q) : 表示經過邊q的所有句子的平均state準確率，對應於alpha_smbr[q]+beta_smbr[q]
  crarg : 所有句子的平均state準確率，對應於tot_forward_score
  <2>LatticeForwardBackwardMmi
  LatticeForwardBackward：對應於第二項∑w:st=ip(om|sm)kP(w)∑wp(om|sm)kP(w))
  AlignmentToPosterior：對應於第一項δ(i=smt)
- CompObjfAndDeriv
  計算對於後驗概率的導數，滿足如下關係
  $\partial J \partial P ( s | o t ) = \partial J \partial l o g P ( o t | s ) \partial l o g P ( o t | s ) \partial P ( s | o t ) = p o s t 1 P ( s | o t )$
Backprop()
逐層進行反向傳播

公式推導

詳細公式推導參考這片博客
smbr訓練實際用到的前後向算法，參考[1]的“Computation for approximate MPE”部分的僞代碼，其中α 和β 比較好理解，但是爲了A(s,sm) 的計算引入了α′ 和β′ 不容易理解，稱之爲到達邊的不完整序列的state平均準確率。

其他點

lattice對應的邊上面有兩個weight
- graph cost: LM+transition+pronunciation
- acoustic cost: -P(o|s)
mpe和smbr的主要體現在A(s,sm) 的計算上
- mpe: 每一幀對應的phone和align是否一致
- smbr: 每一幀對應的state和align是否一致
one_silience_class參數含義體現在對於A(s,sm) 的認定上面
- true: pdf=ref_pdf或者pdf和ref_pdf都屬於silience的話，都認爲相等
- false: pdf=ref_pdf並且pdf不是silience的話，才認爲相等
drop_frames
根據[2]: mmi訓練當分子alignment的狀態沒有在分母的lattice出現的時候，導致梯度過大，捨棄該幀的梯度。這種情況大部分發生在幀數低於50幀的語音，可以使用rejected frame（drop_frames）來解決這種問題。導致這種問題的原因包括
- search error
- poor match of the acoustics to the model
- errors in the reference

經驗設置

根據[1]的總結：
- lattice的beam設定較小的話影響準確率(MPE相比於MMI受影響更小)，而且迭代過程中重新產生lattice的收益很小，而且十分耗時
- 使用1gram的語言模型
- am scale跟解碼保持一致即可，有時候更小（比如原來的1/4）會有收益

smbr只有在大數據量纔會有效果

參考文獻

[1]. discriminative training for large vocabulary speech recognition
[2]. sequence-discriminative training of deep neural networks

xmucas

發佈了132 篇原創文章 · 獲贊 94 · 訪問量 62萬+

他的留言板關注

kaldi mmi/bmmi/mpe/smbr訓練及源碼

訓練流程

kaldi源碼

數據部分

訓練流程

公式推導

其他點

經驗設置

參考文獻

SQL優化-20231016

kaldi feature extraction

kaldi NFS/GlusterFS

kaldi 1d-CNN源碼

cuda 概況和安裝

kaldi 1d-CNN網絡結構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結