版權聲明:本文爲博主原創文章,遵循 CC 4.0 by-sa 版權協議,轉載請附上原文出處鏈接和本聲明。
本文鏈接:https://blog.csdn.net/xmdxcsj/article/details/54695512
訓練流程
make_denlats.sh
產生1gram語言模型對應的解碼網絡
nnet-latgen-faster產生denominator lattice,作爲smbr訓練的分母,充當競爭路徑的作用
align.sh
根據標註的reference作爲G構建WFST
解碼獲得最優路徑,產生每一幀對齊的最好結果numerator alignment,作爲smbr訓練的分子,充當標準路徑的作用
get_egs_discriminative2.sh
重新組織數據格式,將上面兩步產生的numerator alignment、denominator lattice和特徵文件input features組織成以下形式的數據結構(必要時進行split和excise操作):
struct DiscriminativeNnetExample {
/// The weight we assign to this example;
/// this will typically be one, but we include it
/// for the sake of generality.
BaseFloat weight;
/// The numerator alignment
std::vector<int32> num_ali;
/// The denominator lattice. Note: any acoustic
/// likelihoods in the denominator lattice will be
/// recomputed at the time we train.
CompactLattice den_lat;
/// The input data-- typically with a number of frames [NumRows()] larger than
/// labels.size(), because it includes features to the left and right as
/// needed for the temporal context of the network. (see also the
/// left_context variable).
/// Caution: when we write this to disk, we do so as CompressedMatrix.
/// Because we do various manipulations on these things in memory, such
/// as splitting, we don't want it to be a CompressedMatrix in memory
/// as this would be wasteful in time and also would lead to further loss of
/// accuracy.
Matrix<BaseFloat> input_frames;
/// The number of frames of left context in the features (we can work out the
/// #frames of right context from input_frames.NumRows(), num_ali.size(), and
/// this).
int32 left_context;
/// spk_info contains any component of the features that varies slowly or not
/// at all with time (and hence, we would lose little by averaging it over
/// time and storing the average). We'll append this to each of the input
/// features, if used.
Vector<BaseFloat> spk_info;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
train_discriminative2.sh
nnet-combine-egs-discriminative:重新組織egs,512幀組成1個eg,對應於一個batch
開始區分度訓練,最外層的循環控制使用的iter個數,也就是把epoch統一轉化爲iter的個數
num_archive=5000 (比如degs產生的文件個數)
num_jobs_nnet=4
num_epochs=4
num_iters=num_epochs*num_archive/num_jobs_nnet=5000
kaldi源碼
數據部分
數據部分egs_包括
- num_ali
維度:幀數
align後的序列,作爲reference
- den_lat
對應的lattice
- input_frames
維度:(left_context+幀數+right_context)*frame_dim
語音特徵
模型部分包括:
- am_nnet
用來計算P(s|o),以及訓練更新
訓練流程
Propagate()
計算特徵對應的後驗概率P(s|o)
LatticeComputation()
計算準則對應的loss
使用上一步計算出的P(s|o),除以先驗概率P(s),得到似然概率P(o|s),替換lattice邊上對應的P(o|s)
根據不同的準則計算對應的post=∂J∂logP(ot|s)post=∂J∂logP(ot|s)
<1> LatticeForwardBackwardMpeVariants
計算post=∂J∂logP(ot|s)=rq(c(q)−crarg)post=∂J∂logP(ot|s)=rq(c(q)−cargr)
rqrq: 表示邊q的似然概率,對應於αqβq∑rαrβrαqβq∑rαrβr
c(q)c(q): 表示經過邊q的所有句子的平均state準確率,對應於alpha_smbr[q]+beta_smbr[q]
crargcargr: 所有句子的平均state準確率,對應於tot_forward_score
<2>LatticeForwardBackwardMmi
LatticeForwardBackward:對應於第二項∑w:st=ip(om|sm)kP(w)∑wp(om|sm)kP(w))∑w:st=ip(om|sm)kP(w)∑wp(om|sm)kP(w))
AlignmentToPosterior:對應於第一項δ(i=smt)δ(i=stm)
CompObjfAndDeriv
計算對於後驗概率的導數,滿足如下關係
∂J∂P(s|ot)=∂J∂logP(ot|s)∂logP(ot|s)∂P(s|ot)=post1P(s|ot)
∂J∂P(s|ot)=∂J∂logP(ot|s)∂logP(ot|s)∂P(s|ot)=post1P(s|ot)
Backprop()
逐層進行反向傳播
公式推導
詳細公式推導參考這片博客
smbr訓練實際用到的前後向算法,參考[1]的“Computation for approximate MPE”部分的僞代碼,其中αα和ββ比較好理解,但是爲了A(s,sm)A(s,sm)的計算引入了α′α′和β′β′不容易理解,稱之爲到達邊的不完整序列的state平均準確率。
其他點
lattice對應的邊上面有兩個weight
graph cost: LM+transition+pronunciation
acoustic cost: -P(o|s)
mpe和smbr的主要體現在A(s,sm)A(s,sm)的計算上
mpe: 每一幀對應的phone和align是否一致
smbr: 每一幀對應的state和align是否一致
one_silience_class參數含義體現在對於A(s,sm)A(s,sm)的認定上面
true: pdf=ref_pdf或者pdf和ref_pdf都屬於silience的話,都認爲相等
false: pdf=ref_pdf並且pdf不是silience的話,才認爲相等
drop_frames
根據[2]: mmi訓練當分子alignment的狀態沒有在分母的lattice出現的時候,導致梯度過大,捨棄該幀的梯度。這種情況大部分發生在幀數低於50幀的語音,可以使用rejected frame(drop_frames)來解決這種問題。導致這種問題的原因包括
search error
poor match of the acoustics to the model
errors in the reference
經驗設置
根據[1]的總結:
- lattice的beam設定較小的話影響準確率(MPE相比於MMI受影響更小),而且迭代過程中重新產生lattice的收益很小,而且十分耗時
- 使用1gram的語言模型
- am scale跟解碼保持一致即可,有時候更小(比如原來的1/4)會有收益
smbr只有在大數據量纔會有效果
參考文獻
[1]. discriminative training for large vocabulary speech recognition
[2]. sequence-discriminative training of deep neural networks
————————————————
版權聲明:本文爲CSDN博主「xmucas」的原創文章,遵循CC 4.0 by-sa版權協議,轉載請附上原文出處鏈接及本聲明。
原文鏈接:https://blog.csdn.net/xmdxcsj/article/details/54695512