Triphone
以單音素模型爲輸入訓練上下文相關的三音素模型

 #triphone
steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" 2000 10000 data/mfcc/train data/lang exp/mono_ali exp/tri1 || exit 1;

train_deltas.sh中的相關配置如下：

stage=-4 #  This allows restarting after partway, when something when wrong.
config=
cmd=run.pl
scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
realign_iters="10 20 30";
num_iters=35    # Number of iterations of training
max_iter_inc=25 # Last iter to increase #Gauss on.
beam=10
careful=false
retry_beam=40
boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
power=0.25 # Exponent for number of gaussians according to occurrence counts
cluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves
norm_vars=false # deprecated.  Prefer --cmvn-opts "--norm-vars=true"
                # use the option --cmvn-opts "--norm-means=false"
cmvn_opts=
delta_opts=
context_opts=   # use"--context-width=5 --central-position=2" for quinphone

用法爲：

echo "Usage: steps/train_deltas.sh <num-leaves> <tot-gauss> <data-dir> <lang-dir> <alignment-dir> <exp-dir>"
echo "e.g.: steps/train_deltas.sh 2000 10000 data/train_si84_half data/lang exp/mono_ali exp/tri1"

LDA_MLLT
對特徵使用LDA和MLLT進行變換，訓練加入LDA和MLLT的三音素模型。
LDA+MLLT refers to the way we transform the features after computing the MFCCs: we splice across several frames, reduce the dimension (to 40 by default) using Linear Discriminant Analysis), and then later estimate, over multiple iterations, a diagonalizing transform known as MLLT or CTC.
詳情可參考 http://kaldi-asr.org/doc/transform.html

#triphone_ali

steps/align_si.sh --nj $n --cmd "$train_cmd" data/mfcc/train data/lang exp/tri1 exp/tri1_ali || exit 1;

lda_mllt

steps/train_lda_mllt.sh --cmd "$train_cmd" --splice-opts "--left-context=3 --right-context=3" 2500 15000 data/mfcc/train data/lang exp/tri1_ali` exp/tri2b || exit 1;

test tri2b model

local/thchs-30_decode.sh --nj $n "steps/decode.sh" exp/tri2b data/mfcc &train_lda_mllt.sh

相關代碼配置如下：

cmd=run.pl
config=
stage=-5
scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
realign_iters="10 20 30";
mllt_iters="2 4 6 12";
num_iters=35    # Number of iterations of training
max_iter_inc=25  # Last iter to increase #Gauss on.
dim=40
beam=10
retry_beam=40
careful=false
boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
power=0.25 # Exponent for number of gaussians according to occurrence counts
randprune=4.0 # This is approximately the ratio by which we will speed up the
              # LDA and MLLT calculations via randomized pruning.
splice_opts=
cluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves
norm_vars=false # deprecated.  Prefer --cmvn-opts "--norm-vars=false"
cmvn_opts=
context_opts=   # use "--context-width=5 --central-position=2" for quinphone.

Sat
運用基於特徵空間的最大似然線性迴歸（fMLLR）進行說話人自適應訓練
This does Speaker Adapted Training (SAT), i.e. train on fMLLR-adapted features. It can be done on top of either LDA+MLLT, or delta and delta-delta features. If there are no transforms supplied in the alignment directory, it will estimate transforms itself before building the tree (and in any case, it estimates transforms a number of times during training).

lda_mllt_ali

steps/align_si.sh –nj n−−cmd" train_cmd” –use-graphs true data/mfcc/train data/lang exp/tri2b exp/tri2b_ali || exit 1;

sat

steps/train_sat.sh –cmd “$train_cmd” 2500 15000 data/mfcc/train data/lang exp/tri2b_ali exp/tri3b || exit 1;

test tri3b model

local/thchs-30_decode.sh --nj $n "steps/decode_fmllr.sh" exp/tri3b data/mfcc &train_sat.sh
相關具體配置如下：

stage=-5
exit_stage=-100 # you can use this to require it to exit at the
                # beginning of a specific stage.  Not all values are
                # supported.
fmllr_update_type=full
cmd=run.pl
scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
beam=10
retry_beam=40
careful=false
boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
context_opts=  # e.g. set this to "--context-width 5 --central-position 2" for quinphone.
realign_iters="10 20 30";
fmllr_iters="2 4 6 12";
silence_weight=0.0 # Weight on silence in fMLLR estimation.
num_iters=35   # Number of iterations of training
max_iter_inc=25 # Last iter to increase #Gauss on.
power=0.2 # Exponent for number of gaussians according to occurrence counts
cluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves
phone_map=
train_tree=true
tree_stats_opts=
cluster_phones_`這裏寫代碼片`opts=
compile_questions_opts=

decode_fmllr.sh ：對做了發音人自適應的模型進行解碼
Decoding script that does fMLLR. This can be on top of delta+delta-delta, or LDA+MLLT features.
steps/train_quick.sh –cmd “$train_cmd” 4200 40000 data/mfcc/train data/lang exp/tri3b_ali exp/tri4b || exit 1;

test tri4b model

local/thchs-30_decode.sh –nj $n “steps/decode_fmllr.sh” exp/tri4b data/mfcc &train_quick.sh
配置：

Begin configuration..

cmd=run.pl
scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
realign_iters="10 15"; # Only realign twice.
num_iters=20    # Number of iterations of training
maxiterinc=15 # Last iter to increase #Gauss on.
batch_size=750 # batch size to use while compiling graphs... memory/speed tradeoff.
beam=10 # alignment beam.
retry_beam=40
stage=-5
cluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves

End configuration section.

Danni_hgc

發佈了7 篇原創文章 · 獲贊 35 · 訪問量 14萬+

私信關注

kaldi的triphone訓練詳解