Kaldi解碼加速策略概述

前言

本文介紹幾種優化解碼器加速方法，基於kaldi chain模型解碼器（online2-wav-nnet3-latgen-faster），訓練的模型用於喚醒詞場景，主要優化內容包含：特徵提取、TDNN神經網絡計算、FST優化、lattice獲取1-best等。

除了以上方法，kaldi解碼器、openfst、openblas等在編譯時添加 -O3 優化選項和硬浮點運算（需硬件支持）的編譯選項（如ARM neon： -mfloat-abi=softfp -mfpu=neon）

1. 特徵提取加速

這裏主要介紹mfcc特徵提取加速優化，在提取mfcc特徵時，默認開啓了dither功能，此功能主要目的是添加了隨機抖動噪聲，防止聲音的溢出，因爲隨機數生成耗時較多（見以下代碼片段3），移除抖動可以節省mfcc 1/2 ~ 2/3 左右的時間。

// 片段1
void ProcessWindow(const FrameExtractionOptions &opts,
                   const FeatureWindowFunction &window_function,
                   VectorBase<BaseFloat> *window,
                   BaseFloat *log_energy_pre_window) {
  ...
  if (opts.dither != 0.0)
    Dither(window, opts.dither);
  ...
  window->MulElements(window_function.window);
}

// 片段2
void Dither(VectorBase<BaseFloat> *waveform, BaseFloat dither_value) {
  ...
  RandomState rstate;
  for (int32 i = 0; i < dim; i++)
    data[i] += RandGauss(&rstate) * dither_value;
}

// 片段3
/// Returns a random number strictly between 0 and 1.
inline float RandUniform(struct RandomState* state = NULL) {
  return static_cast<float>((Rand(state) + 1.0) / (RAND_MAX+2.0));
}

inline float RandGauss(struct RandomState* state = NULL) {
  return static_cast<float>(sqrtf (-2 * Log(RandUniform(state)))
                            * cosf(2*M_PI*RandUniform(state)));
}

2. TDNN神經網絡計算

優化1：
除了openblas數學庫編譯選項優化方法外，就是依據硬件平臺改寫具體的矩陣乘法加法運算（X * W + B），爲什麼說只有矩陣乘法加法運算呢？因爲在加載聲學模型final.mdl時，會有一個初始化的過程，compile後僅保留權重和偏置值，final.mdl文件中的batch-normal、dropout等參數在解碼時並不會使用到。final.mdl模型compile入口如下：
nnet3::CollapseModel(nnet3::CollapseModelConfig(), &(am_nnet.GetNnet()));

神經網絡解碼時是前向計算，在前向計算中component 組件只會出現三種：NaturalGradientAffineComponent、AffineComponent及RectifiedLinearComponent（Relu），其中NaturalGradientAffineComponent繼承自AffineComponent，在前向計算時其實就只用了AffineComponent和RectifiedLinearComponent。每次解碼時都會根據compile得到的commands來循環計算（每層網絡的權重和偏置在compile後固定不變）。

void NnetComputer::ExecuteCommand() {
  const NnetComputation::Command &c = computation_.commands[program_counter_];
  int32 m1, m2;
  ...
      case kPropagate: {
        const Component *component = nnet_.GetComponent(c.arg1);
        ComponentPrecomputedIndexes *indexes =
            computation_.component_precomputed_indexes[c.arg2].data;
        const CuSubMatrix<BaseFloat> input(GetSubMatrix(c.arg3));
        CuSubMatrix<BaseFloat> output(GetSubMatrix(c.arg4));
        void *memo = component->Propagate(indexes, input, &output);
        ...
        }
        SaveMemo(c.arg5, *component, memo);
        break;
      }
}

由於網絡每層的權重和偏置在copmpile後不變，我們可以獲取到對應的參數，進行8bit或者16bit量化,量化一方面將浮點轉定點運算，加快速度，另一方面由於float類型佔用更多的內存，量化可以減少內存。量化唯一的缺點是需要重寫矩陣運算（^_ 效率要比openblas高才可以，openblas提供的庫函數傳參是float或double類型，未支持量化，也是一個難點）

優化2：減小聲學模型的大小

模型訓練時移除ivector特徵，ivector會使得聲學模型變的較大
對已經訓練生成的final.mdl聲學模型進行奇異值分解，可以減小模型的大小，利用nnet3-copy或者nnet3-am-copy "–edits-config"中的apply-svd方法可以減小final.mdl尺寸，1.2M左右的聲學模型可減小300k左右。新生成的final.mdl識別率極差，需要重新retrain,迭代幾個epoch後，識別率可以與之前基本保持一致。

3. FST優化

優化1：
準確的說也不叫fst優化，主要方法是處理髮射弧時，調整剪枝參數beam的大小，kaldi默認值爲15.0，可以減小該值達到加速的目的，而識別率僅僅略微下降，同時內存也可以降低一些（token數目減少）

優化2：
HCLG.fst的裁剪，HCLG.fst大小越小，遍歷的弧就會越少，相應的速度也會更快。

4. lattice獲取1-best

online2-wav-nnet3-latgen-faster解碼器最終獲得的是CompactLattice，在生成CompactLattice之前，還有一個Lattice（也就是代碼中所說的RawLattice），其實我們只需要這個RawLattice即可獲取1-best path, 關鍵原因是生成CompactLattice比較耗時，尤其是環境人聲嘈雜時，會使得解碼時間變得不穩定，有時甚至6~8s左右的時延（因爲做了phone、word的Determinize），而RawLattice時間相對穩定（50ms內，同性能CPU比較），因爲最終GetDiagnosticsAndPrintOutput輸出結果時，又把CompactLattice轉換成Lattice，來獲取解碼序列。

template <typename FST>
void SingleUtteranceNnet3DecoderTpl<FST>::GetLattice(bool end_of_utterance,
                                             CompactLattice *clat) const {
  if (NumFramesDecoded() == 0)
    KALDI_ERR << "You cannot get a lattice if you decoded no frames.";
  Lattice raw_lat;
  decoder_.GetRawLattice(&raw_lat, end_of_utterance);

  if (!decoder_opts_.determinize_lattice)
    KALDI_ERR << "--determinize-lattice=false option is not supported at the moment";

  BaseFloat lat_beam = decoder_opts_.lattice_beam;
  //
  DeterminizeLatticePhonePrunedWrapper(
      trans_model_, &raw_lat, lat_beam, clat, decoder_opts_.det_opts);
}

總結

本文主要總結了幾種加快解碼速度的方法，解碼器從最初的5~8s（嘈雜環境）到現在的400-500ms（嘈雜環境）穩定解碼（vad截取特定長度語音片段，CPU 1.2GHz）;

此外解碼器的精簡，移除冗餘代碼，也可加快一些速度，但速度沒有想象中提升很大，嘗試裁剪解碼器的代碼，除去final.mdl、HCLG.fst模型，openfst庫的替換改寫（佔用內存較多），解碼器可執行文件可以縮減到300k以下（openblas保留）。

如有其它方法，歡迎一起討論。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Kaldi解碼加速策略概述

前言

1. 特徵提取加速

2. TDNN神經網絡計算

3. FST優化

4. lattice獲取1-best

總結

lightdb hash index的性能和限制

嵌入式平臺Portaudio的交叉編譯

Kaldi aishell2 GMM訓練步驟(含aishell1語料轉aishell2組織方式)

MFCC特徵提取--基於kaldi裁剪

嵌入式平臺openFst的交叉編譯

嵌入式平臺OpenBlas及Clapack的交叉編譯

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結