解读kaldi中mfcc源函数


在我的上一篇博客中,我提到了比较三种方法得到的mfcc数据特征值的差异性,最后得到的结果是kaldi的效果最好,但是却未找到他比其它两种方法究竟强在哪里,于是这篇文章就试着寻求这一问题的结果。

探查kaldi如何产生mfcc数据

因为kaldi的每个例子(egs文件夹下的)里边都含有运行本示例的run.sh文件,于是我就想着可不可以在这个执行文件中找到我们需要的mfcc数据的产生过程,使用vim查看run.sh文件,使用命令:

vi run.sh

定位到产生mfcc文件的位置,如下所示:
在这里插入图片描述
在上边的截图中,我们也可以看到如下语句:

#make  mfcc
   steps/make_mfcc.sh --nj $n --cmd "$train_cmd" data/mfcc/$x exp/make_mfcc/$x mfcc/$x || exit 1;

我们顺藤摸瓜,不妨去看看steps文件夹下的make_mfcc.sh文件,找到该文件夹下,使用vim打开该文件:

vi make_mfcc.sh

我们同样定位找到关于mfcc生成的语句:
在这里插入图片描述在上图中标亮的部分是调用了compute-mfcc-feats函数来计算语音数据的mfcc特征值:

compute-mfcc-feats $vtln_opts $write_utt2dur_opt --verbose=2 \
      --config=$mfcc_config scp,p:$logdir/wav_${name}.JOB.scp ark:- \| \

但是该文件又在哪里呢,经过一番查找发现该文件存在于kaldi下的src源文件的featfestbin目录下:
在这里插入图片描述
图中选中文件是经过编译出来的可执行文件,如果我们想要查看其源码的话则需要查看其对应的.cc文件,此文件对应的c++文件为compute-mfcc-feats.cc文件
在这里插入图片描述
好了,经过一个刨根过程,现在终于问到底儿了,这里我将这几个文件复制到win10系统上来看,当然我们也可以在虚拟机上或者双系统继续查看,使用notepad++打开src/featbin文件夹下的compute-mfcc-feats.cc文件,其内容如下:

// featbin/compute-mfcc-feats.cc

// Copyright 2009-2012  Microsoft Corporation
//                      Johns Hopkins University (author: Daniel Povey)

// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//  http://www.apache.org/licenses/LICENSE-2.0
//
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.

#include "base/kaldi-common.h"
#include "feat/feature-mfcc.h"
#include "feat/wave-reader.h"
#include "util/common-utils.h"

int main(int argc, char *argv[]) {
  try {
    using namespace kaldi;
    const char *usage =
        "Create MFCC feature files.\n"
        "Usage:  compute-mfcc-feats [options...] <wav-rspecifier> "
        "<feats-wspecifier>\n";

    // Construct all the global objects.
    ParseOptions po(usage);
    MfccOptions mfcc_opts;
    // Define defaults for global options.
    bool subtract_mean = false;
    BaseFloat vtln_warp = 1.0;
    std::string vtln_map_rspecifier;
    std::string utt2spk_rspecifier;
    int32 channel = -1;
    BaseFloat min_duration = 0.0;
    std::string output_format = "kaldi";
    std::string utt2dur_wspecifier;

    // Register the MFCC option struct.
    mfcc_opts.Register(&po);

    // Register the options.
    po.Register("output-format", &output_format, "Format of the output "
                "files [kaldi, htk]");
    po.Register("subtract-mean", &subtract_mean, "Subtract mean of each "
                "feature file [CMS]; not recommended to do it this way. ");
    po.Register("vtln-warp", &vtln_warp, "Vtln warp factor (only applicable "
                "if vtln-map not specified)");
    po.Register("vtln-map", &vtln_map_rspecifier, "Map from utterance or "
                "speaker-id to vtln warp factor (rspecifier)");
    po.Register("utt2spk", &utt2spk_rspecifier, "Utterance to speaker-id map "
                "rspecifier (if doing VTLN and you have warps per speaker)");
    po.Register("channel", &channel, "Channel to extract (-1 -> expect mono, "
                "0 -> left, 1 -> right)");
    po.Register("min-duration", &min_duration, "Minimum duration of segments "
                "to process (in seconds).");
    po.Register("write-utt2dur", &utt2dur_wspecifier, "Wspecifier to write "
                "duration of each utterance in seconds, e.g. 'ark,t:utt2dur'.");

    po.Read(argc, argv);

    if (po.NumArgs() != 2) {
      po.PrintUsage();
      exit(1);
    }

    std::string wav_rspecifier = po.GetArg(1);

    std::string output_wspecifier = po.GetArg(2);

    Mfcc mfcc(mfcc_opts);

    if (utt2spk_rspecifier != "" && vtln_map_rspecifier == "")
      KALDI_ERR << ("The --utt2spk option is only needed if "
                    "the --vtln-map option is used.");
    RandomAccessBaseFloatReaderMapped vtln_map_reader(vtln_map_rspecifier,
                                                      utt2spk_rspecifier);

    SequentialTableReader<WaveHolder> reader(wav_rspecifier);
    BaseFloatMatrixWriter kaldi_writer;  // typedef to TableWriter<something>.
    TableWriter<HtkMatrixHolder> htk_writer;

    if (output_format == "kaldi") {
      if (!kaldi_writer.Open(output_wspecifier))
        KALDI_ERR << "Could not initialize output with wspecifier "
                  << output_wspecifier;
    } else if (output_format == "htk") {
      if (!htk_writer.Open(output_wspecifier))
        KALDI_ERR << "Could not initialize output with wspecifier "
                  << output_wspecifier;
    } else {
      KALDI_ERR << "Invalid output_format string " << output_format;
    }

    DoubleWriter utt2dur_writer(utt2dur_wspecifier);

    int32 num_utts = 0, num_success = 0;
    for (; !reader.Done(); reader.Next()) {
      num_utts++;
      std::string utt = reader.Key();
      const WaveData &wave_data = reader.Value();
      if (wave_data.Duration() < min_duration) {
        KALDI_WARN << "File: " << utt << " is too short ("
                   << wave_data.Duration() << " sec): producing no output.";
        continue;
      }
      int32 num_chan = wave_data.Data().NumRows(), this_chan = channel;
      {  // This block works out the channel (0=left, 1=right...)
        KALDI_ASSERT(num_chan > 0);  // should have been caught in
        // reading code if no channels.
        if (channel == -1) {
          this_chan = 0;
          if (num_chan != 1)
            KALDI_WARN << "Channel not specified but you have data with "
                       << num_chan  << " channels; defaulting to zero";
        } else {
          if (this_chan >= num_chan) {
            KALDI_WARN << "File with id " << utt << " has "
                       << num_chan << " channels but you specified channel "
                       << channel << ", producing no output.";
            continue;
          }
        }
      }
      BaseFloat vtln_warp_local;  // Work out VTLN warp factor.
      if (vtln_map_rspecifier != "") {
        if (!vtln_map_reader.HasKey(utt)) {
          KALDI_WARN << "No vtln-map entry for utterance-id (or speaker-id) "
                     << utt;
          continue;
        }
        vtln_warp_local = vtln_map_reader.Value(utt);
      } else {
        vtln_warp_local = vtln_warp;
      }

      SubVector<BaseFloat> waveform(wave_data.Data(), this_chan);
      Matrix<BaseFloat> features;
      try {
        mfcc.ComputeFeatures(waveform, wave_data.SampFreq(),
                             vtln_warp_local, &features);
      } catch (...) {
        KALDI_WARN << "Failed to compute features for utterance " << utt;
        continue;
      }
      if (subtract_mean) {
        Vector<BaseFloat> mean(features.NumCols());
        mean.AddRowSumMat(1.0, features);
        mean.Scale(1.0 / features.NumRows());
        for (int32 i = 0; i < features.NumRows(); i++)
          features.Row(i).AddVec(-1.0, mean);
      }
      if (output_format == "kaldi") {
        kaldi_writer.Write(utt, features);
      } else {
        std::pair<Matrix<BaseFloat>, HtkHeader> p;
        p.first.Resize(features.NumRows(), features.NumCols());
        p.first.CopyFromMat(features);
        HtkHeader header = {
          features.NumRows(),
          100000,  // 10ms shift
          static_cast<int16>(sizeof(float)*(features.NumCols())),
          static_cast<uint16>( 006 | // MFCC
          (mfcc_opts.use_energy ? 0100 : 020000)) // energy; otherwise c0
        };
        p.second = header;
        htk_writer.Write(utt, p);
      }
      if (utt2dur_writer.IsOpen()) {
        utt2dur_writer.Write(utt, wave_data.Duration());
      }
      if (num_utts % 10 == 0)
        KALDI_LOG << "Processed " << num_utts << " utterances";
      KALDI_VLOG(2) << "Processed features for key " << utt;
      num_success++;
    }
    KALDI_LOG << " Done " << num_success << " out of " << num_utts
              << " utterances.";
    return (num_success != 0 ? 0 : 1);
  } catch(const std::exception &e) {
    std::cerr << e.what();
    return -1;
  }
}

这里的源码编译使用的是c++语言,如果需要掌握其详细的内容及步骤,可能还需要掌握一些c++的语言知识。
限于对于c++知识掌握有限,所以暂时还不能理清各部分代码所代表的意思。

如何探查kaldi中其它文件的源编码文件

通过上边的一番探索,我也发现了其它文件的源文件,例如hmm(隐马尔可夫模型)、gmm(高斯混合模型),ivector等等,这里我仅截图显示一下,如果日后研究需要,还需要仔细去探查各个源文件对应的代码。
在这里插入图片描述
语音识别的道路还有很长,我了解的还是太少,也只有在不断的学习中才能不断地成长。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章