解讀kaldi中mfcc源函數
在我的上一篇博客中,我提到了比較三種方法得到的mfcc數據特徵值的差異性,最後得到的結果是kaldi的效果最好,但是卻未找到他比其它兩種方法究竟強在哪裏,於是這篇文章就試着尋求這一問題的結果。
探查kaldi如何產生mfcc數據
因爲kaldi的每個例子(egs文件夾下的)裏邊都含有運行本示例的run.sh文件,於是我就想着可不可以在這個執行文件中找到我們需要的mfcc數據的產生過程,使用vim查看run.sh文件,使用命令:
vi run.sh
定位到產生mfcc文件的位置,如下所示:
在上邊的截圖中,我們也可以看到如下語句:
#make mfcc
steps/make_mfcc.sh --nj $n --cmd "$train_cmd" data/mfcc/$x exp/make_mfcc/$x mfcc/$x || exit 1;
我們順藤摸瓜,不妨去看看steps文件夾下的make_mfcc.sh文件,找到該文件夾下,使用vim打開該文件:
vi make_mfcc.sh
我們同樣定位找到關於mfcc生成的語句:
在上圖中標亮的部分是調用了compute-mfcc-feats函數來計算語音數據的mfcc特徵值:
compute-mfcc-feats $vtln_opts $write_utt2dur_opt --verbose=2 \
--config=$mfcc_config scp,p:$logdir/wav_${name}.JOB.scp ark:- \| \
但是該文件又在哪裏呢,經過一番查找發現該文件存在於kaldi下的src源文件的feat和festbin目錄下:
圖中選中文件是經過編譯出來的可執行文件,如果我們想要查看其源碼的話則需要查看其對應的.cc文件,此文件對應的c++文件爲compute-mfcc-feats.cc文件
好了,經過一個刨根過程,現在終於問到底兒了,這裏我將這幾個文件複製到win10系統上來看,當然我們也可以在虛擬機上或者雙系統繼續查看,使用notepad++打開src/featbin文件夾下的compute-mfcc-feats.cc文件,其內容如下:
// featbin/compute-mfcc-feats.cc
// Copyright 2009-2012 Microsoft Corporation
// Johns Hopkins University (author: Daniel Povey)
// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.
#include "base/kaldi-common.h"
#include "feat/feature-mfcc.h"
#include "feat/wave-reader.h"
#include "util/common-utils.h"
int main(int argc, char *argv[]) {
try {
using namespace kaldi;
const char *usage =
"Create MFCC feature files.\n"
"Usage: compute-mfcc-feats [options...] <wav-rspecifier> "
"<feats-wspecifier>\n";
// Construct all the global objects.
ParseOptions po(usage);
MfccOptions mfcc_opts;
// Define defaults for global options.
bool subtract_mean = false;
BaseFloat vtln_warp = 1.0;
std::string vtln_map_rspecifier;
std::string utt2spk_rspecifier;
int32 channel = -1;
BaseFloat min_duration = 0.0;
std::string output_format = "kaldi";
std::string utt2dur_wspecifier;
// Register the MFCC option struct.
mfcc_opts.Register(&po);
// Register the options.
po.Register("output-format", &output_format, "Format of the output "
"files [kaldi, htk]");
po.Register("subtract-mean", &subtract_mean, "Subtract mean of each "
"feature file [CMS]; not recommended to do it this way. ");
po.Register("vtln-warp", &vtln_warp, "Vtln warp factor (only applicable "
"if vtln-map not specified)");
po.Register("vtln-map", &vtln_map_rspecifier, "Map from utterance or "
"speaker-id to vtln warp factor (rspecifier)");
po.Register("utt2spk", &utt2spk_rspecifier, "Utterance to speaker-id map "
"rspecifier (if doing VTLN and you have warps per speaker)");
po.Register("channel", &channel, "Channel to extract (-1 -> expect mono, "
"0 -> left, 1 -> right)");
po.Register("min-duration", &min_duration, "Minimum duration of segments "
"to process (in seconds).");
po.Register("write-utt2dur", &utt2dur_wspecifier, "Wspecifier to write "
"duration of each utterance in seconds, e.g. 'ark,t:utt2dur'.");
po.Read(argc, argv);
if (po.NumArgs() != 2) {
po.PrintUsage();
exit(1);
}
std::string wav_rspecifier = po.GetArg(1);
std::string output_wspecifier = po.GetArg(2);
Mfcc mfcc(mfcc_opts);
if (utt2spk_rspecifier != "" && vtln_map_rspecifier == "")
KALDI_ERR << ("The --utt2spk option is only needed if "
"the --vtln-map option is used.");
RandomAccessBaseFloatReaderMapped vtln_map_reader(vtln_map_rspecifier,
utt2spk_rspecifier);
SequentialTableReader<WaveHolder> reader(wav_rspecifier);
BaseFloatMatrixWriter kaldi_writer; // typedef to TableWriter<something>.
TableWriter<HtkMatrixHolder> htk_writer;
if (output_format == "kaldi") {
if (!kaldi_writer.Open(output_wspecifier))
KALDI_ERR << "Could not initialize output with wspecifier "
<< output_wspecifier;
} else if (output_format == "htk") {
if (!htk_writer.Open(output_wspecifier))
KALDI_ERR << "Could not initialize output with wspecifier "
<< output_wspecifier;
} else {
KALDI_ERR << "Invalid output_format string " << output_format;
}
DoubleWriter utt2dur_writer(utt2dur_wspecifier);
int32 num_utts = 0, num_success = 0;
for (; !reader.Done(); reader.Next()) {
num_utts++;
std::string utt = reader.Key();
const WaveData &wave_data = reader.Value();
if (wave_data.Duration() < min_duration) {
KALDI_WARN << "File: " << utt << " is too short ("
<< wave_data.Duration() << " sec): producing no output.";
continue;
}
int32 num_chan = wave_data.Data().NumRows(), this_chan = channel;
{ // This block works out the channel (0=left, 1=right...)
KALDI_ASSERT(num_chan > 0); // should have been caught in
// reading code if no channels.
if (channel == -1) {
this_chan = 0;
if (num_chan != 1)
KALDI_WARN << "Channel not specified but you have data with "
<< num_chan << " channels; defaulting to zero";
} else {
if (this_chan >= num_chan) {
KALDI_WARN << "File with id " << utt << " has "
<< num_chan << " channels but you specified channel "
<< channel << ", producing no output.";
continue;
}
}
}
BaseFloat vtln_warp_local; // Work out VTLN warp factor.
if (vtln_map_rspecifier != "") {
if (!vtln_map_reader.HasKey(utt)) {
KALDI_WARN << "No vtln-map entry for utterance-id (or speaker-id) "
<< utt;
continue;
}
vtln_warp_local = vtln_map_reader.Value(utt);
} else {
vtln_warp_local = vtln_warp;
}
SubVector<BaseFloat> waveform(wave_data.Data(), this_chan);
Matrix<BaseFloat> features;
try {
mfcc.ComputeFeatures(waveform, wave_data.SampFreq(),
vtln_warp_local, &features);
} catch (...) {
KALDI_WARN << "Failed to compute features for utterance " << utt;
continue;
}
if (subtract_mean) {
Vector<BaseFloat> mean(features.NumCols());
mean.AddRowSumMat(1.0, features);
mean.Scale(1.0 / features.NumRows());
for (int32 i = 0; i < features.NumRows(); i++)
features.Row(i).AddVec(-1.0, mean);
}
if (output_format == "kaldi") {
kaldi_writer.Write(utt, features);
} else {
std::pair<Matrix<BaseFloat>, HtkHeader> p;
p.first.Resize(features.NumRows(), features.NumCols());
p.first.CopyFromMat(features);
HtkHeader header = {
features.NumRows(),
100000, // 10ms shift
static_cast<int16>(sizeof(float)*(features.NumCols())),
static_cast<uint16>( 006 | // MFCC
(mfcc_opts.use_energy ? 0100 : 020000)) // energy; otherwise c0
};
p.second = header;
htk_writer.Write(utt, p);
}
if (utt2dur_writer.IsOpen()) {
utt2dur_writer.Write(utt, wave_data.Duration());
}
if (num_utts % 10 == 0)
KALDI_LOG << "Processed " << num_utts << " utterances";
KALDI_VLOG(2) << "Processed features for key " << utt;
num_success++;
}
KALDI_LOG << " Done " << num_success << " out of " << num_utts
<< " utterances.";
return (num_success != 0 ? 0 : 1);
} catch(const std::exception &e) {
std::cerr << e.what();
return -1;
}
}
這裏的源碼編譯使用的是c++語言,如果需要掌握其詳細的內容及步驟,可能還需要掌握一些c++的語言知識。
限於對於c++知識掌握有限,所以暫時還不能理清各部分代碼所代表的意思。
如何探查kaldi中其它文件的源編碼文件
通過上邊的一番探索,我也發現了其它文件的源文件,例如hmm(隱馬爾可夫模型)、gmm(高斯混合模型),ivector等等,這裏我僅截圖顯示一下,如果日後研究需要,還需要仔細去探查各個源文件對應的代碼。
語音識別的道路還有很長,我瞭解的還是太少,也只有在不斷的學習中才能不斷地成長。