这是第二次弄了,第一次在朋友服务器上弄,后面关闭了,数据也没能拷贝,重新写写
参考链接:
https://github.com/kaldi-asr/kaldi/issues/1014 关于trials文件的作用
http://blog.csdn.net/zjm750617105/article/details/52421814 kaldi中改写sre10/v1用timit dataset做说话人识别总结
数据目录:
sre 训练PLDA的数据集
train 训练I-vector的数据集
sre10_train 注册数据集
sre10_test 识别数据集
由于清华的数据集并不算大,将Sre和train合并为Sre_train。实际目录为:
sre_ train
sre10_train
sre10_test
目录下需要生成的文件:
utt2spk 该文件指明某一段发音是哪一个说话人发出的
<utterance-id> <speaker-id>
sw02001-A_000098-001156 2001-A
spk2utt 该文件指明一个说话人发出的所有发音 自动生成的
utils/utt2spk_to_spk2utt.pl data/train/utt2spk >data/train/spk2utt
<speaker-id> <utterance-id1> <utterance-id2> ...
wav.scp 该文件指明语音ID对应的语音文件
spk2gender 该文件指明说话人的性别
<speaker-id> < gender>
adg0 f
trials 该文件指明验证集的组成形式(只有sre10_test生成,用于验证的)
<spkA> <spkA-utt1> target
生成以上文件的代码:
生成trials文件
if ( @ARGV > 1 ) {
die "Usage: utt2spk_to_spk2utt.pl [ utt2spk ] > spk2utt";
}
while(<>){
@A = split(" ", $_);
@A == 2 || die "Invalid line in utt2spk file: $_";
($u,$s) = @A;
if(!$seen_spk{$s}) {
$seen_spk{$s} = 1;
push @spklist, $s;
}
push (@{$spk_hash{$s}}, "$u");
}
# 正负样本比率 70 70%正样本
$ratio = 50;
# 测试说话人数
$len = @spklist;
# 按比率生成target nontarget
foreach $s (@spklist) {
$is_target=int(rand(100));
if($is_target < $ratio){
# 生成target
@spk_utt = @{$spk_hash{$s}};
foreach $u (@spk_utt) {
print "$s $u target\n";
}
}
else{
# 生成nontarget 随机找与其不匹配的项
$other_spk_id=int(rand($len));
while($spklist[$other_spk_id] eq $s)
{
$other_spk_id=int(rand($len));
}
$other_spk = $spklist[$other_spk_id];
@spk_utt = @{$spk_hash{$other_spk}};
foreach $u (@spk_utt) {
print "$s $u nontarget\n";
}
}
}
生成其他文件的代码:
#!/bin/bash
# ----egs\thchs30\s5\local\thchs-30_data_prep.sh
# 生成wav.scp, utt2spk.scp, spk2utt.scp spk2gender文件
# sort -u 删除重复行
# xargs -i basename {} 获取文件名
dir=$1
(
for x in sre_train sre10_test sre10_train; do
echo "cleaning data/$x"
cd $dir/data/$x
rm -rf wav.scp utt2spk spk2utt spk2gender spk2gender_repeat trials
# 截取文件名
for nn in `find $dir/data/$x/*.wav | xargs -i basename {} .wav`; do
# 截取spkid
spkid=`echo $nn | awk -F"_" '{print "" $1}'`
# 截取性别
spk_char=`echo $spkid | sed 's/\([A-Z]\).*/\1/'`
# 截取spk编号
spk_num=`echo $spkid | sed 's/[A-Z]\([0-9]\)/\1/'`
# 重组spkid
spkid=$(printf '%s%s' "$spk_char" "$spk_num")
# 重组语音文件编号
uttid=$nn
# 将性别转为小写
spk_gender=$(echo $spk_char | tr '[A-Z]' '[a-z]')
echo $uttid $dir/data/$x/$nn.wav >> wav.scp
echo $uttid $spkid >> utt2spk
echo $spkid $spk_gender >> spk2gender_repeat
done
sort wav.scp -o wav.scp
sort utt2spk -o utt2spk
sort -k2n spk2gender_repeat | uniq > spk2gender
rm -rf spk2gender_repeat
echo "files in data/$x has been created"
done
) || exit 1
# 去掉spk2gender中重复的行
# echo sort -k2n spk2gender_repeat | uniq > spk2gender
utils/utt2spk_to_spk2utt.pl data/sre_train/utt2spk > data/sre_train/spk2utt
utils/utt2spk_to_spk2utt.pl data/sre10_test/utt2spk > data/sre10_test/spk2utt
utils/utt2spk_to_spk2utt.pl data/sre10_train/utt2spk > data/sre10_train/spk2utt
echo "prepare_data over"
echo "create trials of sre10_test"
# 生成测试数据集的trials
utils/generater_trials.pl data/sre10_test/utt2spk > data/sre10_test/trials
echo "create trials over"