這是第二次弄了,第一次在朋友服務器上弄,後面關閉了,數據也沒能拷貝,重新寫寫
參考鏈接:
https://github.com/kaldi-asr/kaldi/issues/1014 關於trials文件的作用
http://blog.csdn.net/zjm750617105/article/details/52421814 kaldi中改寫sre10/v1用timit dataset做說話人識別總結
數據目錄:
sre 訓練PLDA的數據集
train 訓練I-vector的數據集
sre10_train 註冊數據集
sre10_test 識別數據集
由於清華的數據集並不算大,將Sre和train合併爲Sre_train。實際目錄爲:
sre_ train
sre10_train
sre10_test
目錄下需要生成的文件:
utt2spk 該文件指明某一段發音是哪一個說話人發出的
<utterance-id> <speaker-id>
sw02001-A_000098-001156 2001-A
spk2utt 該文件指明一個說話人發出的所有發音 自動生成的
utils/utt2spk_to_spk2utt.pl data/train/utt2spk >data/train/spk2utt
<speaker-id> <utterance-id1> <utterance-id2> ...
wav.scp 該文件指明語音ID對應的語音文件
spk2gender 該文件指明說話人的性別
<speaker-id> < gender>
adg0 f
trials 該文件指明驗證集的組成形式(只有sre10_test生成,用於驗證的)
<spkA> <spkA-utt1> target
生成以上文件的代碼:
生成trials文件
if ( @ARGV > 1 ) {
die "Usage: utt2spk_to_spk2utt.pl [ utt2spk ] > spk2utt";
}
while(<>){
@A = split(" ", $_);
@A == 2 || die "Invalid line in utt2spk file: $_";
($u,$s) = @A;
if(!$seen_spk{$s}) {
$seen_spk{$s} = 1;
push @spklist, $s;
}
push (@{$spk_hash{$s}}, "$u");
}
# 正負樣本比率 70 70%正樣本
$ratio = 50;
# 測試說話人數
$len = @spklist;
# 按比率生成target nontarget
foreach $s (@spklist) {
$is_target=int(rand(100));
if($is_target < $ratio){
# 生成target
@spk_utt = @{$spk_hash{$s}};
foreach $u (@spk_utt) {
print "$s $u target\n";
}
}
else{
# 生成nontarget 隨機找與其不匹配的項
$other_spk_id=int(rand($len));
while($spklist[$other_spk_id] eq $s)
{
$other_spk_id=int(rand($len));
}
$other_spk = $spklist[$other_spk_id];
@spk_utt = @{$spk_hash{$other_spk}};
foreach $u (@spk_utt) {
print "$s $u nontarget\n";
}
}
}
生成其他文件的代碼:
#!/bin/bash
# ----egs\thchs30\s5\local\thchs-30_data_prep.sh
# 生成wav.scp, utt2spk.scp, spk2utt.scp spk2gender文件
# sort -u 刪除重複行
# xargs -i basename {} 獲取文件名
dir=$1
(
for x in sre_train sre10_test sre10_train; do
echo "cleaning data/$x"
cd $dir/data/$x
rm -rf wav.scp utt2spk spk2utt spk2gender spk2gender_repeat trials
# 截取文件名
for nn in `find $dir/data/$x/*.wav | xargs -i basename {} .wav`; do
# 截取spkid
spkid=`echo $nn | awk -F"_" '{print "" $1}'`
# 截取性別
spk_char=`echo $spkid | sed 's/\([A-Z]\).*/\1/'`
# 截取spk編號
spk_num=`echo $spkid | sed 's/[A-Z]\([0-9]\)/\1/'`
# 重組spkid
spkid=$(printf '%s%s' "$spk_char" "$spk_num")
# 重組語音文件編號
uttid=$nn
# 將性別轉爲小寫
spk_gender=$(echo $spk_char | tr '[A-Z]' '[a-z]')
echo $uttid $dir/data/$x/$nn.wav >> wav.scp
echo $uttid $spkid >> utt2spk
echo $spkid $spk_gender >> spk2gender_repeat
done
sort wav.scp -o wav.scp
sort utt2spk -o utt2spk
sort -k2n spk2gender_repeat | uniq > spk2gender
rm -rf spk2gender_repeat
echo "files in data/$x has been created"
done
) || exit 1
# 去掉spk2gender中重複的行
# echo sort -k2n spk2gender_repeat | uniq > spk2gender
utils/utt2spk_to_spk2utt.pl data/sre_train/utt2spk > data/sre_train/spk2utt
utils/utt2spk_to_spk2utt.pl data/sre10_test/utt2spk > data/sre10_test/spk2utt
utils/utt2spk_to_spk2utt.pl data/sre10_train/utt2spk > data/sre10_train/spk2utt
echo "prepare_data over"
echo "create trials of sre10_test"
# 生成測試數據集的trials
utils/generater_trials.pl data/sre10_test/utt2spk > data/sre10_test/trials
echo "create trials over"