fix_date_dir.sh腳本代碼分析


fix_date_dir.sh腳本的使用方法和目的:

這個腳本主要是有助於確保數據目錄中的各種文件得到正確排序和篩選,例如刪除沒有任何特徵對應的語句(如果feats.scp存在)

  echo "Usage: utils/data/fix_data_dir.sh <data-dir>"
  echo "e.g.: utils/data/fix_data_dir.sh data/train"
  echo "This script helps ensure that the various files in a data directory"
  echo "are correctly sorted and filtered, for example removing utterances"
  echo "that have no features (if feats.scp is present)"
  exit 1

調用函數的順序,之後我會詳細介紹每一個函數的代碼:

filter_recordings
filter_speakers
filter_utts
filter_speakers
filter_recordings

檢測train文件夾內是否包含以下文件,每個文件中的格式以及具體情況可以參考以下博客:傳送門

for x in utt2spk spk2utt feats.scp text segments wav.scp cmvn.scp vad.scp \
    reco2file_and_channel spk2gender utt2lang utt2uniq utt2dur reco2dur utt2num_frames; do
  if [ -f $data/$x ]; then
    cp $data/$x $data/.backup/$x
    check_sorted $data/$x
  fi
done

check_sorted:判斷train文件夾是否已經排序且裏面的文件沒有重複

function check_sorted {
  file=$1
  sort -k1,1 -u <$file >$file.tmp
  if ! cmp -s $file $file.tmp; then
    echo "$0: file $1 is not in sorted order or not unique, sorting it"
    mv $file.tmp $file
  else
    rm $file.tmp
  fi
}

filter_recordings:我們在過濾utterance-id之前調用一次,在最後結束的時候再調用一次。

“segments”文件,如下:

s5# head -3 data/train/segments
sw02001-A_000098-001156 sw02001-A 0.98 11.56
sw02001-A_001980-002131 sw02001-A 19.8 21.31
sw02001-A_002736-002893 sw02001-A 27.36 28.93
function filter_recordings {
  # We call this once before the stage when we filter on utterance-id, and once
  # after.

  if [ -f $data/segments ]; then
  # We have a segments file -> we need to filter this and the file wav.scp, and
  # reco2file_and_utt, if it exists, to make sure they have the same list of
  # recording-ids.
  // 我們首先去判斷文件wav.scp,reco2file_and_channel
    if [ ! -f $data/wav.scp ]; then
      echo "$0: $data/segments exists but not $data/wav.scp"
      exit 1;
    fi
    //取出segments的第二列然後排序去重,然後寫入recordings中
    awk '{print $2}' < $data/segments | sort | uniq > $tmpdir/recordings
    //recordings的行數
    n1=$(cat $tmpdir/recordings | wc -l)
    [ ! -s $tmpdir/recordings ] && \
      echo "Empty list of recordings (bad file $data/segments)?" && exit 1;
    //utils/filter_scp.pl a b.就是查找b這個列表中的utt是否在a中每行的第一列utt中出現過
    //輸出b的utt列表中在a中第一列出現過utt 
    //filter_scp.pl還有兩個參數-f是中在a中的第幾列出現,默認是1,--exclude如果有這個參數
    //的話就是輸出沒有出現過得utt,而不是輸出出現過得utt
    utils/filter_scp.pl $data/wav.scp $tmpdir/recordings > $tmpdir/recordings.tmp
    mv $tmpdir/recordings.tmp $tmpdir/recordings
    //recording中的語句現在就是同時在wav.scp和segments中
    //第一二列互換
    cp $data/segments{,.tmp}; awk '{print $2, $1, $3, $4}' <$data/segments.tmp >$data/segments
    //對照recordings,將segmen中的utt沒有出現在wav.scp中的行數刪了
    filter_file $tmpdir/recordings $data/segments
    //第一二列互換
    cp $data/segments{,.tmp}; awk '{print $2, $1, $3, $4}' <$data/segments.tmp >$data/segments
    rm $data/segments.tmp
    //對照recordings,將wav.scp中的utt沒有出現在segment中的行數刪了
    filter_file $tmpdir/recordings $data/wav.scp
    [ -f $data/reco2file_and_channel ] && filter_file $tmpdir/recordings $data/reco2file_and_channel
    [ -f $data/reco2dur ] && filter_file $tmpdir/recordings $data/reco2dur
    true
  fi
}

filter_file:
這個有兩個輸入文件filter和file_to_filter,刪除file_to_filter中的uut沒有出現在filter中的行,同時輸出刪前和刪後的行數對比

function filter_file {
  filter=$1
  file_to_filter=$2
  cp $file_to_filter ${file_to_filter}.tmp
  utils/filter_scp.pl $filter ${file_to_filter}.tmp > $file_to_filter
  if ! cmp ${file_to_filter}.tmp  $file_to_filter >&/dev/null; then
    length1=$(cat ${file_to_filter}.tmp | wc -l)
    length2=$(cat ${file_to_filter} | wc -l)
    if [ $length1 -ne $length2 ]; then
      echo "$0: filtered $file_to_filter from $length1 to $length2 lines based on filter $filter."
    fi
  fi
  rm $file_to_filter.tmp
}

filter_speakers:
在整個程序中,我們認爲utt2spk是主的,spk2utt是派生的,所以我們使用utt2spk_to_spk2utt.pl通過uut2spk生成spk2utt。
這個函數的功能主要是統一uttspeak,cmvn.scp spk2gender中的speak,保證他們的speak都是共同擁有的

function filter_speakers {
  # throughout this program, we regard utt2spk as primary and spk2utt as derived, so...
  utils/utt2spk_to_spk2utt.pl $data/utt2spk > $data/spk2utt
  //刪除cmvn.scp spk2gender中沒有出現的speak
  cat $data/spk2utt | awk '{print $1}' > $tmpdir/speakers
  for s in cmvn.scp spk2gender; do
    f=$data/$s
    if [ -f $f ]; then
      filter_file $f $tmpdir/speakers
    fi
  done

  filter_file $tmpdir/speakers $data/spk2utt
  utils/spk2utt_to_utt2spk.pl $data/spk2utt > $data/utt2spk

  for s in cmvn.scp spk2gender $spk_extra_files; do
    f=$data/$s
    if [ -f $f ]; then
      filter_file $tmpdir/speakers $f
    fi
  done
}

filter_utts:
提取出所有文件都擁有的utt,然後將不是所有文件都有的uut的行刪除

function filter_utts {
  cat $data/utt2spk | awk '{print $1}' > $tmpdir/utts
   //判斷utt2spk是否已經排序
  ! cat $data/utt2spk | sort | cmp - $data/utt2spk && \
    echo "utt2spk is not in sorted order (fix this yourself)" && exit 1;
  //如果按照utt2spk第二列排序 判斷他是否排序
  ! cat $data/utt2spk | sort -k2 | cmp - $data/utt2spk && \
    echo "utt2spk is not in sorted order when sorted first on speaker-id " && \
    echo "(fix this by making speaker-ids prefixes of utt-ids)" && exit 1;
  //判斷spk2utt是否已經排序
  ! cat $data/spk2utt | sort | cmp - $data/spk2utt && \
    echo "spk2utt is not in sorted order (fix this yourself)" && exit 1;
  
  if [ -f $data/utt2uniq ]; then
    ! cat $data/utt2uniq | sort | cmp - $data/utt2uniq && \
      echo "utt2uniq is not in sorted order (fix this yourself)" && exit 1;
  fi

  maybe_wav=
  maybe_reco2dur=
  [ ! -f $data/segments ] && maybe_wav=wav.scp # wav indexed by utts only if segments does not exist.
  [ -s $data/reco2dur ] && [ ! -f $data/segments ] && maybe_reco2dur=reco2dur # reco2dur indexed by utts

  maybe_utt2dur=
  if [ -f $data/utt2dur ]; then
    cat $data/utt2dur | \
      awk '{ if (NF == 2 && $2 > 0) { print }}' > $data/utt2dur.ok || exit 1
    maybe_utt2dur=utt2dur.ok
  fi

  maybe_utt2num_frames=
  if [ -f $data/utt2num_frames ]; then
    cat $data/utt2num_frames | \
      awk '{ if (NF == 2 && $2 > 0) { print }}' > $data/utt2num_frames.ok || exit 1
    maybe_utt2num_frames=utt2num_frames.ok
  fi
  //提取出feats.scp text segments utt2lang $maybe_wav $maybe_utt2dur $maybe_utt2num_frames共同擁有的utt
  for x in feats.scp text segments utt2lang $maybe_wav $maybe_utt2dur $maybe_utt2num_frames; do
    if [ -f $data/$x ]; then
      utils/filter_scp.pl $data/$x $tmpdir/utts > $tmpdir/utts.tmp
      mv $tmpdir/utts.tmp $tmpdir/utts
    fi
  done
  rm $data/utt2dur.ok 2>/dev/null || true
  rm $data/utt2num_frames.ok 2>/dev/null || true

  [ ! -s $tmpdir/utts ] && echo "fix_data_dir.sh: no utterances remained: not proceeding further." && \
    rm $tmpdir/utts && exit 1;


  if [ -f $data/utt2spk ]; then
    new_nutts=$(cat $tmpdir/utts | wc -l)
    old_nutts=$(cat $data/utt2spk | wc -l)
    if [ $new_nutts -ne $old_nutts ]; then
      echo "fix_data_dir.sh: kept $new_nutts utterances out of $old_nutts"
    else
      echo "fix_data_dir.sh: kept all $old_nutts utterances."
    fi
  fi
//將不是共同擁有的utt的行刪除
  for x in utt2spk utt2uniq feats.scp vad.scp text segments utt2lang utt2dur utt2num_frames $maybe_wav $maybe_reco2dur $utt_extra_files; do
    if [ -f $data/$x ]; then
      cp $data/$x $data/.backup/$x
      if ! cmp -s $data/$x <( utils/filter_scp.pl $tmpdir/utts $data/$x ) ; then
        utils/filter_scp.pl $tmpdir/utts $data/.backup/$x > $data/$x
      fi
    fi
  done

}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章