SAD論文閱讀筆記-INTERSPEECH2019

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection1

發表於INTERSPEECH2019

研究背景

The 2019 Inaugural Fearless Steps Challenge - Task1: Speech Activity Detection2 鏈接:link
本文方法在比賽所有27個提交系統中性能排名第一(1/27):DCF=3.318% (on evaluation dataset)

數據集:the Fearless Steps (FS) Challenge Corpus
  • comprised of three mission critical stages from the NASA’s Apollo-11 mission, viz., Lift Off, Lunar Landing, and Lunar Walking
  • 30 individual synchronized analog communications channels with multiple speakers in different locations working real-time to accomplish NASA’s Apollo missions
  • most of the audio channels suffer from a wide range of issues like high channel noise, system noise, attenuated signal bandwidth, transmission noise, cosmic noise, analog tape static noise, noise from tape aging, etc., with noise levels varying within each channel across time

數據集時長:training set ~60h; development set ~20h10min; evaluation set ~20h;採樣率:8KHz
NOTE: The training labels provided, are not ground truth; they are system outputs generated by our Baseline Systems.

評價指標:DCF (Detection Cost Function)

DCF(θ)=0.75PFN(θ)+0.25PFP(θ)DCF(\theta)=0.75*P_{FN}(\theta)+0.25*P_{FP}(\theta)其中,θ\theta爲閾值,需調節閾值使DCF最小。PFNP_{FN}即False Negative rate,也即MAR(Missed Alarm Rate);即PFPP_{FP}即False Positive Rate,也即FAR(False Alarm Rate);DCF即FAR和MAR的加權平均,且MAR所佔比重相比FAR更大。

技術方法:2DCRNN

基本思路:以聲音信號的二維時頻譜圖爲網絡輸入,將SAD當做二維、多標籤圖像分類問題來解決
輸入:將聲音信號劃分成長1s的片段(對應8000個採樣點),對每段提取STFT spectrogram,FFT點數爲256,步長爲64,得到的輸入特徵維數爲129 ×\times 126
輸出:8000 ×\times 1,源代碼中定義的輸出層單元數爲8000,對應輸入的8000個採樣點的標籤
模型結構:5層Conv2D + 2層雙向GRU + 輸出層FNN

layer 2D-CRNN params dimensions
0: input, STFT - 129 ×\times 126
1:Conv2D filtes: 7 ×\times 7, 16, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 65 ×\times 126 ×\times 16
2:Conv2D filtes: 5 ×\times 5, 32, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 33 ×\times 126 ×\times 32
3:Conv2D filtes: 3 ×\times 3, 32, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 17 ×\times 126 ×\times 32
4:Conv2D filtes: 3 ×\times 3, 32, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 9 ×\times 126 ×\times 32
5:Conv2D filtes: 3 ×\times 3, 32, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 5 ×\times 126 ×\times 32
flatting by frame - 126 ×\times 160
6:Bi-GRU 126, return_sequences=True 100 ×\times 252
7:Bi-GRU 126, return_sequences=False 252 ×\times 1
8:output, FNN 8000, ‘sigmoid’ 8000 ×\times 1

2D CRNN for SAD

實驗結果

在development set上的性能比較結果:
Performance
1D CRNN
對比方法:

  • 1D CRNN: raw waveform + 1D CRNN
  • 2D CRNN (STFT spec. image) - proposed
  • 2D CRNN (MFCC image): 20 double-delta coefficients (including the 0th energy coefficient), FFT點數2048,步長512,得到MFCC特徵維數爲20 ×\times 16
  • MFCC RNN3: 2層GNU
  • Google VAD4 (mode=0): WebRTC
  • Baseline of Challenge2: GMM based

Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments5

發表於INTERSPEECH2019
同樣是上述challenge,性能排名第七(7/27):DCF=7.35% (on evaluation dataset)
代碼鏈接:link

技術方法

基本思路:基於信號處理技術,提取語音信號特徵進行多層次的判別,並自適應地調整閾值

  • 對帶噪語音信號首先使用譜減法進行降噪處理
  • 提出兩種特徵作爲判據(evidence):Modulation spectrum; Hilbert envelope of LP residual,進行兩階段的判決
  • 基於短時能量計算Q-Factor——可以反映某一段錄音的語音/非語音比例,根據Q-Factor大小設置對應的閾值
    SAD based on signal processing
實驗結果與性能比較Results

對比方法:

  • Energy-based: G.7926
  • Statistical-based: GMM, Sohn19997
  • Self Adaptive8
  • Spectral Subtraction + Energy Detection

  1. A. Vafeiadis, E. Fanioudakis, I. Potamitis, et al, “Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection,” Proc. Interspeech 2019, 2019. ↩︎

  2. J. H. Hansen, A. Joglekar, M. Chandra Shekhar, V. Kothapally, C. Yu, L. Kaushik, and A. Sangwan, “The 2019 inaugural fearless steps challenge: A giant leap for naturalistic audio,” Proc. Interspeech 2019, 2019. ↩︎ ↩︎

  3. T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7378–7382. ↩︎

  4. “GoogleWebRTC,” 2016. [Online]. Available: https://webrtc.org/ ↩︎

  5. B. Sharma, R. Kumar Das, H. Li, “Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments,” Proc. Interspeech 2019, 2019. ↩︎

  6. A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin, and J. P. Petit, “ITU-T recommendation G.729 annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, vol. 35, no. 9, pp. 64–73, Sep 1997. ↩︎

  7. J. Sohn, N. S. Kim, andW. Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, Jan 1999. ↩︎

  8. T. Kinnunen and P. Rajan, “A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data,” in ICASSP, May 2013, pp. 7229–7233. ↩︎

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章