SAD論文閱讀筆記-INTERSPEECH2019

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection¹

發表於INTERSPEECH2019

研究背景

The 2019 Inaugural Fearless Steps Challenge - Task1: Speech Activity Detection² 鏈接：link
本文方法在比賽所有27個提交系統中性能排名第一（1/27）：DCF=3.318% (on evaluation dataset)

數據集：the Fearless Steps (FS) Challenge Corpus

comprised of three mission critical stages from the NASA’s Apollo-11 mission, viz., Lift Off, Lunar Landing, and Lunar Walking
30 individual synchronized analog communications channels with multiple speakers in different locations working real-time to accomplish NASA’s Apollo missions
most of the audio channels suffer from a wide range of issues like high channel noise, system noise, attenuated signal bandwidth, transmission noise, cosmic noise, analog tape static noise, noise from tape aging, etc., with noise levels varying within each channel across time

數據集時長：training set ~60h; development set ~20h10min; evaluation set ~20h；採樣率：8KHz
NOTE: The training labels provided, are not ground truth; they are system outputs generated by our Baseline Systems.

評價指標：DCF (Detection Cost Function)

$DCF(\theta)=0.75*P_{FN}(\theta)+0.25*P_{FP}(\theta)$ 其中， $\theta$ 爲閾值，需調節閾值使DCF最小。 $P_{FN}$ 即False Negative rate，也即MAR(Missed Alarm Rate)；即 $P_{FP}$ 即False Positive Rate，也即FAR(False Alarm Rate)；DCF即FAR和MAR的加權平均，且MAR所佔比重相比FAR更大。

技術方法：2DCRNN

基本思路：以聲音信號的二維時頻譜圖爲網絡輸入，將SAD當做二維、多標籤圖像分類問題來解決
輸入：將聲音信號劃分成長1s的片段（對應8000個採樣點），對每段提取STFT spectrogram，FFT點數爲256，步長爲64，得到的輸入特徵維數爲129 $\times$ 126
輸出：8000 $\times$ 1，源代碼中定義的輸出層單元數爲8000，對應輸入的8000個採樣點的標籤
模型結構：5層Conv2D + 2層雙向GRU + 輸出層FNN

layer	2D-CRNN params	dimensions
0: input, STFT	-	129 $\times$ 126
1:Conv2D	filtes: 7 $\times$ 7, 16, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	65 $\times$ 126 $\times$ 16
2:Conv2D	filtes: 5 $\times$ 5, 32, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	33 $\times$ 126 $\times$ 32
3:Conv2D	filtes: 3 $\times$ 3, 32, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	17 $\times$ 126 $\times$ 32
4:Conv2D	filtes: 3 $\times$ 3, 32, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	9 $\times$ 126 $\times$ 32
5:Conv2D	filtes: 3 $\times$ 3, 32, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	5 $\times$ 126 $\times$ 32
flatting by frame	-	126 $\times$ 160
6:Bi-GRU	126, return_sequences=True	100 $\times$ 252
7:Bi-GRU	126, return_sequences=False	252 $\times$ 1
8:output, FNN	8000, ‘sigmoid’	8000 $\times$ 1

實驗結果

在development set上的性能比較結果：

對比方法：

1D CRNN: raw waveform + 1D CRNN
2D CRNN (STFT spec. image) - proposed
2D CRNN (MFCC image): 20 double-delta coefficients (including the 0th energy coefficient), FFT點數2048，步長512，得到MFCC特徵維數爲20 $\times$ 16
MFCC RNN³: 2層GNU
Google VAD⁴ (mode=0): WebRTC
Baseline of Challenge²: GMM based

Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments⁵

發表於INTERSPEECH2019
同樣是上述challenge，性能排名第七（7/27）：DCF=7.35% (on evaluation dataset)
代碼鏈接：link

技術方法

基本思路：基於信號處理技術，提取語音信號特徵進行多層次的判別，並自適應地調整閾值

對帶噪語音信號首先使用譜減法進行降噪處理
提出兩種特徵作爲判據(evidence)：Modulation spectrum; Hilbert envelope of LP residual，進行兩階段的判決
基於短時能量計算Q-Factor——可以反映某一段錄音的語音/非語音比例，根據Q-Factor大小設置對應的閾值

實驗結果與性能比較

對比方法：

Energy-based: G.792⁶
Statistical-based: GMM, Sohn1999⁷
Self Adaptive⁸
Spectral Subtraction + Energy Detection

A. Vafeiadis, E. Fanioudakis, I. Potamitis, et al, “Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection,” Proc. Interspeech 2019, 2019. ↩︎
J. H. Hansen, A. Joglekar, M. Chandra Shekhar, V. Kothapally, C. Yu, L. Kaushik, and A. Sangwan, “The 2019 inaugural fearless steps challenge: A giant leap for naturalistic audio,” Proc. Interspeech 2019, 2019. ↩︎ ↩︎
T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7378–7382. ↩︎
“GoogleWebRTC,” 2016. [Online]. Available: https://webrtc.org/ ↩︎
B. Sharma, R. Kumar Das, H. Li, “Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments,” Proc. Interspeech 2019, 2019. ↩︎
A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin, and J. P. Petit, “ITU-T recommendation G.729 annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, vol. 35, no. 9, pp. 64–73, Sep 1997. ↩︎
J. Sohn, N. S. Kim, andW. Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, Jan 1999. ↩︎
T. Kinnunen and P. Rajan, “A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data,” in ICASSP, May 2013, pp. 7229–7233. ↩︎

SAD論文閱讀筆記-INTERSPEECH2019