SAD论文阅读笔记-INTERSPEECH2019

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection¹

发表于INTERSPEECH2019

研究背景

The 2019 Inaugural Fearless Steps Challenge - Task1: Speech Activity Detection² 链接：link
本文方法在比赛所有27个提交系统中性能排名第一（1/27）：DCF=3.318% (on evaluation dataset)

数据集：the Fearless Steps (FS) Challenge Corpus

comprised of three mission critical stages from the NASA’s Apollo-11 mission, viz., Lift Off, Lunar Landing, and Lunar Walking
30 individual synchronized analog communications channels with multiple speakers in different locations working real-time to accomplish NASA’s Apollo missions
most of the audio channels suffer from a wide range of issues like high channel noise, system noise, attenuated signal bandwidth, transmission noise, cosmic noise, analog tape static noise, noise from tape aging, etc., with noise levels varying within each channel across time

数据集时长：training set ~60h; development set ~20h10min; evaluation set ~20h；采样率：8KHz
NOTE: The training labels provided, are not ground truth; they are system outputs generated by our Baseline Systems.

评价指标：DCF (Detection Cost Function)

$DCF(\theta)=0.75*P_{FN}(\theta)+0.25*P_{FP}(\theta)$ 其中， $\theta$ 为阈值，需调节阈值使DCF最小。 $P_{FN}$ 即False Negative rate，也即MAR(Missed Alarm Rate)；即 $P_{FP}$ 即False Positive Rate，也即FAR(False Alarm Rate)；DCF即FAR和MAR的加权平均，且MAR所占比重相比FAR更大。

技术方法：2DCRNN

基本思路：以声音信号的二维时频谱图为网络输入，将SAD当做二维、多标签图像分类问题来解决
输入：将声音信号划分成长1s的片段（对应8000个采样点），对每段提取STFT spectrogram，FFT点数为256，步长为64，得到的输入特征维数为129 $\times$ 126
输出：8000 $\times$ 1，源代码中定义的输出层单元数为8000，对应输入的8000个采样点的标签
模型结构：5层Conv2D + 2层双向GRU + 输出层FNN

layer	2D-CRNN params	dimensions
0: input, STFT	-	129 $\times$ 126
1:Conv2D	filtes: 7 $\times$ 7, 16, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	65 $\times$ 126 $\times$ 16
2:Conv2D	filtes: 5 $\times$ 5, 32, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	33 $\times$ 126 $\times$ 32
3:Conv2D	filtes: 3 $\times$ 3, 32, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	17 $\times$ 126 $\times$ 32
4:Conv2D	filtes: 3 $\times$ 3, 32, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	9 $\times$ 126 $\times$ 32
5:Conv2D	filtes: 3 $\times$ 3, 32, padding=‘same’; maxpooling: 3 $\times$ 3, stride=(2,1)	5 $\times$ 126 $\times$ 32
flatting by frame	-	126 $\times$ 160
6:Bi-GRU	126, return_sequences=True	100 $\times$ 252
7:Bi-GRU	126, return_sequences=False	252 $\times$ 1
8:output, FNN	8000, ‘sigmoid’	8000 $\times$ 1

实验结果

在development set上的性能比较结果：

对比方法：

1D CRNN: raw waveform + 1D CRNN
2D CRNN (STFT spec. image) - proposed
2D CRNN (MFCC image): 20 double-delta coefficients (including the 0th energy coefficient), FFT点数2048，步长512，得到MFCC特征维数为20 $\times$ 16
MFCC RNN³: 2层GNU
Google VAD⁴ (mode=0): WebRTC
Baseline of Challenge²: GMM based

Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments⁵

发表于INTERSPEECH2019
同样是上述challenge，性能排名第七（7/27）：DCF=7.35% (on evaluation dataset)
代码链接：link

技术方法

基本思路：基于信号处理技术，提取语音信号特征进行多层次的判别，并自适应地调整阈值

对带噪语音信号首先使用谱减法进行降噪处理
提出两种特征作为判据(evidence)：Modulation spectrum; Hilbert envelope of LP residual，进行两阶段的判决
基于短时能量计算Q-Factor——可以反映某一段录音的语音/非语音比例，根据Q-Factor大小设置对应的阈值

实验结果与性能比较

对比方法：

Energy-based: G.792⁶
Statistical-based: GMM, Sohn1999⁷
Self Adaptive⁸
Spectral Subtraction + Energy Detection

A. Vafeiadis, E. Fanioudakis, I. Potamitis, et al, “Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection,” Proc. Interspeech 2019, 2019. ↩︎
J. H. Hansen, A. Joglekar, M. Chandra Shekhar, V. Kothapally, C. Yu, L. Kaushik, and A. Sangwan, “The 2019 inaugural fearless steps challenge: A giant leap for naturalistic audio,” Proc. Interspeech 2019, 2019. ↩︎ ↩︎
T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7378–7382. ↩︎
“GoogleWebRTC,” 2016. [Online]. Available: https://webrtc.org/ ↩︎
B. Sharma, R. Kumar Das, H. Li, “Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments,” Proc. Interspeech 2019, 2019. ↩︎
A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin, and J. P. Petit, “ITU-T recommendation G.729 annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, vol. 35, no. 9, pp. 64–73, Sep 1997. ↩︎
J. Sohn, N. S. Kim, andW. Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, Jan 1999. ↩︎
T. Kinnunen and P. Rajan, “A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data,” in ICASSP, May 2013, pp. 7229–7233. ↩︎

SAD论文阅读笔记-INTERSPEECH2019