Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection1
发表于INTERSPEECH2019
研究背景
The 2019 Inaugural Fearless Steps Challenge - Task1: Speech Activity Detection2 链接:link
本文方法在比赛所有27个提交系统中性能排名第一(1/27):DCF=3.318% (on evaluation dataset)
数据集:the Fearless Steps (FS) Challenge Corpus
- comprised of three mission critical stages from the NASA’s Apollo-11 mission, viz., Lift Off, Lunar Landing, and Lunar Walking
- 30 individual synchronized analog communications channels with multiple speakers in different locations working real-time to accomplish NASA’s Apollo missions
- most of the audio channels suffer from a wide range of issues like high channel noise, system noise, attenuated signal bandwidth, transmission noise, cosmic noise, analog tape static noise, noise from tape aging, etc., with noise levels varying within each channel across time
数据集时长:training set ~60h; development set ~20h10min; evaluation set ~20h;采样率:8KHz
NOTE: The training labels provided, are not ground truth; they are system outputs generated by our Baseline Systems.
评价指标:DCF (Detection Cost Function)
其中,为阈值,需调节阈值使DCF最小。即False Negative rate,也即MAR(Missed Alarm Rate);即即False Positive Rate,也即FAR(False Alarm Rate);DCF即FAR和MAR的加权平均,且MAR所占比重相比FAR更大。
技术方法:2DCRNN
基本思路:以声音信号的二维时频谱图为网络输入,将SAD当做二维、多标签图像分类问题来解决
输入:将声音信号划分成长1s的片段(对应8000个采样点),对每段提取STFT spectrogram,FFT点数为256,步长为64,得到的输入特征维数为129 126
输出:8000 1,源代码中定义的输出层单元数为8000,对应输入的8000个采样点的标签
模型结构:5层Conv2D + 2层双向GRU + 输出层FNN
layer | 2D-CRNN params | dimensions |
---|---|---|
0: input, STFT | - | 129 126 |
1:Conv2D | filtes: 7 7, 16, padding=‘same’; maxpooling: 3 3, stride=(2,1) | 65 126 16 |
2:Conv2D | filtes: 5 5, 32, padding=‘same’; maxpooling: 3 3, stride=(2,1) | 33 126 32 |
3:Conv2D | filtes: 3 3, 32, padding=‘same’; maxpooling: 3 3, stride=(2,1) | 17 126 32 |
4:Conv2D | filtes: 3 3, 32, padding=‘same’; maxpooling: 3 3, stride=(2,1) | 9 126 32 |
5:Conv2D | filtes: 3 3, 32, padding=‘same’; maxpooling: 3 3, stride=(2,1) | 5 126 32 |
flatting by frame | - | 126 160 |
6:Bi-GRU | 126, return_sequences=True | 100 252 |
7:Bi-GRU | 126, return_sequences=False | 252 1 |
8:output, FNN | 8000, ‘sigmoid’ | 8000 1 |
实验结果
在development set上的性能比较结果:
对比方法:
- 1D CRNN: raw waveform + 1D CRNN
- 2D CRNN (STFT spec. image) - proposed
- 2D CRNN (MFCC image): 20 double-delta coefficients (including the 0th energy coefficient), FFT点数2048,步长512,得到MFCC特征维数为20 16
- MFCC RNN3: 2层GNU
- Google VAD4 (mode=0): WebRTC
- Baseline of Challenge2: GMM based
Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments5
发表于INTERSPEECH2019
同样是上述challenge,性能排名第七(7/27):DCF=7.35% (on evaluation dataset)
代码链接:link
技术方法
基本思路:基于信号处理技术,提取语音信号特征进行多层次的判别,并自适应地调整阈值
- 对带噪语音信号首先使用谱减法进行降噪处理
- 提出两种特征作为判据(evidence):Modulation spectrum; Hilbert envelope of LP residual,进行两阶段的判决
- 基于短时能量计算Q-Factor——可以反映某一段录音的语音/非语音比例,根据Q-Factor大小设置对应的阈值
实验结果与性能比较
对比方法:
- Energy-based: G.7926
- Statistical-based: GMM, Sohn19997
- Self Adaptive8
- Spectral Subtraction + Energy Detection
A. Vafeiadis, E. Fanioudakis, I. Potamitis, et al, “Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection,” Proc. Interspeech 2019, 2019. ↩︎
J. H. Hansen, A. Joglekar, M. Chandra Shekhar, V. Kothapally, C. Yu, L. Kaushik, and A. Sangwan, “The 2019 inaugural fearless steps challenge: A giant leap for naturalistic audio,” Proc. Interspeech 2019, 2019. ↩︎ ↩︎
T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7378–7382. ↩︎
“GoogleWebRTC,” 2016. [Online]. Available: https://webrtc.org/ ↩︎
B. Sharma, R. Kumar Das, H. Li, “Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments,” Proc. Interspeech 2019, 2019. ↩︎
A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin, and J. P. Petit, “ITU-T recommendation G.729 annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, vol. 35, no. 9, pp. 64–73, Sep 1997. ↩︎
J. Sohn, N. S. Kim, andW. Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, Jan 1999. ↩︎
T. Kinnunen and P. Rajan, “A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data,” in ICASSP, May 2013, pp. 7229–7233. ↩︎