SAD论文阅读笔记-INTERSPEECH2019

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection1

发表于INTERSPEECH2019

研究背景

The 2019 Inaugural Fearless Steps Challenge - Task1: Speech Activity Detection2 链接:link
本文方法在比赛所有27个提交系统中性能排名第一(1/27):DCF=3.318% (on evaluation dataset)

数据集:the Fearless Steps (FS) Challenge Corpus
  • comprised of three mission critical stages from the NASA’s Apollo-11 mission, viz., Lift Off, Lunar Landing, and Lunar Walking
  • 30 individual synchronized analog communications channels with multiple speakers in different locations working real-time to accomplish NASA’s Apollo missions
  • most of the audio channels suffer from a wide range of issues like high channel noise, system noise, attenuated signal bandwidth, transmission noise, cosmic noise, analog tape static noise, noise from tape aging, etc., with noise levels varying within each channel across time

数据集时长:training set ~60h; development set ~20h10min; evaluation set ~20h;采样率:8KHz
NOTE: The training labels provided, are not ground truth; they are system outputs generated by our Baseline Systems.

评价指标:DCF (Detection Cost Function)

DCF(θ)=0.75PFN(θ)+0.25PFP(θ)DCF(\theta)=0.75*P_{FN}(\theta)+0.25*P_{FP}(\theta)其中,θ\theta为阈值,需调节阈值使DCF最小。PFNP_{FN}即False Negative rate,也即MAR(Missed Alarm Rate);即PFPP_{FP}即False Positive Rate,也即FAR(False Alarm Rate);DCF即FAR和MAR的加权平均,且MAR所占比重相比FAR更大。

技术方法:2DCRNN

基本思路:以声音信号的二维时频谱图为网络输入,将SAD当做二维、多标签图像分类问题来解决
输入:将声音信号划分成长1s的片段(对应8000个采样点),对每段提取STFT spectrogram,FFT点数为256,步长为64,得到的输入特征维数为129 ×\times 126
输出:8000 ×\times 1,源代码中定义的输出层单元数为8000,对应输入的8000个采样点的标签
模型结构:5层Conv2D + 2层双向GRU + 输出层FNN

layer 2D-CRNN params dimensions
0: input, STFT - 129 ×\times 126
1:Conv2D filtes: 7 ×\times 7, 16, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 65 ×\times 126 ×\times 16
2:Conv2D filtes: 5 ×\times 5, 32, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 33 ×\times 126 ×\times 32
3:Conv2D filtes: 3 ×\times 3, 32, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 17 ×\times 126 ×\times 32
4:Conv2D filtes: 3 ×\times 3, 32, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 9 ×\times 126 ×\times 32
5:Conv2D filtes: 3 ×\times 3, 32, padding=‘same’; maxpooling: 3 ×\times 3, stride=(2,1) 5 ×\times 126 ×\times 32
flatting by frame - 126 ×\times 160
6:Bi-GRU 126, return_sequences=True 100 ×\times 252
7:Bi-GRU 126, return_sequences=False 252 ×\times 1
8:output, FNN 8000, ‘sigmoid’ 8000 ×\times 1

2D CRNN for SAD

实验结果

在development set上的性能比较结果:
Performance
1D CRNN
对比方法:

  • 1D CRNN: raw waveform + 1D CRNN
  • 2D CRNN (STFT spec. image) - proposed
  • 2D CRNN (MFCC image): 20 double-delta coefficients (including the 0th energy coefficient), FFT点数2048,步长512,得到MFCC特征维数为20 ×\times 16
  • MFCC RNN3: 2层GNU
  • Google VAD4 (mode=0): WebRTC
  • Baseline of Challenge2: GMM based

Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments5

发表于INTERSPEECH2019
同样是上述challenge,性能排名第七(7/27):DCF=7.35% (on evaluation dataset)
代码链接:link

技术方法

基本思路:基于信号处理技术,提取语音信号特征进行多层次的判别,并自适应地调整阈值

  • 对带噪语音信号首先使用谱减法进行降噪处理
  • 提出两种特征作为判据(evidence):Modulation spectrum; Hilbert envelope of LP residual,进行两阶段的判决
  • 基于短时能量计算Q-Factor——可以反映某一段录音的语音/非语音比例,根据Q-Factor大小设置对应的阈值
    SAD based on signal processing
实验结果与性能比较Results

对比方法:

  • Energy-based: G.7926
  • Statistical-based: GMM, Sohn19997
  • Self Adaptive8
  • Spectral Subtraction + Energy Detection

  1. A. Vafeiadis, E. Fanioudakis, I. Potamitis, et al, “Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection,” Proc. Interspeech 2019, 2019. ↩︎

  2. J. H. Hansen, A. Joglekar, M. Chandra Shekhar, V. Kothapally, C. Yu, L. Kaushik, and A. Sangwan, “The 2019 inaugural fearless steps challenge: A giant leap for naturalistic audio,” Proc. Interspeech 2019, 2019. ↩︎ ↩︎

  3. T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7378–7382. ↩︎

  4. “GoogleWebRTC,” 2016. [Online]. Available: https://webrtc.org/ ↩︎

  5. B. Sharma, R. Kumar Das, H. Li, “Multi-level Adaptive Speech Activity Detector for Speech in Naturalistic Environments,” Proc. Interspeech 2019, 2019. ↩︎

  6. A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin, and J. P. Petit, “ITU-T recommendation G.729 annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, vol. 35, no. 9, pp. 64–73, Sep 1997. ↩︎

  7. J. Sohn, N. S. Kim, andW. Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, Jan 1999. ↩︎

  8. T. Kinnunen and P. Rajan, “A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data,” in ICASSP, May 2013, pp. 7229–7233. ↩︎

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章