Category | Year | Authors | Title | Features | Classifier/Decision rule | Dataset | noise condition | Comparisons | Performance |
---|---|---|---|---|---|---|---|---|---|
Statistical model | IEEE SPL 1999 | Jongseo Sohn | A Statistical Model-Based Voice Activity Detection | DFT coefficients, Gsussian modeling | Log-likelihood ratio test (LRT), HMM based hang-over scheme | - | NOISEX-92: vehicle, white, babble; 5dB, 15dB, 25dB | G.729 | ROC curves |
IEEE SPL 2005 | Javier Ramírez | Statistical Voice Activity Detection Using a Multiple Observation Likelihood Ratio Test | DFT coefficients, Gsussian modeling | Multiple observation LRT | AURORA-3 Spanish SpeechDat-Car (SDC) | distant and close-talking in car environments; 5dB | Sohn’s VAD, G.729, AMR1/2, AFE | ROC curves | |
IEEE TASLP2011 | Dongwen Ying | Voice Activity Detection Based on an Unsupervised Learning Framework | log=mel energies | GMM | TIMIT | NOISEX-92 | Sohn, G.729, AMR | ROC curves | |
2015 | Tomi Kinnunen | HAPPY Team Entry to NIST OpenSAD Challenge: A Fusion of Short-TermUnsupervised and Segment i-Vector Based Speech Activity Detectors | - | Fusion of 6 VADs | NIST 2015 OpenSAD | - | Sohn, G.729, GMM, rSAD, ivectors | DCF | |
Deep learning | Interspeech 2016 | Ruben Zazo, Google | Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection | Raw waveform | CLDNN | sythetic, 3800h, balanced | daily life noises, 5~30dB | 40dim. log mel energies+DNN, LSTM, CLDNN | ROC curves, FAR, MAR |
IEEE TASLP2016 | Xiao-Lei Zhang | Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection | MRCG | Multi-resolution stacking (MRS)+bDNN | AURORA-2(8K, noisy), AURORA-4(16K, clean) | NOISEX-92; -5~20dB | SVM, Zhang13, DNN, bDNN | AUC | |
Interspeech 2014 | Xiao-Lei Zhang | Boosted Deep Neural Networks and Multi-resolution Cochleagram Features for Voice Activity Detection | MRCG | bDNN | AURORA-4 | NOISEX-92; -5~5dB | Sohn, Ramirez05, Ying, SVM, Zhang13 | AUC | |
IEEE TASLP2013 | Xiao-Lei Zhang | Deep Belief Networks Based Voice Activity Detection | Pitch, DFT, MFCC, LPC, PLP, AMS | Deep belif network (DBN) | AURORA-2 | -5~10dB | G.729., ETSI Wiener filtering, Sohn, Ramirez05/07, Yu, Shin, Ying, SVM | AUC | |
IEEE SPL 2018 | Juntae Kim | Voice Activity Detection Using an Adaptive Context Attention Model | MRCG | Attention model+LSTM | TIMIT, self-recorded dataset, HAVIC | NOISEX-92; -5~10dB | HFCL, MSFI, DNN, bDNN, LSTM | AUC | |
IEEE ISSPIT 2019 | Guan-Bo Wang | A Fusion Model for Robust Voice Activity Detection | Fbank | Fusion of BUT, CRNN, RNN, SOV | OpenSAT19(16K), 160h | various background noise | BUT, CRNN, RNN, SOV | DCF | |
APSIPA 2019 | Guan-Bo Wang | An RNN and CRNN Based Approach to Robust Voice Activity Detection | Fbank | Fusion of RNN, CRNN | OpenSAT19, OpenSAT17 | BUT, RNN, CRNN | DCF | ||
Interspeech 2019 | Ruixi Lin | Optimizing Voice Activity Detection for Noisy Conditions | MFCC, Fbank | DAE, CNN | AISHELL(16K, maually labeled), AURORA-2 | self-collected noises; 0~20dB | G.729, SVM, DBN, DDNN | Accuracy | |
Interspeech 2015 | Qing Wang | A Universal VAD Based on Jointly Trained Deep Neural Networks | MRCG | Jointly learning DNN with speech enhancement | AURORA-4 | 115 noises types including NOISEX-92; -5~20dB | DNN | AUC | |
IEEE ICASSP 2016 | Sibo Tong | A COMPARATIVE STUDY OF ROBUSTNESS OF DEEP LEARNING APPROACHES FOR VAD | log-mel energies | Noise-aware training, DNN, LSTM, CNN | AURORA-4, WSJ0 | 6 noises; 5~20dB | DNN, LSTM, CNN | AUC, EER | |
ICMSCE 2018 | Jaeseok Kim | Voice Activity Detection based on Multi-Dilated Convolutional Neural Network | MRCG | CNN with multi-dilated convolution | TIMIT | NOISEX-92, sound effect library; -12~10dB | bDNN, RNN, CNN | AUC | |
Interspeech 2013 | Neville Ryant | Speech Activity Detection on YouTube Using Deep Neural Networks | MFCC | DNN | HAVIC, 65h, web videos | - | GMM | EER | |
IEEE ICASSP 2019 | Rajat Hebbar | ROBUST SPEECH ACTIVITY DETECTION IN MOVIE AUDIO: DATA RESOURCES AND EXPERIMENTAL EVALUATION | log-mel energies | CNN-TD | Movies | MUSAN, Audioset | CNN, CLDNN | F1, TPR+FPR | |
Interspeech 2016 | Yuya Fujita | Robust DNN-based VAD augmented with phone entropy based rejection of background speech | Fbank | Acoustic model based DNN, entropy criterion | self-collected, mobile voice search, 1200h | - | DNN | ER | |
IEEE ICASSP 2013 | Thad Hughes | RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION | PLP | RNN | - | - | GMM+SM | ROC curves | |
IEEE ICASSP 2013 | Florian Eyben | REAL-LIFE VOICE ACTIVITY DETECTION WITH LSTM RECURRENT NEURAL NETWORKS AND AN APPLICATION TO HOLLYWOOD MOVIES | RASTA-PLP | LSTM | Buckeye (26h), TIMIT | 4 noises; -6~25dB | Sohn, Ram05, ARG | AUC, FNR+FPR |
SAD论文整理
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.