Paper Reading: Deep Multimodal Speaker Naming

原創

wlwchina

2020-02-25 02:17

http://herohuyongtao.github.io/research/publications/speaker-naming/
問題描述:

spearking naming(SN): localizing + identifying (each speaking character)

問題的難點在於multimodal.
已有的方法都是分別處理各個modal,然後使用 handcrafted heuristics合併.

本文
- 基於CNN的學習框架來同時利用face和audio兩個方面的信息.
- 不需要face tracking, facial landmark localization, subtitle/transcript, 可以獲得state-of-the-art的性能.
- train end-to-end
- use only cropped face regions and corresponding audio
- real time

Architecture:
- face feature extractor : CNN, 最後一層是一個vector
- audio feature extractor : MFCC, 結果也是一個vector
- 拼在一起的feature, 後面是幾層全連接層, 維度逐漸升高
- 整個網絡train end to end

Experiment:

三個任務:
- 1) face recognition(using both information)
- 2) identifying non-matched face-audio pairs
- 3)real world SN

具體網絡設置:
- 2 個conv(15*15, 5*4)和兩個pooling
- 最後一個pooling和fully-connect之間使用7*5的濾波器?
- 兩個卷積層的number of feature map: 48 + 256
- 兩個fully connected layer: 1024, 2028
- pooling factor : 2

初始參數設置:
- bias term all 0.01( prevent the dead unit caused by rectifier units)
- others, [-1, 1] gaussian,然後根據隱層數量進行scale

加入音頻後的網絡:
- 訓練時,兩個feature extractor分別使用預先訓練的參數作爲初始化.
- 每幀對應的音頻, 窗口爲20ms(每個音頻產生特徵維度75)
- 每個臉隨機選擇5個audio, 這樣音頻特徵就有375

Face recognition accuracy:
- only use face : 86.7%
- use face-audio : 88.5%
- 其他方法: <70%

Identifying Non-matched Pairs, 三個SVM進行二值分類(匹配/不匹配)
- 1使用1024D face-audio model的特徵 (82.2%)
- 使用1024D face-alone model的特徵+75D audio feature ( 82.9%)
- 實際採用的SVM和第二個參數相同, 僅僅是把1024D face-alone model換成1024D face-audio model (84.1%)

可見face-audio model的特徵, 再加入audio特徵, 可以獲得更高的性能, 在distinguishing non-matched pairs方面

Speaker Naming

其實就是前兩個實驗的綜合版.
先用SVM去掉 non-matched pairs
然後進行recgonize
SN accuracy: Friends 90.5%, BBT 82.9%

@inproceedings{hu2015deep,
  title={{Deep Multimodal Speaker Naming}},
  author={Hu, Yongtao and Ren, Jimmy SJ. and Dai, Jingwen and Yuan, Chang and Xu, Li and Wang, Wenping},
  booktitle={Proceedings of the 23rd Annual ACM International Conference on Multimedia},
  pages={xxx--xxx},
  year={2015},
  organization={ACM}
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Paper Reading: Deep Multimodal Speaker Naming

Experiment:

Pinta--一個畫圖軟件

在線markdown

linux批量重命名之xargs

ubuntu下配置tex（中文）

casade CNN

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結