Abstract
Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Multi-modal nature of video(視頻的多模態性),提出的方法一個是multi-modal self-supervision,還有一個是adversarial training per modality
Introduction
fine-grained action recognition,
Few works have attempted deep UDA for video data《Temporal attentive alignment for large-scale video domain adaptation, ICCV2019》《Deep domain adaptation in action space, BMVC2018》
Conclusion
modality指的是兩種信息(optical flow和RGB信息),future work包含audio
Key points: Motivation很好; 提出的新數據集