計算機視覺中的自監督學習與注意力建模

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自從深度學習提出以來,AI得到了快速的發展,每年都會有很多成果湧現,2020年也是豐收的一年,在各個AI領域都有很多里程碑的成果,在計算機視覺領域,也有很多技術上的重要突破性進展,今天給大家分享的就是其中兩個重要進展,一個是計算機視覺中的自監督學習,另一個是計算機視覺的Transformer注意力建模,同時介紹講者所在的微軟亞洲研究院研究小組在這方面所做的相關工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面分三個部分來介紹具體的內容:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2020年計算機視覺研究的三大突破"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算機視覺中的自監督學習"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算機視覺的Transformer注意力建模"}]}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2020年計算機視覺研究的三大突破"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先介紹2020年的計算機視覺領域有哪些突破性的進展。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/68\/68f3d22ce930aff74aa71aa44f824399.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 自監督學習"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個突破是在自監督學習領域,2020年自監督學習首次超越了有監督預訓練,這是一個里程碑。標誌性的工作有何愷明等人的MoCo(參見論文《Momentum Contrast for Unsupervised Visual Representation Learning》),以及Hinton等人的SimCLR(參見論文《A Simple Framework for Contrastive Learning of Visual Representations》)。這兩個工作在不到一年的時間裏面已經收穫了450和550個引用,對自監督學習是一個極大的促進。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Transformer注意力建模"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個重要的突破是Transformer成功應用於主流視覺問題,代表工作有DETR(參見論文《End-to-End Object Detection with Transformers》)和ViT(參見論文《An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale》),它首次將transform成功地應用於主流的視覺問題,具體來說是分別應用在目標檢測和圖像分類上。這兩項工作對將CV和NLP統一在同一種模型下,並開闢了一個新的研究潮流。也正因爲如此,2020年的下半年湧現了很多transform相關的CV方向論文。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 用於視圖合成的神經輻射場"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個重要突破是NeRF(參見論文《NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis》),它對於低層視覺來說是一個里程碑的進展。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"視覺中的自監督學習"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 自監督學習的重要性- Yann LeCun的蛋糕"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Yann LeCun在他圖靈獎頒獎典禮演講中有一個著名蛋糕類比,即用一塊蛋糕來類比各種學習方式,其中包括強化學習、有監督學習,以及自監督學習等。Yann LeCun把強化學習比作蛋糕中的櫻桃,認爲它雖然耀眼,但不是根本的;又把有監督學習比作蛋糕中的冰激凌,雖然好喫但也不是根本;Yann LeCun把自監督學習比作蛋糕本身,認爲它纔是實現人類智能最根本的東西。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d1\/d18510b23b597edadb1946975ea9eb90.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"自監督學習爲什麼這麼重要?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Yann LeCun認爲人類嬰兒就是通過自監督學習來認識這個世界的。嬰兒出生後並不能與成人做直接交流和學習,所以他的學習不是有監督學習。嬰兒與環境的交互有一些,但不夠充分,因此他的學習主要也不是強化學習。事實上,嬰兒大部分的學習是通過觀察周圍環境,從觀察中蘊含的自監督任務來進行學習的,也就是說自監督學習纔是人類通向智能最本質的一條道路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於嬰兒是如何進行學習的,IBM的Linda Smith和Michael Gasser在2005年發表的《The Development of Embodied Cognition: Six Lessons from Babies》主題報告中有一些很好的闡述,有意者可以閱讀參考。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. “有監督預訓練+下游任務微調”範式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2012年AlexNet橫空出世,在當年的ImageNet比賽中,將錯誤率空前地降低了約40%,從此人工智能步入深度學習時代,以後所發表的計算機視覺論文中越來越多的出現了“deep”一詞,而在2014年開始“deep”一詞出現的次數又有了明顯的激增。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/84\/84021fb2c3ca505a407fe7fb371da7fb.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而導致激增一部分原因在於2014年在學術界驗證了一個很重要的範式(另一原因可能在於開源深度學習框架開始湧現),即“有監督預訓練+下游任務微調”範式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衆所周知,深度學習訓練需要大量的數據,然而我們在實際的下游任務訓練中並沒有很多的標註樣本數據,通常只有幾千甚至幾百個數據,但是依然能訓練出效果很好的模型,原因就是我們使用了“有監督預訓練+下游任務微調”這樣的範式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以ImageNet爲例,模型首先是在有120萬標註數據的ImageNet分類數據集進行預訓練,得到預訓練模型,此後,下游任務則是基於預訓練模型進行微調,通常的下游任務包括語義分割、目標檢測、細粒度識別等等。相比不使用預訓練模型,使用預訓練模型的下游任務在模型性能上有很大的提升。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9d\/9ddfbe602ac16a71e5e55b6728ef6512.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. “自監督預訓練+下游任務微調”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面講述了“自監督學習”和“有監督預訓練+下游任務微調”,在2019年,這兩件事情走到了一起,即“自監督預訓練+下游任務微調”。這得益於一個里程碑的工作,就是何愷明等人提出的“MoCo”,即在2019年CVPR上發表的論文:“Momentum Contrast for Unsupervised Visual Representation Learning”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MoCo在7個下游任務中,利用自監督預訓練首次超越了有監督預訓練的效果。這很可能意味着人工智能自監督或無監督時代的到來,這不但讓我們可以利用幾乎無限的訓練數據而無需標註,更重要的是,從認知的角度看,“自監督預訓練+下游任務微調”這樣的訓練範式也與人類的學習方式更加接近。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 自監督學習的發展歷程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自監督訓練是如何一步一步發展過來的呢?下面的圖片中展示了過去十幾年中出現的各種自監督學習方法。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/01\/012b8c8ef08dec896f588083507086b4.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方法各種各樣,值得注意的是最近的突破性進展主要發展自2014年的“Exemplar networks”,其特點是把每個樣本圖片都作爲一個類別,例如ImageNet有120萬張樣本圖片,那麼就認爲有120萬個類別。這個任務其實蠻反直覺的,因爲我們學習的目標通常是希望更多抽象,因此接着這個思路做的人最初並不多。終於在2018年中的“Memory bank”將這一思路推到了一個引人關注的程度,2018年底的“Deep metric transfer”證明了這一思路在半監督學習中的重要價值,以及2019年底的MoCo取得了里程碑的結果,即在多個重要下游任務中超越有監督學習。從MoCo開始,自監督學習領域開始迅速發展,2020年2月份,谷歌提出了SimCLR,2020年6月份我們研究小組提出了PIC和PixPro,Deepmind和FAIR分別提出了BYOL和SwAV。2020年11月份,我們研究小組進一步提出PixPro,將自監督學習從圖像級引入到像素級,顯著地提升了物體檢測和分割等下游任務的效果。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f7\/f77782fe2eb85751db90308c72bb61f3.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. PIC:單分支無監督特徵學習算法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Memory bank 、MoCo、SimCLR等都是兩分支算法,即對每個輸入的圖像都生成兩個增強的視圖,兩個增強視圖經過卷積網絡提取其的特徵,利用“將同一個圖像兩個視圖的特徵拉近,將不同圖像生成的視圖的特徵推遠的”原則,對網絡進行訓練。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b6\/b6d77a50afd2e0a90bf0634c1d0bd10a.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PIC不是兩分支網絡,而是一個單分支網絡,即對每個輸入的圖像生成一個增強視圖,同樣使用卷積網絡提取特徵,但在卷積後接了一個分類器用於圖像的分類。相比兩分支算法,PIC更簡潔,並具有同樣好的效果。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. PixPro:像素級自監督學習算法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過去一年,自監督學習在ImageNet的1000類線性分類評測上的性能有着明顯的提升,從MoCo到CLSA有15.6%的絕對性能提升。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/73\/73137e4f70ae094cef7239692c65a22c.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是在一些依賴稠密預測的下游任務的效果上並沒有多少提升,以Pascal VOC目標檢測爲例,從MoCo到InfoMin只有1.7%的絕對性能提升,而PixPro相比InfoMin將Pascal VOC目標檢測任務的性能提升了2.6%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1b\/1b2afb5a6f1fb1ee33b10c5a24e0f193.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PixPro在其他下游任務中也有性能上的提升:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/88\/885ef03aed91ea7f8939ba6c620c4983.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比於其他的自監督訓練算法,PixPro的關鍵思想是將基於圖像的預訓練任務轉變爲基於像素的預訓練任務。在下游任務中,很多都是基於像素的任務,例如圖像分割、目標檢測等。如果預訓練是基於像素進行的訓,那麼其與下游的任務將更加契合,從而有可能帶來更好的表現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/879bf7a0bc8536edaf5de58908006403.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們首先提出了一個區分每個像素的預訓練任務,對於一個輸入的圖像,我們和此前的圖像級方法一樣,對其進行視圖增強,得到兩個增強的視圖圖像,然後使用“拉近兩個視圖中相近像素的特徵,推遠兩個視圖中距離較遠的像素的特徵”的任務來對網絡進行訓練,從而將基於圖像的預訓練推廣到基於像素的預訓練。我們將這種方法稱爲“區分像素”的預訓練任務,簡稱PixContrast"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/86\/86ab7c6b148a10b145199609dc5b498b.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在此基礎上,我們提出了PixPro,它是對PixContrast的一個改進,具體有兩點改動:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"像素平滑:視圖1特徵提取保持不變,視圖2的特徵做平滑處理,即使用周圍的像素對目標像素進行平滑。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"去掉推遠分支"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們稱改進後的方法爲PixPro,即“像素-傳播的一致性”的預訓練任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ee\/ee1eaac9e5deaed03f87efd4b544c433.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改進前的“區分像素”訓練任務主要是進行像素對比,增強模型的空間敏感性;改進後的訓練任務由於加入了像素平滑,增強了模型的空間光滑性。由於下游任務像素之間是有相關性的,所以增加模型的空間光滑性可以增強下游任務的訓練效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PixPro的結構圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c4\/c46cf39f0bfd64c588c6e8c57d11cfa3.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"像素級別的自監督訓練優點還在於:可以讓預訓練模型和下游任務訓練模型採用相同的網絡結構,例如可以把FPN引入到預訓練模型中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e0\/e07d4d6969621a51c130f37c001428de.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"7. 非靜態圖片的自監督預訓練"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了上訴基於靜態圖片的自監督預訓練之外,過去一年中“基於視頻的預訓練”和“多模態特徵預訓練”也有一定進展。“基於視頻的預訓練”的代表性研究員和研究小組有牛津大學的Andrew Zisserman、謝偉迪,以及加州大學伯克利分校的Alexei Efros和王小龍等人。“多模態特徵預訓練”是將圖像與聲音、語言等多種輸入信息相結合來進行自監督訓練,這種方式與人的學習很像。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"視覺中的Transformer注意力建模"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 人工智能的大統一故事"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大統一理論是物理學的聖盃,無數人致力於將四種相互作用力統一起來。在人工智能領域也有這樣一個目標,而且事實上深度學習的浪潮已經讓我們在大一統上前進了很大一步,例如我們的學習機制基本已經統一:數據標註和基於誤差反向傳播的訓練方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2d\/2d7fa9a43e10040e9f2d438314665fd3.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是人工智能的不同領域的主流建模方法還是不盡相同的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自從Yann LeCun提出LeNet到現在,CV領域最基礎模型一直是卷積網絡。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而在NLP或者序列建模領域,其模型則經歷了一個變遷的過程,直到2017年Transformer提出之後才穩定下來,如今Transformer成爲了NLP的主流建模方法。NLP或者序列建模領域的模型的主要變遷如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/da\/da874293d11ba19f2a3eb767f60ee827.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而2020年Transformer在計算機視覺領域的應用改變了CV和NLP兩大領域模型結構不一樣的局面,對視覺和自然語言處理兩大領域的模型統一起到了極大的促進作用。我們對在視覺中使用Transformer已有了多年的探索,我們探索的出發點是“自然語音處理與計算機視覺的統一建模”,這一方面是爲了實現自然語音處理與計算機視覺模型數學形式上的統一,另一方面是希望兩個領域將來能更好的互通和相互促進。在2020年,學術界在這個方向上已經邁了很大一步。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. CV與NLP統一建模"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 卷積在NLP中的應用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在NLP領域,人們曾嘗試將卷積應用在NLP領域,也切實地提出了一些不錯的方法,例如2017年FAIR提出的ConvSeq2Seq、2019年FAIR提出的Dynamic Convolution以及微軟亞洲研究院提出的Deformable Convolution等等都取得了一定效果,但離Transformer的性能都有些差距。特別是在GPT、BERT等預訓練模型提出之後,Transformer的地位更加穩固了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e2\/e2a939de540ed3cb2de731802e2a277e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② Transformer\/注意力機制在CV中的應用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣在CV領域,人們也在嘗試使用Transformer進行建模,並且在2020年的幾項工作使得Transformer在CV的應用得到了突破性的進展。首先得到廣泛關注的是FAIR的“DeTR”,將Transformer整體成功應用於物體檢測中,隨後微軟亞洲研究院提出了RelationNet++,用Transformer解碼器來解決融合不同物體表達方法的問題,在10月份谷歌又提出了Vision Transformer,用Transformer來作爲backbone網絡進行圖像分類。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c0\/c02ddb7a02a5cb180a86fed87b6a8bb6.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面對這些工作一一進行介紹。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 將Transformer應用在CV領域的一些工作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① DETR"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DETR是將Transformer應用到物體檢測任務上,其特點是將Transformer在NLP上的應用方法照搬到CV領域,直接對圖像特徵進行編解碼,並且實現了端到端的訓練。但是它依然使用了CNN進行了特徵提取,並沒有完全擺脫CNN的使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d4\/d430ccf46f6e7d6820e49549e17839f8.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② RelationNet++"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在RelationNet++工作中,我們使用Transformer的解碼器去擬合物體檢測中不同物體的表示。目前表示物體的方法有多種,例如用物體的中心點、或者anchor框,或者Bounding Box,或者物體框的對角位置等等,這些表示方法各有優點,但是我們一種物體檢測模型中一般只使用一種表示方法,但是利用Transformer解碼器就可以把各種表示方法統一起來,將各種表示方法的優點結合起來,這樣在CoCo上單模型可以取得52.7%的mAP。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6d\/6d768f0c53c799361f7476b2f281e29f.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ Vision Transformers"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Vision Transformers是將Transformers應用到了圖像分類任務上,其實際速度和精度均超越了ResNet。它的方法是將原圖分割成多個尺寸是16*16的子圖,對於RGB圖像,每個子圖就是一個768(16*16*3)維的向量,然後將其輸入到Transformers編碼器中。雖然這個方法簡單,但是其效果和實際運行速度都是不錯的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fb\/fbe937cb1bea453d36355e8cdeeab74a.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 注意力機制在CV上的應用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Transformers應用在CV領域之前,人們就已經將Transformers中的注意力機制應用在了CV領域上,例如FAIR在2017年提出了NLNet,我們的研究小組從2017年至2020年分別提出了RelationNet、LRF和LR-Net。也應注意到這期間出現了大量利用注意力機制來解決視覺中各種問題的工作,由於時間關係,本次演講中涉及就少。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1e\/1e4e6c6cd453669057c22fa693f90c87.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"注意力機制用於基本視覺單元間的關係建模"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在視覺的建模中都會涉及到兩個層次的概念,一個是“像素”,一個是“物體”。而視覺的建模就是“像素和像素”之間或“物體和像素”之間或“物體和物體”之間的關係建模。針對“像素和像素”目前主流方法是用卷積進行關係建模,針對“物體和像素”目前主流的方法是利用RoIAlign等方法建立關係,而“物體和物體”的關係建模以前涉及較少。事實上,這些不同的關係建模其實都可以用注意力方法來代替。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/18\/18e3a32ed1130b65d86a5c9f209ecfb2.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. 注意力機制的物體-物體關係建模"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在將注意力機制引入CV領域之前,並沒有考慮物體與物體之間的關係建模,引入注意力之後就可以考慮物體與物體之間的相互關係了,我們有很多工作都使用了注意力機制進行物體之間的關係建模,例如物體檢測模型RelationNet [CVPR’2018]、多目標跟蹤模型Spatial-Temporal Relation Network [ICCV’2019]、視頻物體檢測MEGA [CVPR’2020]等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f3\/f38b6f6e537819c50fc7b390570e5660.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 物體檢測器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們首次使用注意力機制實現了端到端的物體檢測,參見發表在CVPR’2018上的論文《Relation Networks for Object Detection》。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f9\/f97766f69c5f9c8bb504f426161dc3df.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該方法的關鍵是使用注意力機制代替了NMS模塊,這樣整個過程就都可以使用反向傳播進行訓練,使去重模塊也可以學習。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 多目標跟蹤模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在目標跟蹤任務上也使用了注意力機制,可參見發表在ICCV’2019上的論文《Spatial-Temporal Relation Networks for Multi-Object Tracking》"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/06\/064c01255b0779177a42f75fe577d885.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ 視頻物體檢測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在CVPR’2020上發表的論文《Memory Enhanced Global-Local Aggregation for Video Object Detection》則是注意力機制在視頻物體檢測任務上的應用。目前效果最好的視頻物體檢測方法大多都使用了注意力建模。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. 注意力機制的物體-像素關係建模"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用注意力機制之前,基本上都是用RoIAlign等方法在feature map中截取目標物體的區域特徵,在使用注意力之後,就可以自適應的去獲取目標物體的區域特徵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ed\/eded3fd30bee1e8a6ecb32f965d5e5c8.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在ECCV’2018發表的論文《Learning Region Features for Object Detection》就使用注意力機制自動學習了區域特徵的提取。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7c42ceec1d1c6cb854c680a1d8253a0e.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"7. 注意力機制的像素-像素關係建模"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在注意力機制之前,圖像像素與像素的關係是使用卷積進行建模,在像素與像素之間關係建模引入注意力機制之後,注意力機制可以與卷積進行互補,甚至注意力機制完全代替卷積。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 注意力機制與卷積互補"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於卷積自身的結構特性,用卷積進行像素與像素間關係建模會有區域的局部限制,如果用注意力機制進行補充,則可以獲取全局的信息。王小龍和何凱明等人在CVPR’2018提出的非局部網絡(NL-Net)在卷積網絡中插入一種使用注意力實現的全局模塊,這樣可以提升很多任務的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d1\/d1617d6cd2459c435d85b9e115742c7c.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是上面的方法會出現退化現象,即不同的查詢像素,其實受到了一組Key像素的影響,我們將這個研究發表在了TPAMI’2020上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對非局部網絡的退化問題,我們在ECCV’2020發表的《Disentangled Non-Local Neural Networks》論文中給出了一種解耦非局部網絡的方法,使模型能夠學到更有意義的物理關係。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/4076c5d07fa0bcd717e0a40ba2313750.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 用注意力機制替換卷積"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比注意力機制與卷積配合,更進一步的是使用注意力機制完全代替卷積,我們在ICCV’2019發表的論文《Local Relation Networks for Visual Recognition》中提出了一種用注意力代替卷積的方法,即將ResNet中的卷積單元換成注意力單元,模型在相同FLOPs情況下取得了更高的精度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/18\/16\/18bff3fc2fa4d9da313f5b3d63b89416.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算機視覺已經開始進入自監督或無監督訓練的時代"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformer和注意力模型:目前最有可能統一視覺和自然語言的建模方法"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"分享嘉賓:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"胡瀚 博士"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微軟亞洲研究院 | 研究員"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"演講者簡介:Han Hu is currently a principal researcher in Visual Computing Group at Microsoft Research Asia (MSRA). He received the Ph.D degree in 2014 and the B.S. degree in 2008 from Tsinghua University. His Ph.D dissertation was awarded Excellent Doctoral Dissertation Award of CAAI at 2016. He was a visiting student in University of Pennsylvania from October, 2012 to April, 2013. Before he joined MSRA in Dec. 2016, he worked at Institute of Deep Learning (IDL), Baidu Research, His research interest include visual representation learning, joint visual-linguistic representation learning and object recognition. He will serve as an area chair of CVPR2021."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Homepage:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/ancientmooner.github.io\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:dataFunTalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/OFDO2o61zagfifjDcNQ9cg","title":"xxx","type":null},"content":[{"type":"text","text":"計算機視覺中的自監督學習與注意力建模"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章