微軟發佈VinVL:推進視覺語言模型的最新進展

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人類藉由從各渠道感知及融合信息,來理解這個世界,具體方式包括:用雙眼觀察圖像,用雙耳聆聽聲音,以及其他的感知輸入方式。人工智能的核心願景之一,便是開發一種賦予計算機類似功能的算法,通過有效地從多模態數據(比如視覺語言)來學習,從而瞭解我們周圍的世界。例如:視覺語言(即 VL)系統允許我們在相關的圖像中搜索文本,並使用自然語言來描述圖像的內容,反之亦然。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖一所示,典型的視覺語言系統會藉助由兩個模塊組成的模塊化架構,來獲得視覺語言方面的認知:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模塊一:圖像編碼模塊,也稱爲視覺特徵提取器。具體實現:通過卷積神經網絡(CNN)模型,生成輸入圖像的特徵圖譜。在此之前,最爲常見的做法,是使用視覺基因數據集( Visual Genome, VG),來訓練基於卷積神經網絡的對象檢測模型。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模塊二:視覺語言融合模塊。具體實現:將編碼後的圖像與文本映射到同一語義空間中的向量上,從而可以通過其向量的餘弦距離,來計算它們的語義相似度。該模塊通常使用基於 Transformer 的模型,比如說 OSCAR 來實現。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近來,通過大規模運用一一對應的圖像與文本語料庫,視覺語言預訓練(VLP)在改善視覺語言融合模塊方面取得了極大進展。最有代表性的方法,是以自監督的方式,使用海量的圖像與文本配對數據,來訓練基於 Transformer 的大型模型。比如:基於其上下文,預測相應遮蔽的元素。我們可以對預訓練的視覺語言融合模型進行微調,以適應各式各樣的下游視覺語言任務。不過,儘管在改善圖像編碼和物體檢測方面的研究已經獲得了巨大進步,但自從 2017 年經典的自下而上式局部特徵推理機制出現以來,現有的視覺語言預訓練方法都是將圖像編碼模塊作爲黑盒子來對待的,而不去涉及視覺特徵的改善。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文中,我們將會介紹近來微軟在改善圖像編碼模塊方面的進展。針對圖像編碼,微軟的研究員開發了一種新的對象屬性檢測模型,這種模型也被稱爲 VinVL,翻譯過來就是視覺語言中的視覺特徵。我們在全面驗證後確認,在視覺語言模型中,視覺特徵的關係十分重大。微軟的視覺語言系統將 VinVL 與最先進的視覺語言融合模塊(如 OSCAR 及 VIVO)相結合,表現非常優秀,不僅在七個主要的視覺語言基準上都達到了最先進的技術水平,在最具競爭力的視覺語言榜單上——包括視覺問答(VQA),微軟 COCO 圖像字幕和 NOCAPS(新型對象字幕)競賽中均名列前茅。最出彩的是,就常用語圖像字幕生成(CIDEr)而言,微軟的視覺語言系統在 NOCAPS 榜單上的表現遠勝人類(分別獲得 92.5 和 85.3 分)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/33\/331602501692fd45fc403e327b459f2d.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖一:用於視覺語言任務的最新模塊化架構圖,包括兩個模塊、圖像編碼模塊以及視覺語言融合模塊。通常來說,各個模塊分別用視覺基因數據集(Visual Genome)和概念字幕數據集(Conceptual Captions)訓練。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"VinVL:通用的對象屬性檢測模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與傳統的計算機視覺任務(如對象檢測)相反,視覺語言任務需要計算機對於各種各樣的視覺概念有更廣泛的理解,從而能夠將其與文本中的相應概念相匹配。一方面來說,最常見的對象檢測基準(如 COCO、Open Images 和 Objects365)包含的對象類註釋多達 600 個,其主要關注的是形狀明確的對象,比如汽車或人;但缺少無固定形狀的可視對象,比如草叢和天空——而後者對於描述圖像是非常有用的。受限和有所偏好的對象類使得這些對象檢測數據集,在訓練實際應用中非常有用的視覺語言理解模型時捉襟見肘。另一方面,儘管 VG 數據集對於更爲多樣化和不具偏好的對象及屬性類有註釋,也只是個包含了 11 萬張圖像的數據集,從統計學上來說,想要藉以訓練可靠的圖像編碼模型還是太小了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對視覺語言任務,爲了訓練對象屬性檢測模型,我們將四個大型的公共對象檢測數據集,包括 COCO,Open Images,Objects365 和 VG 合併起來,構造了一個包含 2.49M 圖像的大型對象檢測數據集,其中有 1848 個對象類和 524 個屬性類。鑑於大多數數據集並沒有屬性註釋,我們採用了預訓練和微調的策略,來構建我們的對象屬性檢測模型。起初,我們通過這個合併後的數據集,來預訓練對象檢測模型,然後在 VG 數據集的額外屬性分支上對模型進行微調,使得它能夠檢測對象及屬性。最終的對象屬性檢測模型是個具有 152 個卷積層和 133M 參數的 Faster-RCNN 模型,也是 VL 任務中最大的圖像編碼模型了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的圖像屬性檢測模型可以檢測 1594 個對象類以及 524 個視覺屬性。最終,根據我們的實驗,針對輸入圖像上幾乎所有語義上有意義的區域,該模型能夠檢測並進行相應編碼。如圖二所示,與傳統的對象檢測模型(左側)相比,我們的模型(右側)能夠檢測到圖像中更多的視覺對象及屬性,並以更爲豐富的視覺特徵進行編碼,對於大部分的 VL 任務來說,這是至關重要的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/1b\/1ba1f366da9658674a2ee16342f5c405.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖二:通過 Open Images 數據集訓練的傳統對象檢測模型(左側),以及通過四個公共對象檢測數據集訓練的對象屬性檢測模型(右側)。我們的模型包含了更豐富的語義,比如更豐富的視覺概念以及屬性信息,同時所檢測到的邊界框幾乎涵蓋了所有具有語義的區域。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"視覺語言任務方面的最新進展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於圖像編碼模塊是圖像語言系統的基礎,如圖一所示,我們的新圖像編碼可以結合許多現有的 VL 融合模塊一起使用,以提升 VL 任務上的表現。如表一所示,只將常見自下而上的模型所生成的視覺特徵替換爲我們的新模型所生成的那些,完整保留像 OSCAR 和 VIVO 這樣的視覺語言融合模塊,我們發現:現有已建立的七個 VL 任務上,新的結合體表現要明顯優於之前的 SoTA 模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"備註:我們仍然對 VL 融合模塊執行訓練,但使用了同樣的模型架構、訓練數據和訓練方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/0d\/0dfa41519b9a34d59d27767a12fe6e69.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表一:在將視覺特徵從常見的自下而上特徵換成我們的之後,七個 VL 任務上都有提升,NOCAPS 基準來自 VIVO,其他任務的基準來自 OSCAR。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考慮到參數效率,我們在表二中,針對不同大小的模型進行了對比。在大多數任務中,我們的基礎模型都優於之前的大型模型,這表明使用更好的圖像編碼,VL 融合模塊的參數效率更佳。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/5d\/5dd82b71c93efd3df9af400f0bc1e814.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表二:Oscar+,使用我們的對象屬性檢測模型生成視覺特徵,並在已建立的七個 VL 任務中獲得了更好的表現。標記爲 S、B、L 的 SoTA 模型,表示小型、基礎型和大型模型(BERT 提供規模評測)分別可獲得的最佳性能。本文中的所有表格,藍色都表示任務的最佳結果,灰色背景表示 Oscar 所生成的結果。之前 SoTA 的結果是從 ERNIE-VIL 模型、神經狀態機(NSM)、VIVO、VILLA 和 OSCAR 收集來的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的新視覺語言模型包含兩個模塊,一是作爲圖像編碼模型的新對象屬性檢測模型,二是作爲視覺語言融合模塊的 OSCAR,這個新模型截止 2020 年 12 月 31 日一直輕鬆居於多個 AI 基準的頂端,包括視覺問答(VQA)、微軟的 COCO 圖像字幕和 Nocaps。最出彩的是,就常用語圖像字幕生成(CIDEr)而言,我們的 VL 系統在 Nocaps 榜單上的表現遠勝人類(分別獲得 92.5 和 85.3 分)。以 GQA 基準而言,我們的模型也是第一個能勝過 NSM 的 VL 模型,幷包含一些專爲特定任務設計的複雜推理組件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/cd\/cd26c686132eb011932ef49fda483e60.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Visual Question Answering (VQA)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/ce\/ceacfcc01774613ab64c4a226bdf97bd.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Microsoft COCO Image Captioning"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/70\/70596dba0f4fbc104d5279ceff10a407.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Novel Object Captioning (nocaps)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"展望未來"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"VinVL 在改善圖像編碼以提升 VL 理解方面展現出了巨大的潛力。如本文中的圖片所示,我們新開發的圖像編碼模型在諸多 VL 任務中表現優秀。不過,儘管結果很令人振奮,比如在圖像字幕基準上勝過了人類的表現,但我們的模型還絕未達到人類的 VL 理解級別。未來有趣的開發方向包括:1)利用海量圖像分類或加了標籤的數據,進一步擴大對象 - 屬性檢測的預訓練;2)擴展跨模型的視覺語言表徵學習方法,從而建立基於感知的語言模型,令計算機可以像人類一樣,通過自然語言表達視覺概念,也可以反過來通過視覺概念來形象化自然語言。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/vinvl-advancing-the-state-of-the-art-for-vision-language-models\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/vinvl-advancing-the-state-of-the-art-for-vision-language-models\/"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章