跨語言的多模態、多任務檢索模型MURAL解讀

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常,從一種語言到另一種語言沒有直接的一對一翻譯。即使有這樣的翻譯,它們也不一定準確"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":","},{"type":"text","text":"對於非母語人士來說,不同的聯想和內涵很容易丟失。"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是,在這種情況下,如果是基於可視化的實例,其含義可能會更爲清晰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"就拿“wedding”(婚禮)這個單詞來說吧。在英語中,人們通常會聯想到穿着白裙的新娘和穿着燕尾服的新郎,但是翻譯成印地語(शादी)時,更恰當的聯想可能是穿着鮮豔色彩的新娘和穿着"},{"type":"text","text":"高領長外套"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(印度男裝 Sherwani)的新郎。對於這個單詞,每個人的聯想可能有很大的不同,但是如果給他們一個想要表達的概念的圖像,它的意義就會更清楚。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d0\/ff\/d0c73926d44299f0c9c0256a6e8fa9ff.jpg","alt":null,"title":"“婚禮”這個單詞在英語和印地語中表現出不同的心理意象。","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"隨着當前神經機器翻譯和圖像識別技術的發展,在翻譯過程中可以通過提供一段文本和一幅支持圖像來減少這種歧義。已有的研究已經在高資源語言(如英語)學習圖像-文本聯合表示方面取得了很大進展。這些表示模型努力將圖像和文本編碼爲共享嵌入空間的向量,使得圖像和描述它的文本在這個空間中相互接近。"},{"type":"link","attrs":{"href":"https:\/\/ai.googleblog.com\/2021\/05\/align-scaling-up-visual-and-vision.html","title":null,"type":null},"content":[{"type":"text","text":"ALIGN"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 和 "},{"type":"link","attrs":{"href":"https:\/\/openai.com\/blog\/clip\/","title":null,"type":null},"content":[{"type":"text","text":"CLIP"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 表明,當有足夠的訓練數據時,在圖像-文本對上使用"},{"type":"link","attrs":{"href":"https:\/\/lilianweng.github.io\/lil-log\/2021\/05\/31\/contrastive-representation-learning.html#contrastive-loss","title":null,"type":null},"content":[{"type":"text","text":"對比學習損失"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"來訓練雙編碼器模型(即通過兩個獨立的編碼器訓練的模型),效果非常好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"遺憾的是,對於大多數語言來說,這類圖像-文本對數據的規模並不相同。實際上,90% 以上的這類網絡數據屬於資源豐富的前十種語言,比如英語和漢語,而資源不足的語言的數據則少得多。要解決這一問題,我們可以試着爲資源不足的語言手動收集圖像-文本對數據,但是由於這項工作的規模,難度太大,或者我們可以設法利用現有的數據集(例如翻譯對),這類數據集能夠爲多種語言提供必要的學習表示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在 "},{"type":"link","attrs":{"href":"https:\/\/www.google.com\/url?q=https:\/\/aclanthology.org\/volumes\/2021.findings-emnlp\/&sa=D&source=docs&ust=1637623340301000&usg=AOvVaw1y66FDsAa3Tdy65fCm_CgI","title":null,"type":null},"content":[{"type":"text","text":"EMNLP 2021"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 提交的論文《"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/pdf\/2109.05125.pdf","title":null,"type":null},"content":[{"type":"text","text":"MURAL:跨語言的多模態、多任務檢索"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"》("},{"type":"text","marks":[{"type":"italic"},{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"MURAL: Multimodal, Multitask Retrieval Across Languages"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")中,我們描述了一種用於圖像-文本匹配的表示模型,該模型將多任務學習應用於圖像-文本對,並與涵蓋 100 多種語言的翻譯對相結合。這項技術允許用戶通過圖像來表達那些不能直接翻譯成目標語言的詞語。例如,“"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Valiha","title":null,"type":null},"content":[{"type":"text","text":"valiha"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"”一詞是指"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Malagasy_people","title":null,"type":null},"content":[{"type":"text","text":"馬爾加什人"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"所演奏的一種"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Tube_zither","title":null,"type":null},"content":[{"type":"text","text":"管狀樂器"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",在大多數語言中不會有直接的翻譯,但是可以通過圖像輕鬆地描述出來。在實踐中,MURAL 表現了比最先進的模型、其他基準和競爭基線全面持續改進。而且,MURAL 在它所測試的大多數資源不充足的語言中表現良好。此外,我們還發現了由 MURAL 表示學到的有趣的語言相關性。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"MURAL 架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"MURAL 架構是基於 "},{"type":"link","attrs":{"href":"https:\/\/ai.googleblog.com\/2021\/05\/align-scaling-up-visual-and-vision.html","title":null,"type":null},"content":[{"type":"text","text":"ALIGN"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",但以多任務的方式使用。ALIGN 使用雙編碼器架構將圖像和相關文本描述的表示結合起來,而 MURAL 使用雙編碼器架構來實現同樣的目標,並通過合併翻譯對將其擴展到其他語言。圖像-文本對的數據集與 ALIGN 所用的數據集相同,而翻譯對則是用於 "},{"type":"link","attrs":{"href":"https:\/\/ai.googleblog.com\/2020\/08\/language-agnostic-bert-sentence.html","title":null,"type":null},"content":[{"type":"text","text":"LaBSE"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 的數據集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"MURAL 解決了兩個"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/contrastive-loss-explaned-159f2d4a87ec","title":null,"type":null},"content":[{"type":"text","text":"對比學習"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"任務:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"1)圖像-文本匹配;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"2)文本-文本(平行文本)匹配。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這兩項任務共享文本編碼器模塊。該模型從圖像-文本數據中學習圖像和文本之間的聯繫,以及從翻譯對中學習數百種不同語言的表示。其思想是,共享編碼器將把從高資源語言中學到的圖像-文本關聯轉移到低資源語言。結果表明,最好的模型使用了 "},{"type":"link","attrs":{"href":"https:\/\/ai.googleblog.com\/2019\/05\/efficientnet-improving-accuracy-and.html","title":null,"type":null},"content":[{"type":"text","text":"EfficientNet-B7"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 圖像編碼器和 "},{"type":"link","attrs":{"href":"https:\/\/ai.googleblog.com\/2018\/11\/open-sourcing-bert-state-of-art-pre.html","title":null,"type":null},"content":[{"type":"text","text":"BERT-large"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 文本編碼器,這兩者都是從頭開始訓練。所學到的表示可用於下游的視覺和視覺語言任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/02\/b5\/02e15416b89cc50b91fd292d3f4313b5.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"MURAL 架構描述了雙編碼器,兩個任務之間有一個共享的文本編碼器,使用對比學習損失進行訓練。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多語言圖像到文本和文本到圖像檢索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"爲了展示 MURAL 的能力,我們選擇了跨模態檢索的任務(即基於文本檢索相關的圖像,反之亦然),並報告了在各種學術圖像-文本數據集上的得分,這些數據集涵蓋了資源豐富的語言,如 "},{"type":"link","attrs":{"href":"https:\/\/cocodataset.org\/#home","title":null,"type":null},"content":[{"type":"text","text":"MS-COCO"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(及其日文變體 "},{"type":"link","attrs":{"href":"https:\/\/aclanthology.org\/P17-2066.pdf","title":null,"type":null},"content":[{"type":"text","text":"STAIR"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")、"},{"type":"link","attrs":{"href":"https:\/\/aclanthology.org\/Q14-1006\/","title":null,"type":null},"content":[{"type":"text","text":"Flickr30K"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(英語)和 "},{"type":"link","attrs":{"href":"https:\/\/github.com\/multi30k\/dataset","title":null,"type":null},"content":[{"type":"text","text":"Multi30K"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(擴展到德語、法語、捷克語)、"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2012.05107","title":null,"type":null},"content":[{"type":"text","text":"XTD"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(僅測試集,包含七種資源豐富的語言:意大利語、西班牙語、俄語、漢語、波蘭語、土耳其語和韓語)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"除了資源豐富的語言外,我們還在最近發佈的維基百科圖像文本(Wikipedia Image–Text,"},{"type":"link","attrs":{"href":"https:\/\/ai.googleblog.com\/2021\/09\/announcing-wit-wikipedia-based-image.html","title":null,"type":null},"content":[{"type":"text","text":"WIT"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")數據集上對 MURAL 進行了評估,該數據集涵蓋了 108 種語言,包括資源豐富(英語、法語、漢語等)和資源不足(斯瓦希里語、印地語等)的語言。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"MURAL 在對資源豐富和資源不足的語言進行評估時,無論是在"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Zero-shot_learning","title":null,"type":null},"content":[{"type":"text","text":"零樣本學習"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"設置還是"},{"type":"link","attrs":{"href":"https:\/\/stats.stackexchange.com\/questions\/331369\/what-is-meant-by-fine-tuning-of-neural-network","title":null,"type":null},"content":[{"type":"text","text":"微調"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"設置方面,MURAL 總是優於先前 "},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2006.02635","title":null,"type":null},"content":[{"type":"text","text":"M3P"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"、"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2104.00332","title":null,"type":null},"content":[{"type":"text","text":"UC2"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 和 "},{"type":"link","attrs":{"href":"https:\/\/ai.googleblog.com\/2021\/05\/align-scaling-up-visual-and-vision.html","title":null,"type":null},"content":[{"type":"text","text":"ALIGN"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 等最先進的模型。我們發現,相對於最先進的模型 ALIGN,資源不足的語言有着顯著的性能提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/9c\/18\/9c17c091d4c49faea6f35d402e2e1218.jpg","alt":null,"title":"各種多語言圖像-文本檢索基準的平均召回率。","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"平均召回率是用於評估圖像-文本數據集的跨模態檢索性能的常用指標(越高越好)。它衡量的是六個測量值的平均值的 [email protected](即基礎真理圖像出現在前 N 個檢索圖像中的概率):N=[1, 5, 10] 的圖像→文本和文本→圖像檢索。請注意,XTD 的分數報告了文本→圖像檢索爲 [email protected]。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"檢索分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我們還分析了 "},{"type":"link","attrs":{"href":"https:\/\/github.com\/google-research-datasets\/wit","title":null,"type":null},"content":[{"type":"text","text":"WIT 數據集"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"上的零樣本檢索實例,比較了 ALIGN 和 MURAL 對英語(en)和印地語(hi)的檢索。MURAL 比 ALIGN 具有更好的檢索性能,反映了對文本語義的較好把握,如印地語等資源不足的語言。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/82\/f4\/82f7a3d10af90b8ebf483707282e3ef4.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在 WIT 數據集的文本→圖像檢索任務中,用 ALIGN 和 MURAL 檢索到的前 5 張圖像的比較,以印地語文本爲例。在 WIT 數據集的文本→圖像檢索任務中,ALIGN 和 MURAL 對印度語文本進行了比較,印度語文本是:“एकतश्तरी परबिना मसाले या सब्ज़ी के रखी ह सादी स्पगॅत्ती”,翻譯成英文是“A bowl containing plain noodles without any spices or vegetables”(一碗沒有任何香料或蔬菜的普通麪條)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"甚至對於像法語這樣資源豐富的語言中的圖像→文本檢索,MURAL 也顯示出對某些單詞有更好的理解。例如,MURAL 對 “cadran solaire”(法語,英文爲“sundial”(日晷))的查詢的結果比 ALIGN 要好,因爲後者檢索不到任何描述日晷的文本(如下圖)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/6f\/39\/6f5521f27e850088813de09d996f1b39.jpg","alt":null,"title":"同一張日晷圖片上,ALIGN 和 MURAL 在圖片→文本檢索任務中的前 5 個文本結果的比較。","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"嵌入可視化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"此前,"},{"type":"link","attrs":{"href":"https:\/\/aclanthology.org\/D19-1167.pdf","title":null,"type":null},"content":[{"type":"text","text":"研究人員已經表明"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",模型嵌入的可視化能夠揭示語言之間的有趣聯繫——例如,由"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Neural_machine_translation","title":null,"type":null},"content":[{"type":"text","text":"神經機器翻譯"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(neural machine translation,NMT)模型學習的表示已經被證明可以根據它們在某一"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Language_family","title":null,"type":null},"content":[{"type":"text","text":"語言系屬分類"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"中的成員身份來形成集羣。對於屬於日耳曼語系、羅曼語系、斯拉夫語系、烏拉爾語系、芬蘭語系、凱爾特語系和芬蘭-烏戈爾語系(在歐洲和西亞廣泛使用)的一個語言子集進行了類似的可視化處理。我們將 MURAL 的文本嵌入與 "},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2007.01852","title":null,"type":null},"content":[{"type":"text","text":"LaBSE"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 的文本嵌入進行比較,後者是一個純文本的編碼器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"LabSE 的嵌入圖顯示了受語言系屬分類影響的不同語言集羣。例如,羅曼語(紫色,下同)與斯拉夫語(棕色,下同)屬於不同的區域。這一發現與"},{"type":"link","attrs":{"href":"https:\/\/aclanthology.org\/D19-1167.pdf","title":null,"type":null},"content":[{"type":"text","text":"之前研究"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"由 NMT 系統學習的中間表示的研究結果相吻合。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/5f\/f5\/5f6d79cdcc28f6bda5877461a5f515f5.jpg","alt":null,"title":"35 種語言的 LaBSE 文本表示的可視化。","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"語言根據其譜系關係用顏色編碼。代表性的語言有:日耳曼語(紅色)——德語、英語、荷蘭語;烏拉爾語(橙色)——芬蘭語、愛沙尼亞語;斯拉夫語(棕色)——波蘭語、俄語;羅曼語(紫色)——意大利語、葡萄牙語、西班牙語;蓋爾語(藍色)——威爾士語、愛爾蘭語。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"相對於 LaBSE 的可視化,MURAL 的嵌入更注重多模態的學習,表現出一些符合"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Areal_feature","title":null,"type":null},"content":[{"type":"text","text":"區域語言學"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(某一地理區域內的語言或方言共享元素)和"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Language_contact","title":null,"type":null},"content":[{"type":"text","text":"接觸語言學"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(語言或"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Dialect","title":null,"type":null},"content":[{"type":"text","text":"方言"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"相互影響)的集羣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"值得一提的是,在 MURAL 嵌入空間中,羅馬尼亞語(ro)比 LaBSE 更接近保加利亞語(bg)和馬其頓語(mk)等斯拉夫語言,這符合"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Balkan_sprachbund","title":null,"type":null},"content":[{"type":"text","text":"巴爾幹語言聯盟"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的情況。另外一種可能的語言接觸是芬蘭語,愛沙尼亞語(et)和芬蘭語(fi),它們更接近斯拉夫語言集羣。MURAL 以圖像和翻譯爲軸心這一事實似乎爲語言關聯性增添了額外的觀點,因爲它是在深度表示中學習的,超越了在純文本環境中觀察到的語族集羣。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/21\/7b\/21a4ef326deae8748dcccdf77c70e67b.jpg","alt":null,"title":"35 種語言的 MURAL 文本表示的可視化。顏色編碼與上圖相同。","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"結語"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我們的研究結果表明,使用翻譯對進行聯合訓練可以有效地克服許多資源不足的語言中圖像-文本對的稀缺性,並提高跨模態性能。此外,在使用多模態模型學習的文本表示中,觀察區域語言學和接觸語言學的提示也很有意思。因此,需要進一步探索通過多模態模型(如 MURAL)隱式學習到的各種聯繫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最後,我們希望這項工作能促進多模態、多語言空間的深入研究,在這個空間裏,模型學習語言的表示和語言之間的聯繫(通過圖像和文本表示),而不僅僅是資源豐富的語言。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Aashi Jain, 谷歌 AI Resident 成員。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Yinfei Yang,谷歌研究院研究科學家。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"https:\/\/ai.googleblog.com\/2021\/11\/mural-multimodal-multi-task-retrieval.html"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章