從5大挑戰帶你瞭解多模態機器學習

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​​摘要:多模態機器學習旨在從多種模態建立一種模型,能夠處理和關聯多種模態的信息。考慮到數據的異構性,MMML(Multimodal Machine Learning)領域帶來了許多獨特的挑戰,總體而言五種:表示、轉化、對齊、融合、協同學習。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文分享自華爲雲社區","attrs":{}},{"type":"link","attrs":{"href":"https://bbs.huaweicloud.com/blogs/264134?utm_source=infoq&utm_medium=bbs-ex&utm_campaign=ei&utm_content=content","title":"","type":null},"content":[{"type":"text","text":"《多模態學習綜述》","attrs":{}}]},{"type":"text","text":",原文作者:Finetune小能手。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"MultimodalMachine Learning: A Survey and Taxonomy","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一種模態指事物發生或體驗的方式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於多模態研究的問題就是指包含多種模態","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態機器學習旨在從多種模態建立一種模型,能夠處理和關聯多種模態的信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考慮到數據的異構性,MMML(Multimodal Machine Learning)領域帶來了許多獨特的挑戰,總體而言五種:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"表示","attrs":{}},{"type":"text","text":":最爲基礎的挑戰,學習利用多種模態的互補性和冗餘性,來表示和概括模態數據的方法。模態的異構性爲這種表示帶來挑戰。例如:語言通常是符號表示,而語音通常是信號表示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"轉化","attrs":{}},{"type":"text","text":":如何轉換(映射)一種模態的數據到另一種模態。多模態不僅數據異構,而且模態間的關係通常是開放的或者說是主觀的。例如,有許多種正確的方法來描述一張圖片,其中可能並不存在最好的模態轉譯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對齊","attrs":{}},{"type":"text","text":":模態對齊主要是識別多種模態的要素(子要素)間的直接聯繫。例如,將菜譜的每一步和做菜的視頻進行對應。解決這個問題需要衡量不同模態的間的相似性,而且要考慮可能的長距離依賴和歧義。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"融合","attrs":{}},{"type":"text","text":":將多種模態的信息進行連接,從而完成推理。例如,視聽語音識別,視覺描述的嘴脣的運動與音頻信號進行混合來完成所說的詞的推理。當可能至少一種模態數據丟失的時候,來自不同模態的信息在推理中有不同的預測能力和噪聲拓撲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"協同學習","attrs":{}},{"type":"text","text":":在不同模態、表示、預測模型間進行知識遷移。協同訓練、conceptual grounding和zero-shot learning中有典型應用。當某一種模態資源有限時(標註數據很少)有很大意義。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用:應用很多,包括視聽語音識別(AVSR),多媒體數據索引和搜索,社交互動行爲理解,視頻描述等","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多模態表示","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態表示需解決問題:異構數據如何結合,不同級別的噪聲如何處理,丟失數據如何處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bengio指出,好的特徵表示應該:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平滑","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時空一致","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"稀疏","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自然聚類","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Srivastava et.al.補充三點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表徵空間應該反映對應概念相似性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即便一些模態不存在,表徵應該易於獲得","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使填充丟失的模態成爲可能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在之前的研究中(before 2019),大部分多模態表徵簡單的將單模態特徵進行連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩種多模態表徵方法:聯合表示(joint representation),協同表示(coordinatedrepresentation)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聯合表示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個模態爲x_i","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"xi","attrs":{}},{"type":"text","text":"​,x_m = f(x_1, \\dots, x_n)","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"xm","attrs":{}},{"type":"text","text":"​=","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"f","attrs":{}},{"type":"text","text":"(","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x","attrs":{}},{"type":"text","text":"1​,…,","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"xn","attrs":{}},{"type":"text","text":"​)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b0/b0de9c4019356eabac6c2bb637ab5dd9.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​聯合表示常用於訓練推理均爲多模態數據的任務中,最爲簡單的方法就是特徵拼接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度學習方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度學習特徵後幾層天然的包含高層語義信息,常用最後或倒數第二層特徵","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於深度學習網絡需要大量標註數據,常會利用無監督方法,例如自編碼器,進行特徵表示預訓練","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度學習不能天然的解決數據丟失問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"概率圖模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用隱隨機變量構建特徵表示","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最常見的基於圖模型的特徵表示方法,利用深度玻爾茲曼機(DBM)、受限玻爾茲曼機(RBM)作爲模塊構建,類似深度學習,特徵分層,是無監督方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也有用深度信念網絡(DBN)表徵每個模態然後進行聯合表示的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用多模態深度玻爾茲曼機學習多模態特徵表示,由於天然的生成特性,能夠輕鬆處理丟失數據問題,整個模態數據丟失也可以自然解決;還可以用某種模態來生成另一種種模態的樣本;DBM缺陷在於難訓練,計算代價高,需要變分近似訓練方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"序列表徵:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據的長度是變長序列時,例如句子、視頻或者音頻流,使用序列表徵","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RNN,LSTM當前主要用於表示單模態序列,而RNN的某個時刻的hidden state,可以看做在這個時刻前的所有序列的特徵的整合","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AVSR中Cosi等人使用RNN來表示多模態特徵","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"協同表示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個模態爲x_i","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"xi","attrs":{}},{"type":"text","text":"​,f(x_1) \\sim g(x_2)","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"f","attrs":{}},{"type":"text","text":"(","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x","attrs":{}},{"type":"text","text":"1​)∼","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"g","attrs":{}},{"type":"text","text":"(","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x","attrs":{}},{"type":"text","text":"2​),每個模態有對應的映射函數,將它映射到多模態空間中,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"每個模態的投影過程是獨立的","attrs":{}},{"type":"text","text":",但是最終的多模態空間是通過某種限制來協同表示的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩種協同表示方式:相似度模型,結構化模型,前者保證特徵表示的相似性,後者加強在特徵結果空間中的結構化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相似度模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相似度模型最小化不同模態在協同表示空間中的距離,例如狗和狗的圖像的距離,小於狗和車的圖像的距離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度神經網絡在協同表示中的優勢在於能夠以端到端的方式進行協同表示的聯合學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結構化協同空間模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結構化協同表示模型加強了不同模態表示的附加限制,具體的結構化限制根據應用而異","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結構化協同表示空間常用在跨模態哈希中,將高維的數據壓縮到緊湊的二進制表示使得相似的object有相似的編碼,常用於跨模態檢索中","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"哈希的方法迫使最終多模態空間表示有如下限制:1) N維的漢明空間,可控位數的二進制表示;2) 不同模態的相同object有着相似的哈希編碼;3) 多模態空間必須保持數據相似性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種結構化協同表示的方法來源於圖像和語言的“順序嵌入”。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,Vendrov et al. 在多模態空間中強化了一種不相似度量,它是非對稱的偏序關係。主要思想是在語言和圖像的表示中抓住了一種偏序關係,強制了一種層級結構。對於一張圖像,這種偏序關係爲“a woman walking her dog” > “woman walking her dog” > “woman walking”。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一種特殊的結構化協同空間是基於典型相關分析(CCA)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CCA利用線性投影最大化兩個隨機變量相關性,增強了新空間的正交性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CCA模型多用於跨模態搜索,和語音視覺信號分析。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用核方法,CCA可以擴展爲KCCA,這種非參數的方法隨着訓練數據規模的增長可擴展性較差。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度典型相關分析DCCA作爲KCCA的替代品被提出,解決了可擴展性問題,可以得到更好的相關表示空間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度相關性RBM也可以作爲跨模態搜索的方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"KCCA,CCA,DCCA都是非監督的方法,僅能優化特徵表示的相關性,能夠獲取到跨模態的共享特徵。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其它的方法例如,深度典型相關自編碼器,語義相關性最大化方法也用於結構化協同空間表示中","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小結:聯合和協同表示方法是多模態特徵表示的兩種主要方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聯合特徵表示方法將多模態數據投影到一個共同的特徵表示空間,最適用於推理時所有模態的數據都出現的場景。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"協同特徵表示方法將每個模態投影到分離但相關的空間,這種方法適用於推理時僅有一種模態出現的情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聯合表示方法已經用在構建多於兩種模態表示的場景,而協同空間表示常限定爲兩種模態。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多模態轉化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從一種模態轉化爲另一種模態是很多多模態機器學習關注的內容","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態轉化的任務是給定一個模態中的一個實體,生成另一種模態中的相同實體","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,給定一張圖像,我們可以生成一句話來描述這張圖像,或者,給定一個文字描述,我們能夠生成與之匹配的圖像","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態轉化已經研究了很長時間,早期的語音合成、視聽語音生成,視頻描述,跨模態檢索","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近來,NLP和CV領域的結合,以及大規模多模態數據都推動這方面發展。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"熱門應用:視覺場景描述(圖像、視頻描述),除了識別主體部分、理解視覺場景,還需要生成語法正確,理解精確的描述語句。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態轉化可以分爲兩類,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"基於實例的方法和生成式方法","attrs":{}},{"type":"text","text":",前者使用字典實現模態轉化,後者使用模型生成轉化結果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b2/b248fb2b946df0a3d43b1a533f58a2d1.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​考慮到生成式模型需要生成信號或者符號序列(句子),生成式模型方法挑戰更大。所以早期很多方法都傾向於基於實例的方法進行模態轉化。然而隨着深度學習的發展,生成式模型也具備了生成圖像、聲音、文本的能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於實例的方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於實例的方法受限於訓練數據——字典(源模態、目標模態構成的實例對)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩種算法:基於檢索的方法,基於組合的方法,前者直接使用搜索到的轉化結果,不會修改它們,後者依賴於更復雜的規則,基於大量搜索到的實例建立模態轉化結果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於檢索的方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於檢索的方法是多模態轉化最簡單的方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它依賴字典中搜索到的最近的樣本,利用它作爲轉化的結果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"檢索在單模態空間中完成,也可以在中間語義空間中完成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"給定一個待轉化的源模態的實例,單模態檢索通過在字典中查找最近的源模態實例實現模態轉化,本質上就是通過KNN找到源模態到目標模態的映射。一些典型應用場景比如TTS,圖像描述等。這種方法的好處是僅需要單一模態的表示,就可以通過檢索實現。但也是由於採用搜索的方法,所以需要考慮搜索結果的重排序問題。這種方法的問題在於,在單模態空間中相似度高實例的並不一定就是好的模態轉化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種方法是利用中間語義空間來實現相似性比較。這種方法一般會搭配協同表示使用,應該是由於協同表示空間本身就對向量表示進行了相似性限制。在語義空間中進行模態檢索的方法比單模態檢索的方法效果更好,因爲它的搜索空間同時反映了兩種模態,更具有含義。同時,它支持雙向的轉化,這在單模態檢索中不是很直接。然而,中間語義空間檢索的方法需要學習一個語義空間,這需要大量的訓練字典(源模態、目標模態樣本對)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於組合的方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過將檢索結果進行有意義的組合來得到更好的模態轉化結果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於組合的媒體描述(mediadescription)主要是基於圖像的描述語句都有着相同的簡單結構這一特點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常組合的規則都是人工指定的或者啓發式生成的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於實例的方法面臨的最大問題在於它的模型就是整個字典,模型會隨着數據集的增加而不斷增大,而且推理會變慢;另一個問題就是除非整個字典非常大,否則不能覆蓋所有可能的源模態查詢。這個問題可以通過多種模型組合解決。基於實例的方法進行多模態轉化是單方向的,基於語義空間的方法可以在源模態和目標模態間雙向轉化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生成式方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生成式方法在多模態轉化中構建的模型能夠對給定單一模態實例進行多模態轉化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"挑戰在於需要理解源模態來生成目標序列、信號","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可能正確的轉化結果非常多,因此這類方法較難評估","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"三種生成式方法:基於語法,編碼器-解碼器,連續生成模型,第一種方法利用語法來限定目標域,例如生成基於這種模板限定的句子;編碼器解碼器模型先將原模態編碼到一個隱空間表示,然後解碼器生成目標模態;第三種方法基於源模態的一個流式輸入連續生成目標模態,特別適用於時序句子翻譯如TTS。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於語法規則的模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"依賴於爲了生成特定模式而預先定義的語法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種方法先從源模態中檢測高層含義,例如圖像中的實體、視頻中的行爲;然後將這些檢測結果送入一個機遇預定義語法的生成過程來得到目標模態。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一些基於語法的方法依賴於圖模型生成目標模式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於語法的方法有事在於更傾向於生成語句結構上或者邏輯上正確的實例,因爲他們是基於預先定義模板的、限定的語法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點在於生成語法化的結果而不是創新式的轉化,沒有生成新的內容;而且基於語法的方法依賴於複雜的概念,這些概念的detection的pipeline很複雜,每個概念的提取可能需要單獨的模型和獨立的訓練集","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"編碼器解碼器模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於端到端神經網絡訓練,是最近最流行的多模態轉化技術","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心思想是受限將源模態編碼一種向量表示,然後利用解碼器模塊生成目標模態","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"起初用於機器翻譯,當前已經成功用於圖片解說,視頻描述;當前主要用於生成文本,也可以用於生成圖像和連續的語音、聲音","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"編碼","attrs":{}},{"type":"text","text":":首先將源實例進行特定模態編碼。對聲音信號比較流行的編碼方法是RNN和DBN;對詞、句子編碼常用distributionalsemantics和RNN的變種;對於圖像用CNN;視頻編碼仍然常用人工特徵。也可以使用單一的模態表示方法,例如利用協同表示,能夠得到更好的結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"解碼","attrs":{}},{"type":"text","text":":通常利用RNN或者LSTM,將編碼後的特徵表示作爲初始隱藏狀態。Venugopalan et al.驗證了利用預訓練的LSTM解碼器用於圖像解說對於視頻描述任務是有益的。利用RNN面臨的問題在於模型需要從單一的圖像、句子或者視頻向量表示來生成一種描述。當需要生成長序列時,模型會忘記初始輸入。這個問題可以通過注意力機制解決,讓網絡在生成過程中更關注與圖像、句子、視頻的部分內容。基於注意力的生成式RNN也被用於從句子生成圖像的任務,不真實但是有潛質。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於編碼器解碼器的網絡雖然成功但是仍面臨很多問題。Devlin et al.指出網絡可能記住了訓練數據,而不是學習到了如何理解和生成視覺場景。他觀察到kNN模型生成的結果和編解碼網絡的生成結果非常相似。編解碼模型需要的訓練數據規模非常大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"連續生成模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"連續生成模型用於序列翻譯和在線的方式在每個時間戳生成輸出","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當sequence到sequence轉化時,這種方法很有效,例如文本轉語音,語音轉文本,視頻轉文本","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"許多其它的方法也被提出用於這種建模:圖模型,連續編解碼方法,各種其它的迴歸分類方法。這些模型需要額外解決的問題是模態間的時序一致性問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近來,Encoder-Decoder模型常用於序列轉化建模。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小結和討論:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態轉化所面臨的一大挑戰是很難進行評估,有些任務(例如語音識別)有一個正確的translation,而像語音合成和媒體描述則沒有。有時就像在語言翻譯場景中一樣,多種答案都是正確的,哪種翻譯更好通常非常主觀。當前,大量近似自動化評價的標準也在輔助模態轉化結果評估。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人的評價標準是最理想的。一些自動化評價指標例如在媒體描述中常用的:BLEU、ROUGE、Meteor、CIDEr也被提出,但是褒貶不一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決評估問題非常重要,不但能夠用於比較不同的方法,而且能夠提供更好的優化目標。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多模態對齊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態對齊是指找到兩種或多種模態的instances中sub-components之間的對應關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如:給定一張圖片和一個描述,找到詞或者短語對應圖片中的區域;另一個例子是給定一個電影,將它和字幕或者書中的章節對齊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態對齊分成兩類:隱式對齊和顯示對齊,顯示對齊顯示的關注模態間sub-components的對應關係,例如將視頻和菜譜中對應的步驟對齊;隱式對齊常作爲其它任務的一個環節,例如基於文本的圖像搜索中,將關鍵詞和圖片的區域進行對齊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顯示對齊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sub-components間的相似性衡量是顯示對齊的基礎,兩類算法無監督方法和(弱)監督方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無監督方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無監督方法不需要模態間對齊的標註","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dynamic time warping衡量兩個序列的相似性,找到一個optimal的match,是一種dynamic programming的方法。由於DTW需要預定義的相似性度量,可以利用CCA(典型相關性分析)將模態映射到一個協同表達空間。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"DTW和CCA都是線性變換,不能找到模態間的非線性關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖模型也可以用於無監督多模態序列的對齊。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DTW和圖模型的方法用於多模態對齊需要遵循一些限制條件,例如時序一致性、時間上沒有很大的跳躍、單調性。DTW能夠同時學習相似性度量和模態對齊,圖模型方法在建模過程中需要專家知識。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(弱)監督方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監督方法需要標註好的模態對齊實例,用於訓練模態對齊中的相似性度量","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"許多監督式序列對齊方法收到非監督方法的啓發","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前深度學習方法用於模態對齊更加常見","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隱式對齊:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"常用作其它任務的中間步驟,使得例如語音識別、機器翻譯、多媒體描述和視覺問答達到更好的性能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"早期工作基於圖模型,當前更多基於神經網絡","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要人工構建模態間的映射關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"神經網絡:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模態轉換如果能夠使用模態對齊,任務的性能可以得到提升","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單純的使用encoder只能通過調整權重來總結整張圖片、句子、視頻,作爲單一的向量表示;注意力機制的引入,使得decoder能夠關注到sub-components。注意力機制會讓decoder更多的關注sub-components","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"注意力機制可以認爲是深度學習模態對齊的一種慣用方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小結:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模態對齊面臨着許多困難:少有顯示標註模態對齊的數據集;很難設計模態間的相似性度量;存在多種可能的模態對齊,而且一個模態中的elements可能在另一個模態中沒有對應","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多模態融合","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態融合就是整合多種模態的信息進行分類或者回歸任務,多模態融合研究可以追溯到25年前","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態融合帶來的好處有:1)同一個現象的不同模態表示能夠產生更robust的推理結果;2)從多種規模中能夠得到輔助的信息,這些信息在單一模態中是不可見的;3)對於一個多模態系統而言,模態融合能夠在某一種模態消失時仍正常運行","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前多模態表示和融合的界限愈發模糊,因爲在深度學習中,表示學習和分類/迴歸任務交織在一起","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"兩種多模態混合方法:模型無關和基於模型的方法","attrs":{}},{"type":"text","text":",前者不直接依賴於一種特定的機器學習方法,後者顯示的在構建過程中進行融合(核方法、圖模型、神經網絡)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型無關方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型無關的方法,三種:前期融合、後期融合和混合融合","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前期融合是特徵級別的融合,後期融合是推理結果的融合,混合融合同時包括兩種融合方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型無關的融合方法好處是:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"可以兼容任何一種分類器或者回歸器","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前期融合可以看做是多模態表示的一種前期嘗試","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後期融合利用單一模態的預測結果,通過投票機制、加權、signal variance或者一個模型進行融合。後期融合忽略了模態底層特徵之間的關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於模型的方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多核學習(MKL)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"kernel SVM的擴展,對於不同模態使用不同的kernel","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MKL方法是深度學習之前最常用的方法,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"優勢在於loss function是凸的,模型訓練可以使用標準的優化package和全局優化方法,劣勢在於測試時對於數據集的依賴推理速度慢","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本篇綜述中僅考慮淺層的圖模型,深度圖模型例如DBN可以參考前面章節內容","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大多數圖模型可以分成兩類:生成式(聯合概率)和判別式(條件概率)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"圖模型能夠很容易的發掘數據中的空間和時序結構,同時可以將專家知識嵌入到模型中,模型也可解釋","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"神經網絡:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"神經網絡用於模態融合所使用的模態、優化方法可能不同,通過joint hidden layers進行信息融合的思路是一致的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"神經網絡也用於時序多模態融合,通常採用RNN和LSTM,典型的應用是audio-visual情感分類,圖片解說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度神經網絡用於模態融合優點:1)可以從大量數據學習;2)端到端學習多模態特徵表示和融合;3)和非深度學習方法相比性能好,能學習複雜的decision boundary","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:可解釋性差,不知道網絡根據什麼進行推理,也不知道每個模態起的作用;需要大量訓練數據才能得到好的效果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小結:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態融合任務中有如下挑戰:1)signal可能不是時序對齊的,例如密集的連續信號vs稀疏的事件;2)很難建立一個模型來發掘補充信息而非輔助信息;3)每個模態在不同時間點可能展現出不同類型、不同級別的噪聲","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多模態共同學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態共同學習旨在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"通過發掘另一種模態的信息來幫助當前模態建模","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相關場景:一種模態的資源有限,缺乏標註數據或者輸入噪聲大,標籤可靠性低","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"三種co-learning的方法:並行,非並行,混合;第一種方法需要一種模態的observation和另一種模態的observation直接連接,例如在audio-visual speech數據及上,video和speech sample來自同一個speaker;非並行數據方法不需要兩種observation的直接連接,通常利用類別間的交集,例如在zero shot learning中利用Wiki的文本數據擴展傳統的視覺目標識別數據集提升目標識別的性能;混合數據的方法通過一種共享模態或者數據連接起來","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1f/1fd1b233279e5099ac5943888ba11c10.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"並行數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模態之間共享一個實例集合,兩種方法:協同訓練和表徵學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"協同訓練:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當某一個模態的標記數據非常少時,可以利用協同訓練生成更多的標註訓練數據,或者利用模態間的不一致性過濾不可靠標註樣本","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"協同訓練方法能夠生成更多的標註數據,但也可能會導致overfitting","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遷移學習:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態玻爾茲曼機或者多模態自編碼器將一種模態特徵表示轉化爲另一種,這樣不僅能得到多模態表徵,而且對於單模態而言推理過程中也能得到更好的性能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非並行數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不需要依賴模態間共享的實例,有共享的類別或者概念(concept)即可","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遷移學習:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遷移學習能夠從一個數據充分、乾淨的模態學習特徵表示遷移到另一個數據稀缺、噪聲大的模態,這種遷移學習常用多模態協同特徵表示實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Conceptual grounding:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過語言以及其他附加模態,例如視覺、聲音、甚至味覺,學習語義含義,單純的利用文本信息不能很好的學習到語義含義,例如人學習一個概念的時候利用的是所有的感知信息而非單純的符號","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"grounding通常通過尋找特徵表徵間的共同隱空間或者分別學習每個模態的特徵表示然後進行拼接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"conceptual grounding和多模態特徵對齊之間有很高的重合部分,因爲視覺場景和對應描述對齊本身能夠帶來更好的文本或者視覺特徵表示","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"grounding並不能在所有情況下帶來性能的提升,僅當grounding與具體任務相關時有效","attrs":{}},{"type":"text","text":",例如在視覺相關任務中利用圖像進行grounding","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zero-shot learning:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ZSL任務是指在沒有顯示的見過任何sample的情況下識別一種概念,例如不提供任何貓的圖片對圖片中的貓進行分類","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩種方法:單模態方法和多模態方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單模態方法:關注待識別類別的組成部分和屬性,例如視覺方面通過顏色、大小、形狀等屬性去預測爲見過的類別","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態方法:利用另一個模態的信息,在另一個模態中該類別出現過","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混合數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過共享的模態或者數據集連接兩種非數據並行的模態,典型的任務例如多種語言進行圖像描述,圖片會與至少一種語言之間建立聯繫,語言之間的可以利用機器翻譯任務建立起聯繫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目標任務如果僅有少量標註數據,也可利用類似或相關任務去提升性能,例如利用大量文本語料指導圖像分割任務","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小結:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態協同學習通過尋找模態之間的互補信息,使一種模態影響另一種模態的訓練過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多模態協同學習是與任務無關的,可以用於更好的多模態特徵融合、轉換和對齊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://bbs.huaweicloud.com/blogs/264134?utm_source=infoq&utm_medium=bbs-ex&utm_campaign=ei&utm_content=content","title":"","type":null},"content":[{"type":"text","text":"點擊關注,第一時間瞭解華爲雲新鮮技術~","attrs":{}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章