技術解密 |阿里雲多媒體 AI 團隊拿下 CVPR2021 5 冠 1 亞成績的技術分享

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ce/ce2152f22e3c4019533f8f705ce87069.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6 月 19-25 日,備受全球矚目的國際頂級視覺會議 ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"CVPR2021","attrs":{}},{"type":"text","text":"(Computer Vision and Pattern Recognition,即國際機器視覺與模式識別)在線上舉行,但依然人氣爆棚,參會者的激情正如夏日般火熱。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今年阿里雲多媒體 AI 團隊(由阿里雲視頻雲和達摩院視覺團隊組成,以下簡稱 MMAI)參加了大規模人體行爲理解公開挑戰賽 ActivityNet、當前最大時空動作定位挑戰賽 AVA-Kinetics、超大規模時序行爲檢測挑戰賽 HACS 和第一視角人體行爲理解挑戰賽 EPIC-Kitchens 上的總共 ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"6 個賽道,一舉拿下了 5 項冠軍和 1 項亞軍","attrs":{}},{"type":"text","text":",其中在 ActivityNet 和 HACS 兩個賽道上連續兩年蟬聯冠軍!","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"頂級挑戰賽戰績顯赫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"大規模時序動作檢測挑戰賽 ActivityNet","attrs":{}},{"type":"text","text":" 於 2016 年開始,由 KAUST、Google、DeepMind 等主辦,至今已經成功舉辦六屆。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該挑戰賽主要解決時序行爲檢測問題,以驗證 AI 算法對長時視頻的理解能力,是該領域最具影響力的挑戰賽之一。歷屆參賽者來自許多國內外知名機構,包括微軟、百度、上交、華爲、商湯、北大、哥大等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"今年阿里雲 MMAI 團隊最終以 Avg. mAP 44.67% 的成績獲得該項挑戰賽的冠軍!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e09e76c6ab16a80606d30c706c779078.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 1 ActivityNet 挑戰賽證書","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"時空動作定位挑戰賽 AVA-Kinetics","attrs":{}},{"type":"text","text":" 由 2018 年開始,至今已成功舉辦四屆,由 Google、DeepMind 和 Berkeley 舉辦,旨在時空兩個維度識別視頻中發生的原子級別行爲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因其難度與實用性,歷年來吸引了衆多國際頂尖高校與研究機構參與,如 DeepMind、FAIR、SenseTime-CUHK、清華大學等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"今年阿里雲 MMAI 團隊以 40.67% mAP 擊敗對手,獲得第一!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9c/9cf08e20370421ecee170607ba9f654f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 2 AVA-Kinetics 挑戰賽獲獎證書","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"超大規模行爲檢測挑戰賽 HACS","attrs":{}},{"type":"text","text":" 始於 2019 年,由 MIT 主辦,是當前時序行爲檢測任務中的最大挑戰賽。該項挑戰賽包括兩個賽道:全監督行爲檢測和弱監督行爲檢測。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於數據量是 ActivityNet 的兩倍以上,因此具有很大的挑戰性。歷屆參賽隊伍包括微軟、三星、百度、上交、商湯、西交等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"今年阿里雲 MMAI 團隊同時參加兩個賽道,並分別以 Avg. mAP 44.67% 和 22.45% 雙雙奪冠!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6c/6c324ae81cef7481744cbb45dff42fa6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 3 HACS 挑戰賽兩個賽道的獲獎證書","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"第一視角人體動作理解挑戰賽 EPIC-Kitchens","attrs":{}},{"type":"text","text":" 於 2019 年開始,至今已經舉辦三屆,由 University of Bristol 主辦,致力於解決第一視角條件下的人體動作和目標物體的交互理解問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歷年的參賽隊伍包括百度、FAIR、NTU、NUS、Inria-Facebook、三星(SAIC-Cambridge)等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"今年阿里雲 MMAI 團隊參加其中時序動作檢測和動作識別兩個賽道,分別以 Avg. mAP 16.11% 和 Acc. 48.5% 獲得兩項挑戰賽的冠軍和亞軍!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d67290386eda2cba632c6872cac83f0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 4 EPIC-Kitchens 挑戰賽獲獎證書","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四大挑戰的關鍵技術探索","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"行爲理解挑戰賽主要面臨四大挑戰:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是行爲時長分佈廣,從 0.5 秒到 400 秒不等,以一個 200 秒的測試視頻爲例,每 1 秒採集 15 幀圖像,算法必須在 3000 幀圖像中精確定位。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次是視頻背景複雜,通常具有很多不規則的非目標行爲嵌入在視頻中,極大的增加了行爲檢測的難度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再者是類內差較大,相同行爲的視覺表現會因個體、視角、環境的變換而發生明顯的變化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後是算法檢測人體動作還面臨人體之間的互相遮擋、視頻分辨率不足、光照、視角等變化多樣的其他干擾。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本次挑戰賽中,該團隊之所以能夠取得如此出色的成績,主要是由其背後","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"先進技術框架 EMC2 支撐","attrs":{}},{"type":"text","text":",該框架主要對如下幾個核心技術進行探索:","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)強化基礎網絡的優化訓練","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基礎網絡是行爲理解的核心要素之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本次挑戰賽中,阿里雲 MMAI 團隊主要對以下兩方面進行探索:深入研究 Video Transformer (ViViT);探索 Transformer 和 CNN 異構模型的互補性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲主要的基礎網絡,ViViT 的訓練同樣包括預訓練和微調兩個過程,在微調過程,MMAI 團隊充分分析包括輸入尺寸、數據增廣等變量的影響,找到適合當前任務的最佳配置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,考慮 Transformer 和 CNN 結構互補性,還使用了 Slowfast、CSN 等結構,最終通過集成學習分別在 EPIC-Kitchens、ActivityNet、HACS 上取得 48.5%、93.6%、96.1% 的分類性能,相較於去年的冠軍成績,有着明顯的提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/ae684213ea5dd81355a5f3f86fdd1c6e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 5 ViViT 的結構及其性能","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)視頻理解中的實體時空關係建模","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於時空域動作檢測任務而言,基於關係建模學習視頻中的人 - 人關係、人 - 物關係、人 - 場景關係對於正確實現動作識別,特別是交互性動作識別而言是尤爲重要的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此在本次挑戰賽中阿里雲 MMAI 重點對這些關係進行建模分析。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體地,首先定位視頻中的人和物體,並分別提取人和物的特徵表示;爲了更加細粒度地建模不同類型的動作關係,將上述特徵與全局視頻特徵在時空域結合以增強特徵,並分別在不同的時域或空域位置間應用基於 Transformer 結構的關係學習模塊,同時不同位置的關聯學習通過權重共享的方式實現對關聯區域的位置不變性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了進一步建模長序時域關聯,我們構建了結合在線和離線維護的兩階段時序特徵池,將視頻片段前後的特徵信息融合到關聯學習當中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,經過關聯學習的人體特徵被用於進行動作識別任務,基於解耦學習的方式實現了在動作類別長尾分佈下對困難和少量樣本類別的有效學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/05/05254b5de9f9691e36bf38b5b13133b3.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 6 關係建模網絡","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)基於動作提名關係編碼的長視頻理解","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在動作理解相關的多項任務上,在有限的計算條件下,視頻持續時間較長是其主要的挑戰之一,而時序關係學習是解決長時視頻理的重要手段。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 EMC2 中,設計了基於動作提名關係編碼的模塊來提升算法的長時感知能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體地,利用基礎行爲檢測網絡生產出密集的動作提名,其中每個動作提名可以粗略視爲特定動作實體發生的時間區間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後基於自注意力機制,在時間維度上對這些提名實體進行時序關係編碼,使得每個動作提名均能感知到全局信息,從而能夠預測出更加準確的行爲位置,憑藉此技術,EMC2 在 AcitivityNet 等時序行爲檢測上取得冠軍的成績。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ef/ef56b4c5ebe56f7925018f2c28fe3681.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 7 動作提名間的關係編碼","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(4)基於自監督學習的網絡初始化訓練","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化是深度網絡訓練的重要過程,也是 EMC2 的主要組件之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里雲 MMAI 團隊設計了一種基於自訓練的初始化方法 MoSI,即從靜態圖像訓練視頻模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MoSI 主要包含兩個組件:僞運動生成和靜態掩碼設計。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先根據滑動窗口的方式按照指定的方向和速度生成僞視頻片段,然後通過設計合適的掩碼只保留其局部區域的運動模式,使網絡能夠具有局部運動感知的能力。最後,在訓練過程中,模型優化目標是成功預測輸入僞視頻的速度大小和方向。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過這種方式,訓練的模型將具有感知視頻運動的能力。在挑戰賽中,考慮到不使用額外數據的規則,僅在有限的挑戰賽視頻幀做 MoSI 訓練,便可取得明顯的性能提升,保證了各項挑戰賽的模型訓練質量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ea/eae106ddae95292cffa9865a018ef760.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 8 MoSI 訓練過程及其語意分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“視頻行爲分析一直都被認爲是一項非常具有挑戰性的任務,主要源於其內容的多樣性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管基礎機器視覺中各種先進的技術被提出,我們在此次競賽的創新主要包括:1)對自監督學習和 Transformer+CNN 異構融合的深度探索;2)視頻中不同實體間關係建模方法的持續研究。這些探索確認了當前先進技術(如自監督學習)對視頻內容分析的重要性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,我們的成功也說明了實體關係建模對視頻內容理解的重要作用,但其並沒有得到業界足夠的關注。” 阿里巴巴高級研究員金榕總結道。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"基於視頻理解技術打造多媒體 AI 雲產品","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於 EMC2 的技術底座,阿里雲 MMAI 團隊在進行視頻理解的深度研究同時,也積極進行了產業化,推出了多媒體 AI(MultiMedia AI)的技術產品:Retina 視頻雲多媒體 AI 體驗中心(點擊👉 ","attrs":{}},{"type":"link","attrs":{"href":"http://retina.aliyun.com","title":"","type":null},"content":[{"type":"text","text":"多媒體 AI 雲產品體驗中心","attrs":{}}]},{"type":"text","text":" 進行體驗 )。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該產品實現視頻搜索、審覈、結構化和生產等核心功能,日處理視頻數據數百萬小時,爲客戶在視頻搜索、視頻推薦、視頻審覈、版權保護、視頻編目、視頻交互、視頻輔助生產等應用場景中提供了核心能力,極大提高了客戶的工作效率和流量效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0f/0fb16408658c08ca5ed385c249c73f25.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 9 多媒體 AI 產品","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,多媒體 AI 雲產品在傳媒行業、泛娛樂行業、短視頻行業、體育行業以及電商行業均有落地:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"傳媒行業","attrs":{}},{"type":"text","text":",主要支撐央視、人民日報等傳媒行業頭部客戶的業務生產流程,極大提升生產效率,降低人工成本,例如在新聞生成場景中提升了 70% 的編目效率和 50% 的搜索效率;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"泛娛樂行業以及短視頻行業","attrs":{}},{"type":"text","text":",主要支撐集團內業務方優酷、微博、趣頭條等泛娛樂視頻行業下視頻結構化、圖像 / 視頻審覈、視頻指紋搜索、版權溯源、視頻去重、封面圖生成、集錦生成等場景,幫助保護視頻版權、提高流量分發效率,日均調用數億次;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"體育行業","attrs":{}},{"type":"text","text":",支撐第 21 屆世界盃足球賽,打通了視覺、運動、音頻、語音等多模態信息,實現足球賽事直播流跨模態分析,相比傳統剪輯效率提升一個數量級;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"電商行業","attrs":{}},{"type":"text","text":",支撐淘寶、閒魚等業務方,支持新發視頻的結構化,視頻 / 圖像審覈,輔助客戶快速生成短視頻,提升分發效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d8/d863efb6ab865d95fab0fad52bed5a97.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 10 多媒體 AI 對體育行業和影視行業標籤識別","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/54/54b34e5d85fda72a88633356b9351d7a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 11 多媒體 AI 對傳媒行業和電商行業的標籤識別","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"在 EMC2 的支撐下,Retina 視頻雲多媒體 AI 體驗中心具有如下優勢:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1)多模態學習","attrs":{}},{"type":"text","text":":利用視頻、音頻、文本等海量多模態數據,進行跨媒體理解,融合不同領域知識的理解 / 生產體系;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2)輕量化定製","attrs":{}},{"type":"text","text":":用戶可自主註冊需要識別的實體,算法對新增實體標籤可實現 “即插即用”,且對新增類別使用輕量數據可接近已知類別效果;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3)高效能","attrs":{}},{"type":"text","text":":自研高性能音視頻編解碼庫、深度學習推理引擎、GPU 預處理庫,針對視頻場景 IO 和計算密集型特點定向優化,在不同場景達到近 10 倍性能提升;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"4)通用性強","attrs":{}},{"type":"text","text":":多媒體 AI 雲產品在傳媒行業、泛娛樂行業、短視頻行業、體育行業以及電商行業等均有落地應用案例。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“視頻非常有助於提升內容的易理解、易接受和易傳播性,在過去的幾年我們也看到了各行各業,各種場景都在加速內容視頻化的進程,整個社會對於視頻產量的訴求越來越強烈,如何高效、高質的生產出符合用戶需求的視頻,就成爲了核心問題,這裏面涉及到了非常多的細節問題,例如熱點的發現、大量視頻素材的內容理解、多模檢索、基於用戶畫像 / 場景的模板構建等,這些都需要大量的依賴視覺 AI 技術的發展,MMAI 團隊結合行業、場景不斷的改進在視覺 AI 方面的技術,並基於此打磨和構建業務級的多媒體 AI 雲產品,使得視頻得以高質、高效的進行生產,從而有效的推進各行各業、各場景的內容視頻化進程。” ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"阿里雲視頻雲負責人畢玄評價道。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本次 CVPR2021 中,MMAI 通過多項學術挑戰賽一舉擊敗多個國內外強勁對手,拿下了多項冠軍,是對其過硬的技術的有力驗證,其雲產品多媒體 AI 已經服務多個行業的頭部客戶,並將持續創造多行業應用價值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"👇點擊體驗多媒體 AI 雲產品體驗中心:http://retina.aliyun.com","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"源碼開源地址:https://github.com/alibaba-mmai-research/pytorch-video-understanding","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文獻:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Huang Z, Zhang S, Jiang J, et al. Self-supervised motion learning from static images. CVPR2021: 1276-1285.[2] Arnab A, Dehghani M, Heigold G, et al. Vivit: A video vision transformer[J]. arXiv preprint arXiv:2103.15691, 2021.[3] Feichtenhofer C, Fan H, Malik J, et al. Slowfast networks for video recognition. ICCV2019: 6202-6211.[4] Tran D, Wang H, Torresani L, et al. Video classification with channel-separated convolutional networks. ICCV2019: 5552-5561.[5] Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation. ICCV2019: 3889-3898.[6] Feng Y, Jiang J, Huang Z, et al. Relation Modeling in Spatio-Temporal Action Localization[J]. arXiv preprint arXiv:2106.08061, 2021.[7] Qing Z, Huang Z, Wang X, et al. A Stronger Baseline for Ego-Centric Action Detection[J]. arXiv preprint arXiv:2106.06942, 2021.[8] Huang Z, Qing Z, Wang X, et al. Towards training stronger video vision transformers for epic-kitchens-100 action recognition[J]. arXiv preprint arXiv:2106.05058, 2021.[9] Wang X, Qing Z., et al. Proposal Relation Network for Temporal Action Detection[J]. arXiv preprint arXiv:2106.11812, 2021.[10] Wang X, Qing Z., et al. Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling[J]. arXiv preprint arXiv:2106.11811, 2021.[11] Qing Z, Huang Z, Wang X, et al. Exploring Stronger Feature for Temporal Action Localization","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"「視頻雲技術」你最值得關注的音視頻技術公衆號,每週推送來自阿里雲一線的實踐技術文章,在這裏與音視頻領域一流工程師交流切磋。公衆號後臺回覆【技術】可加入阿里雲視頻雲技術交流羣,和作者一起探討音視頻技術,獲取更多行業最新信息。","attrs":{}}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章