技術解密｜阿里雲多媒體 AI 團隊拿下 CVPR2021 5 冠 1 亞成績的技術分享

原創

阿里云视频云

2021-06-27 11:23

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ce/ce2152f22e3c4019533f8f705ce87069.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6 月 19-25 日，備受全球矚目的國際頂級視覺會議 ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"CVPR2021","attrs":{}},{"type":"text","text":"（Computer Vision and Pattern Recognition，即國際機器視覺與模式識別）在線上舉行，但依然人氣爆棚，參會者的激情正如夏日般火熱。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今年阿里雲多媒體 AI 團隊（由阿里雲視頻雲和達摩院視覺團隊組成，以下簡稱 MMAI）參加了大規模人體行爲理解公開挑戰賽 ActivityNet、當前最大時空動作定位挑戰賽 AVA-Kinetics、超大規模時序行爲檢測挑戰賽 HACS 和第一視角人體行爲理解挑戰賽 EPIC-Kitchens 上的總共 ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"6 個賽道，一舉拿下了 5 項冠軍和 1 項亞軍","attrs":{}},{"type":"text","text":"，其中在 ActivityNet 和 HACS 兩個賽道上連續兩年蟬聯冠軍！","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"頂級挑戰賽戰績顯赫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"大規模時序動作檢測挑戰賽 ActivityNet","attrs":{}},{"type":"text","text":" 於 2016 年開始，由 KAUST、Google、DeepMind 等主辦，至今已經成功舉辦六屆。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該挑戰賽主要解決時序行爲檢測問題，以驗證 AI 算法對長時視頻的理解能力，是該領域最具影響力的挑戰賽之一。歷屆參賽者來自許多國內外知名機構，包括微軟、百度、上交、華爲、商湯、北大、哥大等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"今年阿里雲 MMAI 團隊最終以 Avg. mAP 44.67% 的成績獲得該項挑戰賽的冠軍！","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e09e76c6ab16a80606d30c706c779078.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 1 ActivityNet 挑戰賽證書","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"時空動作定位挑戰賽 AVA-Kinetics","attrs":{}},{"type":"text","text":" 由 2018 年開始，至今已成功舉辦四屆，由 Google、DeepMind 和 Berkeley 舉辦，旨在時空兩個維度識別視頻中發生的原子級別行爲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因其難度與實用性，歷年來吸引了衆多國際頂尖高校與研究機構參與，如 DeepMind、FAIR、SenseTime-CUHK、清華大學等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"今年阿里雲 MMAI 團隊以 40.67% mAP 擊敗對手，獲得第一！","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9c/9cf08e20370421ecee170607ba9f654f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 2 AVA-Kinetics 挑戰賽獲獎證書","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"超大規模行爲檢測挑戰賽 HACS","attrs":{}},{"type":"text","text":" 始於 2019 年，由 MIT 主辦，是當前時序行爲檢測任務中的最大挑戰賽。該項挑戰賽包括兩個賽道：全監督行爲檢測和弱監督行爲檢測。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於數據量是 ActivityNet 的兩倍以上，因此具有很大的挑戰性。歷屆參賽隊伍包括微軟、三星、百度、上交、商湯、西交等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"今年阿里雲 MMAI 團隊同時參加兩個賽道，並分別以 Avg. mAP 44.67% 和 22.45% 雙雙奪冠！","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6c/6c324ae81cef7481744cbb45dff42fa6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 3 HACS 挑戰賽兩個賽道的獲獎證書","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"第一視角人體動作理解挑戰賽 EPIC-Kitchens","attrs":{}},{"type":"text","text":" 於 2019 年開始，至今已經舉辦三屆，由 University of Bristol 主辦，致力於解決第一視角條件下的人體動作和目標物體的交互理解問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歷年的參賽隊伍包括百度、FAIR、NTU、NUS、Inria-Facebook、三星（SAIC-Cambridge）等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"今年阿里雲 MMAI 團隊參加其中時序動作檢測和動作識別兩個賽道，分別以 Avg. mAP 16.11% 和 Acc. 48.5% 獲得兩項挑戰賽的冠軍和亞軍！","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d67290386eda2cba632c6872cac83f0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 4 EPIC-Kitchens 挑戰賽獲獎證書","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四大挑戰的關鍵技術探索","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"行爲理解挑戰賽主要面臨四大挑戰：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是行爲時長分佈廣，從 0.5 秒到 400 秒不等，以一個 200 秒的測試視頻爲例，每 1 秒採集 15 幀圖像，算法必須在 3000 幀圖像中精確定位。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次是視頻背景複雜，通常具有很多不規則的非目標行爲嵌入在視頻中，極大的增加了行爲檢測的難度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再者是類內差較大，相同行爲的視覺表現會因個體、視角、環境的變換而發生明顯的變化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後是算法檢測人體動作還面臨人體之間的互相遮擋、視頻分辨率不足、光照、視角等變化多樣的其他干擾。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本次挑戰賽中，該團隊之所以能夠取得如此出色的成績，主要是由其背後","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"先進技術框架 EMC2 支撐","attrs":{}},{"type":"text","text":"，該框架主要對如下幾個核心技術進行探索：","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"（1）強化基礎網絡的優化訓練","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基礎網絡是行爲理解的核心要素之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本次挑戰賽中，阿里雲 MMAI 團隊主要對以下兩方面進行探索：深入研究 Video Transformer （ViViT）；探索 Transformer 和 CNN 異構模型的互補性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲主要的基礎網絡，ViViT 的訓練同樣包括預訓練和微調兩個過程，在微調過程，MMAI 團隊充分分析包括輸入尺寸、數據增廣等變量的影響，找到適合當前任務的最佳配置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外，考慮 Transformer 和 CNN 結構互補性，還使用了 Slowfast、CSN 等結構，最終通過集成學習分別在 EPIC-Kitchens、ActivityNet、HACS 上取得 48.5%、93.6%、96.1% 的分類性能，相較於去年的冠軍成績，有着明顯的提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/ae684213ea5dd81355a5f3f86fdd1c6e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 5 ViViT 的結構及其性能","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"（2）視頻理解中的實體時空關係建模","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於時空域動作檢測任務而言，基於關係建模學習視頻中的人 - 人關係、人 - 物關係、人 - 場景關係對於正確實現動作識別，特別是交互性動作識別而言是尤爲重要的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此在本次挑戰賽中阿里雲 MMAI 重點對這些關係進行建模分析。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體地，首先定位視頻中的人和物體，並分別提取人和物的特徵表示；爲了更加細粒度地建模不同類型的動作關係，將上述特徵與全局視頻特徵在時空域結合以增強特徵，並分別在不同的時域或空域位置間應用基於 Transformer 結構的關係學習模塊，同時不同位置的關聯學習通過權重共享的方式實現對關聯區域的位置不變性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了進一步建模長序時域關聯，我們構建了結合在線和離線維護的兩階段時序特徵池，將視頻片段前後的特徵信息融合到關聯學習當中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後，經過關聯學習的人體特徵被用於進行動作識別任務，基於解耦學習的方式實現了在動作類別長尾分佈下對困難和少量樣本類別的有效學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/05/05254b5de9f9691e36bf38b5b13133b3.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 6 關係建模網絡","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"（3）基於動作提名關係編碼的長視頻理解","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在動作理解相關的多項任務上，在有限的計算條件下，視頻持續時間較長是其主要的挑戰之一，而時序關係學習是解決長時視頻理的重要手段。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 EMC2 中，設計了基於動作提名關係編碼的模塊來提升算法的長時感知能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體地，利用基礎行爲檢測網絡生產出密集的動作提名，其中每個動作提名可以粗略視爲特定動作實體發生的時間區間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後基於自注意力機制，在時間維度上對這些提名實體進行時序關係編碼，使得每個動作提名均能感知到全局信息，從而能夠預測出更加準確的行爲位置，憑藉此技術，EMC2 在 AcitivityNet 等時序行爲檢測上取得冠軍的成績。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ef/ef56b4c5ebe56f7925018f2c28fe3681.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 7 動作提名間的關係編碼","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"（4）基於自監督學習的網絡初始化訓練","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化是深度網絡訓練的重要過程，也是 EMC2 的主要組件之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里雲 MMAI 團隊設計了一種基於自訓練的初始化方法 MoSI，即從靜態圖像訓練視頻模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MoSI 主要包含兩個組件：僞運動生成和靜態掩碼設計。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先根據滑動窗口的方式按照指定的方向和速度生成僞視頻片段，然後通過設計合適的掩碼只保留其局部區域的運動模式，使網絡能夠具有局部運動感知的能力。最後，在訓練過程中，模型優化目標是成功預測輸入僞視頻的速度大小和方向。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過這種方式，訓練的模型將具有感知視頻運動的能力。在挑戰賽中，考慮到不使用額外數據的規則，僅在有限的挑戰賽視頻幀做 MoSI 訓練，便可取得明顯的性能提升，保證了各項挑戰賽的模型訓練質量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ea/eae106ddae95292cffa9865a018ef760.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 8 MoSI 訓練過程及其語意分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“視頻行爲分析一直都被認爲是一項非常具有挑戰性的任務，主要源於其內容的多樣性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管基礎機器視覺中各種先進的技術被提出，我們在此次競賽的創新主要包括：1）對自監督學習和 Transformer+CNN 異構融合的深度探索；2）視頻中不同實體間關係建模方法的持續研究。這些探索確認了當前先進技術（如自監督學習）對視頻內容分析的重要性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外，我們的成功也說明了實體關係建模對視頻內容理解的重要作用，但其並沒有得到業界足夠的關注。” 阿里巴巴高級研究員金榕總結道。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"基於視頻理解技術打造多媒體 AI 雲產品","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於 EMC2 的技術底座，阿里雲 MMAI 團隊在進行視頻理解的深度研究同時，也積極進行了產業化，推出了多媒體 AI（MultiMedia AI）的技術產品：Retina 視頻雲多媒體 AI 體驗中心(點擊👉 ","attrs":{}},{"type":"link","attrs":{"href":"http://retina.aliyun.com","title":"","type":null},"content":[{"type":"text","text":"多媒體 AI 雲產品體驗中心","attrs":{}}]},{"type":"text","text":" 進行體驗 )。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該產品實現視頻搜索、審覈、結構化和生產等核心功能，日處理視頻數據數百萬小時，爲客戶在視頻搜索、視頻推薦、視頻審覈、版權保護、視頻編目、視頻交互、視頻輔助生產等應用場景中提供了核心能力，極大提高了客戶的工作效率和流量效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0f/0fb16408658c08ca5ed385c249c73f25.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 9 多媒體 AI 產品","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前，多媒體 AI 雲產品在傳媒行業、泛娛樂行業、短視頻行業、體育行業以及電商行業均有落地：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1）在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"傳媒行業","attrs":{}},{"type":"text","text":"，主要支撐央視、人民日報等傳媒行業頭部客戶的業務生產流程，極大提升生產效率，降低人工成本，例如在新聞生成場景中提升了 70% 的編目效率和 50% 的搜索效率；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2）在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"泛娛樂行業以及短視頻行業","attrs":{}},{"type":"text","text":"，主要支撐集團內業務方優酷、微博、趣頭條等泛娛樂視頻行業下視頻結構化、圖像 / 視頻審覈、視頻指紋搜索、版權溯源、視頻去重、封面圖生成、集錦生成等場景，幫助保護視頻版權、提高流量分發效率，日均調用數億次；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3）在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"體育行業","attrs":{}},{"type":"text","text":"，支撐第 21 屆世界盃足球賽，打通了視覺、運動、音頻、語音等多模態信息，實現足球賽事直播流跨模態分析，相比傳統剪輯效率提升一個數量級；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4）在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"電商行業","attrs":{}},{"type":"text","text":"，支撐淘寶、閒魚等業務方，支持新發視頻的結構化，視頻 / 圖像審覈，輔助客戶快速生成短視頻，提升分發效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d8/d863efb6ab865d95fab0fad52bed5a97.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 10 多媒體 AI 對體育行業和影視行業標籤識別","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/54/54b34e5d85fda72a88633356b9351d7a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 11 多媒體 AI 對傳媒行業和電商行業的標籤識別","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"在 EMC2 的支撐下，Retina 視頻雲多媒體 AI 體驗中心具有如下優勢：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1）多模態學習","attrs":{}},{"type":"text","text":"：利用視頻、音頻、文本等海量多模態數據，進行跨媒體理解，融合不同領域知識的理解 / 生產體系；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2）輕量化定製","attrs":{}},{"type":"text","text":"：用戶可自主註冊需要識別的實體，算法對新增實體標籤可實現 “即插即用”，且對新增類別使用輕量數據可接近已知類別效果；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3）高效能","attrs":{}},{"type":"text","text":"：自研高性能音視頻編解碼庫、深度學習推理引擎、GPU 預處理庫，針對視頻場景 IO 和計算密集型特點定向優化，在不同場景達到近 10 倍性能提升；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"4）通用性強","attrs":{}},{"type":"text","text":"：多媒體 AI 雲產品在傳媒行業、泛娛樂行業、短視頻行業、體育行業以及電商行業等均有落地應用案例。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“視頻非常有助於提升內容的易理解、易接受和易傳播性，在過去的幾年我們也看到了各行各業，各種場景都在加速內容視頻化的進程，整個社會對於視頻產量的訴求越來越強烈，如何高效、高質的生產出符合用戶需求的視頻，就成爲了核心問題，這裏面涉及到了非常多的細節問題，例如熱點的發現、大量視頻素材的內容理解、多模檢索、基於用戶畫像 / 場景的模板構建等，這些都需要大量的依賴視覺 AI 技術的發展，MMAI 團隊結合行業、場景不斷的改進在視覺 AI 方面的技術，並基於此打磨和構建業務級的多媒體 AI 雲產品，使得視頻得以高質、高效的進行生產，從而有效的推進各行各業、各場景的內容視頻化進程。” ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"阿里雲視頻雲負責人畢玄評價道。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本次 CVPR2021 中，MMAI 通過多項學術挑戰賽一舉擊敗多個國內外強勁對手，拿下了多項冠軍，是對其過硬的技術的有力驗證，其雲產品多媒體 AI 已經服務多個行業的頭部客戶，並將持續創造多行業應用價值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"👇點擊體驗多媒體 AI 雲產品體驗中心：http://retina.aliyun.com","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"源碼開源地址：https://github.com/alibaba-mmai-research/pytorch-video-understanding","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文獻:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Huang Z, Zhang S, Jiang J, et al. Self-supervised motion learning from static images. CVPR2021: 1276-1285.[2] Arnab A, Dehghani M, Heigold G, et al. Vivit: A video vision transformer[J]. arXiv preprint arXiv:2103.15691, 2021.[3] Feichtenhofer C, Fan H, Malik J, et al. Slowfast networks for video recognition. ICCV2019: 6202-6211.[4] Tran D, Wang H, Torresani L, et al. Video classification with channel-separated convolutional networks. ICCV2019: 5552-5561.[5] Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation. ICCV2019: 3889-3898.[6] Feng Y, Jiang J, Huang Z, et al. Relation Modeling in Spatio-Temporal Action Localization[J]. arXiv preprint arXiv:2106.08061, 2021.[7] Qing Z, Huang Z, Wang X, et al. A Stronger Baseline for Ego-Centric Action Detection[J]. arXiv preprint arXiv:2106.06942, 2021.[8] Huang Z, Qing Z, Wang X, et al. Towards training stronger video vision transformers for epic-kitchens-100 action recognition[J]. arXiv preprint arXiv:2106.05058, 2021.[9] Wang X, Qing Z., et al. Proposal Relation Network for Temporal Action Detection[J]. arXiv preprint arXiv:2106.11812, 2021.[10] Wang X, Qing Z., et al. Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling[J]. arXiv preprint arXiv:2106.11811, 2021.[11] Qing Z, Huang Z, Wang X, et al. Exploring Stronger Feature for Temporal Action Localization","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"「視頻雲技術」你最值得關注的音視頻技術公衆號，每週推送來自阿里雲一線的實踐技術文章，在這裏與音視頻領域一流工程師交流切磋。公衆號後臺回覆【技術】可加入阿里雲視頻雲技術交流羣，和作者一起探討音視頻技術，獲取更多行業最新信息。","attrs":{}}]}],"attrs":{}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

數字化轉型新篇章：企業通往智能化的新範式

早在十多年前，一些具有前瞻視野的企業以實現“數字化”爲目標啓動轉型實踐。但時至今日，可以說尚無幾家企業能夠在真正意義上實現“數字化”。在實現“數字化”的征途上，人們發現，努力愈進，彷彿終點愈遠。究其原因，還在於轉型一直落後於技術邊界的拓展

2024-04-29 21:22:20

MindSpore強化學習：使用PPO配合環境HalfCheetah-v2進行訓練

本文分享自華爲雲社區《MindSpore強化學習：使用PPO配合環境HalfCheetah-v2進行訓練》，作者： irrational。半獵豹（Half Cheetah）是一個基於MuJoCo的強化學習環境，由P. Wawrzyński

2024-04-29 10:33:13

圖片旋轉後保存到數據庫

1、圖片通過canvas繪製 2、canvas旋轉 3、canvas 轉成blob 在實例化成文件 4、創建formData裏面append放入文件和其他的參數，再調上傳接口 <div style=" heig

2024-04-29 10:16:22

記一次北京某大學邏輯漏洞挖掘

0x01 信息收集個人覺得教育src的漏洞挖掘就不需要找真實IP了，我們直接進入正題，收集某大學的子域名，可以用oneforall，這裏給大家推薦一個在線查詢子域名的網站：https://www.virustotal.com/ 收集到的子

2024-04-28 22:47:25

1 名工程師輕鬆管理 20 個工作流，創業企業用 Serverless 讓數據處理流程提效

作者：嶽洋、陳德全、劉靜娜北京語勢科技有限公司成立於 2023 年 6 月，語勢科技定位爲“智能投資時代的主題入口”，在資管行業從以機構爲核心轉向以用戶爲核心的變革時代，通過打造主題投資引擎，賦能普惠投資一體化，打造以投資者和資管機構爲主

2024-04-28 21:12:22

實用分享！用Axure RP構建交互的5個小技巧

Axure RP是一套專門爲網站或應用程序所設計的快速原型設計工具，可以讓應用網站策劃人員或網站功能界面設計師更加快速方便的建立Web AP和Website的線框圖、流程圖、原型和規格。在Axure RP中，交互是創建豐富而逼真的原型的

2024-04-28 11:35:53

LoRA微調語言大模型的實用技巧

一、引言隨着深度學習技術的快速發展，語言大模型在自然語言處理領域取得了顯著的進展。然而，傳統的微調方法通常需要大量的計算資源和時間，對於實際應用來說並不友好。爲了解決這個問題，LoRA微調技術應運而生。LoRA（Low-Rank Adap

2024-04-28 11:30:13

系統整容紀：責任鏈設計模式的應用實戰（爆燈了，研發工期由45天降爲1天）

本文通過介紹使用責任鏈設計模式的背景和經歷，來使得讀者加深對於此設計模式的印象，甚至受到一定的啓發來對自己當下所參與、所負責的項目進行“整容”，從而提升系統的“美感”。分享工作中的點點滴滴。一、背景在下所負責的系統中有這麼一個模

2024-04-28 11:17:20

使用 @NoRepositoryBean 簡化數據庫訪問

在 Spring Data JPA 應用程序中管理跨多個存儲庫接口的數據庫訪問邏輯可能會變得乏味且容易出錯。開發人員經常發現自己爲常見查詢和方法重複代碼，從而導致維護挑戰和代碼冗餘。幸運的是，Spring Data JPA 爲這個問題提供了

2024-04-27 21:36:42

嘉爲藍鯨WeOps與DeepFlow強強聯合，共同打造拓展性運維平臺

直達原文：嘉爲藍鯨WeOps x DeepFlow | 強強聯合，共同打造拓展性運維平臺運維管理在企業信息化建設中扮演着至關重要的角色，嘉爲藍鯨WeOps一體化運維平臺致力於爲客戶提供更全面、智能的運維能力。在探索創新的過程中，我們深刻

2024-04-26 23:23:22

如何從0到1設計診斷系統

引言在整車電子電氣體系中，診斷系統的設計扮演着至關重要的角色，負責支持整車的刷寫、故障排查和EOL(End of Line)等關鍵操作。這一重要性在於這些操作的實現都依賴於診斷系統的全面支持。因此，在設計診斷系統時，必須確保

2024-04-26 22:43:26

Sealos 雲主機正式上線，便宜，便宜，便宜！

我們基於 Sealos 雲開發的能力，僅用三天時間就上線 Sealos 的雲主機能力，現在不太懂容器的同學也可以在 Sealos 上開心的使用虛擬機了，本文先說 Sealos 雲主機的優勢，再聊聊我們是怎麼這麼快實現上線的，以及爲什麼我們要

2024-04-26 21:14:40

從零開始學架構V2-架構設計流程-2

一、架構設計流程架構的設計的是爲了降低整體的複雜性，那麼架構設計的第一步就是熟悉業務，識別其中的核心訴求，僅考慮技術的話就是識別複雜度。 1.1 識別複雜度架構的複雜度主要來源於第一節中介紹的“高性能”“高可用”“可擴展”等幾個方面，實

2024-04-25 23:56:26

從零開始學架構V2-初識架構設計-1

一、架構設計的主要目的爲了解決軟件系統複雜度帶來的問題二、複雜性來源軟件的架構設計是一個非常複雜的過程；基於業務&技術現狀、公司成本、團隊規模、團隊技術能力、近三年業務發展規模預測、技術發展趨勢等條件篩選出合適的技術、編寫多種架構設計

2024-04-25 23:56:25

京東廣告研發——效率爲王：廣告統一檢索平臺實踐

1、系統概述實踐證明，將互聯網流量變現的在線廣告是互聯網最成功的商業模式，而電商場景是在線廣告的核心場景。京東服務中國數億的用戶和大量的商家，商品池海量。平臺在兼顧用戶體驗、平臺、廣告主收益的前提推送商品具有挑戰性。京東廣告檢索平臺

2024-04-25 23:17:47

24小時熱門文章

最新文章

最新評論文章