視頻基礎技術在百度的應用

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度近幾年在視頻領域做了一些基礎技術的研究和積累,這些技術在百度內部有非常多的應用場景,其在面向C端和麪向B端的諸多場景中得到廣泛的使用,並取得不錯的應用效果。今天將主要介紹百度在不同視頻場景下主要運用的關鍵技術。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"百度視頻基礎技術架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先和大家分享下百度視頻基礎技術架構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/de\/deaae5ca833386912e12e485025a2e6c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 視頻研發平臺"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研發平臺方面,百度主要使用飛槳平臺,以及在飛槳平臺基礎上開發的Paddle CV。飛槳(PaddlePaddle)是以百度多年的深度學習技術研究和業務應用爲基礎,集深度學習核心訓練和推理框架、基礎模型庫、端到端開發套件和豐富的工具組件於一體深度學習開發平臺。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 視頻AI技術"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻AI技術方面,主要分爲視頻理解、視頻編輯、視頻監控和通用視覺四部分內容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"視頻理解:"},{"type":"text","text":"主要包括視頻語義分析、視頻質量和視頻檢索技術。主要應用於1、內容分析,包括視頻的內容、文字OCR和人臉等內容的分析。2、質量判斷,在視頻播放分發時對視頻質量進行判斷,去除低質量視頻或者進行質量優化提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"視頻編輯:"},{"type":"text","text":"主要包括分割\/關鍵點\/AR、超分辨率、自適應解碼和GAN技術。主要應用於人像的分割關鍵點、AR特效、智能創作和降低視頻帶寬等方面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"視頻監控:"},{"type":"text","text":"主要包括人\/車\/物檢測、視頻跟蹤和相似度量等技術。主要應用於人車物的檢測、人像追蹤等方面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"通用視覺:"},{"type":"text","text":"主要包括預訓練模型、分類\/檢測\/分割和NAS等技術。主要應用於分類檢測分割、基於NAS搜索更好的網絡結構等方面。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"百度視頻基礎技術"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 視頻理解"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/27\/27dcd5f0ad79a8b71ab1bada5cb0a6d3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻理解將主要介紹視頻分類技術點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/33\/33096aa9eb450af929fb806d1ebd2272.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻分類技術是視頻理解中的關鍵技術之一,視頻分類技術和傳統的圖片分類技術相比有較大的差異,視頻分類主要有以下特點:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻分類的計算量大,在單位時間內有較多的幀。如果一個視頻有30秒,1秒有25幀,那麼對應的圖片將可能有幾百張,要逐個分析每張照片的標籤內容需要較大的計算量。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻分類需要多模態信息融合,視頻不止有圖像,還有圖像的運動信息和語音信息等等,這將帶來多模態信息融合的問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻分類需要如何處理時序多幀問題,即在各幀之間的如何實現時序多幀的建模。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻分類的標籤不唯一,視頻具備多標籤的特點,可能包括人名、地名、環境和動作等等不同維度的標籤。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對視頻分類的以上特點,百度也做了一些創新工作。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對計算量大的問題,百度專門設計了一個針對大規模視頻分類的框架,支持千萬級的視頻分類工作。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對多模態信息融合,百度研發了KeyLess Attention技術方案。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對時序多幀信息,百度通過Attention-Cluster技術方案,實現多幀信息建模。通過StNet實現時序信息優化。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對多標籤,百度通過多標籤信息卷積實現多標籤標定。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① 注意力聚類網絡(Attention-Cluster)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面將詳細介紹一種視頻分類的網絡結構——注意力聚類網絡(Attention-Cluster)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ff\/fff957512c399747048149ce08f8c102.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意力聚類網絡的主要思想是:幀間冗餘性、局部判別性、近似無序性和多段可分性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"幀間冗餘性:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中第一排的4張圖片取自一段打高爾夫球的視頻,我們從圖中不難看出在揮杆的動作中,不論是看哪一幀我們都能判斷這是一個打高爾夫球的視頻。因爲視頻中有比較多重複的視頻幀,也就提供了幀間巨大的冗餘性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"局部判別性:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中第二排的4張圖取自一段人刷牙的視頻,可以看到,我們只看到第一幀時就能判斷出刷牙的動作,這樣後面幾幀很可能是多餘的信息,這說明在這段視頻裏,只要能夠找到關鍵幀就能對分析視頻起到事半功倍的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"近似無序性和多段可分性:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中第三排的4張圖取自一段人跳高視頻,這段視頻的時間序列關係是打亂的,但是我們仍可以判斷視頻內容,說明某些情況下視頻存在近似無序性的特點。除此之外,無論是前兩張視頻截圖還是後兩張視頻截圖,我們都能判斷這是一個跳高的視頻,前面是跳高落地過程,後面是跳高起跳過程,這說明視頻中可能有多段信息能夠表徵視頻的類別,這種情況類似於有多個key的keyframe。基於這些思想,可以使用attention機制來做識別,即通過篩選機制來捕獲類似的keyframe,就能夠實現對關鍵幀的分析。由於還有多個不同的keyframe,那麼顯然就能用Attention-Cluster來學習多個不同的特性機制來捕獲不同的關鍵片段,從而優化分類結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5d\/5d6844122aa6755ee5d962c2ff16a2d0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意力聚類網絡的思想類似於transformer的思想,注意力聚類網絡的主要思想也是引入多個attention機制來捕獲一些不同的注意內容模型。除此之外,如網絡結構圖片所示,圖中最左邊的藍色條塊,表示特徵序列。每次我們抽取特徵序列,即對藍色條塊輸入sequence做attention操作,每一個青綠色的條塊代表一個attention unit,表示基於一個注意力機制加權後輸出的向量,多個青綠色條塊,就是代表做過很多attention unit操作。最後不同的輸出向量會concat在一起,除了RGB,對Flow和Audio也做同樣的操作操作,讓他們各自學習各自的注意力特徵,最後再拼成一個完成的特徵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"注意力網絡優化:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼如何保證用不同的注意力機制去捕獲不同的keyframe權重的時候,讓不同的attention unit學習的權重組合是不完全相同的呢?百度提出將這種特徵映射到新的空間裏去,增加表達性,儘量讓attention unit學習到多主要推薦參數之間的diversity,以此來保證注意力機制間的互補性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3b\/3b4792924283167be9219786e921a9c1.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今年百度對注意力聚類網絡模型進行了改進,這次改進主要針對兩個問題,一是之前沒有考慮對每個Attention輸入圖片的時序信息。二是針對每一個feature channel,之前沒有考慮channel之間的權重問題。最近對這些問題進行了優化,優化的結果是一個叫Channel Pyramid Attention Cluster,這個模型可以基於多尺度channel特徵學習不同 local feature part權重,即在視頻中每一幀的feature可以按照金字塔的形式,將其切分成兩份、四份或者多份。這樣每一個channel裏的feature都有可能學習到不同的local feature part權重。以前每個local feature part是同樣的權重,那麼現在切分之後,各個部分各自學習自己的參數,這樣local feature part就有了多樣性。另一個優化結果是Temporal Pyramid Attention Cluster即在時序上和上一種優化類似,通過把序列分成一份兩份或者多份,在每一個part裏,都可以使用Attention Cluster去學習,但在時序裏面,這種拼在一起的順序,可以增加多尺度特徵序列建模時序信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體相關工作可以參考論文:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Purely Attention Based Local Feature Integration For Video Classification,TPAMI2020。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② 跨模態注意力機制+圖卷積優化多標籤問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/67\/672d900dcd4e0b68be52d728bb609a21.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多標籤是因爲視頻分類或者圖片分類裏含有多個標籤信息,那麼如何利用多標籤信息來優化視頻分類結果。針對多標籤優化問題,百度提出以下解決思路,通過跨模態注意力機制+圖卷積來實現優化工作。上圖左邊的圖片是用跨模態的注意力機制+圖卷積來優化圖片分類工作,即通過文本空間+圖片空間特徵實現多模態學習。一個圖片可以有多個標籤,首先標籤之間會有共現信息,比如說很多圖片出現了自行車和人,那麼自行車和人這兩個標籤之間的關係就比較近,如果構造一張圖,每個節點就是一個標籤,他們的邊權重可能就比較高。根據共現信息去求每兩個節點的一個條件概率,來標識它們之間的相似性,基於此去監督學習文本的Embedding。在學習圖片分類時,可以用label的向量和圖片的信息做匹配,生成attention maps,然後用attention maps去做一些加權求和特徵。最後每個類別就能學習到對應的跨模態特徵。以上就是在圖片分類中使用跨模態加圖卷積識別多標籤的思想。視頻方面的思路類似,通過文本空間+視頻時序特徵實現多模態學習,比如在Youtube-8M訓練集中,每個視頻可能有一到多個標籤,同樣可以採用把視頻的標籤做一個label graph去學習每個label embedding feature,針對視頻將不再採用空間這種attention機制,把視頻在時序上做卷積,在每個時間片段上,他們都可以跟其他的時間片段去做attention機制,在跨模態方面,可以把每個label的feature和視頻的每個segment的feature做跨模態分析,這會讓視頻分類效果有比較明顯的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體相關工作可以參考論文:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification,AAAI2020"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/da\/dacfad9c4d3bed81e1d0508451d2d16f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上識別網絡已經在飛槳平臺實現開源,有需要的同學可以去飛槳上使用。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 視頻編輯"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8c\/8c5a65794670afc0e54c3c05d7441863.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻編輯將主要介紹兩個技術點,一是GAN,二是超分辨率技術,三是自適應編解碼技術。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9b\/9b6b0fed7e793810710e62e86ed6f24f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① GAN"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生成式對抗網絡(GAN, Generative Adversarial Networks )是一種深度學習模型,是近年來複雜分佈上無監督學習最具前景的方法之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GAN提供了一種生成高質量數據的方法,在一些場景下,基於GAN生成的數據達到了新的高度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d6\/d61ece1f445916e0569efecebe5f259a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GAN的主要思想來自於博弈論,通過生成網絡G和判斷網絡D之間的博弈,不斷提升各自的生成能力和判別能力,最終使G產生的數據符合真實數據的分佈。生成網絡具備迭代屬性,生成網絡G每次都會進行生成逼真的圖片去欺騙判斷網絡D,判斷網絡D也會自主迭代判斷生成圖片的真僞,當生成網絡G的生成的圖片被判斷網絡D認可,即完成和真實數據分佈的匹配工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GAN避免了對數據分佈的假設,能夠整合各類損失函數,提高了生成效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GAN的重要應用場景-人臉屬性編輯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cc\/ccbb68efee233f69bd74c75394d9debe.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人臉屬性編輯使用StarGAN,StarGAN在判別圖片真僞的同時還要判別圖片的屬性,輸入參數除了原圖,還有變化後的圖片屬性參數。針對生成的結果圖片,判別器將判別每一個它生成的圖片的屬性是否符合我們的預期。除此外,這中間還有一個循環生成的過程,這個過程加上讓圖片重構的損失儘量小的約束,使我們最終能夠有效控制圖片生成的屬性跟質量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7b\/7b8cd0e44c00cce3f3bcf182f1f38f6c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StarGAN是一個比較經典的網絡結構,百度在這方面也做了一些改進工作,這部分工作的主要內容是首次提出selective transfer單元,採用編輯局部屬性替代全局屬性,主要思想是對每個屬性進行0和1的假設,以人臉圖片爲例,如果人臉面部有10個屬性,其中1個屬性代表是否佩戴眼鏡,那麼就可以通過0和1來控制這個屬性的變化。這篇文章部分思想借鑑了殘差網絡的思想。即只對需要變化的屬性特徵置1,讓網絡在學習過程中聚焦這些需要改變屬性的區域,不需要改變的屬性置0依然保持不動,這和傳統的算法是不一樣的,以前需要對整張圖片進行重新學習和改變,即網絡在學習的時候是沒有學習重點的。基於以上思想百度也提出selective transfer單元,這個單元能夠把深層特徵按照想改變的屬性牽引,不停的把需要改變的區域的深層特徵傳到淺層做融合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體相關工作可以參考論文:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"A Unified Selective Transfer Network for Arbitrary Image Attribute Editing ,CVPR 2019"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② 超分辨率技術"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在視頻中廣泛存在超分辨率技術應用需求,超分辨率技術指的是把低分辨率的圖片和視頻優化爲高分辨率圖片和視頻,優化圖片和視頻的顯示效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e8\/e8729978777ea9abf694bc9beb3add18.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度也嘗試過使用多種不同的算法來實現超分辨率技術,從實際使用效果來看,GAN的使用效果一般,不如傳統算法,GAN傾向無中生有,生成滿足數據分佈的局部。在一對一重建任務上,L1loss更直接,能夠更好的保證超分後的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"③ 自適應編解碼技術"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3d\/3dcbeacafce7b0679e441082f0987d17.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自適應編解碼技術主要針對一些帶寬成本較高的場景,比如短視頻。針對這些場景,確定視頻的最佳壓縮參數,既保證壓縮帶寬又保證視頻質量的需求十分迫切。百度去年也做了一個基於視頻內容自適應編解碼的工作,叫做Content adaptive Encoding。這部分工作的主要思路是先把視頻按照shot級別進行切割,這樣能夠有效提高算法並行速度,除此之外,不同的shot能根據視頻的內容信息以及需要保證的質量尋找最合適的CRF壓縮參數,確定最佳壓縮效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體相關工作可以參考論文:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Predicting Rate Control Target Through A Learning Based Content Adaptive Model,PCS 2019"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 視頻監控"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/44\/4471c864095609de7315767269fe4606.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻監控將主要介紹人\/車\/物檢測技術。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/11\/11d8c2160aaebbad163cf9d9006b3b9e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人\/車\/物檢測技術主要通過智慧交通感知技術來舉例,這部分技術主要目標是優化2D感知能力,包括檢測、跟蹤、車道線分割,最終實現兼顧速度和精度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/75\/758d8d2205b74a498e02fc094f1dc9b4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在優化過程中,百度採用以下思路進行技術創新:首先是建立大規模數據預訓練模型,在模型主幹上,百度將YOLO算法的主幹網路換成ResNet34-D,並在這上做了一些改進,提升速度,通過在特徵金字塔裏引入DropBlock來進一步提升效果,除此外還有很多其他trick,不再一一舉例。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 通用視覺"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通用視覺將主要介紹分類\/檢測\/分割技術:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bc\/bcd40ff6aba4432154462dfbb11cf1cd.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無論在C端的圖像理解還是B端的視頻監控場景下,圖像最基礎的分類\/檢測\/分割技術都和場景應用息息相關。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/409b25ad29feade0e38e8b48b130f60c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度基於飛槳平臺開發了PP-YOLO網絡結構,PP-YOLO的主要思路是在提升檢測每個點的效果的同時,也會增加運算速度,如果能保證在儘量高的網絡結構的模式和進度的同時,儘量減少它的運算速度,這樣就能有效提升效果。這是PP-YOLO網絡優化的一個主要的目標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體工作可以參考論文:PP-YOLO: An Effective and Efficient Implementation of Object Detector"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"分享嘉賓:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"文石磊,"},{"type":"text","text":"百度視頻理解技術負責人 | 百度智慧城市主任架構師。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:dataFunTalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/8yZXqRDI0kWCSchMufrKAQ","title":"xxx","type":null},"content":[{"type":"text","text":"視頻基礎技術在百度的應用"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章