優酷速看短視頻自動化生產解決方案

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、簡介"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.1 摘要"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着用戶的時間碎片化程度加劇,視頻“由長變短”成爲一種趨勢,信息流場景下的短視頻消費需求日益增長,優酷每年爲用戶提供大量優質視頻資源,具備天然的“由長變短”優勢,並通過算法研究在速看短視頻的自動化生產方面取得突破。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d1\/f4\/d1699d9f71fa2efe2e5f4b7b3eb094f4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.2 相關研究"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"學術界中將該問題命名爲 text video alignment:給定video的劇本,基於video shot和sentence的相似度,做兩個sequence的對齊。 涉及兩個任務,第一個任務是計算文本與視頻片段的相似性,第二個任務是 text sequence 與 video sequence 的對齊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"video text alignment 與 video text grounding 的區別是其對視頻片段邊界不敏感,不要求迴歸邊界,只做 shot 與 text 相似度的度量。而與 video text retrieval 的的相同之處是需要計算 video clip 和 text 的特徵及相似度,不同之處是 text video alignment 有時序信息,且時序是順序的,不存在亂序。text video alignment的相似比對只在指定的video當中,不存在跨video的檢索。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻中通常會包含多種不同模態的信息,例如光流、人臉、聲音等,之前的方法僅考慮了某一模態的特徵。文章[1]提出了一個相似度計算框架將所有模態特徵納入視頻-文本的相似度計算中,並且可以靈活擴展到更多的模態,也可以處理某一模態特徵缺失的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/18\/10\/18bcebbe6608a801774019b7aee32a10.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章[2]將視頻和文本的跨模態匹配過程抽象爲對視頻序列棧和文本序列棧的操作過程。利用LSTM對視頻序列和文本序列進行建模,構成視頻序列棧和文本序列棧,通過循環預測不同的棧頂操作來實現序列匹配。可以滿足不同類型的匹配要求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/29\/ed\/2925f2caa2b4529828fe4fe4644ee7ed.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章[3]將文章[1]中的相似度計算框架應用在視頻文本檢索領域。在原有結構基礎上增加了信息過濾模塊,增加了不同模態之間的信息融合通道,能夠更好地融合不同模態的特徵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/9d\/7a\/9db78df413002401830b645737ea5b7a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章[4]將圖神經網絡應用在了視頻文本檢索領域。分別在文本和視頻模態提取不同層級的特徵,並使用圖神經網絡進行模態內的特徵融合,最後進行相似度計算。相較於其他方法,圖結構的表示方式能夠更加合理的組織信息,提升模型性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c5\/8a\/c5e709a5ee14f86e254b46335db9bb8a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、算法描述"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 算法框架概覽"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/4b\/a6\/4bf7a0696395ef370806ed72b378afa6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 特徵設計"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.2.1 視頻特徵"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻側特徵提取需要首先進行視頻結構化(通過對視頻中的圖像信息進行智能分析,提取出關鍵信息,並進行文本的語義描述)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d9\/1c\/d925e3a3c3042d735c0bf55a2237cc1c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.2.2 文本特徵"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文本側信息的提取包括了幾個部分:文本分類、命名實體識別(Named Entity Recognition)、指代消解和依存關係分析。這些技術模塊在一起組成完整的文本處理鏈路,提取出文本的關鍵特徵之後供多模態匹配使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文本分類爲匹配算法的權重提供重要依據,匹配算法將按照句子的分類結果採用合適的匹配策略。例如對於描述性的文本採用人物、場景、行爲的嵌入向量匹配;對於對白的文本採用ocr文本匹配。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"命名實體識別可以提取出文本中的命名實體,例如人物、行爲、場景等關鍵信息,這些結構化數據可以通過相似度算法與視頻的嵌入向量計算語義距離,從而爲基於嵌入向量和標籤的匹配算法提供重要的打分函數。採用Bert[1]模型來進行文本分類和命名實體識別的任務,具體來講,使用在其他的較大的中文語料庫上預訓練的模型,然後在自己標註的數據集上進行調優。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"指代消解和依存關係分析爲消除文本特徵中的歧義和冗餘項提供了工具。劇情文本中的句子存在很多代詞指代的情形,無法用NER直接推理出關鍵的人物。例如,陳永仁聽說韓琛新進了一批毒品,於是他趕快把這個消息傳遞給了黃志誠。第二個子句中的他,如果沒有指代消解的能力,就無法準確提出。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"句子的依存關係分析則在此基礎之上提煉出句子中最關鍵的信息部分,捨棄干擾項,大大提升提取特徵的質量。劇情文本當中通常會有不少定語和狀語,這對於text2video的任務其實幫助很小,而且他們會擾亂句子主體的提取。這個時候,我們使用句子的依存關係分析,提取出最關鍵的主語、謂語(行爲)和賓語,作爲句子的主幹成分,從而用作匹配的特徵。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.3 跨模態匹配"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跨模態匹配解決如何對齊文本中的句子與視頻片段的問題。這是一個非常困難的系統性問題。爲了解決這個問題,我們設計了一個多層級的匹配算法,主要分爲兩個語義級別的匹配:嵌入向量級別和標籤級別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對嵌入向量級別,我們會針對文本和視頻分別訓練一個語義嵌入向量提取模型,然後對每一個句子和視頻的片段計算一個相應的語義嵌入向量,再用一個神經網絡來學習這兩個向量之間的匹配關係。這部分的數據我們採用人工標註了一部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"嵌入向量級別可以解決廣義上的語義匹配問題,然而有一些簡單的邏輯可以低成本地使用標籤級別的匹配算法快速、精準地完成。例如,文本中和視頻中出現了對應的人物,那麼我們可以使用對應的人物標籤來過濾到非匹配的片段。針對這個問題,我們設計了一些有效的相似度分數評估函數,用來計算標籤之間的語義距離,從而爲搜索匹配進行打分排序。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.4 文本匹配"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於文本的匹配有兩種不同的需求:分別是短句級別的短文本匹配和句子級別的匹配,在此採用詞向量的方式來計算文本的相似度。在公開的中文語料庫(800萬中文詞)上訓練了詞向量模型,用來計算短語的詞向量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於短語級別的文本匹配,直接根據詞向量模型所計算的詞向量作爲匹配的依據。對於句子級別的文本匹配,對句子中的詞語單獨計算詞向量,然後進行加權平均作爲整個句子的詞向量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了短語和句子的詞向量之後,還需要根據詞向量計算文本的距離。所使用的基準方法非常簡潔:在計算句子中短語的詞嵌入的平均值之後計算兩個句子的詞嵌入的餘弦相似性。這個方法雖然簡潔但是在大部分場景下表現都符合預期。針對比較困難的場景,使用詞移距離,計算其中一文本中的單詞在語義空間中移動到另一文本單詞所需要的最短距離。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、效果展示"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、參考文獻及備註"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Learning a Text-Video Embedding from Incomplete and Heterogeneous Data"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3] Use What You Have: Video Retrieval Using Representations From Collaborative Experts"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"備註:TTS語音合成技術由阿里巴巴達摩院語音實驗室提供"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章