詳解百度富媒體檢索比對系統的關鍵技術

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:","attrs":{}},{"type":"text","text":"百度富媒體檢索比對系統是一套基於Ann(approximate nearest neighbor)檢索和內容特徵比對技術,旨在提供針對圖像、音頻、視頻等多媒體資源的相似檢索系統。包括離線訓練、建庫,在線特徵提取、檢索。目前百度富媒體檢索比對系統除了承接了百度FEED所有視頻、圖像的反作弊、下發去重以及關聯推薦和黃反等業務,另外還支持了包括視頻搜索、貼吧、文庫在內的數十個業務方,支撐了千億級數據規模。在數據規模、系統性能、召回率和準確度上都處於領先地位。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文5612字,預計閱讀時間11分鐘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着互聯網和AI技術的發展,網絡信息的主要傳輸媒介,已經從傳統的網頁文字發展到圖片、視頻、音頻等資源,相對純文字的網頁,富媒體內容能傳遞更多的信息量,同時也帶來更新的用戶體驗。隨着富媒體內容急劇爆發, 理解這些實體內容,找到他們之間的相似關係,不僅能夠對這些富媒體內容進行篩選和處理,還可以更好的被推薦和搜索引擎理解,更好的服務用戶。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"得益於神經網絡的飛速發展,多媒體數據的檢索問題通常可以轉化爲高維向量的相似性檢索, 採用CNN(Convolutional Neural Network)的各種特徵來描述這些多媒體資源的語義信息,基於CNN特徵向量,將query與庫中所有數據進行相關性計算,檢索出相關的結果集。以圖像爲例,我們首先會對存量圖像,進行收錄、篩選,計算它們的CNN特徵,然後把這些CNN進行建庫。當輸入query圖像,需要從歷史庫中檢索出與之相似或相同的圖像時,我們首先計算query圖像的CNN特徵,然後與歷史庫中的全量CNN特徵進行計算(通常計算歐式或cosin距離),選取距離最近的topk個圖像作爲召回結果。通常情況下,我們還會提取圖像的視覺描述信息,來進行輔助後校驗,進一步提升召回的準確率。視頻檢索比對與圖像類似,是在圖像的基礎上增加了關鍵幀的抽取,以及召回圖像幀序列以後,會進行視頻和音頻級別的比對。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/80/8062f7a5f2d4017b125ae653ca27a5fe.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7f/7f45b43954d3da5f01ab2618779858bc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖1. 視頻檢索比對方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於CNN特徵向量的數據檢索,數據量大、特徵維度高以及要求響應時間短。隨着多媒體數據的快速增長,圖像幀的數據量已經達到千億甚至萬億級別,在這種大規模數據集的檢索上,傳統的暴力計算雖然能滿足準確度的需求,但是在計算資源和檢索時間的消耗上是巨大和無法接受的。爲了降低搜索的空間複雜度與時間複雜度,在過去的十幾年裏研究者們找到了一種可供替代的方案:近似最近鄰(Approximate Nearest Neighbor)檢索方法。它們往往通過對向量集合進行預處理,生成一些可以用來指導查找過程的知識,在查找時以犧牲一定精度的方式加速查找過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、整體架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度富媒體內容比對系統,是一套包括離線ANN訓練、建庫和模型訓練,在線特徵預估、檢索比對等功能在內的通用多媒體資源檢索比對系統,處理的資源包括圖像、視頻和音頻。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ee/ee3939e8e7600a10b86bad83b73fd768.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖2. 整體架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"cnn-service: 用來提取資源特徵的模塊,包括圖像、視頻和音頻三種類型資源的特徵提取;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"feature-sevicez: 統一特徵模塊,提供統一特徵提取和緩存功能,對上層隱藏異構cnn特徵,可配置化訪問指定cnn-service;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vs-image: 召回前,訪問feature-service計算query的特徵,然後請求as獲取ann檢索召回結果,進行視頻和音頻級別的比對;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"bs: Ann索引服務,通過cnn特徵,計算topk召回,然後進行視覺特徵後校驗,最終得到召回結果。支持多分片和分片的自動更新、擴容;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"as: 支持bs多分片的併發訪問和異構索引的檢索merge;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"finger-builder: 資源入庫統一入口,獲取資源cnn特徵數據,並寫入afs;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"index-builder: 定時ann索引建庫;","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、應用場景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"B端反作弊","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者上傳、抓取視頻全覆蓋","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每天過濾60%+的重複視頻,減輕審覈壓力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高準確率,嚴格反作弊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百家號發文、UGC、小程序、貼吧等","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"C端下發去重","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶體驗","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原創保護,生態建設","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"關聯推薦","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"短帶長,引入廠外長視頻資源,可爲用戶關聯當前視頻的完整版","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"風控","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"涉政、黃反等敏感資源的識別和屏蔽","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/03/0342d0c383d9c2bdcc9f167fa08818ab.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖3. 業務應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四、關鍵技術","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1、ANN","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ANN搜索方法通過將全空間分割成很多小的子空間,在搜索的時候,通過某種方式,快速鎖定在某一(幾)子空間,然後在該(幾個)子空間裏做遍歷,從而達到次線性的計算複雜度。正是因爲縮減了遍歷的空間大小範圍,從而使得ANN能夠處理大規模數據的索引。常見的ANN檢索算法有:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於樹的方法:經典實現爲KD-Tree、Annoy等。Annoy的核心是不斷用選取的兩個質心的法平面對空間進行分割,最終將每一個區分的子空間裏面的樣本數據限制在K以內通過將空間按維度進行劃分,縮小檢索範圍的方法來加速,適用於空間維度較小的情況。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於Hash的方法:經典實現爲LSH、PCAH等,LSH的核心思想是:在高緯度空間相鄰的數據經過哈希函數的映射投影轉化到低維空間後,他們落入同一個吊桶的概率很大而不相鄰的數據映射到同一個吊桶的概率很小。在檢索時將歐式空間的距離計算轉化到漢明空間,並將全局檢索轉化爲對映射到同一個吊桶的數據進行檢索,從而提高了檢索速度。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"矢量量化方法:PQ、OPQ等,PQ的主要思想是將特徵向量進行正交分解,在分解後的低維正交子空間進行量化,採用較小的碼本進行編碼,從而降低存儲空間。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於倒排索引的方法:IVF、IMI、GNO-IMI等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於圖的方法:NSW、HNSW、NSG等。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2、GNOIMI","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GNOIMI(The Generalized Non-Orthogonal Inverted Multi-Index)是百度內自研實現的ANN檢索引擎,它通過聚類的方式將空間分割成許多子空間。在檢索的時候,通過某種方式,快速鎖定在某一(幾)子空間,然後在該(幾個)子空間裏做遍歷,從而達到次線性的計算複雜度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CNN特徵通常特徵維度高,保存全量數據特徵所需內存與樣本數據量成正比。對於千萬級以上的數據集,通常面臨內存受限的問題。GNOIMI使用PQ乘積量化的方法,用一個有限子集對全量特徵空間進行編碼,達到大幅的降低內存消耗的目的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.訓練","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1)空間分割","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GNOIMI使用聚類的方法對訓練集特徵向量空間進行分割。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶保證原始特徵數據無重複,從原始數據中隨機抽樣。抽樣數據集個數小於等於500w,","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/74/746fa437eb33040c319f3714c8c7eb05.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對抽樣樣本進行KMEANS聚類,得到初始的一級聚類中心","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e4/e444af39effbd5dafb754facb907fe10.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。計算抽樣本與其所屬一級子空間聚類中心的殘差向量,在殘差向量上進行K-means聚類,將殘差向量空間分爲L個子空間,得到二級聚類中心碼本","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3d/3def26c62bafaa3f09645e14b4b5bc01.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。一二級聚類中心將整個數據空間分割爲個子空間(cell),每個cell的聚類中心點定義爲","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bb/bbb7a581d63a93b5023c693464ad1a4c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。任一訓練集樣本特徵向量所屬的cell,滿足","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/75/7502b5b7bb77c6b137718b62863803a4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"空間分割如圖4所示,所有一級聚類中心共享二級聚類中心。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/ae27505f5359cf9d12be7b99182c5472.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖4","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲二級聚類中心使用的是全量原始特徵的殘差向量,因而認爲二級聚類中心在每個一級子空間內分佈相似,全量原始特徵數據共享二級聚類中心。這種方法被稱爲NO-IMI(The Non-Orthogonal Inverted Multi-Index)。藍色點爲一級聚類中心點,紅色點爲個cell的聚類中心點。顯然,cell的形狀和大小需根據數據分佈可變,尤其是在全量特徵數據空間同時存在稀疏和密集區域時。引入參數矩陣,cell的聚類中心點定義爲。引入參數矩陣後cell分佈如圖5所示,cell的形狀和大小根據實際數據分佈可變,空間分割更符合一級子空間內數據分佈情況。參數矩陣是有全量數據計算得到的,因而更準確的描述數據分佈,稱這種方法爲GNO-IMI。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/30e80b9d56d51600d297ff2aee9b4fca.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖5","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2)乘積量化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算抽樣數據集中樣本所屬於的一二級距離中心,得到樣本與一二級聚類中心的殘差數據集。將殘差數據集分爲nsq個空間,每個子空間特徵維度爲feature_dim/nsq,每個子空間分別進行KMEANS聚類,得到256個聚類中心(一個char佔8bit,可用一個char字長標記所有的聚類中心ID),得到每個子空間的碼本。將nsq個子空間的子碼本做笛卡爾積,得到整個數據集的PQ碼本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.建庫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算原始特徵向量數據集中樣本所屬的一二級聚類中心。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算原始特徵向量數據集中樣本與其所屬的一二級聚類中心的殘差。將殘差向量分爲nsq個子空間,在每個子空間內,計算該子特徵向量距離最近的聚類中心並記錄聚類中心ID,將feature_dim維度的特徵向量映射爲nsq個聚類中心的ID,可用nsq個聚類中心ID標識該特徵向量。通常取nsq = feature_dim進行四分之一量化,feature_dim * sizeof(float) -> nsq *sizeof(char)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在檢索過程中,將query與該樣本在每個子空間內的距離,轉化爲與該樣本距離最近的聚類中心的距離。因而,在檢索過程中,無需加載原始特徵向量,可降低檢索過程中所需要的內存。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.檢索","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1) 特徵進行歸一化;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2) 計算query與一級聚類中心的距離並排序;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3) 計算query與前gnoimi_search_cells個一級聚類中心下的二級聚類中心的距離並排序,共計gnoimi_search_cells * gnoimi_fine_cells_count個二級聚類中心;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4) 以優先隊列的方式,從最近的二級聚類中心開始,依次取出其下的樣本,並計算query與這些樣本的距離,取滿neighbors_count個爲止;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5) 排序後返回topK個樣本和對應的距離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ANN的算法本身並不算複雜,難點主要在實現上,GNOIMI做了大量實現優化,簡要介紹如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)設計新的訓練方案,重新組織一二級聚類中心的關係,在召回率略微提升的前提下,訓練速度提升1000%。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)對於L2/COS距離空間下,任意三點滿足三角不等式;在建庫階段,根據該特質,利用樣本、一級聚類中心和二級聚類中心之間的兩兩距離進行剪枝,可降低95%+的計算量,建庫速度提升550%+。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)訓練&建庫所需內存大大降低,僅爲Faiss-IVF*和nmslib-HNSW的10%。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)在檢索階段,空間分割規模超過千萬,計算query與二級聚類中心過程中,設計新的計算&排序邏輯,將百萬級聚類中心的計算&排序時延控制在2ms內,降低20倍。計算query與樣本距離時,優化PQ量化計算過程,降低800%+的計算量,整體吞吐提升30%+。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GNOIMI與IVF*比較時,使用相同聚類中心個數及檢索doc個數下,比較檢索時間、召準和內存;與HNSW比較時,在相同檢索時間下,比較召準和內存。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過測評,百萬數據量級相同檢索時間內GNOIMI效果略低於HNSW,遠超ivf,內存佔用極小,HNSW效果最優,但內存消耗最多。隨着數據增多,維度增大,相同檢索時間內GNOIMI效果相比其他更優,內存保持低增長。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前GNOIMI廣泛應用於百度內各種場景,包括視頻比對、圖片/視頻檢索、FEED等等場景,支撐規模上千億特徵,每天PV過10億","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3、HNSW","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HNSW(Hierarchical Navigable Small World)是ANN搜索領域基於圖的算法。它的前身是 NSW (Navigable-Small-World-Graph) 。NSW 通過設計出一個具有導航性的圖來解決近鄰圖發散搜索的問題,但其搜索複雜度仍然過高,達到多重對數的級別,並且整體性能極易被圖的大小所影響。HNSW 則是着力於解決這個問題。作者借鑑了 SkipList 的思想,提出了 Hierarchical-NSW 的設想。簡單來說,按照一定的規則把一張的圖分成多張,越接近上層的圖,平均度數越低,節點之間的距離越遠;越接近下層的圖平均度數越高,節點之間的距離也就越近(見下圖6)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索從最上層開始,找到本層距離最近的節點之後進入下一層。下一層搜索的起始節點即是上一層的最近節點,往復循環,直至找到結果。由於越是上層的圖,節點越是稀少,平均度數也低,距離也遠,所以可以通過非常小的代價提供了良好的搜索方向,通過這種方式減少大量沒有價值的計算,減少了搜索算法複雜度。更進一步,如果把 HNSW 中節點的最大度數設爲常數,這樣可以獲得一張搜索複雜度僅爲 log(n) 的圖。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3c/3c07a333f8f6af9b2ad5192b112a5a9b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖6. hnsw","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HNSW的一個顯著優點是無需訓練,在某些沒有初始數據的場景非常好用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前百度內容側使用的是hnsw的一種優化實現,在開源版本的基礎上,做了很多優化,性能提升了將近3.6倍。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5e/5efe53911ff536be47d49c605819c6e3.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"五、比對技術","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1.圖像比對","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前主要有兩種表徵圖像的方法:局部特徵點和圖像CNN向量。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"局部特徵點:對圖片的視覺描述,如SIFT、ORB等,對尺度、旋轉、亮度保持不變,視覺變化、防射變化、噪聲也有一定的穩定性。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖像CNN特徵:對圖像的語義特徵,通常使用CNN分類模型等的最後幾層網絡輸出。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f6/f6c875b2f0fa05128c2d778ee12041dc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/68/68c93109645cd241508a0da8c251225a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△圖7","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此當前比對技術,採用CNN特徵篩選+視覺局部特徵後校驗","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d1/d12ebea390f999f98700dabf1315b20b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△圖8","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.視頻比對","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻比對複用了圖像比對的技術,在幀檢索的基礎上增加了視頻和音頻級別的比對技術,主要是基於動態規劃計算最佳匹配序列。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/77/77dce0249a212eaf4a13191622288936.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△圖9","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"六、總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近年來,隨着計算機技術的發展,圖片、視頻、音頻等富媒體信息的呈井噴式增長,內容檢索比對技術在推薦、搜索等各個領域也有了更廣泛的應用。本文對百度富媒體檢索比對系統的基本原理和核心技術進行了一次全面的總覽介紹,同時介紹了各模塊的工作機制,包括:特徵提取、離線訓練建庫、在線預估、檢索比對等。它提供了一套通用的多媒體資源檢索比對方案,保證了高召回、高準確和高性能。基於百度FEED和搜索兩大核心業務,它擁有全網最大的數據規模和最豐富的資源類型,涵蓋了短視頻、小視頻、直播、圖片等絕大數富媒體資源,服務於30+產品線,爲百度產品的效果提升提供了有效的輔助。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"閱讀原文:","attrs":{}},{"type":"link","attrs":{"href":"https://link.zhihu.com/?target=https%3A//mp.weixin.qq.com/s%3F__biz%3DMzg5MjU0NTI5OQ%3D%3D%26mid%3D2247491769%26idx%3D1%26sn%3D4009630e1583d92c37770d59f8e4ffd7%26chksm%3Dc03ed0c5f74959d35eba94b0c239281bac274028754aef4e12267788c44a9a845768002f93f9%26token%3D1381754262%26lang%3Dzh_CN%23rd","title":null,"type":null},"content":[{"type":"text","text":"詳解百度富媒體檢索比對系統的關鍵技術","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"更多幹貨、內推福利,歡迎關注同名公衆號「百度Geek說」~","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章