詳解百度富媒體檢索比對系統的關鍵技術

原創

2021-05-11 16:03

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀：","attrs":{}},{"type":"text","text":"百度富媒體檢索比對系統是一套基於Ann(approximate nearest neighbor)檢索和內容特徵比對技術，旨在提供針對圖像、音頻、視頻等多媒體資源的相似檢索系統。包括離線訓練、建庫，在線特徵提取、檢索。目前百度富媒體檢索比對系統除了承接了百度FEED所有視頻、圖像的反作弊、下發去重以及關聯推薦和黃反等業務，另外還支持了包括視頻搜索、貼吧、文庫在內的數十個業務方，支撐了千億級數據規模。在數據規模、系統性能、召回率和準確度上都處於領先地位。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文5612字，預計閱讀時間11分鐘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着互聯網和AI技術的發展，網絡信息的主要傳輸媒介，已經從傳統的網頁文字發展到圖片、視頻、音頻等資源，相對純文字的網頁，富媒體內容能傳遞更多的信息量，同時也帶來更新的用戶體驗。隨着富媒體內容急劇爆發, 理解這些實體內容，找到他們之間的相似關係，不僅能夠對這些富媒體內容進行篩選和處理，還可以更好的被推薦和搜索引擎理解，更好的服務用戶。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"得益於神經網絡的飛速發展，多媒體數據的檢索問題通常可以轉化爲高維向量的相似性檢索, 採用CNN(Convolutional Neural Network）的各種特徵來描述這些多媒體資源的語義信息，基於CNN特徵向量，將query與庫中所有數據進行相關性計算，檢索出相關的結果集。以圖像爲例，我們首先會對存量圖像，進行收錄、篩選，計算它們的CNN特徵，然後把這些CNN進行建庫。當輸入query圖像，需要從歷史庫中檢索出與之相似或相同的圖像時，我們首先計算query圖像的CNN特徵，然後與歷史庫中的全量CNN特徵進行計算（通常計算歐式或cosin距離），選取距離最近的topk個圖像作爲召回結果。通常情況下，我們還會提取圖像的視覺描述信息，來進行輔助後校驗，進一步提升召回的準確率。視頻檢索比對與圖像類似，是在圖像的基礎上增加了關鍵幀的抽取，以及召回圖像幀序列以後，會進行視頻和音頻級別的比對。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/80/8062f7a5f2d4017b125ae653ca27a5fe.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7f/7f45b43954d3da5f01ab2618779858bc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖1. 視頻檢索比對方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於CNN特徵向量的數據檢索，數據量大、特徵維度高以及要求響應時間短。隨着多媒體數據的快速增長，圖像幀的數據量已經達到千億甚至萬億級別，在這種大規模數據集的檢索上，傳統的暴力計算雖然能滿足準確度的需求，但是在計算資源和檢索時間的消耗上是巨大和無法接受的。爲了降低搜索的空間複雜度與時間複雜度，在過去的十幾年裏研究者們找到了一種可供替代的方案：近似最近鄰(Approximate Nearest Neighbor)檢索方法。它們往往通過對向量集合進行預處理，生成一些可以用來指導查找過程的知識，在查找時以犧牲一定精度的方式加速查找過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、整體架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度富媒體內容比對系統，是一套包括離線ANN訓練、建庫和模型訓練，在線特徵預估、檢索比對等功能在內的通用多媒體資源檢索比對系統，處理的資源包括圖像、視頻和音頻。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ee/ee3939e8e7600a10b86bad83b73fd768.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖2. 整體架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"cnn-service: 用來提取資源特徵的模塊，包括圖像、視頻和音頻三種類型資源的特徵提取;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"feature-sevicez: 統一特徵模塊，提供統一特徵提取和緩存功能，對上層隱藏異構cnn特徵，可配置化訪問指定cnn-service;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vs-image: 召回前，訪問feature-service計算query的特徵，然後請求as獲取ann檢索召回結果，進行視頻和音頻級別的比對;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"bs: Ann索引服務，通過cnn特徵，計算topk召回，然後進行視覺特徵後校驗，最終得到召回結果。支持多分片和分片的自動更新、擴容;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"as: 支持bs多分片的併發訪問和異構索引的檢索merge;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"finger-builder: 資源入庫統一入口，獲取資源cnn特徵數據，並寫入afs；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"index-builder: 定時ann索引建庫；","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、應用場景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"B端反作弊","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者上傳、抓取視頻全覆蓋","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每天過濾60%+的重複視頻，減輕審覈壓力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高準確率，嚴格反作弊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百家號發文、UGC、小程序、貼吧等","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"C端下發去重","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶體驗","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原創保護，生態建設","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"關聯推薦","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"短帶長，引入廠外長視頻資源，可爲用戶關聯當前視頻的完整版","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"風控","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"涉政、黃反等敏感資源的識別和屏蔽","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/03/0342d0c383d9c2bdcc9f167fa08818ab.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖3. 業務應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四、關鍵技術","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1、ANN","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ANN搜索方法通過將全空間分割成很多小的子空間，在搜索的時候，通過某種方式，快速鎖定在某一（幾）子空間，然後在該（幾個）子空間裏做遍歷，從而達到次線性的計算複雜度。正是因爲縮減了遍歷的空間大小範圍，從而使得ANN能夠處理大規模數據的索引。常見的ANN檢索算法有：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於樹的方法：經典實現爲KD-Tree、Annoy等。Annoy的核心是不斷用選取的兩個質心的法平面對空間進行分割，最終將每一個區分的子空間裏面的樣本數據限制在K以內通過將空間按維度進行劃分，縮小檢索範圍的方法來加速，適用於空間維度較小的情況。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於Hash的方法：經典實現爲LSH、PCAH等，LSH的核心思想是：在高緯度空間相鄰的數據經過哈希函數的映射投影轉化到低維空間後，他們落入同一個吊桶的概率很大而不相鄰的數據映射到同一個吊桶的概率很小。在檢索時將歐式空間的距離計算轉化到漢明空間，並將全局檢索轉化爲對映射到同一個吊桶的數據進行檢索，從而提高了檢索速度。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"矢量量化方法：PQ、OPQ等，PQ的主要思想是將特徵向量進行正交分解，在分解後的低維正交子空間進行量化，採用較小的碼本進行編碼，從而降低存儲空間。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於倒排索引的方法：IVF、IMI、GNO-IMI等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於圖的方法：NSW、HNSW、NSG等。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2、GNOIMI","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GNOIMI(The Generalized Non-Orthogonal Inverted Multi-Index)是百度內自研實現的ANN檢索引擎，它通過聚類的方式將空間分割成許多子空間。在檢索的時候，通過某種方式，快速鎖定在某一（幾）子空間，然後在該（幾個）子空間裏做遍歷，從而達到次線性的計算複雜度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CNN特徵通常特徵維度高，保存全量數據特徵所需內存與樣本數據量成正比。對於千萬級以上的數據集，通常面臨內存受限的問題。GNOIMI使用PQ乘積量化的方法，用一個有限子集對全量特徵空間進行編碼，達到大幅的降低內存消耗的目的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.訓練","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1)空間分割","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GNOIMI使用聚類的方法對訓練集特徵向量空間進行分割。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶保證原始特徵數據無重複，從原始數據中隨機抽樣。抽樣數據集個數小於等於500w，","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/74/746fa437eb33040c319f3714c8c7eb05.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對抽樣樣本進行KMEANS聚類，得到初始的一級聚類中心","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e4/e444af39effbd5dafb754facb907fe10.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。計算抽樣本與其所屬一級子空間聚類中心的殘差向量，在殘差向量上進行K-means聚類，將殘差向量空間分爲L個子空間，得到二級聚類中心碼本","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3d/3def26c62bafaa3f09645e14b4b5bc01.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。一二級聚類中心將整個數據空間分割爲個子空間（cell），每個cell的聚類中心點定義爲","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bb/bbb7a581d63a93b5023c693464ad1a4c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。任一訓練集樣本特徵向量所屬的cell，滿足","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/75/7502b5b7bb77c6b137718b62863803a4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"空間分割如圖4所示，所有一級聚類中心共享二級聚類中心。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/ae27505f5359cf9d12be7b99182c5472.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖4","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲二級聚類中心使用的是全量原始特徵的殘差向量，因而認爲二級聚類中心在每個一級子空間內分佈相似，全量原始特徵數據共享二級聚類中心。這種方法被稱爲NO-IMI(The Non-Orthogonal Inverted Multi-Index)。藍色點爲一級聚類中心點，紅色點爲個cell的聚類中心點。顯然，cell的形狀和大小需根據數據分佈可變，尤其是在全量特徵數據空間同時存在稀疏和密集區域時。引入參數矩陣，cell的聚類中心點定義爲。引入參數矩陣後cell分佈如圖5所示，cell的形狀和大小根據實際數據分佈可變，空間分割更符合一級子空間內數據分佈情況。參數矩陣是有全量數據計算得到的，因而更準確的描述數據分佈，稱這種方法爲GNO-IMI。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/30e80b9d56d51600d297ff2aee9b4fca.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖5","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2)乘積量化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算抽樣數據集中樣本所屬於的一二級距離中心，得到樣本與一二級聚類中心的殘差數據集。將殘差數據集分爲nsq個空間，每個子空間特徵維度爲feature_dim/nsq，每個子空間分別進行KMEANS聚類，得到256個聚類中心(一個char佔8bit，可用一個char字長標記所有的聚類中心ID)，得到每個子空間的碼本。將nsq個子空間的子碼本做笛卡爾積，得到整個數據集的PQ碼本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.建庫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算原始特徵向量數據集中樣本所屬的一二級聚類中心。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算原始特徵向量數據集中樣本與其所屬的一二級聚類中心的殘差。將殘差向量分爲nsq個子空間，在每個子空間內，計算該子特徵向量距離最近的聚類中心並記錄聚類中心ID，將feature_dim維度的特徵向量映射爲nsq個聚類中心的ID，可用nsq個聚類中心ID標識該特徵向量。通常取nsq = feature_dim進行四分之一量化，feature_dim * sizeof(float) -> nsq *sizeof(char)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在檢索過程中，將query與該樣本在每個子空間內的距離，轉化爲與該樣本距離最近的聚類中心的距離。因而，在檢索過程中，無需加載原始特徵向量，可降低檢索過程中所需要的內存。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.檢索","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1) 特徵進行歸一化;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2) 計算query與一級聚類中心的距離並排序;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3) 計算query與前gnoimi_search_cells個一級聚類中心下的二級聚類中心的距離並排序，共計gnoimi_search_cells * gnoimi_fine_cells_count個二級聚類中心;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4) 以優先隊列的方式，從最近的二級聚類中心開始，依次取出其下的樣本，並計算query與這些樣本的距離，取滿neighbors_count個爲止;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5) 排序後返回topK個樣本和對應的距離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ANN的算法本身並不算複雜，難點主要在實現上，GNOIMI做了大量實現優化，簡要介紹如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1）設計新的訓練方案，重新組織一二級聚類中心的關係，在召回率略微提升的前提下，訓練速度提升1000%。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2）對於L2/COS距離空間下，任意三點滿足三角不等式；在建庫階段，根據該特質，利用樣本、一級聚類中心和二級聚類中心之間的兩兩距離進行剪枝，可降低95%+的計算量，建庫速度提升550%+。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3）訓練&建庫所需內存大大降低，僅爲Faiss-IVF*和nmslib-HNSW的10%。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4）在檢索階段，空間分割規模超過千萬，計算query與二級聚類中心過程中，設計新的計算&排序邏輯，將百萬級聚類中心的計算&排序時延控制在2ms內，降低20倍。計算query與樣本距離時，優化PQ量化計算過程，降低800%+的計算量，整體吞吐提升30%+。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GNOIMI與IVF*比較時，使用相同聚類中心個數及檢索doc個數下，比較檢索時間、召準和內存；與HNSW比較時，在相同檢索時間下，比較召準和內存。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過測評，百萬數據量級相同檢索時間內GNOIMI效果略低於HNSW，遠超ivf，內存佔用極小，HNSW效果最優，但內存消耗最多。隨着數據增多，維度增大，相同檢索時間內GNOIMI效果相比其他更優，內存保持低增長。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前GNOIMI廣泛應用於百度內各種場景，包括視頻比對、圖片/視頻檢索、FEED等等場景，支撐規模上千億特徵，每天PV過10億","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3、HNSW","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HNSW（Hierarchical Navigable Small World）是ANN搜索領域基於圖的算法。它的前身是 NSW (Navigable-Small-World-Graph) 。NSW 通過設計出一個具有導航性的圖來解決近鄰圖發散搜索的問題，但其搜索複雜度仍然過高，達到多重對數的級別，並且整體性能極易被圖的大小所影響。HNSW 則是着力於解決這個問題。作者借鑑了 SkipList 的思想，提出了 Hierarchical-NSW 的設想。簡單來說，按照一定的規則把一張的圖分成多張，越接近上層的圖，平均度數越低，節點之間的距離越遠；越接近下層的圖平均度數越高，節點之間的距離也就越近（見下圖6）。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索從最上層開始，找到本層距離最近的節點之後進入下一層。下一層搜索的起始節點即是上一層的最近節點，往復循環，直至找到結果。由於越是上層的圖，節點越是稀少，平均度數也低，距離也遠，所以可以通過非常小的代價提供了良好的搜索方向，通過這種方式減少大量沒有價值的計算，減少了搜索算法複雜度。更進一步，如果把 HNSW 中節點的最大度數設爲常數，這樣可以獲得一張搜索複雜度僅爲 log(n) 的圖。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3c/3c07a333f8f6af9b2ad5192b112a5a9b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△","attrs":{}},{"type":"text","text":"圖6. hnsw","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HNSW的一個顯著優點是無需訓練，在某些沒有初始數據的場景非常好用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前百度內容側使用的是hnsw的一種優化實現，在開源版本的基礎上，做了很多優化，性能提升了將近3.6倍。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5e/5efe53911ff536be47d49c605819c6e3.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"五、比對技術","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1.圖像比對","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前主要有兩種表徵圖像的方法：局部特徵點和圖像CNN向量。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"局部特徵點：對圖片的視覺描述，如SIFT、ORB等，對尺度、旋轉、亮度保持不變，視覺變化、防射變化、噪聲也有一定的穩定性。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖像CNN特徵：對圖像的語義特徵，通常使用CNN分類模型等的最後幾層網絡輸出。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f6/f6c875b2f0fa05128c2d778ee12041dc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/68/68c93109645cd241508a0da8c251225a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△圖7","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此當前比對技術，採用CNN特徵篩選+視覺局部特徵後校驗","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d1/d12ebea390f999f98700dabf1315b20b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△圖8","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.視頻比對","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻比對複用了圖像比對的技術，在幀檢索的基礎上增加了視頻和音頻級別的比對技術，主要是基於動態規劃計算最佳匹配序列。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/77/77dce0249a212eaf4a13191622288936.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"△圖9","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"六、總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近年來，隨着計算機技術的發展，圖片、視頻、音頻等富媒體信息的呈井噴式增長，內容檢索比對技術在推薦、搜索等各個領域也有了更廣泛的應用。本文對百度富媒體檢索比對系統的基本原理和核心技術進行了一次全面的總覽介紹，同時介紹了各模塊的工作機制，包括：特徵提取、離線訓練建庫、在線預估、檢索比對等。它提供了一套通用的多媒體資源檢索比對方案，保證了高召回、高準確和高性能。基於百度FEED和搜索兩大核心業務，它擁有全網最大的數據規模和最豐富的資源類型，涵蓋了短視頻、小視頻、直播、圖片等絕大數富媒體資源，服務於30+產品線，爲百度產品的效果提升提供了有效的輔助。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"閱讀原文：","attrs":{}},{"type":"link","attrs":{"href":"https://link.zhihu.com/?target=https%3A//mp.weixin.qq.com/s%3F__biz%3DMzg5MjU0NTI5OQ%3D%3D%26mid%3D2247491769%26idx%3D1%26sn%3D4009630e1583d92c37770d59f8e4ffd7%26chksm%3Dc03ed0c5f74959d35eba94b0c239281bac274028754aef4e12267788c44a9a845768002f93f9%26token%3D1381754262%26lang%3Dzh_CN%23rd","title":null,"type":null},"content":[{"type":"text","text":"詳解百度富媒體檢索比對系統的關鍵技術","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"更多幹貨、內推福利，歡迎關注同名公衆號「百度Geek說」～","attrs":{}}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

百舸實踐之「埋點數據深度治理與應用」 | 京東雲技術團隊

一、背景隨着公司和業務的不斷髮展，百舸平臺也從單一內容投放轉向了以流量和數據爲基礎的流量運營模式。在這個轉變過程中，數據深度治理與應用的重要性尤爲凸顯，在數據深度治理過程中，需要將用戶行爲數據、投放素材以及投放效果三者緊密的串聯起來。數據

2024-05-06 23:16:41

北美美國加拿大TikTok與YouTube：打造海外網紅廣告營銷推廣計劃

【本篇由言同數字科技有限公司原創】在社交媒體的蓬勃發展下，網紅達人博主在TikTok上的帶貨能力日益凸顯，成爲品牌營銷的新寵。本文將深入探討TikTok Shop美區小店帶貨營銷的經驗與策略，爲品牌提供有效的營銷參考。 1. 定位精準的網紅

2024-05-06 22:36:16

Java集合中的Map

Map是用於保存具有映射關係的數據集合，它具有雙列存儲的特點，即一次必須添加兩個元素，即一組鍵值對<Key,Value>，其中Key的值不可重複（當Key的值重複的時候，後面插入的對象會將之前插入的具有相同的Key值的對象覆蓋掉），Valu

2024-05-06 11:34:11

lightdb 單機模式下數據庫平移

前言 lightdb數據庫使用一段時間之後，希望在其他服務器重新部署一套，但是要求數據可以平滑遷移到新的數據庫上面去，可以參考本文章進行操作步驟 1. 數據庫安裝在新的服務器安裝數據庫，具體安裝步驟可以參考：https://w

2024-05-05 21:55:24

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

【本篇由言同數字科技有限公司原創】在當今數字營銷時代，TikTok已成爲一種受歡迎的社交媒體平臺，尤其在年輕人中頗具影響力。而其中的直播帶貨更是吸引了衆多品牌的注意，成爲推廣產品和增加銷售的重要途徑。下面言同數字將針對海外TikTok網紅直

2024-05-03 22:36:01

ollama使用

ollama 僅支持。gguf的格式其他格式需要llama.cpp 轉換 curl https://ollama.ai/install.sh | sh ollama --version ollama pull llama2-chin

2024-05-01 00:42:55

「Qt Widget中文示例指南」如何實現一個快捷編輯器（一）

Qt 是目前最先進、最完整的跨平臺C++開發工具。它不僅完全實現了一次編寫，所有平臺無差別運行，更提供了幾乎所有開發過程中需要用到的工具。如今，Qt已被運用於超過70個行業、數千家企業，支持數百萬設備及應用。快捷編輯器示例展示瞭如何創建一

2024-04-30 23:36:29

解鎖HDC 2024之旅：從購票到報名，全程攻略

本文分享自華爲雲社區《解鎖HDC 2024之旅：從購票到報名，全程攻略》，作者：華爲雲社區精選。 Hi，代碼界的小夥伴們，集結號已經吹響了！華爲開發者大會（HDC 2024）——這場匯聚了HarmonyOS NEXT鴻蒙星河版、盤古大模型5

2024-04-30 22:34:35

銀行核心背後的落地工程體系丨Oracle - TiDB 數據遷移詳解

本文作者：張顯華，孟凡輝，莊培培系列導讀：徐戟（白鱔）數據庫技術專家，Oracle ACE，PostgreSQL ACE Director 當前，國內大量的關鍵行業的核心繫統正在實現國產化替代，而與此同時，這些行業的數字化轉型也正在進入

2024-04-30 22:24:59

30 秒出服裝設計稿，森馬用函數計算+AIGC 整“新活”!

創新項目如何去賦能我們的業務，這件事情在森馬很重要。阿里雲函數計算幫我們屏蔽掉了想把AI落地到實際業務場景中 GPU 算力資源儲備、採購成本、技術門檻等很多難題，從而迅速做出決策，快人一步站在正確的起點，體驗新技術對整個服裝爆款設計、營銷

2024-04-30 21:12:14

消金公司2023財報解析：息差維持高位，信用成本攀升

來源 | 鐳射財經（leishecaijing） 2023年，是持牌消金行業承上啓下的關鍵一年，也是鍛造韌性、比拼內功最緊張的一年。一方面，住戶短期消費貸款餘額在2022年觸底後，伴隨經濟復甦、消費提振，於2023年重新回到上行軌道。短

2024-04-30 13:11:32

Linux下製作Nginx綠色免安裝包

前言 linux下安裝nginx比較繁瑣，遇到內網部署環境更是麻煩，所以研究了下nginx綠色免安裝版的部署包製作，開箱即用，特此記錄分享，一下操作在centos8環境下安裝，如果需要其他內核系統的安裝（Debian/Ubuntu等），請在

2024-04-29 21:38:23

數字化轉型新篇章：企業通往智能化的新範式

早在十多年前，一些具有前瞻視野的企業以實現“數字化”爲目標啓動轉型實踐。但時至今日，可以說尚無幾家企業能夠在真正意義上實現“數字化”。在實現“數字化”的征途上，人們發現，努力愈進，彷彿終點愈遠。究其原因，還在於轉型一直落後於技術邊界的拓展

2024-04-29 21:22:20

MindSpore強化學習：使用PPO配合環境HalfCheetah-v2進行訓練

本文分享自華爲雲社區《MindSpore強化學習：使用PPO配合環境HalfCheetah-v2進行訓練》，作者： irrational。半獵豹（Half Cheetah）是一個基於MuJoCo的強化學習環境，由P. Wawrzyński

2024-04-29 10:33:13

圖片旋轉後保存到數據庫

1、圖片通過canvas繪製 2、canvas旋轉 3、canvas 轉成blob 在實例化成文件 4、創建formData裏面append放入文件和其他的參數，再調上傳接口 <div style=" heig

2024-04-29 10:16:22

24小時熱門文章

最新文章

最新評論文章