知乎搜索文本相關性與知識蒸餾

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"導讀:"},{"type":"text","text":"大家好,我是申站,知乎搜索團隊的算法工程師。今天給大家分享下知乎搜索中文本相關性和知識蒸餾的工作實踐,主要內容包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"知乎搜索文本相關性的演進"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BERT在知乎搜索的應用和問題"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"知識蒸餾及常見方案"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"知乎搜索在BERT蒸餾上的實踐"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"知乎搜索文本相關性的演進"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 文本相關性的演進"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/42\/4251090247e937d0f89b5a9c8933522b.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們首先來介紹下知乎搜索中的文本相關性。在搜索場景中,文本相關性可以定義爲⽤戶搜索query的意圖與召回 doc 內容的相關程度。我們需要通過不同模型來對這種相關程度進行建模。整體而言,文本的相關性一般可以分爲兩個維度,字面匹配和語義相關。知乎搜索中文本相關性模型的演進也是從這兩個方面出發並有所側重和發展。在知乎搜索的整個架構中,文本相關性模型主要定位於爲二輪精排模型提供更高維\/抽象的特徵,同時也兼顧了一部分召回相關的工作。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Before NN"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d0\/d08280d435c813c91913b45b8bbca784.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"知乎搜索中的文本相關性整體演進可以分爲三個階段。在引入深度語義匹配模型前,知乎搜索的文本相關性主要是基於TF-IDF\/BM25的詞袋模型,下圖右邊是BM25的公式。詞袋模型通常來說是一個系統的工程,除了需要人工設計公式外,在統計詞的權重、詞頻的基礎上,還需要覆蓋率、擴展同義詞,緊密度等各種模塊的協同配合,才能達到一個較好的效果。知乎搜索相關性的一個比較早期的版本就是在這個基礎上迭代的。右下部分爲在基於詞袋模型的基礎上,可以參考使用的一些具體特徵。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. Before BERT"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2b\/2bd9c7202360eb52534db5ffa81f9bf0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於 BM25 的詞袋模型不管如何設計,主要還是隻解決文本相關性中的字面匹配這部分問題。第二階段引入的深度語義匹配模型則聚焦於解決語義相關的問題,主要分爲兩部分:雙塔表示模型和底層交互模型。微軟的DSSM(左下)是雙塔模型的典型代表。雙塔模型通過兩個不同的 encoder來分別獲取query和doc的低維語義句向量表示,然後針對兩個語義向量來設計相關性函數(比如cosine)。DSSM擺脫了詞袋模型複雜的特徵工程和子模塊設計,但也存在固有的缺陷:query和doc的語義表示是通過兩個完全獨立的 encoder 來獲取的,兩個固定的向量無法動態的擬合doc在不同 query的不同表示。這個反應到最後的精度上,肯定會有部分的損失。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"底層交互模型一定程度上解決了這個問題。這個交互主要體現在 query 和 doc term\/char 交互矩陣(中)的設計上,交互矩陣使模型能夠在靠近輸入層就能獲取 query 和 doc 的相關信息。在這個基礎上,後續通過不同的神經網絡設計來實現特徵提取得到 query-doc pair 的整體表示,最後通過全連接層來計算最終相關性得分。Match-Pyramid(右下)、KNRM(右上)是交互模型中比較有代表性的設計,我們在這兩個模型的基礎上做了一些探索和改進,相比於傳統的 BM25 詞袋模型取得了很大的提升。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. BERT"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/71\/7180bbcda0afe685f3e3da24cbf20352.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BERT模型得益於 transformer 結構擁有非常強大的文本表示能力。第三階段我們引入了 BERT希望能夠進一筆提高知乎搜索中文本相關性的表型。BERT 的應用也分爲表示模型和交互模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於交互模型來說,如下左圖,query和doc分別爲sentence1和sentence2直接輸入到BERT模型中,通過BERT做一個整體的encoder去得到sentence pair的向量表示,再通過全連接層得到相似性打分,因爲每個doc都是依賴query的,每個query-doc pair都需要線上實時計算,對GPU機器資源的消耗非常大,對整體的排序服務性能有比較大的影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於上述原因,我們也做了類似於DSSM形式的表示模型,將BERT作爲encoder,訓練數據的中的每個query和doc在輸入層沒有區分,都是做爲不同的句子輸入,得到每個句向量表示,之後再對兩個表示向量做點乘,得到得到相關度打分。通過大量的實驗,我們最終採用了 BERT 輸出 token 序列向量的 average 作爲句向量的表示。從交互模型到表示模型的妥協本質是空間換時間,因爲doc是可以全量離線計算存儲的,在線只需要實時計算比較短的 query ,然後doc直接通過查表,節省了大量的線上計算。相比於交互模型,精度有一部分損失。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"BERT在知乎搜索的應用和問題"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 搜索業務架構中的BERT"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e4\/e4cbbe1e27aeb58614a96480513689b5.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下圖中我們可以看到,BERT在知乎搜索業務的召回和排序階段都扮演了比較重要的角色。交互模型的主要服務於二輪精排模型,依賴於線上實時的計算query和doc,爲精排模塊提供相關性特徵。表示模型又分爲在線和離線兩塊,在線表示模型實時的爲用戶輸入的query提供句向量表示,離線表示模型爲庫中的doc進行批量句向量計算。一方面,doc向量通過TableStore\/TiDB 和Redis的兩級存儲設計,爲線上排序做查詢服務;另一方面,使用 faiss 對批量doc 向量構建語義索引,在傳統的 term 召回基礎上補充向量語義召回。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. BERT表示模型語義召回"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9e\/9e6ca320fec0618e728cf4cf0cf53161.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面詳細介紹下我們的語義召回模型。首先看個例子,對於「瑪莎拉蒂 ghlib」這個case,用戶真正想搜的是「瑪莎拉蒂 Ghibli」這款車,但用戶一般很難記住完整的名稱,可能會輸錯。在輸錯的情況下,基於傳統的term匹配方式(Google搜索的例子)只能召回“瑪莎拉蒂”相關的 doc,而無法進行這輛車型的召回,這種場景下就需要進行語義召回。更通用的來說,語義召回可以理解爲增加了字面不匹配但是語義相關的 doc 的召回。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語義召回模型整體是BERT 相關性任務中雙塔表示模型的一個應用。BERT做爲encoder來對query和doc進行向量的表示,基於faiss對全量 doc 向量構建語義索引,線上實時的用query向量進行召回。這個策略上線後,線上top20 doc中語義召回doc數量佔總召回 doc 數量的比例能到達 5%+。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. BERT帶來的問題"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2a\/2a0dc2cc6aaba67899525d0fe3b6a8bd.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BEER 模型上線後,爲不同的模塊都取得了不錯收益的同時,也給整個系統帶來了不少問題。這些問題整體可以歸結爲線上實時計算、離線存儲、模型迭代三個方面。具體的見上圖。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 蒸餾前的嘗試"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對上述性能或存儲的問題,在對BERT 蒸餾之前,我們也進行了很多不同的嘗試。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/73\/73df712c4e20c152414cadfef8640f44.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BERT 交互模型的部署放棄了使用原生TF serving,而是在cuda 的基礎上用c++ 重寫了模型的加載和serving,加上混合精度的使用。在我們的業務規模上,線上實時性能提高到原來的約 1.5 倍,使BERT交互模型滿足了的最低的可上線要求。在這個基礎上,對線上的 BERT 表示模型增加 cache,減少約 60% 的請求,有效減少了GPU 機器資源的消耗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個思路是嘗試給BERT在橫向和縱向維度上瘦身。橫向上,一方面可以減小serving 時 max_seq_length長度,減少計算量;另一方面可以對錶示向量進行維度壓縮來降低存儲開銷。這兩種嘗試在離線和在線指標上都有不同程度的損失,因此被放棄。縱向上,主要是減少模型的深度,即減少 transformer層數。這對於顯存和計算量都能得到顯著的優化。前期嘗試過直接訓練小模型,以及使用BERT-base若干層在下游的相關性任務上進行fine-tune。這兩種方案,在離線指標上的表現就沒法達到要求,因此也沒有上線。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對 doc數量過大,存儲開銷過大和語義索引構建慢的問題。在這方面做了一個妥協的方案:通過wilson score 等規則過濾掉大部分低質量的 doc,只對約 1\/3 的doc 存儲表示向量和構建語義索引。該方案會導致部分文檔的相關性特徵存在缺失。對於表示模型存在的低交互問題,嘗試Poly-encoder(Facebook方案)將固定的 768維表示向量轉爲多個head的形式,用多個head做attention的計算,保證性能在部分下降的前提得到部分精度的提升。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"智知識蒸餾及常見方案"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 知識蒸餾"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d7\/d708ee2d326c584c15775e2c09d83d93.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面簡單介紹下知識蒸餾。從下圖中看,我們可以把知識蒸餾的整體形式簡化爲:大模型不考慮性能問題儘量學習更多的知識(數據),小模型通過適量的數據去高效地學習大模型的輸出,達到一個知識遷移的效果。實際 serving 使用的是小模型。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a2\/a2855e31e71aecea18afc0906b0b5aad.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"知識蒸餾爲什麼能有效?關鍵點在於 soft target 和 temperature。soft target對應的是teacher模型的輸出,類似於概率分佈,知識蒸餾從hard target轉爲soft target的學習有利於模型更好的去擬合標籤,引入temperature則是爲了進一步平滑標籤,讓模型去學習到類別和類別中的知識。這裏需要注意的是,temperature 的選取不宜過大,太大的 temperature 會導致不同類別之間的差異被完全平滑掉。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. BERT蒸餾方案"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d0\/d0d2c2c98aacbb0f9aacd3b51cdebbee.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對與BERT的蒸餾我們做了大量的調研,並對目前主流的蒸餾方案做了歸納分類。基於任務維度來說,主要對應於現在的pretrain + fine-tune 的兩段式訓練。在預訓練階段和下游任務階段都有不少的方案涉及。技巧層面來分的話,主要包括不同的遷移知識和模型結構的設計兩方面。後面我會選兩個典型的模型簡單介紹一下。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 蒸餾-MiniLM"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/35\/356984f07be717b337e26534bb781d75.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MiniLM是基於預訓練任務的蒸餾,其是一種通用的面向Transformer-based預訓練模型壓縮算法。主要改進點有三個,一是蒸餾teacher模型最後一層Transformer的自注意力模塊,二是在自注意模塊中引入 values-values點乘矩陣的知識遷移,三是使⽤了 assistant ⽹絡來輔助蒸餾。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 蒸餾-BERT to Simple NN"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f2\/f2e3c30fc64706f79e9a1f479b86b9bc.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BERT to Simple NN更多的是做了一些loss形式的設計,使其訓練方式更高效。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"知乎搜索再BERT蒸餾上的實踐"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. BERT蒸餾上的實踐和收益"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/99\/9925bff638023b1c1cfacaf7539522bd.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面的介紹中我有提到,在做 BERT蒸餾前其實已經做了很多嘗試,但是多少都會有精度的損失。因此,我們做蒸餾的第一目標是離線模型對⽐線上 BERT精度⽆損。但對BERT-base 直接進行蒸餾,無論如何都沒辦法避免精度的損失,所以我們嘗試用更大的模型(比如BERT-large\/Robert-large\/XLNET)來作爲 teacher 進行蒸餾。這些多層的模型均在我們知乎全量語料先做pretrain,再做fine-tune,得到微調後的模型再做蒸餾。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 蒸餾-Patient KD"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/51\/5168d46b701643a40e883f3a715a196c.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們對交互模型和表示模型都做了蒸餾,主要採用了Patient KD模型的結構設計,Student模型基於BERT-base的若干層運用不同的策略進行參數的初始化,去學習Robert-large大模型的方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中知識遷移主要有三部分:student的預測與真實標籤的交叉熵、student與teacher的預測的交叉熵和中間隱層的向量之間的normalized MSE。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. BERT交互模型蒸餾"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/48\/4803f2639d76b3fa4539e6218356a6cf.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於我們選的teacher模型Robert-large,單純預訓練模型其nDCG指標爲0.914,線上之前使用的BERT-base 是0.907,若對BERT-base的若干6層直接去做fine-tune能達到的最高指標是0.903,對比於BERT-base精度會損失很多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們這塊做了一些嘗試,基於Robert-large從24層蒸餾到6層的話能到0.911,能超過線上BERT-base的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練數據方面,我們經歷了點擊日誌數據挖掘到逐漸建立起完善的標註數據集。目前,相關性任務訓練和蒸餾主要均基於標註數據集。標註數據分爲 title和 content兩部分,Query 數量達到 10w+ 的規模,標註 doc 在 300w ~ 400w 之間。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. BERT表示模型蒸餾"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c3\/c3d232fcdb7238c332b9416eaf00333e.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在BERT表示模型上,蒸餾時我們希望對向量維度和模型層數同時進行壓縮,但蒸餾後得到的student模型表現不及預期。所以最後上線的方案中,表示模型層數還是維持了12層。在蒸餾時,爲了提高精度,選取交互模型作爲teacher進行蒸餾。因爲交互模型是query和doc之間的打分,交互模型得到的logits與表示模型點乘後的打分在數量值會有較大差值,所以用pairwise形式通過teacher差值擬合來進行loss的計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在維度壓縮方面我們做了對比實驗,BERT模型輸出做 average pooling 後接全連接層分別壓縮至8維到768維。如圖所示,128維和64維的表現跟768維差別不大,在上線時選擇維度爲64和128進行嘗試,兩者在線上表現沒有太明顯的差異,最終選擇了64維的方案,把模型的維度壓縮了12倍,存儲消耗更低。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":" 5. 蒸餾的收益"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"蒸餾的收益主要分爲在線和離線兩部分。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/40c5efcf7bf28c760d87791106baa510.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"在線方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"交互模型的層數從12層壓縮到6層,排序相關性特徵P95減少爲原本的1\/2,整體搜索入口下降40ms,模型部署所需的GPU機器數也減少了一半,降低了資源消耗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表示模型語義索引存儲規模title減爲1\/4,content維度從768維壓縮至64維,雖然維度減少了12倍,但增加了倒排索引doc的數量,所以content最終減爲1\/6,語義索引召回也有比較大的提升,title減少爲1\/3,content減少爲1\/2。精排模塊需要線上實時查詢離線計算好的向量,所以查詢服務也有提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"離線方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表示模型語義索引的構建時間減少爲1\/4,底層知乎自研的TableStore\/TIDB存儲減爲原來的1\/6,LTR訓練數據和訓練時間都有很大的提升,粗排早期用的是BM25等基礎特徵,後來引入了32維的BERT向量,提升了精排精度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"嘉賓介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"申站 碩士,知乎社區業務中心搜索算法工程師。目前主要負責知乎搜索系統中文本相關性探索和落地優化,在 query 理解和召回上也有長期的經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:datafuntalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/xgCtgEMRZ1VgzRZWjYIjTQ","title":"xxx","type":null},"content":[{"type":"text","text":"知乎搜索文本相關性與知識蒸餾"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章