詳解預訓練模型在信息檢索第一階段的應用

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀","attrs":{}},{"type":"text","text":":近年來,預訓練模型在自然語言處理的不同任務中都取得了極大的成功,在信息檢索中也進步不小。近期,我們邀請到了中科院的範意興博士現場分享,內容聚焦預訓練模型在信息檢索中第一階段檢索(召回階段)的應用,並對最近幾年的相關研究進行系統的梳理和回顧。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全文6110字,預計閱讀時間21分鐘。","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"一、信息檢索的發展過程","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1、信息檢索三個層次的視角","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)從基礎問題的角度。查詢與文檔的相關性度量是信息檢索的核心問題,重點關注檢索結果的準確度,但相關性目前仍無明確定義,是與場景、需求和認知有關的一個概念。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)從檢索框架的角度,需要從大量的文檔集合裏快速返回相關的文檔並排序,除了檢索結果的正確性,還要考慮檢索的效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)從系統的角度,不僅要解決排序的問題,還要解決用戶意圖模糊、用戶輸入有錯、文檔結構異質等問題。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bf/bf0c3b2781f5c5e4d0004235d3cdedf2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2、信息檢索的發展過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1) 相關性度量","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相關性度量的建模範式,經歷了","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"傳統檢索模型、Learning-to-rank (LTR)模型、Neural IR(NeuIR)模型","attrs":{}},{"type":"text","text":"的演進。傳統檢索模型更多關注查詢和文檔中重疊的詞,在此基礎上建模權重,後將不同權重的得分組合,得到相關性得分。代表性的方法包括 PlV、DIR、Language Model、BM25等,都是通過對詞頻、文檔長度、以及詞的IDF的不同組合方式,來得到度量的得分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cdefc7fde9b9b03130cbebfcd29f925c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"LTR","attrs":{}},{"type":"text","text":"模型的建模過程,是從指標函數的計算變成特徵學習的過程。通過人工構造不同特徵,後採用神經網絡或者函數優化的方式,來解決特徵組合的問題。代表方法包括RankNet、RankSVM、LambdaMart等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c7/c7ef2b2cfab01ce33fe235a94897bafb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"NeuIR方法","attrs":{}},{"type":"text","text":"更進一步,通過模型學習構建人工設計特徵的過程,採用機器學習優化的方式來抽取特徵。大致可以分爲3類:一是基於表示的方法,單獨學習查詢和文檔的表示進行打分;二是基於交互的方法,把查詢和文檔從底層做交互,對交互信號進行抽象得到打分;三是將二者結合起來進行打分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5f/5fc28baa957c2d1190d5a588fbc01f3e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2) 檢索框架","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"檢索框架的發展經歷了從單階段檢索到多階段檢索的過程,早期的檢索引擎(例如Indri、ES等)主要是構建單階段檢索,使用BM25、LM等傳統算法。後續引入了更多的階段來進一步提升系統的效率,包括L2R和Neural等,這些方法的計算複雜度很高,通常放在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Reranker","attrs":{}},{"type":"text","text":"階段。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/13/1337caf264bddcd969e49d0319d2bf4a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3) 檢索系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了滿足實際檢索場景的需要,檢索系統包含更多的模塊,包括查詢改寫、意圖理解、文檔索引等。在檢索系統中,從底層文檔存儲的結構來看,它經歷了從基於符號的系統架構到基於向量的系統架構的轉變,在真實的檢索系統中,通常也會結合二者的優勢來建模。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8d/8d83a50a3450b5d93d809948239e7dd6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於符號的檢索系統,通過把文檔表示成一個高維稀疏的向量,每個詞在向量裏佔一維,每個文檔映射在詞表向量上就是一個極度稀疏的表示。存儲結構採用倒排索引的方式,在離線階段把文檔索引起來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c5/c5477e108e086d4626dad2ec14965743.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"基於向量的檢索系統","attrs":{}},{"type":"text","text":"主要通過語義映射將文檔映射到一個低維稠密的向量空間。這裏採用的索引方式包括基於圖和量化的方法,這種方式也帶動了整個系統的架構變遷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/24/245d2a4c8e5f0d642e4796ca4c5ff748.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預訓練的方法在檢索系統中的不同模塊都有應用:在Query Parser層面有rewriting、expansion、suggestion;在Doc Encode層面也有很多文檔編碼的方式;在第一階段檢索、重排階段也提出了不同的模型。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"二、預訓練在檢索第一階段的應用","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1、基於Term-based檢索模型的兩個問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一階段檢索早期的方法是基於Term-based的方法,主要是對文檔清洗,通過倒排方式索引到引擎中。這類方法面臨兩個經典的問題:一是查詢稀疏與文檔冗餘帶來的詞失配的問題;二是無序詞排列導致丟失語義依賴信息的問題。這兩個問題造成語義信息丟失,導致後續建模無法提升性能。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2、三種解決方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對以上問題主要有三類不同提升方法,分別是:稀疏檢索方法(Sparse Retrieval)、稠密檢索方法(Dense Retrieval)、混合方法(Hybrid Retrieval)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Sparse Retrieval:","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"保持表示的稀疏性,與倒排索引集成以實現高效檢索。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Dense Retrieval:","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從符號空間到語義空間,使用神經網絡算法進行語義相似檢索","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"**Hybrid:**結合基於倒排的檢索和語義檢索,繼承兩者的優點。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1)Sparse Retrieval","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據改進倒排索引中模塊的不同,Sparse Retrieval可以分爲Neural Weighting Scheme和Sparse Representation Learning兩類。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Neural Weighting Scheme","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用離散符號空間中的預訓練模型改進項加權方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要包括:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"a)Term Re-weighting","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過改變原始文檔在倒排索引中的權重來提升語義,通過預訓練的模型從語義的角度來估計詞的權重。其中有兩個代表方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"DeepCT [Dai et al., SIGIR 2020]","attrs":{}},{"type":"text","text":":核心思想是用BERT裏學好的基於上下文的向量來估計每一個詞對於文檔的重要度並把它索引到倒排索引中替換原來的基於詞頻統計權重的方式。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/71/71dd0bfde5fd1c15880ec4a8633586ea.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏的關鍵問題是如何構建監督信號來學習詞權重,論文提出了一種","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Query term recall","attrs":{}},{"type":"text","text":"的監督信號進行學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"HDCT[Dai et al., WWW 2020]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/65/650c795999187f45378339b1c232b0c4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心思想是將文檔拆成段落,對每個段落去估計每個詞的權重最後把每個段落裏的權重組合起來。存在兩個核心問題:一是每個段落中詞的權重如何評估,這裏直接採用了DeepCT的估計方式,此外,對預測的打分開根乘以N再Round到整數後進行索引,避免匹配過程被部分高權重的詞所主導。二是如何組合段落的詞權重形成文檔的詞權重,論文采用的是平均加權和按段落位置衰減的方式加權。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"b)Document Expansion","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲文檔擴展語義相關詞項,提高部分詞項的權重,緩解詞彙失配問題。其思想是在索引的過程中預測和文檔相關的詞並全部加入文檔中,添加到倒排索引中去。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"doc2query/docTTTTTquery [Noqueira et al., 2019]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d2/d2b7f101e8561b1cb26596cc6b11608c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"doc2query的結構是6層的Transformer,docT5query與doc2query的區別是利用了一個更強大的預訓練模型T5來生成擴展詞項。其核心思想都是利用給定的查詢和文檔的標註對訓練一個Seq2Seq生成模型,然後對每個文檔生成僞query拼接到文檔尾部進行索引構建。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"UED (Unified Encoder-Decoder) [Yan et al., AAAI 2021]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3d/3d8ed1cb0992e0efae40bbc90bf45382.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該方法的核心思想是利用訓練集中有標註好的(q,d)對同時訓練判別模型和生成模型,一方面可以只用Encoder部分將(q,d)拼起來做相關性的判別預測,另一方面可以額外加一個Decoder部分來生成查詢。兩部分同時學習互相增益會使Encoder部分更強。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"c)Expansion + Re-weighting","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該種方式是把文檔擴展與詞權重同時進行學習,代表性工作有:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SparTerm [Bai et al., AAAI 2021]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SparTerm方法在整個詞表上進行詞權重評估,從而達到詞項擴展以及詞權重估計同時進行的效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/03/037c2f7de5f44f9197541ae62c807dbf.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SPLADE [Formal et al., SIGIR 2021]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SparTerm的重要性估計部分是提前學習好的並且從實驗結果來看詞項擴展能力比較受限,SPLADE在它的基礎上做了擴展:一是把原來詞估計的權重進行saturate function使擴展能力變強,二是利用FLOPS正則項約束來使整體網絡可以端到端地學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Sparse Representation Learning","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種方法就是把文檔映射到一個語義向量的隱空間中去。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Latent Term Indexing","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/70/70c9d249758b48a2481c60fae2416b41.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文檔倒排索引中每個維度對應的詞,現在映射到一個關聯多維語義的“latent term ”的ID。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SNRM [Zamani et al., CIKM 2018]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心思想是對文檔的詞序列進行卷積操作獲取每一個N-grams的表示,然後將其經過全連接的映射進行語義抽象後再反向稀疏化放大到更高的維度,最後通過池化操作得到高維稀疏的文檔表達。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/af/af2cb41ce114fc1f9135df55e90d324b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UHD-BERT (Bucketed Ultra-High Dimensional Sparse Representations) [Jang et al., 2021]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f0/f0f063b5e4abe1d5442d0021285e58df.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在SNRM基礎上提出一個基於BERT的稀疏表示學習方法,區別於SNRM用的卷積和全連接,UHD-BERT底層Encoder用的是BERT。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2)Dense Retrieval","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dense Retrieval直接改變了原有的檢索模式,把查詢和文檔映射到語義空間中,然後用ANN算法來進行檢索。一般採用雙塔的編碼方式對查詢和文檔單獨地編碼得到二者獨立的表達,從而使文檔可以進行離線索引。這類方法可以分爲兩類:Term-level Representations、Doc-level Representations。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Term-level Representations","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ColBERT [Khattab et al., SIGIR 2020]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2b/2b314cbc6a899b0b751433e08c1dbd73.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心思想是把交互的過程延遲到後面去使得前面表達的計算可以單獨進行,上層交互採用MaxSim計算函數對每個查詢詞和文檔中的每一個詞計算相似度取最大,然後把每個查詢詞的得分加和起來得到最終的相似度分值,ColBERT性能相對於上文提到的方法都有較大提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Doc-level Representations","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RepBERT [Zhan et al., 2020] , DPR [Karpukhin et al., EMNLP 2020]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/07/07cca2e2daee358bc7d361e8cd792a48.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該方法是採用雙塔的BERT來建模查詢和文檔的表示進行相似度計算。在訓練的方式上有一些特別,會採用一些BM25召回的強負例來提升模型的建模能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0b/0bfc7b94c8c18f8912b3a070f133fbef.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到,RepBERT方法非常簡單而且性能比DeepCT等都要好但相對於BM25+BERT Large這樣交互式的模型還是要差。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TCT-ColBERT [Lin et al., 2020]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5d/5d41ff0edd263141f79d0bbbf3536381.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ColBERT性能相比傳統方法得到大幅提升但需要大量的文檔存儲,TCT-ColBERT通過蒸餾技術,將ColBERT的強建模能力蒸餾到類似於RepBERT這樣的雙塔架構上去,從而降低搜索延遲減少存儲代價。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Multi-vector —— ME-BERT [Luan et al., TACL 2020]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/68/687303f8fff69c3ce3ef12dc556ad503.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將BERT得到的前M個詞的表達作爲文檔的表達,構建索引的時候同時索引多個向量,計算最終得分的時候通過max來取與查詢表達相似度最高的分值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Multi-vector  [Tang et al., TACL 2021]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e9/e9115d52ad81d2c1780f61827b50a85c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相較於直接取前M個詞的表達,該種方法用聚類的方法,利用k-means對文檔所有詞的表達進行聚類得到k個簇中心,所有簇中心的向量共同作爲文檔表達,然後與查詢計算相似度,取最大的相似度分值作爲最終的相關性得分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3) Hybrid of Sparse-dense Retrieval","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了以上的兩種建模方式,還可以將兩種方式結合起來利用Sparse Retrieval和Dense Retrieval各自的優勢來提升性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/30903c5eed5dd05d8fa551c8ea20a482.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CLEAR  [Gao et al., ECIR 2021]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該方法是先用Sparse Retrieval進行檢索將檢索不好的查詢再用Dense Retrieval進行增強。方式就是先計算Sparse Retrieval的結果並估計殘差項,然後用殘差項來監督Dense Retrieval模型的訓練。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1e/1e2352fd1d3363d70d365f162e5cecd0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COIL (COntextualized Inverted List) [Gao et al., NAACL 2021]","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ee/ee74fe04603b753952068fcd2dcafbc7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種是基於ColBERT基礎之上,ColBERT依賴於ANN的檢索方式計算量很大,而COIL改變底層索引結構改用倒排索引方式能夠減少計算量。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"三、在檢索第一階段的挑戰性問題","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1、負樣本採樣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在研究較多問題是,如何挖掘更強的負例來提升模型的判斷能力。一般來說,無論用怎樣的神經網絡方法,最後算loss時都是區分正例和負例間的差異。由於召回階段的文檔規模非常大,在訓練時不能將所有負例文檔都進行計算,實際訓練時一般通過採樣來選擇負例,如何高效挖掘最具信息量的負例是模型訓練的一個關鍵問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前幾種方法有:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"‍‍‍‍‍‍‍‍‍ANCE‍‍‍‍‍‍‍‍‍","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心思想是初始學習先用BM25的方法採一些負例學習雙塔的BERT模型,後續的學習過程利用前面學好的模型進行初始化,並對文檔進行索引來檢索得到top-n文檔隨機採樣負樣本,其中每隔一段時間都需要對文檔索引進行更新,因此該方法的訓練代價較大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/24/24996aa9908f093d7aeb5a11c42709b5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RocketQA","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RocketQA是百度提出的方法,它跟ANCE比較接近只是做了額外的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"denoised","attrs":{}},{"type":"text","text":"操作,如果使用學到的雙塔模型的召回結果進行負樣本採樣的話可能會帶來假負樣本。該方法額外訓練了一個更強的排序模型對召回的文檔中靠前的部分進行相關性評估並去除可能的僞負樣本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/95/95aa0d4984e11ee1d7abd8869a57b4bd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"TAS-Balanced","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TAS-Balanced用了一個更復雜的採樣策略,先對查詢進行聚類,然後同一batch內的樣本按簇進行採樣。除此之外,在採樣負樣本時考慮了與正樣本具有不同margin的文檔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a5/a53ef48ce7b172f508c14c5470af9fb8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2、索引聯合學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大部分雙塔模型學習是獨立於索引過程的,在訓練過程中,查詢和文檔的相關性是基於神經網絡輸出的稠密向量計算得到。然而,在檢索時,相關性的計算是基於索引後的文檔表達,例如量化後的向量表示。兩者之間的差異會對檢索的性能造成一定的損失。那麼怎麼樣能在學文檔表示的同時,與索引階段聯合優化,也是一個很重要的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e9/e9d9b75f1148e38ad4772b3cae631e60.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"京東做了一個較新的方法就是把相關性的計算跟中間索引的過程聯合起來。將乘積量化的索引過程的梯度回傳到表達學習模塊,以實現表達學習和索引構建的聯合訓練。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3、聯合生成模型和判別模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"判別模型學的是查詢和文檔相關性的一個判別問題,生成模型需要輸入一個文檔去生成查詢的序列。生成模型通常對模型要求更高因爲需要對文檔語義有深度的理解。如何結合兩種建模方式的優勢來提升檢索的性能也是一個值得探索的方向。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a40dbb8eaa1e2326adea2e2d1a728e14.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前存在的一種做法是讓模型同時學習判別任務與序列生成任務,需要注意的是在學習序列生成任務時無法提前獲知「查詢」後面的詞,所以","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Mixed Attention階段","attrs":{}},{"type":"text","text":"需要一個特別的操作。這個操作就是對文檔而言每個詞都可以進行雙向交互,上下文mask。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"嘉賓介紹","attrs":{}},{"type":"text","text":":","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"範意興,博士,中科院計算所副研究員,主要研究內容包括信息檢索、自然語言處理等,在國際頂級學術會議SIGIR、WWW、CIKM等發表論文30餘篇,開發了深度文本匹配工具MatchZoo, Github的star數累計4000+。曾獲得中科院院長優秀獎,入選中國科協青年人才託舉工程、中國科學院青年創新促進會會員,擔任CIPS信息檢索專委會委員、CIPS青年工作委員會委員、以及國內外多個會議組委成員。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"推薦閱讀","attrs":{}},{"type":"text","text":":","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247506720&idx=1&sn=fafc2be64d476fa4bb0a93f286cbf9ab&chksm=c03eeb5cf749624a146a29150750b1194c605044e65dae0221dbd3f8501a21030519dae5c56b&scene=21#wechat_redirect","title":"","type":null},"content":[{"type":"text","text":"|百度商業大規模高性能全息日誌檢索技術揭祕","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247506407&idx=1&sn=3e65787b30cd6ab2b06d784cc5d0698a&chksm=c03ee99bf749608d3f7c74b4e2370d3b7d3ff102d57763ec8c6d4abdf22ee6229829f3f12d93&scene=21#wechat_redirect","title":"","type":null},"content":[{"type":"text","text":"|快速剪輯-助力度咔智能剪輯提效實踐","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247506362&idx=1&sn=7d2e59b8daed0b4adf369f6656dc09da&chksm=c03ee9c6f74960d00a859de5f33d936fc92cc8a9a9eaa49a221b8afa5118ae6312cbd590e7e7&scene=21#wechat_redirect","title":"","type":null},"content":[{"type":"text","text":"|短視頻個性化Push工程精進之路","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"---------- END ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度 Geek 說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章