基於深度學習的短文本相似度學習與行業測評

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文本相似度計算作爲NLP的熱點研究方向之一,在搜索推薦、智能客服、閒聊等領域得到的廣泛的應用。在不同的應用領域,也存在着一定的差異,例如在搜索領域大多是計算query與document的相似度;而在智能客服、聊天領域更注重的是query與query之間的匹配,即短文本之間的相似度計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不同的文本長度,相似度的計算方案也存在差異,長文本匹配更多注重文本的關鍵詞或者主題的匹配,業界使用的較多的算法如:TF-IDF、LSA、LDA;而短文本匹配更多的是句子整體的語義一致性,業界較爲主流的算法有:word2vec、esim、abcnn、bert等深度模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比於長文本的相似度計算,短文本的相似度計算存在更大的挑戰。其一,短文本可以利用的上下文信息有限,語義刻畫不夠全面;其二,短文本通常情況下,口語化程度更高,存在缺省的可能性更大;第三,短文本更注重文本整體語義的匹配,對文本的語序、句式等更爲敏感。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/dd/dd609df367630b74f2f11c9a8aa2d301.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不同文本相似度算法的得分分佈不一致,無法通過評分來對算法進行評估。因此對於不同的算法方案,可以設定特定的得分門限,得分高於門限,可判斷爲語義相同;否則,判斷爲語義不同。對於一個給定標籤的數據集,可以通過準確率來衡量相似度計算的效果。常用的中文評估語料有:LCQMC、BQ Corpus、PAWS-X (中文)、afqmc等。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 主流方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業界常用的短文本相似度計算方案大致可以分爲兩類:監督學習與無監督學習,通常情況下,監督學習效果相對較好。在沒有足夠的訓練數據需要冷啓動的情況下,可優先考慮使用無監督學習來進行上線。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.1 無監督學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最簡單有效的無監督學習方案就是預訓練的方式,使用word2vec或者bert等預訓練模型,對任務領域內的無標籤數據進行預訓練。使用得到的預訓練模型,獲取每個詞以及句子的語義表示,用於相似度的計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Word2vec是nlp領域一個劃時代的產物,將word的表徵從離散的one-hot的方式轉化成連續的embedding的形式,不僅降低了計算維度,各個任務上的效果也取得了質的飛躍。Word2vec通過對大規模語料來進行語言模型(language model)的建模,使得語義相近的word,在embedding的表示上,也具有很強的相關性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過cbow或者max-pooling的方式,使用句子中每個詞的word embedding計算得到sentence embedding,可以使得語義相似的句子在sentence embedding的表示上也具備較高的相關性,相比於傳統的TF-IDF等相似度計算具有更好的泛化性。但是cbow的方式來計算sentence embedding,句子中所有word使用相同的權重,無法準確獲取句子中的keyword,導致語義計算的準確率有限,難以達到上線標準。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/18/184efb66a1785ca52c0ac08c3474ee6d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1e/1eecf39f367c0333756c185f41a07ca4.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然Word2vec提供了一定的泛化性,但其最大的弱點是在不同的語境下,同一個word的表徵完全相同,無法滿足豐富的語言變化。gpt、bert等大規模預訓練模型的出現,徹底解決了這個問題,做到了word的表徵與上下文相關,同時也不斷刷新了各個領域任務的榜單。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但實驗證明直接使用bert輸出的token embedding來計算句子的sentence embedding,無論使用cbow的方式對所有token embedding求平均或者直接使用[CLS] token的embedding來表示,語義計算的效果都不佳,甚至不如GloVe。究其原因,在bert的預訓練過程中,高頻詞之間共現概率更大,MLM任務訓練使得它們之間語義表徵更加接近,而低頻詞之間的分佈更爲稀疏。語義空間分佈的不均勻,導致低頻詞周圍中存在很多語義的“hole”,由於這些“hole”的存在,導致語義計算的相似度存在偏差。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/be/be623a9123b4077705f4e5b8076a9b7b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決bert語義空間不均勻的問題,CMU與字節跳動合作的bert-flow提出將bert的語義空間映射到一個標準的高斯隱空間,由於標準高斯分佈滿足各向同性,區域內不存在“hole”,不會破壞語義空間的連續性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5d/5d7c11ff77221eef2a72b2c44494945e.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bert-flow的訓練過程就是學習一個可逆的映射","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"f","attrs":{}},{"type":"text","text":",把服從高斯分佈的變量z映射到BERT編碼的u,那","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就可以把u映射到均勻的高斯分佈,這時我們最大化從高斯分佈中產生BERT表示的概率,就學習到了這個映射:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6c/6c08c12b68c898c0a5072046ec2e7995.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實驗表明,通過bert-flow的方式來進行語義表徵與相似度計算的效果,要遠遠優於word2vec以及直接使用bert的方式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5c/5ca0fb026d1b46d9ebe4291cf4faaf46.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.2 監督學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bert-flow的出現使得無監督學習在文本相似度計算方面取得了較大進步,但是在特定任務上相比於監督學習,效果還存在一定的差距。監督學習常用的相似度計算模型大致可以分爲兩類:語義表徵模型,語義交互式模型。語義表徵模型常用於海量query召回,交互式模型更多使用於語義排序階段。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DSSM是搜索領域最常用的語義表徵模型之一,而在短文本匹配領域,使用最多的網絡結構是孿生網絡,常用的孿生網絡包括:siamese cbow,siamese cnn,siamese lstm等。孿生網絡訓練時,所有query使用相同模型來進行語義表徵,通過餘弦相似度等方式來計算query間的相似度,不斷最大化正樣本之間的相關性,抑制負樣本之間的相關性。預測時,每個query通過語義模型單獨獲取語義向量,用來計算query之間的相似度得分。由於query 語義表徵僅與本身有關,因此在進行query檢索時,可以提前對語料庫中query構建語義索引,大大提升系統的檢索效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3f/3f023abc2132bae26366e94bd524f345.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比於語義表徵模型,交互式語義模型具有更好的匹配效果,模型結構往往也更加複雜,常用的交互式語義模型有ABCNN、ESIM等。交互式模型在計算query之間的語義相似度時,不僅對單個query的語義特徵進行建模,還需要query之間的交互特徵。交互式模型通常使用二分類的任務來進行訓練,當模型輸入的兩個query語義一致,label爲“1”,反之,label爲“0”。在預測時,可通過logits來作爲置信度判斷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a0/a06513318c8f6c1b01e72913a6bb80e9.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cdde4171ae8a8ab83073d536d157be96.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大規模預訓練模型的出現,也橫掃了文本相似度任務的各項榜單。Bert將lcqmc數據集的SOTA帶到了86%的水平。隨後,Roberta、albert、ernie等新的預訓練模型層出不窮,也不斷刷新着匹配準確率的SOTA水平。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0f/0fba64261159a8fdff268446be336d03.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. 業務應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在語義問答的業務中,通常會使用召回+排序的算法架構,在我們的閒聊業務中,我們也使用了類似的架構。使用siamese cnn語義表徵模型來進行語義召回,用蒸餾後的transformer語義交互模型來做排序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/26/26f96fbf67b83830b6f173e9024af969.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在語義表徵模型的loss構建上,我們參考了人臉識別領域的損失函數設計。這個兩個任務在本質上是相似的,人臉識別是將人臉圖片用向量表示,而文本檢索式將文本用向量來進行表示,都期望正樣本之間有足夠高的相關性,負樣本之間足夠好區分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用Siamese cnn進行語義建模時,我們使用了1個標準query,1個正樣本,5個負樣本(嘗試過其他負樣本數量,在我們的數據上效果不如5個負樣本),訓練過程其實是在這6個樣本中,識別出對應正樣本的位置,因此可將其轉化爲分類任務來進行訓練,每個正負樣本分別對應一個類別。使用每個樣本與標準query之間的相似度,來作爲對應類別的logits,對logits進行歸一化並構建loss函數。傳統的softmax歸一化構建的分類邊界使得類別之間可分,爲了更好的語義表徵效果,需要使得類內更加匯聚,類間更加分散。ASoftmax、AMSoftmax、ArcFace等歸一化方式,提出將所有query映射到一個球面,query之間的相似度通過他們之間的夾角來計算,夾角越小相似度越高,通過在角度域添加margin的方式,使得類內更匯聚,類間更可分,達到更好的語義表徵效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a8/a867ed5cbdbf5515b54ee47eec073342.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們對比了softmax、Asoftmax、AMSoftmax、Arcface等不同歸一化方式,其中,Softmax沒有添加任何margin,ASoftmax通過倍角的方式在角度域添加margin,AMSoftmax則是在餘弦域添加margin,而Arcface則是直接在角度域添加固定margin。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/33/33f5f3e60af935464b4082131cc32931.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們使用30W的語料庫來構建索引,使用12900條線上query(語料庫中不包含完全相同的query)來進行召回測試,使用相同的向量索引工具,對比發現AMSoftmax、Arcface召回效果上有很大提升,在我們的業務中得到了應用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3a/3afd2d3405e93564b074744a10391db1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在排序模型方面,我們嘗試了ABCNN、ESIM、transformer等交互式語義模型,但效果相比於bert等預訓練模型,還存在一定的差距。我們團隊自研的預訓練模型Xbert,在與Roberta large同規模的情況下,融入了自研知識圖譜數據,添加了WWM(whole word MLM)、DAE、Entity MLM等任務,使用LAMB優化器進行優化。我們使用XBert在業務數據上進行了測試,相比於同規模的Roberta large準確率有接近0.9%的提升。爲了滿足上線需求,我們參考tiny bert的方式,用Xbert蒸餾了一個4層的transformer model用於線上推斷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/20/20f7ae0464e52f2800798c7173b755f9.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在內部的問答數據集上對不同排序方案做了的效果對比,使用12900條線上用戶真實query,進行全鏈路的效果對比測試。用語義召回top1的準確率來評估語義表徵模型的效果,並且通過消歧模塊進一步提升應答準確率;測試排序模型效果時,我們使用了多路召回,共召回30個候選,使用排序模型對候選排序,選擇排序後的top1作爲最終答案。若經過消歧模塊,所有候選均被消歧掉,或排序後的top1候選排序得分不滿足應答門限時,則該query系統無應答。因此,我們使用應答率與應答準確率來作爲系統最終的評測指標,來評估不同方案的效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/90/90da7ea8f0b480b26a84bedb3130ef54.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了測試自研的Xbert在公開的語義相似度數據集上的效果,在lcqmc數據集上,單模型準確率88.96%,較Roberta large單模型87.9%的準確率,提升了1%;通過使用正樣本之間的傳遞性以及負樣本採樣的方式,來進行數據增強以及FGM對抗訓練的方式,準確率提升至89.23%;通過ensemble的方式,將準確率進一步提升至90.47%。通過相同的方式,在bq_corpus上達到了87.17%,在paws-x任務上達到了88%,在afqmc數據集上也達到了77.234%,在百度舉辦的千言文本相似度比賽中完成登頂。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/98/9871ed69af44827b1a53b7bcef86d44d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. 總結與展望","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"短文本相似度在我們的閒聊領域得到了應用,使用語義表徵學習來進行召回+交互模型排序的算法架構,在保證系統性能的前提下,取得了不錯的業務效果。在語義表徵模型上,我們使用人臉識別領域的loss來提升召回效果;在語義排序方面,我們也利用了大規模預訓練模型以及模型蒸餾,來進一步提升業務效果。在大規模預訓練語言模型方面,我們積極探索與改進,相比於現有開源預訓練模型,我們的Xbert在業務上以及公開數據集上的評測效果,都有了進一步的提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在今後的工作中,我們會利用好預訓練模型這個核武器,在我們Xbert的基礎上努力優化突破,將文本的相似度匹配任務帶新的臺階。在解決單輪相似度匹配的情況下,我們也會繼續探索結合上下文的多輪匹配以及多輪生成等任務,來進一步提升我們閒聊業務的體驗。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章