基於ALBERT的文本相似度解決方案

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一、引言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 公司很多項目都有這麼一個需求:新的一份文本需要到歷史庫中,看是否存在類是的文本。在自然語言處理中這類問題屬於文本相似度計算的範疇,簡單而言:就是給一個被計較的文本a,和一個可能存在相似文本的集合C,找出集合C中所有和文本a相似的文本。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、思路探索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 整體思路大體可以分爲兩種:一是,文本間直接進行相似度;二是,針對文本提取特徵,對文本特徵間進行相似度計算。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.1 文本之間進行相似度計算"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.1算法有嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 答案是肯定。例如BERT、ALBERT包括詞向量(下文稱之爲:word2vec)等等,很多算法都是可以支持的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.2 方案優點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 不用對文本進行\"深加工\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 調用算法運算成本低(後面詳解)"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.3 方案缺點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 文本中有很多\"無意義\"的描述,對這些描述進行相似度計算浪費計算資源的同時還會影響最終結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 文本可能過長,導致模型的輸入不能一次接受,需要拆分成多次,這就比較複雜了。比如:下圖中有兩個文本T1和T2,他們都是在描述A、B、C三件事,但是描述的字數和順序可能不盡相同。假設總共都有1000個字,以ALBERT爲例,他只能輸入最多500的字,那麼只能對T1,T2拆分成4塊,每塊250個字,就會出現T1的某一塊都是描述A,而T2描述A和B的一部分,這就會給模型的識別造成不必要的困擾。["},{"type":"text","marks":[{"type":"strong"}],"text":"注:這種情況下只會進行4*4=16次相似度計算"},{"type":"text","text":"]"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/16/16e00abd22d69e4cfddf80031c434a69.png","alt":null,"title":"文本間直接比較示意圖","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 雖然詞向量可以解決輸入長度的問題,但是文本中所有詞的詞向量之後再如何進行相似度計算呢?而且詞向量對一詞多義的詞語無法處理,不能很好的結合語義特徵。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.2 文本特徵間進行相似度計算"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2.1算法有嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 文本間相似度計算的算法,都是可以被挪用到特徵間相似度計算。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2.2 方案優點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 不用考慮文本拆分不好造成的不利影響"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 數據標註簡單"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2.3 方案缺點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 調用算法成本高"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 還是上面那個例子,可以對T1和T2文本進行關鍵詞的提取,比如說T1提取了40個關鍵詞,T2提取了30個關鍵詞,那麼就需要調用1200次相似度算法進行計算,而文本間只需要調用16次。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.3 最終選擇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 雖然上述兩者方案都有各自的優缺點,但是考慮到相似文本數據少,如果採用文本間相似度計算,標註數據會比較困難,而且中文博大精深,一個字的不同都會導致句子的含義不同;而採用特徵間相似度計算,只需要對特徵(比如:關鍵詞)進行相似詞語進行標記就可以了,雖然會耗費計算資源,可以多部署幾套進行解決,因此,最終採用文本間特徵相似度計算。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、關鍵詞相似度計算"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 文本的特徵有多種多樣,綜合多方面考慮選擇\"關鍵詞\"作爲文本的特徵進行相似度計算(考慮:文本相似度計算,在進入人工智能算法計算相似度之前,會對文本的主體包括人名、地址、機構、職級等一些結構化信息先進行判斷,對送入算法的本文其實只需要考慮文本間\"關鍵詞\"的相似與否即可,不必要對上篇的文本在進行逐一地判別)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.1 關鍵詞提取算法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 關鍵詞提取算法有很多,都各有優劣,下面只介紹常用其中的幾種:"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1.1 TF-IDF關鍵詞提取辦法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" TF-IDF,即:詞頻-逆文件頻率。是一種用於資訊檢索與資訊探勘的常用加權技術。TF-IDF是一種統計方法,用以評估一字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度。字詞的重要性隨着它在文件中出現的次數成正比增加,但同時會隨着它在語料庫中出現的頻率成反比下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 示例:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a49047b0b17eb7138ea8512e95a37d4c.png","alt":null,"title":"TF-IDF提取示例","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" TF-IDF算法的優點是簡單快速,結果比較符合實際情況。缺點是,單純以\"詞頻\"衡量一個詞的重要性,不夠全面,有時重要的詞可能出現次數並不多。而且,這種算法無法體現詞的位置信息,出現位置靠前的詞與出現位置靠後的詞,都被視爲重要性相同,這是不正確的。(一種解決方法是,對全文的第一段和每一段的第一句話,給予較大的權重。)"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1.2 TextRank算法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "},{"type":"link","attrs":{"href":"http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf","title":""},"content":[{"type":"text","text":"TextRank "}]},{"type":"text","text":"算法是一種用於文本的基於圖的排序算法。其基本思想來源於谷歌的 "},{"type":"link","attrs":{"href":"http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf","title":""},"content":[{"type":"text","text":"PageRank"}]},{"type":"text","text":"算法, 通過把文本分割成若干組成單元(單詞、句子)並建立圖模型, 利用投票機制對文本中的重要成分進行排序, 僅利用單篇文檔本身的信息即可實現關鍵詞提取、文摘。和 LDA、HMM 等模型不同, TextRank不需要事先對多篇文檔進行學習訓練, 因其簡潔有效而得到廣泛應用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" TextRank 一般模型可以表示爲一個有向有權圖 G =(V, E), 由點集合 V和邊集合 E 組成, E 是V ×V的子集。圖中任兩點 Vi , Vj 之間邊的權重爲 wji , 對於一個給定的點 Vi, In(Vi) 爲 指 向 該 點 的 點 集 合 , Out(Vi) 爲點 Vi 指向的點集合。點 Vi 的得分定義如下:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d3/d39008c6fb221d5af8d8128f94d71855.png","alt":null,"title":"TextRank得分定義","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 其中, d 爲阻尼係數, 取值範圍爲 0 到 1, 代表從圖中某一特定點指向其他任意點的概率, 一般取值爲 0.85。使用TextRank 算法計算圖中各點的得分時, 需要給圖中的點指定任意的初值, 並遞歸計算直到收斂, 即圖中任意一點的誤差率小於給定的極限值時就可以達到收斂, 一般該極限值取 0.0001。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1.3 詞向量聚類"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 詞向量就是用向量來代表詞,常見的詞向量算法有:Word2Vec、GloVe、ELMo等。該種方法就是首先利用詞向量算法訓練語料獲取詞向量特徵,在通過K-mens算法進行聚類,再根據聚類中心獲取K個離聚類中心最近的詞作爲關鍵詞。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 優點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​\t\t速度快、提取的結果準確性高"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​\t缺點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 詞向量很難包括特定任務下的所有關鍵詞"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​\t\t關鍵詞需要人工添加,較爲麻煩"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​\t\t只提取K個關鍵詞,其他的詞語被默認捨去,極有可能提取不全"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​\t\t詞向量訓練雖然簡單,但是他們往往過大,第一次加載速度很慢"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1.4 BERT或ALBERT利用NER方式提取關鍵詞"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" BERT和ALBERT是最近兩年來廣受好評的模型,NER即命名實體識別。在\"智慧法醫\"項目中曾利用BERT和ALBERT做過NER,主要去識別傷情語句中的傷情部位、大小、類型;因此將關鍵詞抽象成需要被提取的一個實體,這樣做的好處在於:針對不同任務,可以自定義\"關鍵詞\",比如說\"智慧法醫\"中認爲\"傷情部位\"是關鍵詞,\"案由提取\"中將案由作爲\"關鍵詞\",這樣一來模型的複用性極大提高,他不再是解決某一個問題,還是在解決某一類問題;模型泛化能力強,上面提到的幾個關鍵詞提取的方法都有一個共同的缺點,就是需要進行中文分詞,分詞的結果將極大影響關鍵詞提取結果,而BERT和ALBERT算法可以基於\"字\"去訓練模型,不再受到\"詞\"的約束,因此使得模型的泛化能力更強。缺點在於:標註數據以及計算時間受到文本長度的影響,但是都能夠在3s內返回。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "},{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"綜合上面多種關鍵詞提取算法的對比,最終採用ALBERT + BILSTM + CRF 的關鍵詞提取算法。"},{"type":"text","text":"(BILSTM:雙向長短時記憶網絡,CRF:條件隨機場)"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.2 相似度計算算法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 由於在之前的\"關鍵詞提取\"上採用ALBERT+BILSTM+CRF,因此在相似計算的算法上也採用ALBERT。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "},{"type":"text","marks":[{"type":"strong"}],"text":" 爲什麼使用ALBERT而不使用BERT?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ALBERT是BERT的改進,主要通過\"因式分解\"以及\"參數共享\"機制對BERT進行改造,但整體的模型結構、輸入輸出都沒有發生任何變化,但是ALBERT的收斂速度更快,預測時間更短(約爲BERT的十分之一),模型更小(ALBERT_tiny只有10幾M),而且ALBERT的精度和BERT相當。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 算法結構("},{"type":"text","marks":[{"type":"strong"}],"text":"圖中的BERT,在實際中被替換成了ALBERT"},{"type":"text","text":"):"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f7/f7aef4cd9bde7c8139b07f66dbf6f0aa.png","alt":null,"title":"模型結構圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 在實際使用中,關鍵詞1之前會用CLS標識,關鍵詞2之前用SEP進行標識;在輸出部分只取C,即句子向量的輸出,之後會把輸出接上一個全連接層,做一個2分類(1:相似,0:不相似)。至此,就完成了對相似度計算模型的搭建。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四、實踐實例:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 目前整個一套文本相似度算法,在項目中使用的流程圖如下:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c6/c6338bd653fc4bf213d6c9b729f7c0c4.png","alt":null,"title":"實踐實例流程圖","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" 注:最終的輸出是ALBERT相似計算結果,以及文本結構化信息的加權結果。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"五、總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 整個文本相似度計算算法複用性強,對關鍵詞提取準確,相似度計算結果較好;但是它依舊存在着一個缺陷,如果單次需要被比較的關鍵詞數量很多,時間會有點長,不過目前的實時性還是不錯的,目前針對1200對關鍵詞計算相似度時間在3s左右;如果對實時性還有更高的要求可以多部署幾套,並且目前的計算時間還在進一步優化中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章