語義相似度在好大夫搜索的優化探索

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着近年來自然語言處理技術的飛躍式發展,許多之前很難實現的自動化效果被逐步用於互聯網業務生產實際中,給我們帶來了高效便捷的服務體驗。本文記錄了好大夫在線在搜索業務上優化問答搜索相似性效果的探索。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索引擎中,召回和排序是搜索流程的重要組分。當用戶進行查詢檢索時,搜索引擎首先會檢索召回大量的文檔,然後根據候選文檔的文本相關性、內容質量等特徵,綜合計算出每一個文檔的排序分值,最終展現給用戶,其中的"},{"type":"text","marks":[{"type":"strong"}],"text":"核心問題"},{"type":"text","text":"包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"理解用戶在找什麼;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"哪些文檔是和用戶意圖"},{"type":"text","marks":[{"type":"strong"}],"text":"真正相關的;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"哪些信息是可以信賴的優質內容;"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/29\/2910413a66547c3ba95061b882da2bd8.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖1:搜索流程簡圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要理解這些查詢,提供更好的搜索體驗,用最小的成本找出用戶想要的相關文檔,並儘可能的把找到的相關度好的結果放到前面,讓用戶一眼能看到自己想要的結果(手氣不錯),或者讓用戶走火入魔陷入點了還想點的境地。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、好大夫搜索現狀和難點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在好大夫在線的搜索業務中,我們需要理解用戶的檢索意圖,並在站內收錄的病程\/文章\/介紹中,返回用戶真正想要的結果。比如,用戶在檢索“感冒了能不能喫西瓜”時,除了返回“感冒”和“西瓜”相關的條目外,還應該理解用戶是在找感冒條件下一些相關的科普或者提問,應該觸類旁通,返回類似 “感冒了能不能喫水果” 或者 “感冒飲食禁忌” 之類的條目。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"好大夫在線搜索業務的特點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"集中在醫療垂類,描述性查詢非常多,涉及大量實體和知識的不規範表述(口語化);"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"收錄的內容絕大部分經過了嚴格的審覈,很少出現標題黨和歪曲事實的東西。 "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的搜索相關性技術中,最經典的是bm25,根據TF-IDF來計算查詢和文檔的相似度,主要考慮在詞級別上的匹配,但是需要維護實體詞和同義詞詞表,同時如果用詞索引無法召回相關性好的文檔,比如一些用戶記不清實際的名字(如西藥藥品)錯字少字或者描述性表示(胳臂上有好多個紅點),展示出的效果就不夠好,對於非專業用戶來說易用性也比較差。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5f\/5f2f0772056602c54a1591c912657732.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2:搜索召回排序的新發展方向"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們面臨的難點包括並不限於:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"醫學知識專業性強、涵蓋廣泛,由於人力成本高昂,很難去發掘和標註更好的專業知識和語料;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"數據基礎建設不足,搜索業務積累數據少;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"描述性文本的非標準化性高,真實的全召回評價標準比較難建立;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"線上召回和排序響應時間需要足夠短等。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、相關性優化的目標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"什麼纔是好的相關性?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩句話出現了很多一樣的字符,有很大的概率它們是相關的,這樣的結果很少;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相應句法結構上出現了同義詞,也有可能是相關的,這樣的依賴同義詞的積累;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然沒幾個詞相同,但是講了差不多一個意思或是想要的答案,更有可能是相關的;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進一步理解用戶的潛在意圖,擴展到同類別或者總結性的知識,也可能是用戶想要的。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以,如果能有一種方式把語義相近的句子放在一堆,不受字符和詞的約束,檢索的時候去相應的堆裏找,會有很大概率獲得效果的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"於是,我們就需要一個相關性打分模型,可以輸入兩個句子\/文檔對,很快的輸出一個相關性得分,語義相近的得分較高。 這樣就可以把模型認爲的結果放在前面讓用戶看到。例如判斷 “手心腫痛” 跟 “手掌心腫痛” 相似性要高於 “手心起紅色點,又腫又疼”,還高於 “手癢痛手臂麻”。這個排序模型要能足夠準確的衡量相關性。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8a\/8a4a625569c47236792fefc829a96ccd.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3:相關性優化任務示意"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"相似度模型訓練初探索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"飛速發展的自然語言處理技術已經在語義表示領域被廣泛使用,可以實現類似的效果,就是把一段文本通過模型編碼成一個向量,相近的意思得出的編碼向量可以比較接近。從最開始的讓詞義相近Word2vec到zero-shot預訓練模型FLAN,模型越做越大,語義表示做的越來越好,在搜索實際業務中使用這些技術是需要一定的適配和取捨的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c0\/c018a06cd2daef25f8e2b21c2045aa7a.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖片來源: Xu H, Zhengyan Z, Ning D, et al. Pre-Trained Models: Past, Present and Future[J]. arXiv preprint arXiv:2106.07139, 2021."}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"業務數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是數據,數據是人工智能的基石,模型擬合的就是輸入的訓練數據,沒數據再好的算法也出不來效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼如何找到搜索需要的講了同一個意思或者答案的句子? 搜索問答類場景下用戶有明確的搜索目標,會對需求進行顯式的描述,每個人的表述可能不盡相同。用戶很可能會對返回的想要的結果進行點擊進一步查看,所以從點擊日誌來的數據可以作爲初步的數據。當然,這些數據存在一些問題,比如頭部點擊過度(排的越靠前的被點的概率越大),暴露偏差expose-bias(展現列表沒出現的不會被點擊),以及獲取信息後進一步決策(想找一個看乳腺結節的醫生,搜索乳腺疾病醫生排行,然後點了一個醫院知名度高的醫生)等,需要一定的清洗和處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的優勢在於公司已經積累了大量醫學相關的文本數據,我們都可以用來進行領域垂類預訓練,以加強預訓練語言模型的表示能力。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"模型結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後是模型結構,搜索場景對服務延時有很高的要求,大模型固然效果好,從Roberta-large到GPT-3人盡皆知,但動輒上億的參數對於我司CPU的線上推理十分不友好。需要相對取捨,用小的模型進行快速計算,知識蒸餾如DistilBERT[2]\/參數空間搜索AutoTinyBERT[3]類似的操作和推理優化如TVM[4]是少不了的。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"訓練任務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更重要的是訓練模型的任務,讓模型學什麼? 如何讓模型儘可能多的貼合相關性打分實際? 從Pointwise到Pairwise到Listwise,Pointwise方法預測每個文檔和查詢間的相關分數,爲了學到不同文檔之間的排序關係,Pairwise方法將排序問題轉換爲文檔間的兩兩比較,Listwise方法則學習更多的文檔排序之間的相互關係。從sentence-bert[5]到SimCSE[6],文檔和查詢間交互和對比的方式也在不斷變化,我們對這些任務也進行了一定的試驗,選用與SimCSE類似的對比學習正負樣本方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/82\/8215c80907ff5b229f51622c8d624b69.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖片來源: Gao T, Yao X, Chen D. SimCSE: Simple Contrastive Learning of Sentence Embeddings[J]. arXiv preprint arXiv:2104.08821, 2021"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時參考sentence-bert中embedding話題交互帶來更好的表示效果,我們也加入了一些別的模塊來控制embedding符合相應的疾病主題或者內容主題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/80893abb6a2b5fb1b7b723435d0dfcbe.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖4:模型任務模塊示意"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時一些訓練技巧也是可以事半功倍的利器,如大的batchsize,虛擬對抗訓練[7](Virtual adversarial training),花式dropout(ConSERT[8]),配合好相應的損失函數如InfoNCE[9]\/Tripletloss,都可以帶來肉眼可見的魯棒性和指標提升。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"相似度模型訓練優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過前期試驗,我們用骨感的數據製作了一版模型,可以對描述性的文本做較好的近似,把相關的和不是很相關的文檔區別開來。但是同時也遇到了一些問題:一些長尾的詞模型不認識,如一些藥品名,模型在判斷相似的時候不知道應該跟哪個藥品相似;還有一些查詢中有好多詞,模型對重要的詞和可以捨棄的詞理解不深,出現了一些撿了芝麻的badcase,在雖然主題能找準,但是還不夠好。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"更強的訓練任務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在需要小小模型承擔更強的功能,需要更嚴苛的訓練任務承載實際的業務經驗和垂類知識。受蘇劍林大神[10]和google的MUM描述的啓發,充分利用現有數據,從現有數據中設計出更多可能對下游任務有幫助的任務進行訓練。可以利用業務數據設計任務進行無監督或者半監督學習,來提升模型的深層表示能力。考慮現在欠缺的實際,需要認識各種類別的詞和相應的知識關係,在文檔\/查詢句中找到重要的詞,要分清哪些是重要的,哪些可能是可以捨棄的,對於錯的詞是否可以被糾正,一個文檔是否可以找到最接近的查詢句,查詢句怎麼和查詢句做好負樣本..."}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c9\/c93626f96515483940669f0100dda46a.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5:模型訓練任務改進示意"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"於是我們根據基礎數據和知識,設計了文檔和查詢句中相關實體的檢出\/替換糾錯\/改寫對比等任務。在模型中用對比學習方式進行任務並行訓練,把需要處理的各種類別文本映射到同一個抽象語義空間裏,配合(am-softmax[11]\/加入KL散度的Regularized Dropout[12])等loss,儘可能減小設定爲相似的樣本間的交互結果,擴大設定爲不相似的樣本之間的交互結果。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/df\/df6e7716cba4cbce09972a42716f2f64.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6:訓練樣本和任務示意"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實體級別的意思和類別得到了更好的表示,取得了更好的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/82\/82e165bec6935fd37b8fbc33c0bf1ff5.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e2\/e249cb0ef2d7c83e578268730106bc14.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖7\/8:簡單效果示意"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"實際效果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優化上線後,用戶對於問答類搜索結果的頁面點擊率提升了4.6%,表明用戶更願意點擊返回的搜索結果(搜索召回了更相關的結果),如圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8d\/8dce16782e884c97c63f31621fbfd00a.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖9:上線前後問答類點擊率變化效果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,用戶在問答類搜索結果頁面的行爲長度(如搜1次點3次,即行爲長度爲4),也相應增加了8.5%,表明用戶點擊了更多的結果(搜索結果的相關性更好),如圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fe\/fe7f97a1682cb2aa0bef0d3baca01358.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖10:上線前後用戶搜索頁行爲長度變化(問答類、信息類)"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、接下來的挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語義相似度優化上線後,通過主動用戶評測及用戶點擊數據分析,證明我們這個優化方向是ok的。接下來,我們會持續以醫療相關領域知識爲基礎,不斷完善相關的數據及模型能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然,除了相似度算法模型以外,好大夫搜索還有很長的道路要走。我們希望以好大夫15年來積累的海量醫療內容爲基礎,爲用戶打造“最實用的醫療搜索”,做用戶“簡單可信賴”的小夥伴。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考文獻:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.Xu H, Zhengyan Z, Ning D, et al. Pre-Trained Models: Past, Present and Future[J]. arXiv preprint arXiv:2106.07139, 2021."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.Sanh V, Debut L, Chaumond J, et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter[J]. arXiv preprint arXiv:1910.01108, 2019."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.Yin Y, Chen C, Shang L, et al. AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models[J]. arXiv preprint arXiv:2107.13686, 2021."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.Tianqi Chen, et al. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” "},{"type":"text","marks":[{"type":"italic"}],"text":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18),"},{"type":"text","text":" 2018."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5.Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks[J]. arXiv preprint arXiv:1908.10084, 2019."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6.Gao T, Yao X, Chen D. SimCSE: Simple Contrastive Learning of Sentence Embeddings[J]. arXiv preprint arXiv:2104.08821, 2021."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7.Miyato T, Maeda S, Koyama M, et al. Virtual adversarial training: a regularization method for supervised and semi-supervised learning[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 41(8): 1979-1993."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"8.Yan Y, Li R, Wang S, et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer[J]. arXiv preprint arXiv:2105.11741, 2021."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"9.Hjelm R D, Fedorov A, Lavoie-Marchildon S, et al. Learning deep representations by mutual information estimation and maximization[J]. arXiv preprint arXiv:1808.06670, 2018."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"10.蘇劍林. (Jun. 11, 2021). 《SimBERTv2來了!融合檢索和生成的RoFormer-Sim模型 》[Blog post]. Retrieved from https:\/\/spaces.ac.cn\/archives\/8454"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"11.Wang F, Cheng J, Liu W, et al. Additive margin softmax for face verification[J]. IEEE Signal Processing Letters, 2018, 25(7): 926-930."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"12.Liang X, Wu L, Li J, et al. R-Drop: Regularized Dropout for Neural Networks[J]. arXiv preprint arXiv:2106.14448, 2021."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者簡介"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"曹騰:好大夫在線算法工程師,專注於自然語言處理相關技術的研究與業務落地。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章