百度內容風控是怎樣在秒級之內完成詞表匹配

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:","attrs":{}},{"type":"text","text":"我們在實現檢測一個字符串是否包含另一個字符串時,簡單的用一個字符串匹配算法就可以實現,如果要實現檢測一個字符串是否包含 N 個字符串時,這個 N 有可能上千萬,再利用簡單的字符串匹配算法就沒法滿足我們的需求了,上千萬的詞需要可以靈活的維護,業務方匹配時能夠拿到自己的詞進行匹配,千萬詞的匹配需要保證匹配速度,要在秒級之內出結果。所以,我們需要一套解決此類問題的方案——詞表服務 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文5370字,預計閱讀時間 12分鐘。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內容審覈平臺需要檢測作者發的文章中是否含有特殊的敏感詞。對於不同的業務線對這些詞的要求也不同,有的嚴格有的寬鬆;有的需要單詞,有的需要多詞;有的需要檢測出隱含詞、變體詞;有的在標題生效,有的在正文生效;有的檢測出送人審,有的檢測出直接拒絕;有的需要幾千詞,有的需要上萬、百萬、甚至千萬詞。對於這些詞各業務線可以自己維護,方便增加、刪除、修改,各業務可以根據自己的需求配置詞的生效規則;在檢測的時候業務方可以拿到自己維護的詞對文章進行檢測,而且需要保證檢測的時效,能夠實時拿到檢測結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"二、架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/45/451937a0c192d651ec23b18d91baedd3.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"上圖是詞表服務的整體架構:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(1)詞表管理","attrs":{}},{"type":"text","text":":各業務線在詞表管理平臺維護自己的詞表,每個業務線可以添加多個詞表組,每個詞表組中可以維護敏感詞以及可以動態添加敏感詞的屬性;詞表管理平臺用ES實現了對詞表及上千萬詞高效的分詞檢索能力;詞表管理會定時生成各業務線的詞表BOS文件,上傳到BOS服務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(2)服務層","attrs":{}},{"type":"text","text":":業務方調用詞表服務統一對外的匹配接口,服務層將匹配任務送到策略算子層,完成詞表的匹配功能。詞表對外的統一服務相當於一個簡單的網關,提供了鑑權功能,驗證請求是否合法;提供了流量限制的功能,可以爲每個請求方設置流量限制值;提供了結果處理的功能,策略算子返回的敏感詞屬性只是一部分,根據業務方的需求,可以完善策略算子返回的敏感詞屬性;提供了流量轉發的功能,可以根據配置將各業務線的請求打到不同的集羣,實現各業務策略算子分集羣部署。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(3)策略算子層","attrs":{}},{"type":"text","text":":策略算子實現對文本中敏感詞的匹配,匹配的模式有包含匹配、強過濾匹配、多模匹配,命中的敏感詞會返回給詞表服務層。各業務線的詞表會被策略的每個算子用全量刷新的方式或者實時同步增量數據的方式加載到內存,支持算子的匹配功能。全量刷新的方式:詞表管理平臺會定時將詞表分業務線生成BOS文件,上傳到 BOS 服務,策略算子定時從BOS文件中同步敏感詞到內存;實時同步的方式:策略算子會實時掃描刺詞表數據庫,將增量的詞表加載到內存。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(4)基礎服務","attrs":{}},{"type":"text","text":":GDP框架實現了詞表服務開發,Pandora平臺實現了詞表服務的部署,mysql 實現了詞表數據的存儲,ES實現了詞表的分詞檢索,bdrp實現了限流及緩存功能,BOS服務實現了詞表文件的的傳輸。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"三、詞表管理平臺","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞表管理平臺,實現了各業務線維護自己的詞表,每個業務線下可以創建多個詞表組,方便業務方分類管理自己的敏感詞,每個詞表組的含義由業務方賦予,具體體現在當命中的敏感詞屬於這個詞表組的時候,業務方是否根據詞表組做不同的處置;每個詞表組下可以維護敏感詞,敏感詞的屬性由業務方自己選擇,例如,審覈類型這個屬性,業務方可以根據命中具體某個敏感詞後要送審,就選擇送審詞這個屬性值,如果要拒絕就選擇拒絕詞這個屬性值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1詞表管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各業務線可以添加、修改的詞表,可以對詞表進行檢索。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)新增詞表,選擇屬於的業務線,添加名字和備註,可以一次將詞表創建到多個業務線下,如果其他業務線有詞表可以複用,也可直接將其他業務線的詞表拷貝到自己新建的詞表下,方便快捷,方便管理人員對詞表的管理。如圖1:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c0/c05ee65828714bccd247cd04af8751eb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)修改詞表,可以修改詞表的名字、備註,可以將詞表重新指定業務線,如果其他業務線有詞表可以複用,也可直接將其他業務線的詞表拷貝到自己的詞表下,方便快捷,方便管理人員對詞表的管理。如圖2:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/15/156bc4811c44cfdbec9b7f93119302d7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)詞表檢索,支持通過詞表ID、詞表名稱、業務線以及創建的時間檢索詞表;詞表名稱的檢索,利用了 ES 的特性可以實現對詞表名稱進行分詞檢索;在檢索到的列表中,可以看到詞表的id、詞表名稱、業務線、詞表的創建時間、更新時間、每個詞表下的詞條數量、詞表備註、詞表的生效狀態操作人等詞表屬性;可以在列表中狀態中點擊,將詞表改成生效或失效狀態;在操作欄可以點擊修改,修改詞表,點擊追加給詞表添加詞,點擊查看查看詞表的詳情信息。如圖3:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ad/add599d4ac966e9cbc7b1bd215402dc9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.2敏感詞維護","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在詞表中可以高效快捷的維護敏感詞。重要的敏感詞的屬性包含:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)  詞條類型:標識敏感詞是送審詞還是過濾詞;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)  敏感類型:標識詞條的敏感分類;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)  匹配模式:包含匹配-檢測本文中是否包含敏感詞,強過濾匹配-檢測文本中漢字、字母、數字、特殊字符相互組合後是否包含敏感詞,多模匹配-檢測文本中是否命中2個或3個詞,且多個詞間距在有效範圍內。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(4)  生效位置:敏感詞在文章中的生效位置,如,標題、正文、圖片中給的文字等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(5)  豁免詞:包含匹配中敏感詞的屬性,如果敏感詞是A,豁免詞是B,文本中有AB詞,則敏感詞A不會命中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(6)  延展策略:多模位置置換-如果有多模詞AB,文本中有詞BA,則可以命中AB敏感詞;字母大小寫轉換-忽略大小寫,如果敏感詞是cd,文本中有cD、Cd、CD詞,則都可命中cd詞。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(7)  失效時間:提供了長期有效和具體失效時間兩種選擇。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"敏感詞維護提供了單條添加、批量添加、單條修改、批量修改、詞表檢索、詞條檢索等功能:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)單條追加,追加的詞表名稱已經確定,業務方可以根據自己的業務選擇詞的屬性,追加中的操作,如果詞的匹配模式屬性選擇了包含詞,可以添加這個詞的豁免詞。如圖4:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e6/e6600c4608028fb6e7e665286dc1341f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)批量添加,支持同步最大一次添加3000條,可以同時添加到不同業務線的不同的詞表中,方便快捷,方便了管理員對敏感詞的維護工作,要添加的所有的敏感詞屬性必須一致纔可以使用此功能,二期不支持給包含詞添加豁免詞屬性,可以在敏感詞輸入框中換行輸入多條。如圖5:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/02/02e3810aa693d7a600069ee2ff596445.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)批量創建,業務方可以根據自己的業務將敏感詞及屬性維護到EXCEL表中,每個文件最大支持3萬詞,提交後,可以生成一個創建任務,後臺運行,同時可以創建多個任務,執行的時候是順序執行,如圖6:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/25/256a3f11fb6c30c7b458c60b92270c17.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)單條修改,可以修改詞條的任意屬性,如果敏感詞是同步批量添加的包含詞,想要添加敏感詞的豁免詞可以在這裏修改。如圖7:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fc/fcdf3b3c8cc32504d8738108a878f3d8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)批量修改,業務方可以根據自己的業務將敏感詞及要修改的屬性維護到EXCEL表中,每個文件最大支持3萬詞,提交後,可以生成一個更新任務,後臺運行,同時可以創建多個更新任務,執行的時候是順序執行。如圖8::","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4e/4e801087e6535362da4e2145672277e5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(4)敏感詞檢索,可以根據敏感詞的審覈類型、匹配模式、生效位置、敏感類型、操作人、所屬業務線、所屬詞表、敏感詞的創建時間等屬性檢索,敏感詞的檢索使用了ES分詞檢索的特性,可以支持分詞檢索,也可以實現精確檢索;檢索的列表中展示了敏感詞的名稱、所屬業務線、所屬詞表、操作人、操作時間、備註等字段,可以查看總體數量,可以導出,批量解除,操作欄中,可以點擊修改,進入修改頁面,可以點擊解除,解除此條敏感詞。如圖9:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/68/68fd7d4e8aff53fc3dff04f80ef84d75.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"四、詞表服務統一入口","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞表服務統一入口,提供了標準的 API 接口,業務方調用詞表服務統一對外的匹配接口,服務層將匹配任務送到策略算子層,完成詞表的匹配功能。詞表對外的統一服務相當於一個簡單的網關,提供了鑑權功能,驗證請求是否合法;提供了流量限制的功能,可以爲每個請求方設置流量限制值;提供了結果處理的功能,策略算子返回的敏感詞屬性只是一部分,根據業務方的需求,可以完善策略算子返回的敏感詞屬性;提供了流量轉發的功能,可以根據配置將各業務線的請求打到不同的集羣,實現各業務策略算子分集羣部署。具體的流程,如圖10。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a3/a3dce8f8d4df7c33945021a763d1152b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"五、策略加載詞表","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"策略加載詞表經過多方案的迭代,方案最終逐漸成熟穩定。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一版詞表在策略的生效方案:詞表管理平臺將所有的業務線的詞表生成一個詞表文件,上傳到BOS,詞表策略30min定時掃描加載一次。所有業務線集中到一個詞表文件中,一次加載,導致了策略加載詞表速度慢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二版的方案,30分鐘生效時間後來不能滿足業務方的需求,詞表管理平臺按照業務線生成多個詞表文件,推送到BOS系統,詞表策略定時,分業務線開啓多線程加載詞表,詞表生效時間由 30min 減少到5分鐘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單三版方案,5min鍾時間對於特殊場景還是不滿足,我們增加了詞表實時同步方案,由詞表策略10s定時去數據庫掃描增量的數據加載到內存,但是這種方案不適合上萬的增量數據加載,只適合萬級以內詞的加載。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在詞表策略加載詞表,第二版和第三版同時存在,優勢互補,整個演變過程如圖11:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d4/d466629a6aca4f1b7eed63619a16a758.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BOS文件格式,多列用製表符分割,多模詞用 & 符號連接,包含詞添加前綴+號識別,主要的信息有敏感詞id、敏感詞名稱、敏感詞所屬詞表id、多模詞詞間距、失效時間、審覈類型、匹配類型、所屬業務線、生效位置、敏感類型、延展策略、豁免詞。如圖12:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5f/5fd8ab7c8dbffee836c168269fccd4b9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全量加載和增量實時同步加載流程,全量加載會在啓動的時候加載一次,加載的頻率半個小時以上,可以根據業務線配置;增量實時同步10s中去數據庫檢測一次是否有增量數據,然後分頁加載到內存。如圖13:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f5/f51bb296acfb93017d639d9d04ad3acf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"策略緩存詞表到內存的加載結構,如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)業務線、生效位置,敏感詞,敏感詞id 字典映射。匹配到敏感詞,可以根據業務線,生效位置,快速的找到敏感詞的id,通過敏感詞的 id 再獲取敏感詞的屬性規則,用於計算匹配到的敏感詞是否有效。如圖14:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/37af0ee3dd38b996644b8b610972c24c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)敏感詞id及敏感詞屬性規則字典映射,BOS文件每行敏感詞處理存儲。通過敏感詞ID能夠快速查到敏感詞的屬性規則,用於計算匹配到的敏感詞是否有效。如圖15:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/31/314689f9536e36e005174849ceeb7528.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)敏感詞掛在到字典樹(Trie樹),每個業務線、生效位置生成一個字典樹,字典樹是詞表策略的核心,上千萬的敏感詞匹配能在10ms以內返回配置結果。如圖16:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5b/5b1a24414ac3f74de8211cda4ad727b2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"六、詞表策略匹配實現","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.1詞表策略匹配流程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"策略配置匹配流程,如圖17:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/20/20d2ad1e2746853875c8ba58747cd882.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)輸入匹配參數,request_id請求的唯一標識,用於上下游定位,req_from請求來源,用於識別請求業務方,token用於權限校驗,service_line業務線標識,用於識別匹配用的詞表,conent要匹配的文本,以及文本的配置,用於識別需要哪個生效位置的敏感詞。如圖18:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/85/85431653e72422dec74de8a0c043dcc0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)將文本中的中文、字母、數字、特殊符號抽取組合生成不同組合的文字片段,用於強過濾匹配。如圖19:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1d/1d28c203a01f5efc2fd32e3559ef0591.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)根據業務線以及文本的位置將文本送到對應的字典樹匹配出單個敏感詞,信息包含敏感詞、敏感詞在文本中的位置、敏感詞的長度,位置和長度用於多模詞,詞間距是否有效的計算。匹配出的結果,如圖20:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/89/8941a63ca42fd2bd25cf234df4521915.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(4)通過業務線,生效位置、敏感詞,從 match_data(圖13)緩存中獲取到敏感詞所屬的敏感詞ID,再通過敏感詞ID從line_cahe緩存中獲取到敏感詞的屬性規則;如果匹配到的敏感詞是包含詞或者過濾詞,直接命中輸出;如果是多模詞,則再查找多模詞中的其他詞是否命中,如果命中切兩個詞的順序和詞間距滿足多模詞的屬性規則,則命中輸出。結果返回,如圖21:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1b/1b7c6296c1af1ca3e1fe225b17109022.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.2大文本匹配超時解決方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PGC圖文經常有幾十萬字的大文本文章過詞表,由於字數太多,召回的詞量能達到幾萬,這些詞在做匹配規則計算時耗時太長,導致匹配超時。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優化方案如圖21:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)優化前,一個大文本文章,標題字數100,正文19.9w,詞表匹配時先匹配標題,耗時10ms,再匹配正文,由於正文字數多,耗時19s,最終匹配的耗時兩者累加達到20s。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)優化後,大文本文章過詞表,先將字數超過5000的正文,拆成多個小於等於5000的正文,詞表匹配時,多個文字片段並行匹配,最終耗時結果是多個並行計算中耗時最大的一個,我舉的例子50ms。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/21/2161e2a8c6d4a479d19de5ad03dc2c24.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.3 字典樹(Trie樹)的實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"字典樹匹配算法使用了廠內的開源C++庫 dictmatch,dictmatch實現了最簡單的Trie樹的算法,沒有進行穿線改進,因此是需要回朔的。但是其使用2個表來表示Trie樹,並對其佔用空間大的問題進行了很大的優化,特點是在建樹的時候比較慢,但在查詢的時候非常快。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"字典樹結構,如圖23:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b1/b184bf8f05939b3e25b5b6320feb0b44.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"七、發展&思考","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞表特殊字符支持:現在的詞表詞的存儲以及字典樹的匹配算法對於表情及其他特殊字符不支持,詞表服務下一步的優化迭代會主要放在特殊字符的支持上,能夠滿足更多業務的需求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞表分業務線部署:現在詞表服務 60+ 的業務方,各業務線都是混部,所有業務線的詞表都在實例中加載一份,耗費內存特大,而且詞表服務出問題會影響所有的業務方;如果每個業務線都分集羣部署,會增加維護成本,所以我們在探索一種自動分業務線部署的方式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"推薦閱讀:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247496007&idx=1&sn=ea4e0dc518177e456ff01a2961af2842&chksm=c03ec13bf749482dd2a5d241d68d087454fd79f4f204fcbc98821dbbaecf6b2d51945c111a41&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|揭祕百度微服務監控:百度遊戲服務監控的演進","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247495573&idx=1&sn=ed2ab72bdea9ac56cb3c63c2afc5fbb0&chksm=c03edfe9f74956ff87711de4647dbf0e9fc3067bacd9a1363bb08827d8257af4b59dff7a1012&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|如何像百度直播一樣優化用戶體驗(起播篇)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247495344&idx=1&sn=6337a259a68066fd0c25dc807e3ca29c&chksm=c03edeccf74957da8269a7ca6e4c4ee067ae59645d5196289e688f721523eece797348ea19b1&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|百度搜索穩定性問題分析的故事(下)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"---------- END ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度Geek說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章