國際酒店聚合算法優化

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 背景介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"『聚合解決的是“讓數據有的比”的問題,聚合的成功率和準確率直接奠定了用戶在網站進行比價體驗的基調。』"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在酒店頻道,聚合一直被認爲是業務的基礎和核心。因爲無論是Qunar最開始定位的報價搜索,還是現在轉型的比價平臺,業務模式決定了我們要從衆多代理商和渠道獲取大量的酒店數據並對其進行整合, "},{"type":"text","marks":[{"type":"strong"}],"text":"所以聚合解決的是一個“讓數據有的比”的問題"},{"type":"text","text":" ,不誇張的說,聚合的成功率和準確率直接奠定了用戶在網站進行比價體驗的基調。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"酒店聚合的職責簡單歸納就是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"將不同代理商來源的酒店數據(酒店tree)統一到Qunar酒店下,爲比價平臺提供酒店對應關係的映射。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如以下這組聚合關係:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/56\/c1\/56b1889d8286a784d73331854b06d3c1.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前酒店聚合算法主要參考 "},{"type":"text","marks":[{"type":"strong"}],"text":"酒店名、地址、城市、"},{"type":"text","text":" 座標、電話這幾類數據進行判定,名稱和地址作爲重點參考內容,在大多數場景下可直接決定聚合結果;座標和電話由於來源數據規範等問題(如座標系不一致,酒店電話和代理商電話摻雜提供),僅做輔助判定。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. 痛點難點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"『各國地址信息命名本地化、差異大;基於對文本相似度的計算,對於酒店名和地址中的信息解析能力有限。』"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"國際酒店在Qunar起步較晚,各類基礎數據,特別是聚合層面的數據積累有限,且 "},{"type":"text","marks":[{"type":"strong"}],"text":"國際代理商數據參差不齊,數據的規範化程度較差"},{"type":"text","text":" ,再加上運營資源有限,依靠 "},{"type":"text","marks":[{"type":"strong"}],"text":"人工爲數據建立聚合關係成本高、效率低、不現實"},{"type":"text","text":" 。基於以上這些情況,我們急需提高國際酒店自動聚合算法的能力。本文將介紹過去一年中針對國際幾個重點國家進行的聚合算法優化的情況,希望能給到有類似業務痛點的同學一些解題思路和參考。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前國際酒店聚合的痛點與難點集中在以下兩個方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"A. 不同國家間地址信息的本地化差異"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/4e\/ca\/4e76218331b691b34ba7a540e031feca.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上圖所示,地址的組成格式在各個國家之間都有差異且其中夾雜着本地化的信息(圖中標紅的部分)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"B. 原有算法針對文本相似度計算,對於酒店名和地址中的信息解析能力有限"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"酒店聚合算法的本質是針對兩個酒店信息進行文本相似度的計算,滿足一定的相似度分值後即可判定爲二者存在聚合關係。但對於類似名稱,地址類的長文本數據,其本身就包含多種成分,如品牌,分店,酒店行業詞,路名,門牌號,城市等,且這些成分的重要程度是有所區別的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如“永正”作爲品牌信息幾乎可以直接鎖定“北京永正商務酒店”,然而“商務酒店”作爲行業詞,由於信息太過模糊,幾乎不能定位到任何具體的酒店。如果不能把這些成分區分清楚而直接進行文本相似度計算,就可能出現把下列兩家酒店聚合到一起的錯誤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/f0\/c0\/f07b4cd7fbb90578e3f44a20b31214c0.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖所示,目前的國際酒店聚合算法對於酒店名和地址中的詳細成分幾乎無法拆分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/69\/51\/69ee2a7de731e89bbf4ab41fd22dee51.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. 優化思路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"『整理高頻次出現的名稱及地址格式,形成各種分詞結果;從文本匹配變爲將分詞結果與可能模式進行匹配,針對相似度進行加權評分計算。』"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文提到,酒店名和地址可細分出多種不同成分,其中重要的部分可歸納爲:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/43\/27\/4309e7140f27064a6d89966db2b8c927.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 "},{"type":"text","marks":[{"type":"strong"}],"text":"城市層級,品牌,poi字典"},{"type":"text","text":" 以qunar酒店基礎信息爲基底,配合代理商數據,酒店官網擴充覆蓋及同義詞;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對酒店名分詞並進行反向高頻統計,配合運營人工篩選,可整理出酒店 "},{"type":"text","marks":[{"type":"strong"}],"text":"描述詞"},{"type":"text","text":" 和 "},{"type":"text","marks":[{"type":"strong"}],"text":"行業詞"},{"type":"text","text":" 字典;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"酒店地址中的路名信息由於存在大量本地化內容,這部分關鍵字詞典需要分地區(一般是按國家維度)對地址分詞並統計高頻結果,配合官方數據(google,Wikipedia等)篩選整理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於目前庫內已有的酒店數據,我們對名稱和地址的組成格式進行統計,整理出高頻率出現的格式,具體如下圖截取的片段所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a8\/fa\/a8d1e1195a3d974782774d315103b7fa.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過引入這種模式匹配的概念,就可以很容易的從給定的酒店名和地址中抽取出各類成分信息,舉例如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"酒店名拆分示例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/38\/47\/38e948762a5cde26568ceb9fe9e89b47.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"酒店地址拆分示例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/80\/d8\/808a8f68a3fc864b9ab7f251221271d8.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"【劃重點】"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化後的一次完整聚合流程如下"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"一條待聚合的酒店數據進入;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"基於預先收集的各類型詞彙(品牌,城市,國家)的詞典,對酒店名和地址數據進行解析,找出其中所有可能的分詞結果;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"將分詞結果與所有可能的模式進行匹配,選出合法匹配中的最優解作爲解析結果;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"根據解析出的關鍵成分(品牌+分店名,路名+門牌號)在已聚合數據中進行全文檢索,初篩出相似度最高的n條結果作爲候選集;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"在待聚合和各候選酒店之間,針對各個成分進行對比,對比出的基於字符串的相似度(匹配的可能性有:完全一致,包含,前綴,後綴等)結果配合各個成分本身的權重(如酒店名中:品牌>分店>行業詞>城市),最終算出一個綜合的分數;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"在某些特殊情況下進行打分的調整:如路名相同,門牌號不同,減分;如聯繫電話相同或座標距離相近,加分;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"選擇相似度得分最高的候選酒店作爲最終候選者,判斷得分是否高於給定的聚合分數線,是則可認爲聚合成功。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4. 優化成果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對國際重點的八個國家,我們按照模式匹配的方式優化了聚合算法,收效成果如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/e9\/bc\/e99e7891dc159da89719e2c41d463abc.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後續還將繼續分享國際酒店房型的聚合經驗,歡迎各位多多交流,共同進步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"王剛,2013年加入Qunar,目前主要負責國際酒店供應鏈系統,專注於基礎數據集成,競對分析,搜索 & 聚合算法等領域。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文轉載自公衆號Qunar技術沙龍(ID:QunarTL)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzA3NDcyMTQyNQ==&mid=2649263854&idx=1&sn=ce65fb27b7d3cc0e8422ec64f8c6f41d&chksm=87675f10b010d606822f4586ae3dcdd07ab7e1d872861e72fa608991e31982bf4d2368406a79&token=1438974945&lang=zh_CN#rd","title":"","type":null},"content":[{"type":"text","text":"國際酒店聚合算法優化"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章