美團外賣特徵平臺的建設與實踐

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1 背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美團外賣業務種類繁多、場景豐富,根據業務特點可分爲推薦、廣告、搜索三大業務線以及數個子業務線,比如商家推薦、菜品推薦、列表廣告、外賣搜索等等,滿足了數億用戶對外賣服務的全方面需求。而在每條業務線的背後,都涉及用戶、商家、平臺三方面利益的平衡:用戶需要精準的展現結果;商家需要儘可能多的曝光和轉化;平臺需要營收的最大化,而算法策略通過模型機制的優化迭代,合理地維護這三方面的利益平衡,促進生態良性發展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務的發展,外賣算法模型也在不斷演進迭代中。從之前簡單的線性模型、樹模型,到現在複雜的深度學習模型,預估效果也變得愈發精準。這一切除了受益於模型參數的不斷調優,也受益於外賣算法平臺對算力增長的工程化支撐。外賣算法平臺通過統一算法工程框架,解決了模型&特徵迭代的系統性問題,極大地提升了外賣算法的迭代效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據功能不同,外賣算法平臺可劃分爲三部分:模型服務、模型訓練和特徵平臺。其中,模型服務用於提供在線模型預估,模型訓練用於提供模型的訓練產出,特徵平臺則提供特徵和樣本的數據支撐。本文將重點闡述外賣特徵平臺在建設過程中遇到的挑戰以及優化思路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/03\/03281c304142f4f6f718f6d9b4fe8ff0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"誠然,業界對特徵系統的研究較爲廣泛,比如微信FeatureKV存儲系統聚焦於解決特徵數據快速同步問題,騰訊廣告特徵工程聚焦於解決機器學習平臺中Pre-Trainer方面的問題,美團酒旅在線特徵系統聚焦於解決高併發情形下的特徵存取和生產調度問題,而外賣特徵平臺則聚焦於提供從樣本生成->特徵生產->特徵計算的一站式鏈路,用於解決特徵的快速迭代問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着外賣業務的發展,特徵體量也在快速增長,外賣平臺面對的挑戰和壓力也不斷增大。目前,平臺已接入特徵配置近萬個,特徵維度近50種,日處理特徵數據量幾十TB,日處理特徵千億量級,日調度任務數量達數百個。面對海量的數據資源,平臺如何做到特徵的快速迭代、特徵的高效計算以及樣本的配置化生成?下文將分享美團外賣在平臺建設過程中的一些思考和優化思路,希望能對大家有所幫助或啓發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2 特徵框架演進"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 舊框架的不足"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"外賣業務發展初期,爲了提升策略迭代效率,算法同學通過積累和提煉,整理出一套通用的特徵生產框架,該框架由三部分組成:特徵統計、特徵推送和特徵獲取加載。如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4a\/4af6966e988b6ff367e3b0d714915767.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵統計"},{"type":"text","text":":基於基礎數據表,框架支持統計多個時段內特定維度的總量、分佈等統計類特徵。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵推送"},{"type":"text","text":":框架支持將Hive表裏的記錄映射成Domain對象,並將序列化後的結果寫入KV存儲。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵獲取加載"},{"type":"text","text":":框架支持在線從KV存儲讀取Domain對象,並將反序列化後的結果供模型預估使用。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該框架應用在外賣多條業務線中,爲算法策略的迭代提供了有力支撐。但隨着外賣業務的發展,業務線的增多,數據體量的增大,該框架逐漸暴露以下三點不足:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵迭代成本高"},{"type":"text","text":":框架缺乏配置化管理,新特徵上線需要同時改動離線側和在線側代碼,迭代週期較長。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵複用困難"},{"type":"text","text":":外賣不同業務線間存在相似場景,使特徵的複用成爲可能,但框架缺乏對複用能力的很好支撐,導致資源浪費、特徵價值無法充分發揮。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"平臺化能力缺失"},{"type":"text","text":":框架提供了特徵讀寫的底層開發能力,但缺乏對特徵迭代完整週期的平臺化追蹤和管理能力。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 新平臺的優勢"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對舊框架的不足,我們在2018年中旬開始着手搭建新版的特徵平臺,經過不斷的摸索、實踐和優化,平臺功能逐漸完備,使特徵迭代能力更上一層臺階。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵平臺框架由三部分組成:訓練樣本生成(離線)、特徵生產(近線)以及特徵獲取計算(在線),如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/28\/28649c3859b9ee0a3df330b1811d12b0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"訓練樣本生成"},{"type":"text","text":":離線側,平臺提供統一配置化的訓練樣本生成能力,爲模型的效果驗證提供數據支撐。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵生產"},{"type":"text","text":":近線側,平臺提供面對海量特徵數據的加工、調度、存儲、同步能力,保證特徵數據在線快速生效。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵獲取計算"},{"type":"text","text":":在線側,平臺提供高可用的特徵獲取能力和高性能的特徵計算能力,靈活支撐多種複雜模型的特徵需求。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,外賣特徵平臺已接入外賣多條業務線,涵蓋數十個場景,爲業務的策略迭代提供平臺化支持。其中,平臺的優勢在於兩點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"業務提效"},{"type":"text","text":":通過特徵配置化管理能力、特徵&算子&解決方案複用能力以及離線在線打通能力,提升了特徵迭代效率,降低了業務的接入成本,助力業務快速拿到結果。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"業務賦能"},{"type":"text","text":":平臺以統一的標準建立特徵效果評估體系,有助於特徵在業務間的借鑑和流通,最大程度發揮出特徵的價值。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3 特徵平臺建設"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 特徵生產:海量特徵的生產能力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵同步的方式有多種,業界常見做法是通過開發MR任務\/Spark任務\/使用同步組件,從多個數據源讀取多個字段,並將聚合的結果同步至KV存儲。這種做法實現簡單,但存在以下問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵重複拉取"},{"type":"text","text":":同一特徵被不同任務使用時,會導致特徵被重複拉取,造成資源浪費。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺乏全局調度"},{"type":"text","text":":同步任務間彼此隔離,相互獨立,缺乏多任務的全局調度管理機制,無法進行特徵複用、增量更新、全侷限流等操作,影響特徵的同步速度。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"存儲方式不夠靈活健壯"},{"type":"text","text":":新特徵存儲時,涉及到上下游代碼\/文件的改動,迭代成本高,特徵數據異常時,需長時間重導舊數據,回滾效率較低。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圍繞上述幾點問題,本文將從三個方面進行特徵生產核心機制的介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵語義機制"},{"type":"text","text":":用於解決平臺從數百個數據源進行特徵拉取和轉化的效率問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵多任務調度機制"},{"type":"text","text":":用於解決海量特徵數據的快速同步問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵存儲機制"},{"type":"text","text":":用於解決特徵存儲在配置化和可靠性方面的問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.1 特徵語義"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵平臺目前已接入上游Hive表數百個、特徵配置近萬個,其中大部分特徵都需天級別的更新。那平臺如何從上游高效地拉取特徵呢?直觀想法是從特徵配置和上游Hive表兩個角度進行考慮:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵配置角度"},{"type":"text","text":":平臺根據每個特徵配置,單獨啓動任務進行特徵拉取。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優點:控制靈活。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:每個特徵都會啓動各自的拉取任務,執行效率低且耗費資源。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"上游Hive表角度"},{"type":"text","text":":Hive表中多個特徵字段,統一放至同一任務中拉取。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優點:任務數量可控,資源佔用低。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:任務邏輯耦合較重,新增特徵時需感知Hive表其它字段拉取邏輯,導致接入成本高。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述兩種方案都存在各自問題,不能很好滿足業務需求。因此,特徵平臺結合兩個方案的優點,並經過探索分析,提出了特徵語義的概念:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵語義"},{"type":"text","text":":由特徵配置中的上游Hive表、特徵維度、特徵過濾條件、特徵聚合條件四個字段提取合併而成,本質就是相同的查詢條件,比如:Select "},{"type":"text","marks":[{"type":"strong"}],"text":"KeyInHive"},{"type":"text","text":",f1,f2 "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"From "},{"type":"text","marks":[{"type":"strong"}],"text":"HiveSrc"},{"type":"text","text":" Where "},{"type":"text","marks":[{"type":"strong"}],"text":"Condition"},{"type":"text","text":" Group by "},{"type":"text","marks":[{"type":"strong"}],"text":"Group"},{"type":"text","text":",此時該四個字段配置相同,可將F1、F2兩個特徵的獲取過程可合併爲一個SQL語句進行查詢,從而減少整體查詢次數。另外,平臺將語義合併過程做成自動化透明化,接入方只需關心新增特徵的拉取邏輯,無需感知同表其它字段,從而降低接入成本。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵平臺對特徵語義的處理分爲兩個階段:語義抽取和語義合併,如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f7\/f75b68c7cfbb988a5012b9f338239bd9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"語義抽取"},{"type":"text","text":":平臺解析特徵配置,構建SQL語法樹,通過支持多種形式判同邏輯(比如交換律、等效替換等規則),生成可唯一化表達的SQL語句。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"語義合併"},{"type":"text","text":":如果不同特徵對應的語義相同,平臺會將其抽取過程進行合併,比如:Select KeyInHive, "},{"type":"text","marks":[{"type":"strong"}],"text":"Extract1 as f1"},{"type":"text","text":", "},{"type":"text","marks":[{"type":"strong"}],"text":"Extract2 as f2"},{"type":"text","text":" From HiveSrc Where Condition Group by Group,其中Extract即特徵的抽取邏輯,f1和f2的抽取邏輯可進行合併,並將最終抽取到的特徵數據落地至特徵共享表中存儲,供多業務方使用。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.2 特徵多任務調度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了保證每天數十TB數據量的快速同步,特徵平臺首先按照特徵的處理流程:獲取、聚合和同步,分別制定了特徵語義任務、特徵聚合任務和特徵同步任務:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵語義任務"},{"type":"text","text":":用於將特徵數據從數據源拉取解析,並落地至特徵共享表中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵聚合任務"},{"type":"text","text":":用於不同業務線(租戶)按照自身需求,從特徵共享表中獲取特定特徵並聚合,生成全量快照以及增量數據。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵同步任務"},{"type":"text","text":":用於將增量數據(天級)和全量數據(定期)同步至KV存儲中。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,特徵平臺搭建了多任務調度機制,將不同類型的任務進行調度串聯,以提升特徵同步的時效性,如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d8\/d830d5bb42ef62ada2189e5da02d4b42.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"任務調度器"},{"type":"text","text":":按照任務執行順序,循環檢測上游任務狀態,保證任務的有序執行。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵語義任務調度"},{"type":"text","text":":當上遊Hive表就緒後,執行語義任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上游監測:通過上游任務調度接口實時獲取上游Hive表就緒狀態,就緒即拉取,保證特徵拉取的時效性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語義優先級:每個語義都會設置優先級,高優先級語義對應的特徵會被優先聚合和同步,保證重要特徵的及時更新。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隊列優選:平臺會獲取多個隊列的實時狀態,並優先選擇可用資源最多的隊列執行語義任務,提升任務執行效率。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵複用"},{"type":"text","text":":特徵的價值在於複用,特徵只需接入平臺一次,就可在不同業務間流通,是一種業務賦能的體現。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵統一存儲在特徵共享表中,供下游不同業務方按需讀取,靈活使用。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵的統一接入複用,避免相同數據的重複計算和存儲,節省資源開銷。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵聚合任務調度"},{"type":"text","text":":當上遊語義任務就緒後,執行聚合任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多租戶機制:多租戶是平臺面向多業務接入的基礎,業務以租戶爲單位進行特徵管理,併爲平臺分攤計算資源和存儲資源。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵分組:特徵分組將相同維度下的多個特徵進行聚合,以減少特徵Key的數量,避免大量Key讀寫對KV存儲性能造成的影響。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全量快照:平臺通過天級別聚合的方式生成特徵全量快照,一方面便於增量數據探查,另一方面也避免歷史數據的丟失。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"增量探查:通過將最新特徵數據與全量快照的數值對比,探查出發生變化的特徵,便於後續增量同步。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵補償:因就緒延遲而未被當天同步的特徵,可跨天進行補償同步,避免出現特徵跨天丟失的問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵同步任務調度"},{"type":"text","text":":當上遊聚合任務就緒後,執行同步任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"增量同步:將經全量快照探查到的增量數據,同步寫入KV存儲,大大降低數據寫入量,提升同步效率。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全量刷新:KV存儲中的數據由於過期時間限制,需定期進行全量刷新,避免出現特徵過期導致的數據丟失問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全侷限流:通過監測同步任務的並行度以及KV存儲狀態指標,實時調整全局同步速度,在保證KV存儲穩定性前提下,充分利用可用資源來提升特徵同步效率。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.3 特徵存儲"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"3.1.3.1 特徵動態序列化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵數據通過聚合處理後,需存儲到HDFS\/KV系統中,用於後續任務\/服務的使用。數據的存儲會涉及到存儲格式的選型,業界常見的存儲格式有JSON、Object、Protobuf等,其中JSON配置靈活,Object支持自定義結構,Protobuf編碼性能好且壓縮比高。由於特徵平臺支持的數據類型較爲固定,但對序列化反序列化性能以及數據壓縮效果有較高要求,因此選擇Protobuf作爲特徵存儲格式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Protobuf的常規使用方式是通過Proto文件維護特徵配置。新增特徵需編輯Proto文件,並編譯生成新版本JAR包,在離線&在線同時發佈更新後,才能生產解析新增特徵,導致迭代成本較高。Protobuf也提供了動態自描述和反射機制,幫助生產側和消費側動態適配消息格式的變更,避免靜態編譯帶來的JAR包升級成本,但代價是空間成本和性能成本均高於靜態編譯方式,不適用於高性能、低時延的線上場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對該問題,特徵平臺從特徵元數據管理的角度,設計了一種基於Protobuf的特徵動態序列化機制,在不影響讀寫性能前提下,做到對新增特徵讀寫的完全配置化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲方便闡述,先概述下Protobuf編碼格式。如下圖所示,Protobuf按“鍵-值”形式序列化每個屬性,其中鍵標識了該屬性的序號和類型。可以看出,從原理上,序列化主要要依賴鍵中定義的字段序號和類型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/09\/09dc4f13a3ec6b232ce7bc7c627da918.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,特徵平臺通過從元數據管理接口查詢元數據,來替換常規的Proto文件配置方式,去動態填充和解析鍵中定義的字段序號和類型,以完成序列化和反序列化,如下圖所示:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d3\/d34177e0c712b8f949af2ff873ea2685.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵序列化"},{"type":"text","text":":通過查詢特徵元數據,獲取特徵的序號和類型,將特徵序號填充至鍵的序號屬性中,並根據特徵類型決定鍵的類型屬性以及特徵值的填充方式。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵反序列化"},{"type":"text","text":":解析鍵的屬性,獲取特徵序號,通過查詢特徵元數據,獲取對應的特徵類型,並根據特徵類型決定特徵值的解析方式(定長\/變長)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"3.1.3.2 特徵多版本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵數據存儲於KV系統中,爲在線服務提供特徵的實時查詢。業界常見的特徵在線存儲方式有兩種:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單一版本存儲和多版本存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單一版本存儲即覆蓋更新,用新數據直接覆蓋舊數據,實現簡單,對物理存儲佔用較少,但在數據異常的時候無法快速回滾。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多版本存儲相比前者,增加了版本概念,每一份數據都對應特定版本,雖然物理存儲佔用較多,但在數據異常的時候可通過版本切換的方式快速回滾,保證線上穩定性。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,特徵平臺選擇特徵多版本作爲線上數據存儲方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的多版本方式是通過全量數據的切換實現,即每天在全量數據寫入後再進行版本切換。然而,特徵平臺存在增量和全量兩種更新方式,不能簡單複用全量的切換方式,需考慮增量和全量的依賴關係。因此,特徵平臺設計了一種適用於增量&全量兩種更新方式下的版本切換方式(如下圖所示)。該方式以全量數據爲基礎,白天進行增量更新,版本保持不變,在增量更新結束後,定期進行全量更新(重寫),並進行版本切換。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8a\/8a7185c000b11c50fa623f576e2ded46.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2 特徵獲取計算:高性能的特徵獲取計算能力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵獲取計算爲模型服務、業務系統、離線訓練提供特徵的實時獲取能力和高性能的計算能力,是特徵平臺能力輸出的重要途徑。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舊框架中,特徵處理分散在業務系統中,與業務邏輯耦合嚴重,隨着模型規模增長和業務系統的架構升級,特徵處理性能逐漸成爲瓶頸,主要存在以下問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"需要代碼開發"},{"type":"text","text":":特徵處理的代碼冗長,一方面會造成易增難改的現象,另一方面相同邏輯代碼重複拷貝較多,也會造成複用性逐漸變差,代碼質量就會持續惡化。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"潛在性能風險"},{"type":"text","text":":大量實驗同時進行,每次處理特徵並集,性能會互相拖累。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"一致性難以保證"},{"type":"text","text":":離線訓練樣本和在線預估對特徵的處理邏輯難以統一。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,我們在新平臺建設中,將特徵處理邏輯抽象成獨立模塊,並對模塊的職責邊界做了清晰設定:通過提供統一API的方式,只負責特徵的獲取和計算,而不關心業務流程上下文。在新的特徵獲取和計算模塊設計中,我們主要關注以下兩個方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"易用性"},{"type":"text","text":":特徵處理配置的易用性會影響到使用方的迭代效率,如果新增特徵或更改特徵計算邏輯需要代碼改動,勢必會拖慢迭代效率。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"性能"},{"type":"text","text":":特徵處理過程需要實時處理大量特徵的拉取和計算邏輯,其效率會直接影響到上游服務的整體性能。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圍繞以上兩點,本文將從下述兩個方面分別介紹特徵獲取計算部分:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模型特徵自描述MFDL"},{"type":"text","text":":將特徵計算流程配置化,提升特徵使用的易用性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵獲取流程"},{"type":"text","text":":統一特徵獲取流程,解決特徵獲取的性能問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.2.1 模型特徵自描述MFDL"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型特徵處理是模型預處理的一部分,業界常用的做法有:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將特徵處理邏輯和模型打包在一起,使用PMML或類似格式描述。優點是配置簡潔;缺點是無法單獨更新模型文件或特徵配置。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將特徵處理邏輯和模型隔離,特徵處理部分使用單獨的配置描述,比如JSON或CSV等格式。優點是特徵處理配置和模型文件分離,便於分開迭代;缺點是可能會引起特徵配置和模型加載不一致性的問題,增加系統複雜度。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考慮到對存量模型的兼容,我們定義了一套自有的配置格式,能獨立於模型文件之外快速配置。基於對原有特徵處理邏輯的梳理,我們將特徵處理過程抽象成以下兩個部分:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模型特徵計算"},{"type":"text","text":":主要用來描述特徵的計算過程。這裏區分了原始特徵和模型特徵:將從數據源直接獲取到的特徵稱之爲原始特徵,將經過計算後輸入給模型的特徵稱之爲模型特徵,這樣就可以實現同一個原始特徵經過不同的處理邏輯計算出不同的模型特徵。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模型特徵轉換"},{"type":"text","text":":將生成的模型特徵根據配置轉換成可以直接輸入給模型的數據格式。由於模型特徵計算的結果不能被模型直接使用,還需要經過一些轉換邏輯的處理,比如轉換成Tensor、Matrix等格式。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於該兩點,特徵平臺設計了MFDL(Model Feature Description Language)來完整的描述模型特徵的生成流程,用配置化的方式描述模型特徵計算和轉換過程。其中,特徵計算部分通過自定義的DSL來描述,而特徵轉換部分則針對不同類型的模型設計不同的配置項。通過將特徵計算和轉換分離,就可以很方便的擴展支持不同的機器學習框架或模型結構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/87a53b53a19bcb2197e914f8d334bcb8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在MFDL流程中,特徵計算DSL是模型處理的重點和難點。一套易用的特徵計算規範需既要滿足特徵處理邏輯的差異性,又要便於使用和理解。經過對算法需求的瞭解和對業界做法的調研,我們開發了一套易用易讀且符合編程習慣的特徵表達式,並基於JavaCC實現了高性能的執行引擎,支持了以下特性:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵類型"},{"type":"text","text":":支持以下常用的特徵數據結構:"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"單值類型(String\/Long\/Double)"},{"type":"text","text":":數值和文本類型特徵。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Map類型"},{"type":"text","text":":交叉或字典類型的特徵。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"List類型"},{"type":"text","text":":Embedding或向量特徵。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"邏輯運算"},{"type":"text","text":":支持常規的算術和邏輯運算,比如a>b?(a-b):0。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"函數算子"},{"type":"text","text":":邏輯運算只適合少量簡單的處理邏輯,而更多複雜的邏輯通常需要通過函數算子來完成。業務方既可以根據自己的需求編寫算子,也可快速複用平臺定期收集整理的常用算子,以降低開發成本,提升模型迭代效率。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵計算DSL舉例如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c7\/c7a1f303a2a254543237a8525ebbc6f4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於規範化的DSL,一方面可以讓執行引擎在執行階段做一些主動優化,包括向量化計算、並行計算等,另一方面也有助於使用方將精力聚焦於特徵計算的業務邏輯,而不用關心實現細節,既降低了使用門檻,也避免了誤操作對線上穩定性造成的影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於MFDL是獨立於模型文件之外的配置,因此特徵更新迭代時只需要將新的配置推送到服務器上,經過加載和預測後即可生效,實現了特徵處理的熱更新,提升了迭代效率。同時,MFDL也是離線訓練時使用的特徵配置文件,結合統一的算子邏輯,保證了離線訓練樣本\/在線預估特徵處理的一致性。在系統中,只需要在離線訓練時配置一次,訓練完成後即可一鍵推送到線上服務,安全高效。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是一個TF模型的MFDL配置示例:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fb\/fb69f5f3d36661c988b90e9c0138ddbc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.2.2 特徵獲取流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MFDL中使用到的特徵數據,需在特徵計算之前從KV存儲進行統一獲取。爲了提升特徵獲取效率,平臺會對多個特徵數據源異步並行獲取,並針對不同的數據源,使用不同的手段進行優化,比如RPC聚合等。特徵獲取的基本流程如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5d\/5d8acb81c21fa14517cd718322251595.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在特徵生產章節已經提到,特徵數據是按分組進行聚合存儲。特徵獲取在每次訪問KV存儲時,都會讀取整個分組下所有的特徵數據,一個分組下特徵數量的多少將會直接影響到在線特徵獲取的性能。因此,我們在特徵分組分配方面進行了相關優化,既保證了特徵獲取的高效性,又保證了線上服務的穩定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"3.2.2.1 智能分組"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵以分組的形式進行聚合,用於特徵的寫入和讀取。起初,特徵是以固定分組的形式進行組織管理,即不同業務線的特徵會被人工聚合到同一分組中,這種方式實現簡單,但卻暴露出以下兩點問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵讀取性能差"},{"type":"text","text":":線上需要讀取解析多個業務線聚合後的特徵大Value,而每個業務線只會用到其中部分特徵,導致計算資源浪費、讀取性能變差。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"影響KV集羣穩定性"},{"type":"text","text":":特徵大Value被高頻讀取,一方面會將集羣的網卡帶寬打滿,另一方面大Value不會被讀取至內存,只能磁盤查找,影響集羣查詢性能(特定KV存儲場景)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,特徵平臺設計了智能分組,打破之前固定分組的形式,通過合理機制進行特徵分組的動態調整,保證特徵聚合的合理性和有效性。如下圖所示,平臺打通了線上線下鏈路,線上用於上報業務線所用的特徵狀態,線下則通過收集分析線上特徵,從全局視角對特徵所屬分組進行智能化的整合、遷移、反饋和管理。同時,基於存儲和性能的折中考慮,平臺建立了兩種分組類型:業務分組和公共分組:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c2\/c2a2901bef5925bd8222130ea3fc332d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"業務分組"},{"type":"text","text":":用於聚合每個業務線各自用到的專屬特徵,保證特徵獲取的有效性。如果特徵被多業務共用,若仍存儲在各自業務分組,會導致存儲資源浪費,需遷移至公共分組(存儲角度)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"公共分組"},{"type":"text","text":":用於聚合多業務線同時用到的特徵,節省存儲資源開銷,但分組增多會帶來KV存儲讀寫量增大,因此公共分組數量需控制在合理範圍內(性能角度)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過特徵在兩種分組間的動態遷移以及對線上的實時反饋,保證各業務對特徵所拉即所用,提升特徵讀取性能,保證KV集羣穩定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"3.2.2.2 分組合並"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"智能分組可以有效的提升特徵獲取效率,但同時也引入了一個問題:在智能分組過程中,特徵在分組遷移階段,會出現一個特徵同時存在於多個分組的情況,造成特徵在多個分組重複獲取的問題,增加對KV存儲的訪問壓力。爲了優化特徵獲取效率,在特徵獲取之前需要對特徵分組進行合併,將特徵儘量放在同一個分組中進行獲取,從而減少訪問KV存儲的次數,提升特徵獲取性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖所示,經過分組合並,將特徵獲取的分組個數由4個(最壞情況)減少到2個,從而對KV存儲訪問量降低一半。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ed\/ed838faa3065675a933e9b3b3e7b56f7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3 訓練樣本構建:統一配置化的一致性訓練樣本生成能力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.3.1 現狀分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練樣本是特徵工程連接算法模型的一個關鍵環節,訓練樣本構建的本質是一個數據加工過程,而這份數據如何做到“能用”(數據質量要準確可信)、“易用”(生產過程要靈活高效)、“好用”(通過平臺能力爲業務賦能)對於算法模型迭代的效率和效果至關重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在特徵平臺統一建設之前,外賣策略團隊在訓練樣本構建流程上主要遇到幾個問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"重複性開發"},{"type":"text","text":":缺少體系化的平臺系統,依賴一些簡單工具或定製化開發Hive\/Spark任務,與業務耦合性較高,在流程複用、運維成本、性能調優等方面都表現較差。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"靈活性不足"},{"type":"text","text":":樣本構建流程複雜,包括但不限數據預處理、特徵抽取、特徵樣本拼接、特徵驗證,以及數據格式轉換(如TFRecord)等,已有工具在配置化、擴展性上很難滿足需求,使用成本較高。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"一致性較差"},{"type":"text","text":":線上、線下在配置文件、算子上使用不統一,導致在線預測樣本與離線訓練樣本的特徵值不一致,模型訓練正向效果難保障。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.3.2 配置化流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺化建設最重要的流程之一是“如何進行流程抽象”,業界有一些機器學習平臺的做法是平臺提供較細粒度的組件,讓用戶自行選擇組件、配置依賴關係,最終生成一張樣本構建的DAG圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於用戶而言,這樣看似是提高了流程編排的自由度,但深入瞭解算法同學實際工作場景後發現,算法模型迭代過程中,大部分的樣本生產流程都比較固定,反而讓用戶每次都去找組件、配組件屬性、指定關係依賴這樣的操作,會給算法同學帶來額外的負擔,所以我們嘗試了一種新的思路來優化這個問題:"},{"type":"text","marks":[{"type":"strong"}],"text":"模板化 + 配置化"},{"type":"text","text":",即平臺提供一個基準的模板流程,該流程中的每一個節點都抽象爲一個或一類組件,用戶基於該模板,通過簡單配置即可生成自己樣本構建流程,如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7b\/7b117f554762d007dd5c742b4340b33f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個流程模板包括三個部分:"},{"type":"text","marks":[{"type":"strong"}],"text":"輸入(Input)"},{"type":"text","text":"、"},{"type":"text","marks":[{"type":"strong"}],"text":"轉化(Transform)"},{"type":"text","text":"、"},{"type":"text","marks":[{"type":"strong"}],"text":"輸出(Output)"},{"type":"text","text":", 其中包含的組件有:Label數據預處理、實驗特徵抽取、特徵樣本關聯、特徵矩陣生成、特徵格式轉換、特徵統計分析、數據寫出,組件主要功能:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Label數據預處理"},{"type":"text","text":":支持通過自定義Hive\/Spark SQL方式抽取Label數據,平臺也內置了一些UDF(如URL Decode、MD5\/Murmur Hash 等),通過自定義SQL+UDF方式靈活滿足各種數據預處理的需求。在數據源方面,支持如下類型:"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一致性特徵樣本:指線上模型預測時,會將一次預測請求中使用到的特徵及Label相關字段收集、加工、拼接,爲離線訓練提供基礎的樣本數據,推薦使用,可更好保障一致性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自定義:不使用算法平臺提供的一致性特徵樣本數據源,通過自定義方式抽取Label數據。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"父訓練樣本:可依賴之前或其他同學生產的訓練樣本結果,只需要簡單修改特徵或採樣等配置,即可實現對原數據微調,快速生成新的訓練數據,提高執行效率。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"實驗特徵抽取"},{"type":"text","text":":線下訓練如果需要調研一些新特徵(即在一致性特徵樣本中不存在)效果,可以通過特徵補錄方式加入新的特徵集。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵樣本關聯"},{"type":"text","text":":將Label數據與補錄的實驗特徵根據唯一標識(如:poi_id)進行關聯。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵矩陣生成"},{"type":"text","text":":根據用戶定義的特徵MFDL配置文件,將每一個樣本需要的特徵集計算合併,生成特徵矩陣,得到訓練樣本中間表。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵格式轉換"},{"type":"text","text":":基於訓練樣本中間表,根據不同模型類型,將數據轉換爲不同格式的文件(如:CSV\/TFRecord)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵統計分析"},{"type":"text","text":":輔助功能,基於訓練樣本中間表,對特徵統計分析,包括均值、方差、最大\/最小值、分位數、空值率等多種統計維度,輸出統計分析報告。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據寫出"},{"type":"text","text":":將不同中間結果,寫出到Hive表\/HDFS等存儲介質。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面提到,整個流程是模板化,模板中的多數環節都可以通過配置選擇開啓或關閉,所以整個流程也支持從中間的某個環節開始執行,靈活滿足各類數據生成需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.3.3 一致性保障"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"(1)爲什麼會不一致?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文還提到了一個關鍵的問題:一致性較差。先來看下爲什麼會不一致?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/76\/76453b6b21653b2a3f5a3df032e7cb27.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖展示了在離線訓練和在線預測兩條鏈路中構建樣本的方式,最終導致離線、在線特徵值Diff的原因主要有三點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵配置文件不一致"},{"type":"text","text":":在線側、離線側對特徵計算、編排等配置描述未統一,靠人工較難保障一致性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵更新時機不一致"},{"type":"text","text":":特徵一般是覆蓋更新,特徵抽取、計算、同步等流程較長,由於數據源更新、重刷、特徵計算任務失敗等諸多不確定因素,在線、離線在不同的更新時機下,數據口徑不一致。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵算子定義不一致"},{"type":"text","text":":從數據源抽取出來的原始特徵一般都需要經過二次運算,線上、線下算子不統一。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"(2)如何保證一致性?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"明確了問題所在,我們通過如下方案來解決一致性問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/24\/24100a1345a493aa725f4cc2a51f9573.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"打通線上線下配置"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線下生成訓練樣本時,用戶先定義特徵MFDL配置文件,在模型訓練後,通過平臺一鍵打包功能,將MFDL配置文件以及訓練輸出的模型文件,打包、上傳到模型管理平臺,通過一定的版本管理及加載策略,將模型動態加載到線上服務,從而實現線上、線下配置一體化。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"提供一致性特徵樣本"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過實時收集在線Serving輸出的特徵快照,經過一定的規則處理,將結果數據輸出到Hive表,作爲離線訓練樣本的基礎數據源,提供一致性特徵樣本,保障在線、離線數據口徑一致。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"統一特徵算子庫"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文提到可以通過特徵補錄方式添加新的實驗特徵,補錄特徵如果涉及到算子二次加工,平臺既提供基礎的算子庫,也支持自定義算子,通過算子庫共用保持線上、線下計算口徑一致。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.3.4 爲業務賦能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從特徵生產,到特徵獲取計算,再到生成訓練樣本,特徵平臺的能力不斷得到延展,逐步和離線訓練流程、在線預測服務形成一個緊密協作的整體。在特徵平臺的能力邊界上,我們也在不斷的思考和探索,希望能除了爲業務提供穩定、可靠、易用的特徵數據之外,還能從特徵的視角出發,更好的建設特徵生命週期閉環,通過平臺化的能力反哺業務,爲業務賦能。在上文特徵生產章節,提到了特徵平臺一個重要能力:特徵複用,這也是特徵平臺爲業務賦能最主要的一點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵複用需要解決兩個問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵快速發現"},{"type":"text","text":":當前特徵平臺有上萬特徵,需要通過平臺化的能力,讓高質量的特徵快速被用戶發現,另外,特徵的“高質量”如何度量,也需要有統一的評價標準來支撐。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵快速使用"},{"type":"text","text":":對於用戶發現並篩選出的目標特徵,平臺需要能夠以較低的配置成本、計算資源快速支持使用(參考上文3.1.2 小節“特徵複用”)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本小節重點介紹如何幫助用戶快速發現特徵,主要包括兩個方面:主動檢索和被動推薦,如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/8742f3a2153046fddb3bd6e2a91ea513.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,用戶可以通過主動檢索,從特徵倉庫篩選出目標特徵候選集,然後結合特徵畫像來進一步篩選,得到特徵初選集,最後通過離線實驗流程、在線ABTest,結合模型效果,評估篩選出最終的結果集。其中特徵畫像主要包括以下評價指標:"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵複用度"},{"type":"text","text":":通過查看該特徵在各業務、各模型的引用次數,幫助用戶直觀判斷該特徵的價值。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵標註信息"},{"type":"text","text":":通過查看該特徵在其他業務離線、在線效果的標註信息,幫助用戶判斷該特徵的正負向效果。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據質量評估"},{"type":"text","text":":平臺通過離線統計任務,按天粒度對特徵進行統計分析,包括特徵的就緒時間、空值率、均值、方差、最大\/小值、分位點統計等,生成特徵評估報告,幫助用戶判斷該特徵是否可靠。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次,平臺根據特徵的評價體系,將表現較好的Top特徵篩選出來,通過排行榜展現、消息推送方式觸達用戶,幫助用戶挖掘高分特徵。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲業務賦能是一個長期探索和實踐的過程,未來我們還會繼續嘗試在深度學習場景中,建立每個特徵對模型貢獻度的評價體系,並通過自動化的方式打通模型在線上、線下的評估效果,通過智能化的方式挖掘特徵價值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4 總結與展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文分別從特徵框架演進、特徵生產、特徵獲取計算以及訓練樣本生成四個方面介紹了特徵平臺在建設與實踐中的思考和優化思路。經過兩年的摸索建設和實踐,外賣特徵平臺已經建立起完善的架構體系、一站式的服務流程,爲外賣業務的算法迭代提供了有力支撐。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"未來,外賣特徵平臺將繼續推進從離線->近線->在線的全鏈路優化工作,在計算性能、資源開銷、能力擴展、合作共建等方面持續投入人力探索和建設,並在更多更具挑戰的業務場景中發揮平臺的價值。同時,平臺將繼續和模型服務和模型訓練緊密結合,共建端到端算法閉環,助力外賣業務蓬勃發展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:英亮 陳龍 劉磊等"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/mp.weixin.qq.com\/s\/YyRLJa9NomPvzTWJKaCesQ"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:美團外賣特徵平臺的建設與實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:美團技術團隊 - 微信公衆號 [ID:meituantech]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章