機器學習特徵系統在伴魚的演進

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/u\/banyu\/publish","title":"xxx","type":null},"content":[{"type":"text","text":"伴魚"}]},{"type":"text","text":",我們在多個在線場景使用機器學習提高用戶的使用體驗,例如:在伴魚繪本中,我們根據用戶的帖子瀏覽記錄,爲用戶推薦他們感興趣的帖子;在轉化後臺裏,我們根據用戶的繪本購買記錄,爲用戶推薦他們可能感興趣的課程等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵是機器學習模型的輸入。如何高效地將特徵從數據源加工出來,讓它能夠被在線服務高效地訪問,決定了我們能否在生產環境可靠地使用機器學習。爲此,我們搭建了特徵系統,系統性地解決這一問題。目前,伴魚的機器學習特徵系統運行了接近 100 個特徵,支持了多個業務線的模型對在線獲取特徵的需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面,我們將介紹特徵系統在伴魚的演進過程,以及其中的權衡考量。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"特徵系統 V1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵系統 V1 由三個核心組件構成:特徵管道,特徵倉庫,和特徵服務。整體架構如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/21\/c5\/2194e3e2088f0317fb62cd4c12dec4c5.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵管道包括流特徵管道和批特徵管道,它們分別消費流數據源和批數據源,對數據經過預處理加工成特徵(這一步稱爲特徵工程),並將特徵寫入特徵倉庫。批特徵管道使用 "},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/","title":"xxx","type":null},"content":[{"type":"text","text":"Spark "}]},{"type":"text","text":"實現,由 DolphinScheduler 進行調度,跑在 YARN 集羣上。出於技術棧的一致考慮,流特徵管道使用 Spark Structured Streaming 實現,和批特徵管道一樣跑在 YARN 集羣上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵倉庫選用合適的存儲組件(Redis)和數據結構(Hashes),爲模型服務提供低延遲的特徵訪問能力。之所以選用 Redis 作爲存儲,是因爲:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"伴魚有豐富的 Redis 使用經驗。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包括 "},{"type":"link","attrs":{"href":"https:\/\/doordash.engineering\/2020\/11\/19\/building-a-gigascale-ml-feature-store-with-redis\/","title":null,"type":null},"content":[{"type":"text","text":"DoorDash Feature Store"}]},{"type":"text","text":" 和 "},{"type":"link","attrs":{"href":"https:\/\/docs.feast.dev\/feast-on-kubernetes\/concepts\/stores#online-store","title":null,"type":null},"content":[{"type":"text","text":"Feast"}]},{"type":"text","text":" 在內的業界特徵倉庫解決方案都使用了 Redis。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵服務屏蔽特徵倉庫的存儲和數據結構,對外暴露 RPC 接口 "},{"type":"codeinline","content":[{"type":"text","text":"GetFeatures(EntityName, FeatureNames)"}]},{"type":"text","text":",提供對特徵的低延遲點查詢。在實現上,這一接口基本對應於 Redis 的 "},{"type":"codeinline","content":[{"type":"text","text":"HMGET EntityName FeatureName_1 ... FeatureName_N"}]},{"type":"text","text":" 操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一版本的特徵系統存在幾個問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算法工程師缺少控制,導致迭代效率低。這個問題與系統涉及的技術棧和公司的組織架構有關。在整個系統中,特徵管道的迭代需求最高,一旦模型對特徵有新的需求,就需要修改或者編寫一個新的 Spark 任務。而 Spark 任務的編寫需要有一定的 Java 或 Scala 知識,不屬於算法工程師的常見技能,因此交由大數據團隊全權負責。大數據團隊同時負責多項數據需求,往往有很多排期任務。結果便是新特徵的上線涉及頻繁地跨部門溝通,迭代效率低。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵管道只完成了輕量的特徵工程,降低在線推理的效率。由於特徵管道由大數據工程師而非算法工程師編寫,複雜的數據預處理涉及更高的溝通成本,因此這些特徵的預處理程度都比較輕量,更多的預處理被留到模型服務甚至模型內部進行,增大了模型推理的時延。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決這幾個問題,特徵系統 V2 提出幾個設計目的:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將控制權交還算法工程師,提高迭代效率。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將更高權重的特徵工程交給特徵管道,提高在線推理的效率。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"特徵系統 V2"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵系統 V2 相比特徵系統 V1 在架構上的唯一不同點在於,它將特徵管道切分爲三部分:特徵生成管道,特徵源,和特徵注入管道。值得一提的是,管道在實現上均從 Spark 轉爲 Flink,和公司數據基礎架構的發展保持一致。特徵系統 V2 的整體架構如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/71\/71\/71bb73d26679a34950dac22f2a16a071.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵生成管道讀取原始數據源,加工爲特徵,並將特徵寫入指定特徵源(而非特徵倉庫)。如果管道以流數據源作爲原始數據源,則它是流特徵生成管道;如果管道以批數據源作爲原始數據源,則它是批特徵生成管道。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵生成管道的邏輯由算法工程師全權負責編寫。其中,批特徵生成管道使用 HiveQL 編寫,由 DolphinScheduler 調度。流特徵生成管道使用 PyFlink 實現,詳情見下圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/dd\/50\/ddd1d9531a905b142eb5b23b24591950.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算法工程師需要遵守下面步驟:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"用 Flink SQL 聲明 Flink 任務源(source.sql)和定義特徵工程邏輯(transform.sql)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"(可選)用 Python 實現特徵工程邏輯中可能包含的 UDF 實現(udf_def.py)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"使用自研的代碼生成工具,生成可執行的 PyFlink 任務腳本(run.py)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"本地使用由平臺準備好的 Docker 環境調試 PyFlink 腳本,確保能在本地正常運行。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"把代碼提交到一個統一管理特徵管道的代碼倉庫,由 AI 平臺團隊進行代碼審覈。審覈通過的腳本會被部署到伴魚實時計算平臺,完成特徵生成管道的上線。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一套流程確保了:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算法工程師掌握上線特徵的自主權。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺工程師把控特徵生成管道的代碼質量,並在必要時可以對它們實現重構,而無需算法工程師的介入。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵源存儲從原始數據源加工形成的特徵。值得強調的是,它同時還是連接算法工程師和 AI 平臺工程師的橋樑。算法工程師只負責實現特徵工程的邏輯,將原始數據加工爲特徵,寫入特徵源,剩下的事情就交給 AI 平臺。平臺工程師實現特徵注入管道,將特徵寫入特徵倉庫,以特徵服務的形式對外提供數據訪問服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵注入管道將特徵從特徵源讀出,寫入特徵倉庫。由於 Flink 社區缺少對 Redis sink 的原生支持,我們通過拓展 "},{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/flink\/blob\/master\/flink-streaming-java\/src\/main\/java\/org\/apache\/flink\/streaming\/api\/functions\/sink\/RichSinkFunction.java","title":null,"type":null},"content":[{"type":"text","text":"RichSinkFunction"}]},{"type":"text","text":" 簡單地實現了 "},{"type":"codeinline","content":[{"type":"text","text":"StreamRedisSink"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"BatchRedisSink"}]},{"type":"text","text":",很好地滿足我們的需求。其中,"},{"type":"codeinline","content":[{"type":"text","text":"BatchRedisSink"}]},{"type":"text","text":" 實現了批量寫入,大幅減少對 Redis server 的請求量,增大吞吐,將寫入效率提升了 7 倍,見"},{"type":"link","attrs":{"href":"https:\/\/tech.ipalfish.com\/blog\/2021\/06\/25\/flink-bulk-insert-redis\/","title":null,"type":null},"content":[{"type":"text","text":"博客"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵系統 V2 很好地滿足了我們提出的設計目的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於特徵生成管道的編寫只需用到 SQL 和 Python 這兩種算法工程師十分熟悉的工具,因此他們全權負責特徵生成管道的編寫和上線,無需依賴大數據團隊,大幅提高了迭代效率。在熟悉後,算法工程師通常只需花費半個小時以內,就可以完成流特徵的編寫、調試和上線。而這個過程原本需要花費數天,取決於大數據團隊的排期。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出於同樣的原因,算法工程師可以在有需要的前提下,完成更重度的特徵工程,從而減少模型服務和模型的負擔,提高模型在線推理效率。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵系統 V1 解決了特徵上線的問題,而特徵系統 V2 在此基礎上,解決了特徵上線難的問題。在特徵系統的演進過程中,我們總結出作爲平臺研發的幾點經驗:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺應該提供用戶想用的工具。這與 Uber ML 平臺團隊在內部推廣的"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/scaling-michelangelo\/","title":null,"type":null},"content":[{"type":"text","text":"經驗"}]},{"type":"text","text":"相符。算法工程師在 Python 和 SQL 環境下工作效率最高,而不熟悉 Java 和 Scala。那麼,想讓算法工程師自主編寫特徵管道,平臺應該支持算法工程師使用 Python 和 SQL 編寫特徵管道,而不是讓算法工程師去學 Java 和 Scala,或是把工作轉手給大數據團隊去做。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺應該提供易用的本地調試工具。我們提供的 Docker 環境封裝了 Kafka 和 Flink,讓用戶可以在本地快速調試 PyFlink 腳本,而無需等待管道部署到測試環境後再調試。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺應該在鼓勵用戶自主使用的同時,通過自動化檢查或代碼審覈等方式牢牢把控質量。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:陳易生"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/tech.ipalfish.com\/blog\/2021\/07\/30\/palfish-feature-system\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:機器學習特徵系統在伴魚的演進"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:伴魚技術博客"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章