Uber 機器學習平臺實踐

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文是"},{"type":"link","attrs":{"href":"https:\/\/tech.ipalfish.com\/blog\/2021\/05\/31\/mlsys-we-love\/","title":null,"type":null},"content":[{"type":"text","text":"「算法工程化實踐調研」"}]},{"type":"text","text":"系列的第 2 篇,介紹來自 Uber 在 2017 年 9 月發佈的技術博客 "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/michelangelo-machine-learning-platform\/","title":null,"type":null},"content":[{"type":"text","text":"Meet Michelangelo: Uber’s Machine Learning Platform"}]},{"type":"text","text":" [1]。它介紹了機器學習平臺 Michelangelo(意大利文藝復興時期偉大的繪畫家、雕塑家、建築師和詩人)的各個組件的職能,第一次細緻地向大家描述了機器學習平臺應有的全貌。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"現狀和問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 有衆多業務線,其中包含許多使用 ML 的場景,例如:共享出行 App 需要用算法預測乘客的到達時間、送餐 App 需要用算法爲用戶按照個人喜好爲餐館排序、客服中心需要用算法減少人工客服的介入,等等。多個業務線爲了快速滿足使用 ML 的需求,很自然地採了煙囪式的架構,逐漸產生了系統難以維護、模型難以上線等問題,亟需統一的解決方案來涵蓋 ML 的全流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了系統地解決問題,Michelangelo 平臺(以下簡稱平臺)首先釐清並定義了 ML 的全流程——數據管理、模型訓練、模型評估、模型部署、做出預測、預測監控,並給出了各環節的解決方案。下文討論平臺在各個環節的做法。爲了便於討論,下圖展示了平臺的全景,會在後文被反覆引用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/06\/06fe039ef971b81eace79599de1bd737.png","alt":"Michelangelo","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據是 ML 中最難的部分。平臺的數據管理組件包括特徵生成管道和特徵倉庫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵生成管道(Stream Engine 和 Data Prep Job)經工作流引擎的調度,從 Kafka 流數據源 和 Data Lake 批數據源讀取數據,轉換成供 ML 模型使用的特徵,寫入特徵倉庫(Cassandra Feature Store 和 Hive Feature Store)。一旦特徵落庫,就可以供在線預測(Realtime Predict Service)、離線訓練(Batch Training Job)和離線預測(Batch Predict Job)使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得一提的是,爲了簡化特徵生成管道的實現,平臺內置了一套基於 Scala 的 DSL,支持聲明式地定義邏輯,而無需編寫 Samza(注:平臺在 2019 年之前已經把邏輯遷移到 Flink)和 Spark 代碼。平臺後續開發的 "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/michelangelo-pyml\/","title":null,"type":null},"content":[{"type":"text","text":"PyML"}]},{"type":"text","text":" 支持直接調用 Python 庫,編寫 Python 代碼,進一步簡化了特徵生成管道的實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣的架構有什麼好處?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵生成管道是把數據加工爲特徵的唯一指定地點,避免因邏輯散落各處(訓練、預測)而造成的錯誤。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵倉庫作爲聯繫數據和模型的橋樑,讓不同團隊分工明確。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集中的特徵倉庫使得特徵的發現和共享成爲可能,避免不同業務線重複開發和維護特徵生成管道,這對於 Uber 業務線衆多的現狀十分重要。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型訓練"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型訓練是一個高度交互的過程。平臺支持算法工程師在最爲熟悉的 Jupyter Notebook 環境中,通過調用 Python SDK 完成模型訓練的全過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"定義模型。算法工程師需要在模型配置中,聲明模型類型(需要在平臺的支持列表內)、超參數(平臺支持超參數搜索)、數據源、特徵生成管道 DSL、計算資源要求(機器數量、內存大小、是否使用 GPU 等)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"觸發訓練。觸發後,工作流引擎執行訓練任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"存儲訓練結果。訓練完成後,評估報告(P-R 曲線和 ROC 曲線)、模型配置和模型參數會被上傳到模型倉庫(Cassandra Model Repo),用於分析和部署。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型訓練注重迭代,因此訓練效率十分重要。平臺通過 "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/dsw\/","title":null,"type":null},"content":[{"type":"text","text":"Data Science Workbench"}]},{"type":"text","text":" 滿足算法工程師的不同訓練需求:在 GPU 集羣上分佈式訓練深度學習模型、在 CPU 集羣上訓練樹和線性模型、在普通 Python 環境下實驗各種不同模型。平臺還針對深度學習的訓練提供額外支持,見 "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/horovod\/","title":null,"type":null},"content":[{"type":"text","text":"Horovod"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型評估"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在得到理想模型之前,算法工程師往往需要訓練很多個模型,記錄、評估、比較這些模型會爲算法工程師提供很多有用的信息。平臺在基於 Cassandra 的模型倉庫中記錄詳盡的模型元數據,包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練發起人。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練工作流的起止時間。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型配置。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用的訓練和驗證數據集。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個特徵的分佈。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型準確度指標。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"常用圖表,如 ROC、P-R、confusion matrix。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"習得參數。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型可視化。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型部署"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺支持通過 CLI 和 UI 快捷地部署模型。部署所需的模型 artifacts(包括元數據、模型參數文件,和編譯過的特徵生成 DSL)被打包成 ZIP,傳送到指定服務器。預測服務會重載模型,重新開始處理預測請求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"做出預測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺支持在線和離線預測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於在線預測,在模型部署完成後,在線預測服務已經將預測所需的模型從模型倉庫中載入到內存中。一旦在線預測服務收到從客戶端發來的包含 entity ID 的預測請求,它首先通過 entity ID 從在線特徵倉庫獲取對應的特徵向量,然後將特徵向量輸入模型,計算出預測值,返回給客戶端。預測服務使用 Java 開發,以實現高併發和低延遲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於離線預測,工作流將模型載入工作流的內存,從離線特徵倉庫中批量讀取特徵並作出預測,將預測值寫入 Hive 或 Kafka。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"預測監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練中表現優秀的模型在生產環境中可能會錯得離譜,因此平臺支持對模型的預測進行細緻的監控。最有效的辦法是在生產環境中打印出一定比例的預測日誌,並在稍後與觀測值進行 join 和比較。另一種辦法是記錄下預測值的分佈。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 團隊在開發 Michelangelo 的過程中總結出三點重要經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從小做起,快速迭代。從範圍最小、影響力最大的切入點做起,容易出成果,獲得領導層支持,有利於後續的快速迭代。在初期,平臺專注於支持大規模的離線訓練和離線預測。隨後,逐漸支持特徵倉庫、模型評估、在線預測服務、深度學習、Jupyter Notebook 集成、partitioned models 等。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓工程師使用最趁手的工具。在平臺的早期,外部用戶還不多,開發的主力是平臺工程師,使用 Spark \/ Spark ML \/ Scala \/ Java 這個技術棧有利於快速迭代。但當平臺愈發成熟,平臺的重點變爲快速的模型試錯和迭代,該階段的開發主力是算法工程師,他們可能需要定製化地在不同環節增加對某些模型的支持,例如實現未被內置支持的數據預處理、實現對新的深度學習模型分佈式訓練的支持等,這時他們希望能使用熟悉的 TensorFlow \/ PyTorch \/ Python \/ Jupyter Notebook 進行開發。PyML 正是這方面的嘗試。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據是 ML 系統最重要,但也是最難的部分。難體現在技術和人兩方面。在技術的方面,每個公司都有一套運行多年的數據倉庫、數據管道、工作流系統,其中可能存在各種不盡完善之處,導致在接入 ML 平臺後難以應對頻繁和快速的變化。在人的方面,取數的需求由算法工程師提出,由數據工程師實現,跨部門、跨工作語言(JVM vs. Python)的合作通常不易。特徵倉庫這個抽象層正是爲了解決這個問題而提出的,詳情可參見平臺的特徵倉庫 "},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"Palette"}]},{"type":"text","text":"。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 通過豐富的業務實踐,沉澱出一個成熟的 ML 平臺,並慷慨地通過技術博客、公開演講和開源軟件等形式,展現出平臺在不同發展階段的權衡、技術選型和着力點,讓我們得以窺見一個成熟的 ML 平臺的發展歷程,對我們在伴魚從零開始搭建 ML 平臺有很大的借鑑意義。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文獻"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Meet Michelangelo: Uber’s Machine Learning Platform. "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/michelangelo-machine-learning-platform\/","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/eng.uber.com\/michelangelo-machine-learning-platform\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] Scaling Machine Learning at Uber with Michelangelo. "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/scaling-michelangelo\/","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/eng.uber.com\/scaling-michelangelo\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:陳易生"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/tech.ipalfish.com\/blog\/2021\/05\/31\/uber-michelangelo-overview\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:Uber 機器學習平臺實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:伴魚技術博客"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章