使用Apache Spark管理、部署和擴展機器學習管道(十一)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"寫在前面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是強哥,一個熱愛分享的技術狂。目前已有 12 年大數據與AI相關項目經驗, 10 年推薦系統研究及實踐經驗。平時喜歡讀書、暴走和寫作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業餘時間專注於輸出大數據、AI等相關文章,目前已經輸出了40萬字的推薦系統系列精品文章,今年 6 月底會出版「構建企業級推薦系統:算法、工程實現與案例分析」一書。如果這些文章能夠幫助你快速入門,實現職場升職加薪,我將不勝歡喜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要獲得更多免費學習資料或內推信息,一定要看到文章最後喔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內推信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你正在看相關的招聘信息,請加我微信:liuq4360,我這裏有很多內推資源等着你,歡迎投遞簡歷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"免費學習資料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想獲得更多免費的學習資料,請關注同名公衆號【數據與智能】,輸入“資料”即可!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"學習交流羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想找到組織,和大家一起學習成長,交流經驗,也可以加入我們的學習成長羣。羣裏有老司機帶你飛,另有小哥哥、小姐姐等你來勾搭!加小姐姐微信:epsila,她會帶你入羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上一章中,我們介紹瞭如何使用MLlib構建機器學習管道。本章將重點介紹如何管理和部署你訓練的模型。在本章結束時,你將能夠使用MLflow來追蹤、重現和部署MLlib模型,以及討論各種模型部署方案之間的困難和折衷方案,並設計可擴展的機器學習解決方案。但是,在討論部署模型之前,讓我們首先討論一些模型管理的最佳實踐,在進行部署之前將你的模型準備好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"模型管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在部署機器學習模型之前,應確保可以複製和追蹤模型的效果。對我們而言,機器學習解決方案的端到端可重現性意味着我們需要能夠再現生成模型的代碼,包括用於訓練的環境,用於訓練的數據以及模型本身。每個數據科學家都喜歡提醒你設置隨機數種子,以便你可以重現實驗(例如,在使用具有固有隨機性的模型(例如隨機森林)時進行訓練集/測試集拆分)。但是,有許多方面比設置種子更有助於提高可重複性,其中有些方面很微妙。下面有幾個例子:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"庫版本控制","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據科學家將他們的代碼交給你時,他們可能會或者可能不會提及依賴庫。雖然你可以通過查看錯誤消息從而確定需要哪些庫,但是不確定它們使用的是哪個庫版本,因此你可能會安裝最新的庫。這樣一來,如果他們的代碼是建立在庫的先前版本上的,則會存在不同版本之間的某些默認行爲是不一致的,從而導致不同的結果,因此使用最新版本可能會導致代碼中斷或結果有所不同(例如,考慮一下XGBoost如何更改了v0.90版本中處理缺失值的方式)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"數據演進","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設你在2020年6月1日建立模型,並追蹤所有超參數、庫等。然後你嘗試在2020年7月1日重現相同的模型,但是由於基礎數據發送變更,所以引起管道中斷或結果不同,如果有人在初始構建後添加了額外的列或更多數量級的數據,則可能會發生這種情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"執行順序","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果數據科學家將他們的代碼交給你,那麼你應該能夠從上到下運行它而不會出錯。但是,數據科學家以無序運行或多次運行同一個有狀態單元而臭名昭著,這使得它們的結果很難重現。(他們還可能引入具有與用於訓練最終模型的參數不同的超參數的代碼副本!)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"並行作業","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了最大化吞吐量,GPU將並行運行許多操作。但是,執行順序不一定總是得到保證,這可能導致結果的不確定性。這是類似tf.reduce_sum()函數之類的已知問題,例如在對浮點數進行聚合(精度有限)時:添加它們的順序可能會產生略有不同的結果,並且在許多迭代中都會加劇這種結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無法複製實驗通常會阻礙業務部門採用你的模型或將其投入生產。儘管你可以構建自己的內部工具來追蹤模型、數據和依賴項版本等,但是它們可能變得過時,脆弱並且需要花費大量的開發工作來維護。同樣重要的是要擁有用於管理模型的行業範圍的標準,以便可以與合作伙伴輕鬆共享它們。開源和專有工具都可以通過抽象出許多常見困難來幫助我們重現我們的機器學習實驗。本節將重點介紹MLflow,因爲它與當前可用的開源模型管理工具MLlib的集成最爲緊密。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"MLflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MLflow是一個開放源代碼平臺,可幫助開發人員複製和共享實驗,管理模型等。它提供Python,R和Java / Scala的接口,以及REST API。如圖11-1所示,MLflow具有四個主要組件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"Tracking","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供API來記錄參數,指標,代碼版本,模型和其他人工素材(例如圖形和文本)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Projects","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打包數據科學項目及其依賴性是一種標準化操作,以便在其他平臺上運行。它可以幫助你管理模型訓練過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Models","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過標準化的方式打包模型以部署到不同的執行環境。它提供了用於加載和應用模型的一致API,無論用於構建模型的算法或庫是什麼。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Registry","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用於追蹤模型演進,模型版本,階段轉換和註釋的存儲庫。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/80/80e5d51afb6bf017e3724f951cbfc05a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們嘗試在第10章中爲重現性而進行的MLlib模型實驗。然後,當我們討論模型部署時,我們將看到MLflow的其他組件如何發揮作用。要開始使用MLflow,只需在本地主機上運行pip install mlflow即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Tracking","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MLflow Tracking是一個日誌記錄API,與實際進行訓練的庫和環境無關。它圍繞","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"運行(runs)","attrs":{}},{"type":"text","text":"的概念進行組織,而","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"運行","attrs":{}},{"type":"text","text":"是數據科學代碼的執行。將運行彙總到","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"實驗中","attrs":{}},{"type":"text","text":",以便許多運行可以成爲給定實驗的一部分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MLflow追蹤服務器可以託管許多實驗。你可以使用筆記本(notebooks),本地應用程序或雲作業登錄到追蹤服務器,如圖11-2所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bd/bda7d7bd296965941470e618d723a549.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們檢查一些可以記錄到追蹤服務器的內容:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"參數","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你代碼的鍵/值輸入,例如,隨機森林中的超參數num_trees或max_depth","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"指標","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數值(可以隨時間更新),例如,RMSE或精度值","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Artifacts","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文件,數據和模型,例如matplotlib圖像或Parquet文件","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"元數據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有關運行的信息,例如執行運行的源代碼或代碼的版本(例如,代碼版本的Git提交哈希字符串)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你訓練的模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下,追蹤服務器會將所有內容記錄到文件系統中,但是你可以指定數據庫以加快查詢速度,例如參數和指標。讓我們將MLflow追蹤添加到第10章的隨機森林代碼中:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml import Pipeline","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.feature import StringIndexer, VectorAssembler","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.regression import RandomForestRegressor","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.evaluation import RegressionEvaluator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"filePath = \"\"\"/databricks-datasets/learning-spark-v2/sf-airbnb/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sf-airbnb-clean.parquet\"\"\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"airbnbDF = spark.read.parquet(filePath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(trainDF, testDF) = airbnbDF.randomSplit([.8, .2], seed=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"categoricalCols = [field for (field, dataType) in trainDF.dtypes","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                   if dataType == \"string\"]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"indexOutputCols = [x + \"Index\" for x in categoricalCols]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"stringIndexer = StringIndexer(inputCols=categoricalCols,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                              outputCols=indexOutputCols,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                              handleInvalid=\"skip\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"numericCols = [field for (field, dataType) in trainDF.dtypes","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"               if ((dataType == \"double\") & (field != \"price\"))]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"assemblerInputs = indexOutputCols + numericCols","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vecAssembler = VectorAssembler(inputCols=assemblerInputs,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                               outputCol=\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"rf = RandomForestRegressor(labelCol=\"price\", maxBins=40, maxDepth=5,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                           numTrees=100, seed=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipeline = Pipeline(stages=[stringIndexer, vecAssembler, rf])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要開始使用MLflow進行記錄,你將需要使用開始運行mlflow.start_run(),而不是顯示調用mlflow.end_run()。本章中的示例將使用一個with子句來自動結束運行:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import mlflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import mlflow.spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import pandas as pd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"with mlflow.start_run(run_name=\"random-forest\") as run:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Log params: num_trees and max_depth","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.log_param(\"num_trees\", rf.getNumTrees())","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.log_param(\"max_depth\", rf.getMaxDepth())","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Log model","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  pipelineModel = pipeline.fit(trainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.spark.log_model(pipelineModel, \"model\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Log metrics: RMSE and R2","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  predDF = pipelineModel.transform(testDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  regressionEvaluator = RegressionEvaluator(predictionCol=\"prediction\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                                            labelCol=\"price\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  rmse = regressionEvaluator.setMetricName(\"rmse\").evaluate(predDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  r2 = regressionEvaluator.setMetricName(\"r2\").evaluate(predDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.log_metrics({\"rmse\": rmse, \"r2\": r2})","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Log artifact: feature importance scores","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  rfModel = pipelineModel.stages[-1]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  pandasDF = (pd.DataFrame(list(zip(vecAssembler.getInputCols(),","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                                    rfModel.featureImportances)),","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                           columns=[\"feature\", \"importance\"])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"              .sort_values(by=\"importance\", ascending=False))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First write to local filesystem, then tell MLflow where to find that file","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  pandasDF.to_csv(\"feature-importance.csv\", index=False)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.log_artifact(\"feature-importance.csv\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們檢查一下MLflow UI,你可以通過在終端中運行mlflow ui並導航到","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"http://localhost:5000/來訪問它","attrs":{}},{"type":"text","text":"。圖11-3顯示了UI的屏幕截圖。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/84/849037feffbca5149a5767c0f890dc28.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UI會存儲給定實驗的所有運行任務。你可以搜索所有運行任務,篩選滿足特定條件的運行任務信息,並排比較運行等。如果需要,還可以將內容導出爲CSV文件以在本地進行分析。在用戶界面中點擊運行任務\"random-forest\"。你應該看到如圖11-4所示的面板。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cd883f8e8b4c37203bef3e958caa86b0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你會注意到,它追蹤用於此MLflow運行的源代碼,並存儲所有相應的參數,指標等。你可以自由地在文本和標籤中添加有關此運行的註釋。運行完成後,你將無法修改參數或指標。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你還可以使用MlflowClient或REST API查詢追蹤服務器:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from mlflow.tracking import MlflowClient","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"client = MlflowClient()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"runs = client.search_runs(run.info.experiment_id,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                          order_by=[\"attributes.start_time desc\"],","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                          max_results=1)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"run_id = runs[0].info.run_id","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"runs[0].data.metrics","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這將產生以下輸出:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"{'r2':0.22794251914574226,'rmse':211.5096898777315}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將本書這部分的代碼作爲MLflow項目託管在GitHub倉庫中,所以你可以嘗試用不同的超參數值如max_depth和num_trees來運行它。MLflow項目中的YAML文件指定了庫依賴關係,因此該代碼可以在其他環境中運行:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mlflow.run(","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  \"https://github.com/databricks/LearningSparkV2/#mlflow-project-example\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  parameters={\"max_depth\": 5, \"num_trees\": 100})","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Or on the command line","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mlflow run https://github.com/databricks/LearningSparkV2/#mlflow-project-example","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"-P max_depth=5 -P num_trees=100","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,你已經追蹤並複製了實驗,讓我們討論MLlib模型可用的各種部署選項。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"MLlib的模型部署選項","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部署機器學習模型對於每個組織和應用場景而言意味着不同的東西。業務約束將對延遲,吞吐量,成本等權衡指標提出不同的要求,這些要求決定了哪種模型部署模式適合手頭的任務——無論是批處理,流計算,實時還是移動/嵌入式任務。在移動/嵌入式系統上部署模型超出了本書的範圍,因此我們將主要關注其他方面。表11-1顯示了這三種用於生成預測的部署選項的吞吐量和延遲權衡。我們同時關心併發請求的數量和這些請求的大小,並且最終的解決方案看起來會大不相同。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/36/36fc4693e4f88d56b394ae1fa784db17.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"批處理會定期生成預測,並將結果寫到持久性存儲中,以供其他地方使用。它通常是最便宜和最簡單的部署選項,因爲你只需要在計劃的運行期間爲計算付費。批處理在每個數據點上的效率更高,因爲在所有預測中進行攤銷時,你可以累積較少的開銷。對於Spark尤其如此,因爲在驅動程序和執行程序之間來回通信會產生開銷,因此你不想一次做出一個數據點的預測!但是,它的主要缺點是延遲高,因爲通常將其設定爲數小時或數天以生成下一批預測。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流提供了吞吐量和延遲之間的良好折衷。你將不斷對數據的微批次進行預測,並在數秒至數分鐘內獲得你的預測。如果你使用的是結構化流,則幾乎所有代碼都將與批處理用例相同,從而可以輕鬆地在這兩個選項之間來回切換。使用流傳輸時,你將需要爲用來持續保持運行狀態的VM或計算資源付費,並確保已正確配置流以使其具有容錯能力,並在傳入數據中出現峯值時提供緩衝。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時部署優先考慮延遲而不是吞吐量,並在幾毫秒內生成預測。你的基礎架構將需要支持負載平衡,並且在需求激增的情況下(例如,假期期間的在線零售商)能夠擴展到許多併發請求。有時,當人們說“實時部署”時,它們的意思是實時提取預先計算的預測,但是這裏我們指的是實時進行模型預測。實時部署是Spark唯一無法滿足延遲要求的選項,因此要使用它,你需要將模型導出到Spark之外。例如,如果你打算使用REST端點進行實時模型推斷(例如,在50毫秒內計算預測),則MLlib不滿足該應用程序所需的等待時間要求,如圖11-5所示。你需要準備好你的特徵並在Spark之外進行建模,因爲用Spark建模可能既費時又困難。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ca/cae2efb5aeef68a6295eeb4f57761a81.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在開始建模過程之前,需要定義模型部署需求。MLlib和Spark只是你工具箱中的一些工具,你需要了解應在何時何地應用它們。本節的其餘部分將更深入地討論MLlib的部署選項,然後我們將考慮針對非MLlib模型的Spark部署選項。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"批","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"批處理部署代表了部署機器學習模型的大多數情形,並且可以說這是最容易實現的選擇。你將運行常規作業以生成預測,並將結果保存到表,數據庫,數據湖等中,以供下游使用。實際上,你已經在第10章中使用MLlib生成了如何生成批量預測。MLlib的model.transform()會將模型並行應用於DataFrame的所有分區:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Load saved model with MLflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import mlflow.spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelineModel = mlflow.spark.load_model(f\"runs:/{run_id}/model\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Generate predictions","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"inputDF = spark.read.parquet(\"/databricks-datasets/learning-spark-v2/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  sf-airbnb/sf-airbnb-clean.parquet\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"predDF = pipelineModel.transform(inputDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"批處理部署要記住的幾件事是:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"你將多久生成一次預測?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在延遲和吞吐量之間需要權衡。你將獲得更高的吞吐量,將許多預測分批處理,但是接收任何單個預測所花費的時間將更長,從而延遲了你對這些預測採取行動的能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"你將多久重新訓練一次模型?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與sklearn或TensorFlow之類的庫不同,MLlib不支持在線更新或熱啓動。如果你想重新訓練模型以合併最新數據,則必須從頭開始重新訓練整個模型,而不是利用現有參數。就重新訓練的頻率而言,有些人將建立一個常規工作來對模型進行再次訓練(例如,每月一次),而其他人將積極地監視模型漂移以確定什麼時候需要重新訓練。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"你將如何對模型進行版本控制?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以使用MLflow模型註冊表來追蹤所使用的模型,並控制它們如何在暫存區,生產和歸檔之間進行轉換。你可以在圖11-6中看到Model Registry的屏幕截圖。你也可以將Model Registry與其他部署選項一起使用。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/03/0311519ebeb184dbeff5accd32c33faa.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了使用MLflow UI來管理模型外,你還可以通過編程方式來管理它們。例如,一旦你註冊了生產模型,那麼它就具有一個一致的URI,可用於檢索最新版本:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Retrieve latest production model","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"model_production_uri = f\"models:/{model_name}/production\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"model_production = mlflow.spark.load_model(model_production_uri)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"流","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無需等待每小時或每晚的工作來處理數據並生成預測,結構化流可以連續對傳入數據執行推斷。儘管這種方法比批處理解決方案的成本更高,因爲你必須不斷地爲計算時間付費(並獲得較低的吞吐量),但是你可以獲得更多的好處,即可以更頻繁地生成預測,從而可以更快地對它們進行操作。通常,流處理解決方案比批處理解決方案維護和監視更爲複雜,但是它們提供了較低的延遲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Spark可以很容易地將批處理預測轉換爲流式預測,並且幾乎所有代碼都是相同的。唯一的區別是,在讀取數據時,你需要使用spark.readStream()而不是spark.read(),並更改數據源。在以下示例中,我們將通過在Parquet文件目錄中進行流傳輸來模擬流數據的讀取。你會注意到我們指定了一個schema,即使我們正在處理Parquet文件。這是因爲在處理流數據時,我們需要先定義數據結構。在此示例中,我們將使用在上一章的Airbnb數據集上訓練的隨機森林模型來執行這些流預測。我們將使用MLflow加載保存的模型。我們已將源文件劃分爲一百個小型Parquet文件,因此你可以看到輸出在每個觸發間隔處都在變化:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Load saved model with MLflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelineModel = mlflow.spark.load_model(f\"runs:/{run_id}/model\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Set up simulated streaming data","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"repartitionedPath = \"/databricks-datasets/learning-spark-v2/sf-airbnb/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  sf-airbnb-clean-100p.parquet\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"schema = spark.read.parquet(repartitionedPath).schema","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamingData = (spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                 .readStream","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                 .schema(schema) # Can set the schema this way","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                 .option(\"maxFilesPerTrigger\", 1)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                 .parquet(repartitionedPath))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Generate predictions","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamPred = pipelineModel.transform(streamingData)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生成這些預測後,你可以將它們寫到任何目標位置以供以後檢索(有關結構化流技巧,請參見第8章)。如你所見,批處理和流傳輸方案之間的代碼幾乎沒有變化,這使得MLlib成爲這兩種方案的理想解決方案。但是,根據任務的延遲要求,MLlib可能不是最佳選擇。使用Spark時,在生成查詢計劃以及在驅動程序和工作節點之間傳遞任務和結果時會涉及大量開銷。因此,如果你需要真正的低延遲預測,則需要從Spark導出模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近實時","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你的用例需要數百毫秒到幾秒鐘的預測,則可以構建一個使用MLlib生成測的預測服務器。儘管這對於Spark來說不是理想的用例,因爲你正在處理的數據量非常小,但是與流或批處理解決方案相比,你將獲得更低的延遲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"用於實時推理的模型導出模式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在某些需要實時推理的場景中,包括欺詐檢測,廣告推薦等。雖然使用少量記錄進行預測可能會實現實時推理所需的低延遲,但是你將需要應對負載平衡(處理許多併發請求)以及對延遲至關重要的任務中的位置定位。有流行的託管解決方案,例如AWS SageMaker和Azure ML,它們提供了低延遲模型服務解決方案。在本節中,我們將向你展示如何導出MLlib模型,以便將其部署到那些服務中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從Spark中導出模型的一種方法是用Python、C等原生地重新實現模型,雖然提取模型的係數看起來很簡單,但導出所有的特徵工程和預處理步驟(OneHot Encoder, VectorAssembler等),很快就會遇到麻煩,並且非常容易出錯。有一些開源庫,例如MLeap和ONNX,可以幫助你自動導出MLlib模型的受支持子集,以消除它們對Spark的依賴。但是,在撰寫本文時,開發MLeap的公司已不再支持它。MLeap也不支持Scala 2.12 / Spark 3.0。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面,ONNX(開放神經網絡交換)已成爲機器學習互操作性的事實上的開放標準。你們中的某些人可能還記得其他ML互操作性格式,例如PMML(預測模型標記語言),但是這些格式從未像現在的ONNX一樣吸引人。ONNX在深度學習社區中非常流行,它是一種允許開發人員輕鬆在庫和語言之間進行切換的工具,並且在撰寫本文時,它已提供對MLlib的實驗性支持。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了導出MLlib模型之外,還有其他一些與Spark集成的第三方庫,這些第三方庫可以方便地在實時場景中進行部署,例如XGBoost和H2O.ai的Sparkling Water(其名稱源自H2O和Spark的組合) 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"XGBoost是針對結構化數據問題的Kaggle競賽中最成功的算法之一,它是數據科學家中非常受歡迎的庫。儘管從技術上講XGBoost不是MLlib的一部分,但是XGBoost4J-Spark庫允許你將分佈式XGBoost集成到MLlib管道中。XGBoost的一個好處是易於部署:訓練了MLlib管道後,你可以提取XGBoost模型並將其另存爲非Spark模型以供在Python中使用,如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val xgboostModel =","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  xgboostPipelineModel.stages.last.asInstanceOf[XGBoostRegressionModel]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"xgboostModel.nativeBooster.saveModel(nativeModelPath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import xgboost as xgb","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"bst = xgb.Booster({'nthread': 4})","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"bst.load_model(\"xgboost_native_model\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在撰寫本文時,分佈式XGBoost API僅在Java / Scala中可用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本書的GitHub repo中包含了完整的示例。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,你已經瞭解了導出用於實時服務環境的MLlib模型的不同方法,讓我們討論如何將Spark應用於非MLlib模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"利用Spark處理非MLlib模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,MLlib並不是總是能滿足你的機器學習需求的最佳解決方案。它可能無法滿足超低延遲推理需求,或者對你要使用的算法需要內置支持。對於這些情況,你仍然可以利用Spark,但不能利用MLlib。在本節中,我們將討論如何使用Spark使用Pandas UDF執行單節點模型的分佈式推理,執行超參數調整以及縮放特徵工程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Pandas UDF","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然MLlib非常適合用於模型的分佈式訓練,但你不僅限於使用MLlib來通過Spark進行批處理或流式預測,還可以創建自定義函數來大規模應用預訓練的模型,稱爲用戶定義函數(UDF,在第5章中有介紹)。一個常見的用例是在單臺機器上(也許在數據的一部分上)構建scikit-learn或TensorFlow模型,但使用Spark在整個數據集上進行分佈式推理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你定義自己的UDF以將模型應用於Python中的DataFrame的每個記錄,請選擇Pandas UDF以優化序列化和反序列化,如第5章中所述。但是,如果你的模型很大,那麼Pandas UDF要在同一Python工作進程中爲每個批次重複加載相同的模型,會產生很高的開銷。在Spark 3.0中,Pandas UDF可以接受pandas.Series的迭代器,或者pandas.DataFrame,因此你只需要加載一次模型就可以了,而不是爲迭代器中的每個系列加載一次模型。有關帶有Pandas UDF的Apache Spark 3.0中新增功能的更多詳細信息,請參見第12章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果工作節點在第一次加載模型權重後對其進行緩存,則隨後對具有相同模型加載的相同UDF的調用將變得更快。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在以下示例中,我們將使用Spark 3.0中引入的mapInPandas(),將scikit-learn模型應用於Airbnb數據集。mapInPandas()接受pandas.DataFrame的迭代器作爲輸入,並輸出的另一個迭代器pandas.DataFrame。如果你的模型需要所有列作爲輸入,它很靈活且易於使用,但是需要對整個DataFrame進行序列化/反序列化(將他傳遞給輸入)。你可以通過spark.sql.execution.arrow.maxRecordsPerBatch的配置來控制每個pandas.DataFrame的大小。本書的GitHub倉庫中提供了用於生成模型的代碼的完整副本,但是在這裏,我們僅專注於從MLflow加載保存的scikit-learn模型並將其應用於我們的Spark DataFrame:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import mlflow.sklearn","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import pandas as pd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"def predict(iterator):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  model_path = f\"runs:/{run_id}/random-forest-model\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  model = mlflow.sklearn.load_model(model_path) # Load model","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  for features in iterator:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    yield pd.DataFrame(model.predict(features))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.mapInPandas(predict, \"prediction double\").show(3)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+-----------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|       prediction|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+-----------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 90.4355866254844|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|255.3459534312323|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 499.625544914651|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+-----------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了使用Pandas UDF大規模應用模型外,你還可以使用它們並行化構建許多模型的過程。例如,你可能想爲每種IoT設備類型構建模型以預測故障時間。你可以針對這樣的任務使用pyspark.sql.GroupedData.applyInPandas()(Spark 3.0中引入)。該函數處理pandas.DataFrame並返回另一個pandas.DataFrame。本書的GitHub倉庫包含完整的代碼示例,用於按IoT設備類型構建模型並使用MLflow追蹤各個模型。爲了簡潔起見,這裏只選取一個代碼片段:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.groupBy(\"device_id\").applyInPandas(build_model, ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"schema=trainReturnSchema)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"groupBy()將導致你的數據集被重新打散,並且你需要確保每個組的模型和數據都可以放在同一臺計算機上。你們中有些人可能很熟悉pyspark.sql.GroupedData.apply()(例如df.groupBy(\"device_id\").apply(build_model)),但該API將在Spark的未來版本將會被廢棄,取而代之的是pyspark.sql.GroupedData.applyInPandas()。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然你已經瞭解瞭如何應用UDF執行分佈式推理和並行化模型構建,那麼讓我們看一下如何使用Spark進行分佈式超參數調整。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"用於分佈式超參數調整的Spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即使你不打算進行分佈式推理或不需要MLlib的分佈式訓練功能,你仍然可以利用Spark進行分佈式超參數調整。本節將特別介紹兩個開源庫:Joblib和Hyperopt。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Joblib","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據其文檔,Joblib是“一組在Python中提供輕量級管道處理的工具。” 它具有一個Spark後端,可以在Spark集羣上分發任務。Joblib可用於超參數調整,因爲它會自動將數據副本廣播給所有worker,然後worker將在其數據副本上創建具有不同超參數的自己的模型。這使你可以並行訓練和評估多個模型。你仍然有一個基本限制,即單個模型和所有數據都必須放在一臺機器上,但是你可以簡單地並行化超參數搜索,如圖11-7所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5e/5edc99e54b81c28ce1a89f626b7350a0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要使用Joblib,請通過pip install joblibspark安裝它。確保你使用的scikit-learn是0.21或更高版本以及pyspark v2.4.4或更高版本。此處顯示瞭如何進行分佈式交叉驗證的示例,並且相同的方法也適用於分佈式超參數調整:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from sklearn.utils import parallel_backend","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from sklearn.ensemble import RandomForestRegressor","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from sklearn.model_selection import train_test_split","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from sklearn.model_selection import GridSearchCV","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import pandas as pd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from joblibspark import register_spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"register_spark() # Register Spark backend","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df = pd.read_csv(\"/dbfs/databricks-datasets/learning-spark-v2/sf-airbnb/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  sf-airbnb-numeric.csv\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"X_train, X_test, y_train, y_test = train_test_split(df.drop([\"price\"], axis=1),","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  df[[\"price\"]].values.ravel(), random_state=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"rf = RandomForestRegressor(random_state=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"param_grid = {\"max_depth\": [2, 5, 10], \"n_estimators\": [20, 50, 100]}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"gscv = GridSearchCV(rf, param_grid, cv=3)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"with parallel_backend(\"spark\", n_jobs=3):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  gscv.fit(X_train, y_train)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"print(gscv.cv_results_)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有關從交叉驗證器返回的參數的說明,請參見scikit-learn GridSearchCV文檔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Hyperopt","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hyperopt是一個Python庫,用於“在笨拙的搜索空間上進行串行和並行優化,其中可能包括實值,離散值和條件維。” 你可以通過pip install hyperopt安裝它。使用Apache Spark擴展Hyperopt的主要方法有兩種:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 將單機Hyperopt與分佈式訓練算法(例如MLlib)一起使用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 使用分佈式Hyperopt和包含SparkTrials類單機訓練算法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於前一種情況,你不需要進行任何配置就可以將MLlib與Hyperopt以及其他任何庫一起使用。因此,讓我們看一下後一種情況:具有單節點模型的分佈式Hyperopt。不幸的是,在撰寫本文時,你無法將分佈式超參數評估與分佈式訓練模型結合在一起。可以在本書的GitHub repo中找到用於並行化Keras模型的超參數搜索的完整代碼示例;此處僅包含一個片段以說明Hyperopt的關鍵組件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import hyperopt","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"best_hyperparameters = hyperopt.fmin(","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  fn = training_function,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  space = search_space,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  algo = hyperopt.tpe.suggest,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  max_evals = 64,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  trials = hyperopt.SparkTrials(parallelism=4))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"fmin()生成新的超參數配置供你使用training_function,並將它們傳遞給SparkTrials。SparkTrials在每個Spark執行器上將這些訓練任務的批處理作爲一個單任務Spark作業並行運行。Spark任務完成後,它將結果和相應的損失返回給驅動程序。Hyperopt使用這些新結果來爲將來的任務計算更好的超參數配置。這允許超參數調整的大規模擴展。MLflow還與Hyperopt集成,因此你可以在超參數調整中追蹤已訓練的所有模型的結果。SparkTrials的一個重要參數是parallelism。用於確定要同時評估的最大試驗次數,也就是並行度。如果爲parallelism=1,則你將按順序訓練每個模型,但是通過充分利用自適應算法可能會得到更好的模型。如果你設置了parallelism=max_evals(要訓練的模型總數),那麼你只是在進行隨機搜索。處於1和max_evals之間的任意數字允許你可以在可伸縮性和適應性之間進行權衡。默認情況下,parallelism設置爲Spark執行程序的數量。你還可以指定一個超時時間(timeout)來限制fmin()允許的最大秒數。即使MLlib不適合你的問題,也希望你能在任何機器學習任務中看到使用Spark的價值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Kaolas","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pandas是Python中非常流行的數據分析和操作庫,但僅限於在一臺機器上運行。Koalas是一個開放源代碼庫,它在Apache Spark之上實現了Pandas DataFrame API,從而簡化了從Pandas到Spark的過渡。你可以用命令pip install koalas進行安裝,然後只需將pd代碼中的任何(Pandas)邏輯替換爲ks(Koalas)。這樣,你可以使用Pandas擴展分析,而無需完全重寫PySpark中的代碼庫。這是一個如何將Pandas代碼更改爲Koalas的示例(你需要已安裝PySpark):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In pandas","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import pandas as pd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pdf = pd.read_csv(csv_path, header=0, sep=\";\", quotechar='\"')","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pdf[\"duration_new\"] = pdf[\"duration\"] + 100","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In koalas","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import databricks.koalas as ks","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"kdf = ks.read_csv(file_path, header=0, sep=\";\", quotechar='\"')","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"kdf[\"duration_new\"] = kdf[\"duration\"] + 100 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然Koalas的目標是最終實現所有Pandas功能,但尚未全部實現。如果你有Koalas無法提供的某些功能,則可以隨時通過調用切換到使用Spark API kdf.to_spark()。另外,你可以通過調用kdf.to_pandas()並使用Pandas API將數據帶到驅動程序中(請注意數據集不能太大,否則將導致驅動程序崩潰!)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"小結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本章中,我們介紹了用於管理和部署機器學習管道的各種最佳實踐。你瞭解了MLflow如何幫助你追蹤和重現實驗以及打包代碼及其依賴項以將其部署到其他地方。我們還討論了主要的部署選項(批處理,流和實時)及其相關的權衡取捨。MLlib是用於大規模模型訓練和批處理/流式使用案例的絕佳解決方案,但對於小數據集的實時推理,相對於單節點模型就沒什麼優勢。你的部署需求直接影響你可以使用的模型和框架的類型,因此在開始模型構建過程之前,討論這些需求至關重要。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一章中,我們將重點介紹Spark 3.0中的一些關鍵新功能,以及如何將其合併到Spark工作負載中。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章