使用Apache Spark管理、部署和扩展机器学习管道(十一)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"写在前面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是强哥,一个热爱分享的技术狂。目前已有 12 年大数据与AI相关项目经验, 10 年推荐系统研究及实践经验。平时喜欢读书、暴走和写作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"业余时间专注于输出大数据、AI等相关文章,目前已经输出了40万字的推荐系统系列精品文章,今年 6 月底会出版「构建企业级推荐系统:算法、工程实现与案例分析」一书。如果这些文章能够帮助你快速入门,实现职场升职加薪,我将不胜欢喜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要获得更多免费学习资料或内推信息,一定要看到文章最后喔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"内推信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你正在看相关的招聘信息,请加我微信:liuq4360,我这里有很多内推资源等着你,欢迎投递简历。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"免费学习资料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想获得更多免费的学习资料,请关注同名公众号【数据与智能】,输入“资料”即可!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"学习交流群","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想找到组织,和大家一起学习成长,交流经验,也可以加入我们的学习成长群。群里有老司机带你飞,另有小哥哥、小姐姐等你来勾搭!加小姐姐微信:epsila,她会带你入群。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上一章中,我们介绍了如何使用MLlib构建机器学习管道。本章将重点介绍如何管理和部署你训练的模型。在本章结束时,你将能够使用MLflow来追踪、重现和部署MLlib模型,以及讨论各种模型部署方案之间的困难和折衷方案,并设计可扩展的机器学习解决方案。但是,在讨论部署模型之前,让我们首先讨论一些模型管理的最佳实践,在进行部署之前将你的模型准备好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"模型管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在部署机器学习模型之前,应确保可以复制和追踪模型的效果。对我们而言,机器学习解决方案的端到端可重现性意味着我们需要能够再现生成模型的代码,包括用于训练的环境,用于训练的数据以及模型本身。每个数据科学家都喜欢提醒你设置随机数种子,以便你可以重现实验(例如,在使用具有固有随机性的模型(例如随机森林)时进行训练集/测试集拆分)。但是,有许多方面比设置种子更有助于提高可重复性,其中有些方面很微妙。下面有几个例子:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"库版本控制","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当数据科学家将他们的代码交给你时,他们可能会或者可能不会提及依赖库。虽然你可以通过查看错误消息从而确定需要哪些库,但是不确定它们使用的是哪个库版本,因此你可能会安装最新的库。这样一来,如果他们的代码是建立在库的先前版本上的,则会存在不同版本之间的某些默认行为是不一致的,从而导致不同的结果,因此使用最新版本可能会导致代码中断或结果有所不同(例如,考虑一下XGBoost如何更改了v0.90版本中处理缺失值的方式)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"数据演进","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假设你在2020年6月1日建立模型,并追踪所有超参数、库等。然后你尝试在2020年7月1日重现相同的模型,但是由于基础数据发送变更,所以引起管道中断或结果不同,如果有人在初始构建后添加了额外的列或更多数量级的数据,则可能会发生这种情况。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"执行顺序","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果数据科学家将他们的代码交给你,那么你应该能够从上到下运行它而不会出错。但是,数据科学家以无序运行或多次运行同一个有状态单元而臭名昭著,这使得它们的结果很难重现。(他们还可能引入具有与用于训练最终模型的参数不同的超参数的代码副本!)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"并行作业","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了最大化吞吐量,GPU将并行运行许多操作。但是,执行顺序不一定总是得到保证,这可能导致结果的不确定性。这是类似tf.reduce_sum()函数之类的已知问题,例如在对浮点数进行聚合(精度有限)时:添加它们的顺序可能会产生略有不同的结果,并且在许多迭代中都会加剧这种结果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"无法复制实验通常会阻碍业务部门采用你的模型或将其投入生产。尽管你可以构建自己的内部工具来追踪模型、数据和依赖项版本等,但是它们可能变得过时,脆弱并且需要花费大量的开发工作来维护。同样重要的是要拥有用于管理模型的行业范围的标准,以便可以与合作伙伴轻松共享它们。开源和专有工具都可以通过抽象出许多常见困难来帮助我们重现我们的机器学习实验。本节将重点介绍MLflow,因为它与当前可用的开源模型管理工具MLlib的集成最为紧密。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"MLflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MLflow是一个开放源代码平台,可帮助开发人员复制和共享实验,管理模型等。它提供Python,R和Java / Scala的接口,以及REST API。如图11-1所示,MLflow具有四个主要组件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"Tracking","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供API来记录参数,指标,代码版本,模型和其他人工素材(例如图形和文本)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Projects","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打包数据科学项目及其依赖性是一种标准化操作,以便在其他平台上运行。它可以帮助你管理模型训练过程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Models","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过标准化的方式打包模型以部署到不同的执行环境。它提供了用于加载和应用模型的一致API,无论用于构建模型的算法或库是什么。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Registry","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用于追踪模型演进,模型版本,阶段转换和注释的存储库。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/80/80e5d51afb6bf017e3724f951cbfc05a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"让我们尝试在第10章中为重现性而进行的MLlib模型实验。然后,当我们讨论模型部署时,我们将看到MLflow的其他组件如何发挥作用。要开始使用MLflow,只需在本地主机上运行pip install mlflow即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Tracking","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MLflow Tracking是一个日志记录API,与实际进行训练的库和环境无关。它围绕","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"运行(runs)","attrs":{}},{"type":"text","text":"的概念进行组织,而","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"运行","attrs":{}},{"type":"text","text":"是数据科学代码的执行。将运行汇总到","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"实验中","attrs":{}},{"type":"text","text":",以便许多运行可以成为给定实验的一部分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MLflow追踪服务器可以托管许多实验。你可以使用笔记本(notebooks),本地应用程序或云作业登录到追踪服务器,如图11-2所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bd/bda7d7bd296965941470e618d723a549.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"让我们检查一些可以记录到追踪服务器的内容:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"参数","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你代码的键/值输入,例如,随机森林中的超参数num_trees或max_depth","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"指标","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数值(可以随时间更新),例如,RMSE或精度值","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Artifacts","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文件,数据和模型,例如matplotlib图像或Parquet文件","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"元数据","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有关运行的信息,例如执行运行的源代码或代码的版本(例如,代码版本的Git提交哈希字符串)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你训练的模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默认情况下,追踪服务器会将所有内容记录到文件系统中,但是你可以指定数据库以加快查询速度,例如参数和指标。让我们将MLflow追踪添加到第10章的随机森林代码中:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml import Pipeline","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.feature import StringIndexer, VectorAssembler","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.regression import RandomForestRegressor","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.evaluation import RegressionEvaluator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"filePath = \"\"\"/databricks-datasets/learning-spark-v2/sf-airbnb/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sf-airbnb-clean.parquet\"\"\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"airbnbDF = spark.read.parquet(filePath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(trainDF, testDF) = airbnbDF.randomSplit([.8, .2], seed=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"categoricalCols = [field for (field, dataType) in trainDF.dtypes","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                   if dataType == \"string\"]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"indexOutputCols = [x + \"Index\" for x in categoricalCols]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"stringIndexer = StringIndexer(inputCols=categoricalCols,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                              outputCols=indexOutputCols,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                              handleInvalid=\"skip\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"numericCols = [field for (field, dataType) in trainDF.dtypes","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"               if ((dataType == \"double\") & (field != \"price\"))]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"assemblerInputs = indexOutputCols + numericCols","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vecAssembler = VectorAssembler(inputCols=assemblerInputs,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                               outputCol=\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"rf = RandomForestRegressor(labelCol=\"price\", maxBins=40, maxDepth=5,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                           numTrees=100, seed=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipeline = Pipeline(stages=[stringIndexer, vecAssembler, rf])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要开始使用MLflow进行记录,你将需要使用开始运行mlflow.start_run(),而不是显示调用mlflow.end_run()。本章中的示例将使用一个with子句来自动结束运行:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import mlflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import mlflow.spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import pandas as pd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"with mlflow.start_run(run_name=\"random-forest\") as run:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Log params: num_trees and max_depth","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.log_param(\"num_trees\", rf.getNumTrees())","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.log_param(\"max_depth\", rf.getMaxDepth())","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Log model","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  pipelineModel = pipeline.fit(trainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.spark.log_model(pipelineModel, \"model\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Log metrics: RMSE and R2","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  predDF = pipelineModel.transform(testDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  regressionEvaluator = RegressionEvaluator(predictionCol=\"prediction\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                                            labelCol=\"price\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  rmse = regressionEvaluator.setMetricName(\"rmse\").evaluate(predDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  r2 = regressionEvaluator.setMetricName(\"r2\").evaluate(predDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.log_metrics({\"rmse\": rmse, \"r2\": r2})","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Log artifact: feature importance scores","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  rfModel = pipelineModel.stages[-1]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  pandasDF = (pd.DataFrame(list(zip(vecAssembler.getInputCols(),","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                                    rfModel.featureImportances)),","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                           columns=[\"feature\", \"importance\"])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"              .sort_values(by=\"importance\", ascending=False))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First write to local filesystem, then tell MLflow where to find that file","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  pandasDF.to_csv(\"feature-importance.csv\", index=False)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  mlflow.log_artifact(\"feature-importance.csv\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"让我们检查一下MLflow UI,你可以通过在终端中运行mlflow ui并导航到","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"http://localhost:5000/来访问它","attrs":{}},{"type":"text","text":"。图11-3显示了UI的屏幕截图。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/84/849037feffbca5149a5767c0f890dc28.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UI会存储给定实验的所有运行任务。你可以搜索所有运行任务,筛选满足特定条件的运行任务信息,并排比较运行等。如果需要,还可以将内容导出为CSV文件以在本地进行分析。在用户界面中点击运行任务\"random-forest\"。你应该看到如图11-4所示的面板。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cd883f8e8b4c37203bef3e958caa86b0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你会注意到,它追踪用于此MLflow运行的源代码,并存储所有相应的参数,指标等。你可以自由地在文本和标签中添加有关此运行的注释。运行完成后,你将无法修改参数或指标。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你还可以使用MlflowClient或REST API查询追踪服务器:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from mlflow.tracking import MlflowClient","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"client = MlflowClient()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"runs = client.search_runs(run.info.experiment_id,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                          order_by=[\"attributes.start_time desc\"],","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                          max_results=1)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"run_id = runs[0].info.run_id","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"runs[0].data.metrics","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这将产生以下输出:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"{'r2':0.22794251914574226,'rmse':211.5096898777315}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们将本书这部分的代码作为MLflow项目托管在GitHub仓库中,所以你可以尝试用不同的超参数值如max_depth和num_trees来运行它。MLflow项目中的YAML文件指定了库依赖关系,因此该代码可以在其他环境中运行:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mlflow.run(","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  \"https://github.com/databricks/LearningSparkV2/#mlflow-project-example\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  parameters={\"max_depth\": 5, \"num_trees\": 100})","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Or on the command line","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mlflow run https://github.com/databricks/LearningSparkV2/#mlflow-project-example","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"-P max_depth=5 -P num_trees=100","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在,你已经追踪并复制了实验,让我们讨论MLlib模型可用的各种部署选项。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"MLlib的模型部署选项","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部署机器学习模型对于每个组织和应用场景而言意味着不同的东西。业务约束将对延迟,吞吐量,成本等权衡指标提出不同的要求,这些要求决定了哪种模型部署模式适合手头的任务——无论是批处理,流计算,实时还是移动/嵌入式任务。在移动/嵌入式系统上部署模型超出了本书的范围,因此我们将主要关注其他方面。表11-1显示了这三种用于生成预测的部署选项的吞吐量和延迟权衡。我们同时关心并发请求的数量和这些请求的大小,并且最终的解决方案看起来会大不相同。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/36/36fc4693e4f88d56b394ae1fa784db17.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"批处理会定期生成预测,并将结果写到持久性存储中,以供其他地方使用。它通常是最便宜和最简单的部署选项,因为你只需要在计划的运行期间为计算付费。批处理在每个数据点上的效率更高,因为在所有预测中进行摊销时,你可以累积较少的开销。对于Spark尤其如此,因为在驱动程序和执行程序之间来回通信会产生开销,因此你不想一次做出一个数据点的预测!但是,它的主要缺点是延迟高,因为通常将其设定为数小时或数天以生成下一批预测。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流提供了吞吐量和延迟之间的良好折衷。你将不断对数据的微批次进行预测,并在数秒至数分钟内获得你的预测。如果你使用的是结构化流,则几乎所有代码都将与批处理用例相同,从而可以轻松地在这两个选项之间来回切换。使用流传输时,你将需要为用来持续保持运行状态的VM或计算资源付费,并确保已正确配置流以使其具有容错能力,并在传入数据中出现峰值时提供缓冲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实时部署优先考虑延迟而不是吞吐量,并在几毫秒内生成预测。你的基础架构将需要支持负载平衡,并且在需求激增的情况下(例如,假期期间的在线零售商)能够扩展到许多并发请求。有时,当人们说“实时部署”时,它们的意思是实时提取预先计算的预测,但是这里我们指的是实时进行模型预测。实时部署是Spark唯一无法满足延迟要求的选项,因此要使用它,你需要将模型导出到Spark之外。例如,如果你打算使用REST端点进行实时模型推断(例如,在50毫秒内计算预测),则MLlib不满足该应用程序所需的等待时间要求,如图11-5所示。你需要准备好你的特征并在Spark之外进行建模,因为用Spark建模可能既费时又困难。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ca/cae2efb5aeef68a6295eeb4f57761a81.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在开始建模过程之前,需要定义模型部署需求。MLlib和Spark只是你工具箱中的一些工具,你需要了解应在何时何地应用它们。本节的其余部分将更深入地讨论MLlib的部署选项,然后我们将考虑针对非MLlib模型的Spark部署选项。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"批","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"批处理部署代表了部署机器学习模型的大多数情形,并且可以说这是最容易实现的选择。你将运行常规作业以生成预测,并将结果保存到表,数据库,数据湖等中,以供下游使用。实际上,你已经在第10章中使用MLlib生成了如何生成批量预测。MLlib的model.transform()会将模型并行应用于DataFrame的所有分区:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Load saved model with MLflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import mlflow.spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelineModel = mlflow.spark.load_model(f\"runs:/{run_id}/model\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Generate predictions","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"inputDF = spark.read.parquet(\"/databricks-datasets/learning-spark-v2/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  sf-airbnb/sf-airbnb-clean.parquet\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"predDF = pipelineModel.transform(inputDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"批处理部署要记住的几件事是:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"你将多久生成一次预测?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在延迟和吞吐量之间需要权衡。你将获得更高的吞吐量,将许多预测分批处理,但是接收任何单个预测所花费的时间将更长,从而延迟了你对这些预测采取行动的能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"你将多久重新训练一次模型?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"与sklearn或TensorFlow之类的库不同,MLlib不支持在线更新或热启动。如果你想重新训练模型以合并最新数据,则必须从头开始重新训练整个模型,而不是利用现有参数。就重新训练的频率而言,有些人将建立一个常规工作来对模型进行再次训练(例如,每月一次),而其他人将积极地监视模型漂移以确定什么时候需要重新训练。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"你将如何对模型进行版本控制?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以使用MLflow模型注册表来追踪所使用的模型,并控制它们如何在暂存区,生产和归档之间进行转换。你可以在图11-6中看到Model Registry的屏幕截图。你也可以将Model Registry与其他部署选项一起使用。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/03/0311519ebeb184dbeff5accd32c33faa.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了使用MLflow UI来管理模型外,你还可以通过编程方式来管理它们。例如,一旦你注册了生产模型,那么它就具有一个一致的URI,可用于检索最新版本:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Retrieve latest production model","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"model_production_uri = f\"models:/{model_name}/production\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"model_production = mlflow.spark.load_model(model_production_uri)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"流","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"无需等待每小时或每晚的工作来处理数据并生成预测,结构化流可以连续对传入数据执行推断。尽管这种方法比批处理解决方案的成本更高,因为你必须不断地为计算时间付费(并获得较低的吞吐量),但是你可以获得更多的好处,即可以更频繁地生成预测,从而可以更快地对它们进行操作。通常,流处理解决方案比批处理解决方案维护和监视更为复杂,但是它们提供了较低的延迟。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Spark可以很容易地将批处理预测转换为流式预测,并且几乎所有代码都是相同的。唯一的区别是,在读取数据时,你需要使用spark.readStream()而不是spark.read(),并更改数据源。在以下示例中,我们将通过在Parquet文件目录中进行流传输来模拟流数据的读取。你会注意到我们指定了一个schema,即使我们正在处理Parquet文件。这是因为在处理流数据时,我们需要先定义数据结构。在此示例中,我们将使用在上一章的Airbnb数据集上训练的随机森林模型来执行这些流预测。我们将使用MLflow加载保存的模型。我们已将源文件划分为一百个小型Parquet文件,因此你可以看到输出在每个触发间隔处都在变化:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Load saved model with MLflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelineModel = mlflow.spark.load_model(f\"runs:/{run_id}/model\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Set up simulated streaming data","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"repartitionedPath = \"/databricks-datasets/learning-spark-v2/sf-airbnb/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  sf-airbnb-clean-100p.parquet\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"schema = spark.read.parquet(repartitionedPath).schema","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamingData = (spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                 .readStream","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                 .schema(schema) # Can set the schema this way","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                 .option(\"maxFilesPerTrigger\", 1)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                 .parquet(repartitionedPath))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Generate predictions","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamPred = pipelineModel.transform(streamingData)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生成这些预测后,你可以将它们写到任何目标位置以供以后检索(有关结构化流技巧,请参见第8章)。如你所见,批处理和流传输方案之间的代码几乎没有变化,这使得MLlib成为这两种方案的理想解决方案。但是,根据任务的延迟要求,MLlib可能不是最佳选择。使用Spark时,在生成查询计划以及在驱动程序和工作节点之间传递任务和结果时会涉及大量开销。因此,如果你需要真正的低延迟预测,则需要从Spark导出模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近实时","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你的用例需要数百毫秒到几秒钟的预测,则可以构建一个使用MLlib生成测的预测服务器。尽管这对于Spark来说不是理想的用例,因为你正在处理的数据量非常小,但是与流或批处理解决方案相比,你将获得更低的延迟。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"用于实时推理的模型导出模式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在某些需要实时推理的场景中,包括欺诈检测,广告推荐等。虽然使用少量记录进行预测可能会实现实时推理所需的低延迟,但是你将需要应对负载平衡(处理许多并发请求)以及对延迟至关重要的任务中的位置定位。有流行的托管解决方案,例如AWS SageMaker和Azure ML,它们提供了低延迟模型服务解决方案。在本节中,我们将向你展示如何导出MLlib模型,以便将其部署到那些服务中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从Spark中导出模型的一种方法是用Python、C等原生地重新实现模型,虽然提取模型的系数看起来很简单,但导出所有的特征工程和预处理步骤(OneHot Encoder, VectorAssembler等),很快就会遇到麻烦,并且非常容易出错。有一些开源库,例如MLeap和ONNX,可以帮助你自动导出MLlib模型的受支持子集,以消除它们对Spark的依赖。但是,在撰写本文时,开发MLeap的公司已不再支持它。MLeap也不支持Scala 2.12 / Spark 3.0。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面,ONNX(开放神经网络交换)已成为机器学习互操作性的事实上的开放标准。你们中的某些人可能还记得其他ML互操作性格式,例如PMML(预测模型标记语言),但是这些格式从未像现在的ONNX一样吸引人。ONNX在深度学习社区中非常流行,它是一种允许开发人员轻松在库和语言之间进行切换的工具,并且在撰写本文时,它已提供对MLlib的实验性支持。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了导出MLlib模型之外,还有其他一些与Spark集成的第三方库,这些第三方库可以方便地在实时场景中进行部署,例如XGBoost和H2O.ai的Sparkling Water(其名称源自H2O和Spark的组合) 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"XGBoost是针对结构化数据问题的Kaggle竞赛中最成功的算法之一,它是数据科学家中非常受欢迎的库。尽管从技术上讲XGBoost不是MLlib的一部分,但是XGBoost4J-Spark库允许你将分布式XGBoost集成到MLlib管道中。XGBoost的一个好处是易于部署:训练了MLlib管道后,你可以提取XGBoost模型并将其另存为非Spark模型以供在Python中使用,如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val xgboostModel =","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  xgboostPipelineModel.stages.last.asInstanceOf[XGBoostRegressionModel]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"xgboostModel.nativeBooster.saveModel(nativeModelPath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import xgboost as xgb","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"bst = xgb.Booster({'nthread': 4})","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"bst.load_model(\"xgboost_native_model\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在撰写本文时,分布式XGBoost API仅在Java / Scala中可用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本书的GitHub repo中包含了完整的示例。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在,你已经了解了导出用于实时服务环境的MLlib模型的不同方法,让我们讨论如何将Spark应用于非MLlib模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"利用Spark处理非MLlib模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,MLlib并不是总是能满足你的机器学习需求的最佳解决方案。它可能无法满足超低延迟推理需求,或者对你要使用的算法需要内置支持。对于这些情况,你仍然可以利用Spark,但不能利用MLlib。在本节中,我们将讨论如何使用Spark使用Pandas UDF执行单节点模型的分布式推理,执行超参数调整以及缩放特征工程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Pandas UDF","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"虽然MLlib非常适合用于模型的分布式训练,但你不仅限于使用MLlib来通过Spark进行批处理或流式预测,还可以创建自定义函数来大规模应用预训练的模型,称为用户定义函数(UDF,在第5章中有介绍)。一个常见的用例是在单台机器上(也许在数据的一部分上)构建scikit-learn或TensorFlow模型,但使用Spark在整个数据集上进行分布式推理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你定义自己的UDF以将模型应用于Python中的DataFrame的每个记录,请选择Pandas UDF以优化序列化和反序列化,如第5章中所述。但是,如果你的模型很大,那么Pandas UDF要在同一Python工作进程中为每个批次重复加载相同的模型,会产生很高的开销。在Spark 3.0中,Pandas UDF可以接受pandas.Series的迭代器,或者pandas.DataFrame,因此你只需要加载一次模型就可以了,而不是为迭代器中的每个系列加载一次模型。有关带有Pandas UDF的Apache Spark 3.0中新增功能的更多详细信息,请参见第12章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果工作节点在第一次加载模型权重后对其进行缓存,则随后对具有相同模型加载的相同UDF的调用将变得更快。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在以下示例中,我们将使用Spark 3.0中引入的mapInPandas(),将scikit-learn模型应用于Airbnb数据集。mapInPandas()接受pandas.DataFrame的迭代器作为输入,并输出的另一个迭代器pandas.DataFrame。如果你的模型需要所有列作为输入,它很灵活且易于使用,但是需要对整个DataFrame进行序列化/反序列化(将他传递给输入)。你可以通过spark.sql.execution.arrow.maxRecordsPerBatch的配置来控制每个pandas.DataFrame的大小。本书的GitHub仓库中提供了用于生成模型的代码的完整副本,但是在这里,我们仅专注于从MLflow加载保存的scikit-learn模型并将其应用于我们的Spark DataFrame:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import mlflow.sklearn","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import pandas as pd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"def predict(iterator):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  model_path = f\"runs:/{run_id}/random-forest-model\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  model = mlflow.sklearn.load_model(model_path) # Load model","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  for features in iterator:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    yield pd.DataFrame(model.predict(features))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.mapInPandas(predict, \"prediction double\").show(3)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+-----------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|       prediction|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+-----------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 90.4355866254844|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|255.3459534312323|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| 499.625544914651|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+-----------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了使用Pandas UDF大规模应用模型外,你还可以使用它们并行化构建许多模型的过程。例如,你可能想为每种IoT设备类型构建模型以预测故障时间。你可以针对这样的任务使用pyspark.sql.GroupedData.applyInPandas()(Spark 3.0中引入)。该函数处理pandas.DataFrame并返回另一个pandas.DataFrame。本书的GitHub仓库包含完整的代码示例,用于按IoT设备类型构建模型并使用MLflow追踪各个模型。为了简洁起见,这里只选取一个代码片段:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.groupBy(\"device_id\").applyInPandas(build_model, ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"schema=trainReturnSchema)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"groupBy()将导致你的数据集被重新打散,并且你需要确保每个组的模型和数据都可以放在同一台计算机上。你们中有些人可能很熟悉pyspark.sql.GroupedData.apply()(例如df.groupBy(\"device_id\").apply(build_model)),但该API将在Spark的未来版本将会被废弃,取而代之的是pyspark.sql.GroupedData.applyInPandas()。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然你已经了解了如何应用UDF执行分布式推理和并行化模型构建,那么让我们看一下如何使用Spark进行分布式超参数调整。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"用于分布式超参数调整的Spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即使你不打算进行分布式推理或不需要MLlib的分布式训练功能,你仍然可以利用Spark进行分布式超参数调整。本节将特别介绍两个开源库:Joblib和Hyperopt。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Joblib","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根据其文档,Joblib是“一组在Python中提供轻量级管道处理的工具。” 它具有一个Spark后端,可以在Spark集群上分发任务。Joblib可用于超参数调整,因为它会自动将数据副本广播给所有worker,然后worker将在其数据副本上创建具有不同超参数的自己的模型。这使你可以并行训练和评估多个模型。你仍然有一个基本限制,即单个模型和所有数据都必须放在一台机器上,但是你可以简单地并行化超参数搜索,如图11-7所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5e/5edc99e54b81c28ce1a89f626b7350a0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要使用Joblib,请通过pip install joblibspark安装它。确保你使用的scikit-learn是0.21或更高版本以及pyspark v2.4.4或更高版本。此处显示了如何进行分布式交叉验证的示例,并且相同的方法也适用于分布式超参数调整:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from sklearn.utils import parallel_backend","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from sklearn.ensemble import RandomForestRegressor","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from sklearn.model_selection import train_test_split","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from sklearn.model_selection import GridSearchCV","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import pandas as pd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from joblibspark import register_spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"register_spark() # Register Spark backend","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df = pd.read_csv(\"/dbfs/databricks-datasets/learning-spark-v2/sf-airbnb/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  sf-airbnb-numeric.csv\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"X_train, X_test, y_train, y_test = train_test_split(df.drop([\"price\"], axis=1),","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  df[[\"price\"]].values.ravel(), random_state=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"rf = RandomForestRegressor(random_state=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"param_grid = {\"max_depth\": [2, 5, 10], \"n_estimators\": [20, 50, 100]}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"gscv = GridSearchCV(rf, param_grid, cv=3)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"with parallel_backend(\"spark\", n_jobs=3):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  gscv.fit(X_train, y_train)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"print(gscv.cv_results_)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有关从交叉验证器返回的参数的说明,请参见scikit-learn GridSearchCV文档。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Hyperopt","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hyperopt是一个Python库,用于“在笨拙的搜索空间上进行串行和并行优化,其中可能包括实值,离散值和条件维。” 你可以通过pip install hyperopt安装它。使用Apache Spark扩展Hyperopt的主要方法有两种:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 将单机Hyperopt与分布式训练算法(例如MLlib)一起使用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 使用分布式Hyperopt和包含SparkTrials类单机训练算法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于前一种情况,你不需要进行任何配置就可以将MLlib与Hyperopt以及其他任何库一起使用。因此,让我们看一下后一种情况:具有单节点模型的分布式Hyperopt。不幸的是,在撰写本文时,你无法将分布式超参数评估与分布式训练模型结合在一起。可以在本书的GitHub repo中找到用于并行化Keras模型的超参数搜索的完整代码示例;此处仅包含一个片段以说明Hyperopt的关键组件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import hyperopt","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"best_hyperparameters = hyperopt.fmin(","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  fn = training_function,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  space = search_space,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  algo = hyperopt.tpe.suggest,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  max_evals = 64,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  trials = hyperopt.SparkTrials(parallelism=4))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"fmin()生成新的超参数配置供你使用training_function,并将它们传递给SparkTrials。SparkTrials在每个Spark执行器上将这些训练任务的批处理作为一个单任务Spark作业并行运行。Spark任务完成后,它将结果和相应的损失返回给驱动程序。Hyperopt使用这些新结果来为将来的任务计算更好的超参数配置。这允许超参数调整的大规模扩展。MLflow还与Hyperopt集成,因此你可以在超参数调整中追踪已训练的所有模型的结果。SparkTrials的一个重要参数是parallelism。用于确定要同时评估的最大试验次数,也就是并行度。如果为parallelism=1,则你将按顺序训练每个模型,但是通过充分利用自适应算法可能会得到更好的模型。如果你设置了parallelism=max_evals(要训练的模型总数),那么你只是在进行随机搜索。处于1和max_evals之间的任意数字允许你可以在可伸缩性和适应性之间进行权衡。默认情况下,parallelism设置为Spark执行程序的数量。你还可以指定一个超时时间(timeout)来限制fmin()允许的最大秒数。即使MLlib不适合你的问题,也希望你能在任何机器学习任务中看到使用Spark的价值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Kaolas","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pandas是Python中非常流行的数据分析和操作库,但仅限于在一台机器上运行。Koalas是一个开放源代码库,它在Apache Spark之上实现了Pandas DataFrame API,从而简化了从Pandas到Spark的过渡。你可以用命令pip install koalas进行安装,然后只需将pd代码中的任何(Pandas)逻辑替换为ks(Koalas)。这样,你可以使用Pandas扩展分析,而无需完全重写PySpark中的代码库。这是一个如何将Pandas代码更改为Koalas的示例(你需要已安装PySpark):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In pandas","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import pandas as pd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pdf = pd.read_csv(csv_path, header=0, sep=\";\", quotechar='\"')","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pdf[\"duration_new\"] = pdf[\"duration\"] + 100","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In koalas","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import databricks.koalas as ks","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"kdf = ks.read_csv(file_path, header=0, sep=\";\", quotechar='\"')","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"kdf[\"duration_new\"] = kdf[\"duration\"] + 100 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"虽然Koalas的目标是最终实现所有Pandas功能,但尚未全部实现。如果你有Koalas无法提供的某些功能,则可以随时通过调用切换到使用Spark API kdf.to_spark()。另外,你可以通过调用kdf.to_pandas()并使用Pandas API将数据带到驱动程序中(请注意数据集不能太大,否则将导致驱动程序崩溃!)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"小结","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本章中,我们介绍了用于管理和部署机器学习管道的各种最佳实践。你了解了MLflow如何帮助你追踪和重现实验以及打包代码及其依赖项以将其部署到其他地方。我们还讨论了主要的部署选项(批处理,流和实时)及其相关的权衡取舍。MLlib是用于大规模模型训练和批处理/流式使用案例的绝佳解决方案,但对于小数据集的实时推理,相对於单节点模型就没什么优势。你的部署需求直接影响你可以使用的模型和框架的类型,因此在开始模型构建过程之前,讨论这些需求至关重要。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一章中,我们将重点介绍Spark 3.0中的一些关键新功能,以及如何将其合并到Spark工作负载中。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章