如何从Pandas迁移到Spark?这8个问答解决你所有疑问

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}},{"type":"strong"}],"text":"本文最初发布于Medium网站,经原作者授权由InfoQ中文站翻译并分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}},{"type":"strong"}],"text":"原文链接:"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/moving-from-pandas-to-spark-7b0b7d956adb?fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"https:\/\/towardsdatascience.com\/moving-from-pandas-to-spark-7b0b7d956adb"}],"marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}},{"type":"strong"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当你的数据集变得越来越大,迁移到Spark可以提高速度并节约时间。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多数数据科学工作流程都是从Pandas开始的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pandas是一个很棒的库,你可以用它做各种变换,可以处理各种类型的数据,例如CSV或JSON等。我喜欢Pandas — 我还为它做了一个名为“为什么Pandas是新时代的Excel”的"},{"type":"link","attrs":{"href":"https:\/\/podcasts.apple.com\/us\/podcast\/17-why-pandas-is-the-new-excel\/id1453716761?i=1000454831790&fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"播客"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我仍然认为Pandas是数据科学家武器库中的一个很棒的库。但总有一天你需要处理非常大的数据集,这时候Pandas就要耗尽内存了。而这种情况正是Spark的用武之地。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/07\/ce\/076cb95057d0a6985f137943d0008dce.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"Spark非常适合大型数据集❤️"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这篇博文会以问答形式涵盖你可能会遇到的一些问题,和我一开始遇到的一些疑问。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"问题一:Spark是什么?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark是一个处理海量数据集的框架。它能以分布式方式处理大数据文件。它使用几个worker来应对和处理你的大型数据集的各个块,所有worker都由一个驱动节点编排。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个框架的分布式特性意味着它可以扩展到TB级数据。你不再受单机器的内存限制。Spark生态系统现在发展得相当成熟,你无需担心worker编排事宜,它还是开箱即用的,且速度飞快。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/52\/87\/52ee476525d289041b75ba5043cefb87.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Spark生态系统["},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/?fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"参考"}]},{"type":"text","text":"]"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"问题二:我什么时候应该离开Pandas并认真考虑改用Spark?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这取决于你机器的内存大小。我觉得大于10GB的数据集对于Pandas来说就已经很大了,而这时候Spark会是很好的选择。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假设你的数据集中有10列,每个单元格有100个字符,也就是大约有100个字节,并且大多数字符是ASCII,可以编码成1个字节 — 那么规模到了大约10M行,你就应该想到Spark了。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"问题三:Spark在所有方面都比Pandas做得更好吗?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"并非如此!对于初学者来说,Pandas绝对更容易学习。Spark学起来更难,但有了最新的API,你可以使用数据帧来处理大数据,它们和Pandas数据帧用起来一样简单。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,直到最近,Spark对可视化的支持都不怎么样。你只能对数据子集进行可视化。最近情况发生了变化,因为Databricks宣布他们将对Spark中的可视化提供原生支持(我还在等着看他们的成果)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但在这一支持成熟之前,Spark至少不会在可视化领域完全取代Pandas。你完全可以通过df.toPandas()将Spark数据帧变换为Pandas,然后运行可视化或Pandas代码。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"问题四:Spark设置起来很困呢。我应该怎么办?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark可以通过PySpark或Scala(或R或​​SQL)用Python交互。我写了一篇在本地或在自定义服务器上开始使用PySpark的"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/how-to-get-started-with-pyspark-1adc142456ec?fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"博文"}]},{"type":"text","text":"— 评论区都在说上手难度有多大。我觉得你可以直接使用托管云解决方案来尝试运行Spark。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我推荐两种入门Spark的方法:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Databricks"},{"type":"text","text":"——它是一种完全托管的服务,可为你管理AWS\/Azure\/GCP中的Spark集群。他们有笔记本可用,与Jupyter笔记本很像。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Amazon"},{"type":"text","text":"****"},{"type":"text","marks":[{"type":"strong"}],"text":"EMR和Zeppelin****笔记本"},{"type":"text","text":"——它是AWS的半托管服务。你需要托管一个SparkEMR端点,然后运行​​Zeppelin笔记本与其交互。其他云供应商也有类似的服务,这里就不赘述了。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/04\/4b\/049c8eaca60e7217b3eea3302e03e84b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Databricks是一种Spark集群的流行托管方式"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"问题五:Databricks和EMR哪个更好?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我花了几个小时试图了解每种方法的优缺点后,总结出了一些要点:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"EMR完全由亚马逊管理,你无需离开AWS生态系统。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"如果你有DevOps专业知识或有DevOps人员帮助你,EMR可能是一个更便宜的选择——你需要知道如何在完成后启动和关闭实例。话虽如此,EMR可能不够稳定,你可能需要花几个小时进行调试。DatabricksSpark要稳定许多。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"使用Databricks很容易安排作业——你可以非常轻松地安排笔记本在一天或一周的特定时间里运行。它们还为GangliaUI中的指标提供了一个接口。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"对于Spark作业而言,Databricks作业的成本可能比EMR高30-40%。但考虑到灵活性和稳定性以及强大的客户支持,我认为这是值得的。在Spark中以交互方式运行笔记本时,Databricks收取6到7倍的费用——所以请注意这一点。鉴于在30\/60\/120分钟的活动之后你可以关闭实例从而节省成本,我还是觉得它们总体上可以更便宜。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考虑以上几点,如果你开始的是第一个Spark项目,我会推荐你选择Databricks;但如果你有充足的DevOps专业知识,你可以尝试EMR或在你自己的机器上运行Spark。如果你不介意公开分享你的工作,你可以免费试用Databricks社区版或使用他们的企业版试用14天。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"问题六:PySpark与Pandas相比有哪些异同?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我觉得这个主题可以另起一篇文章了。作为Spark贡献者的Andrew Ray的这次"},{"type":"link","attrs":{"href":"https:\/\/www.youtube.com\/watch?v=XrpSRCwISdk&fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"演讲"}]},{"type":"text","text":"应该可以回答你的一些问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它们的主要相似之处有:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Spark数据帧与Pandas数据帧非常像。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"PySpark的groupby、aggregations、selection和其他变换都与Pandas非常像。与Pandas相比,PySpark稍微难一些,并且有一点学习曲线——但用起来的感觉也差不多。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它们的主要区别是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Spark允许你查询数据帧——我觉得这真的很棒。有时,在SQL中编写某些逻辑比在Pandas\/PySpark中记住确切的API更容易,并且你可以交替使用两种办法。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Spark数据帧是不可变的"},{"type":"text","text":"。不允许切片、覆盖数据等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Spark是延迟求值的"},{"type":"text","text":"。它构建了所有变换的一个图,然后在你实际提供诸如collect、show或take之类的动作时对它们延迟求值。变换可以是宽的(查看所有节点的整个数据,也就是orderBy或groupBy)或窄的(查看每个节点中的单个数据,也就是contains或filter)。与窄变换相比,执行多个宽变换可能会更慢。与Pandas相比,你需要更加留心你正在使用的宽变换!"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/ea\/13\/ea57d5fbc1873c983d5097e771d1f113.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Spark中的窄与宽变换。宽变换速度较慢。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"问题七:Spark还有其他优势吗?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark不仅提供数据帧(这是对RDD的更高级别的抽象),而且还提供了用于流数据和通过MLLib进行分布式机器学习的出色API。因此,如果你想对流数据进行变换或想用大型数据集进行机器学习,Spark会很好用的。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"问题八:有没有使用Spark的数据管道架构的示例?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有的,下面是一个ETL管道,其中原始数据从数据湖(S3)处理并在Spark中变换,加载回S3,然后加载到数据仓库(如Snowflake或Redshift)中,然后为Tableau或Looker等BI工具提供基础。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/68\/y9\/689595ef251cacdcf70f606405a47yy9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"用于BI工具大数据处理的ETL管道示例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e6\/b6\/e636ee8799889107bb47226631c75fb6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"在Amazon SageMaker中执行机器学习的管道示例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你还可以先从仓库内的不同来源收集数据,然后使用Spark变换这些大型数据集,将它们加载到Parquet文件中的S3中,然后从SageMaker读取它们(假如你更喜欢使用SageMaker而不是Spark的MLLib)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SageMaker的另一个优势是它让你可以轻松部署并通过Lambda函数触发模型,而Lambda函数又通过API Gateway中的REST端点连接到外部世界。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我写了一篇关于这个架构的"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/what-makes-aws-sagemaker-great-for-machine-learning-c8a42c208aa3#:~:text=Well%2C%20SageMaker%20lets%20you%20decide,possible%20on%20a%20local%20setup.?fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"博文"}]},{"type":"text","text":"。此外,Jules Damji所著的《"},{"type":"link","attrs":{"href":"https:\/\/www.amazon.com\/dp\/1492050040\/?tag=omnilence-20&fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"Learning Spark"}]},{"type":"text","text":"》一书非常适合大家了解Spark。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文到此结束。我们介绍了一些Spark和Pandas的异同点、开始使用Spark的最佳方法以及一些利用Spark的常见架构。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如有任何问题或意见,请在"},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/in\/sanketgupta107\/?fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"领英"}]},{"type":"text","text":"("},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/in\/sanketgupta107\/?fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"https:\/\/www.linkedin.com\/in\/sanketgupta107\/"}]},{"type":"text","text":")上联系我!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"资源"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"JulesDamji关于Spark幕后工作原理的"},{"type":"link","attrs":{"href":"https:\/\/databricks.com\/session\/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets?fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"演讲"}]},{"type":"text","text":"真的很棒。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"JulesDamji的《"},{"type":"link","attrs":{"href":"https:\/\/www.amazon.com\/dp\/1492050040\/?tag=omnilence-20&fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"Learning"}]},{"type":"link","attrs":{"href":"https:\/\/www.amazon.com\/dp\/1492050040\/?tag=omnilence-20&fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":" "}]},{"type":"link","attrs":{"href":"https:\/\/www.amazon.com\/dp\/1492050040\/?tag=omnilence-20&fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"Spark"}]},{"type":"text","text":"》一书。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"AndrewRay的"},{"type":"link","attrs":{"href":"https:\/\/www.amazon.com\/dp\/1492050040\/?tag=omnilence-20&fileGuid=HnLkF26yDSUQEi0O","title":"","type":null},"content":[{"type":"text","text":"演讲"}]},{"type":"text","text":"对比了Pandas与PySpark的语法。"}]}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章