At what situation I can use Dask instead of Apache Spark? [closed]

问题:

Closed . 关闭 This question is opinion-based . 这个问题是基于意见的 It is not currently accepting answers. 它目前不接受答案。

Want to improve this question?想改善这个问题吗? Update the question so it can be answered with facts and citations by editing this post .更新问题,以便通过编辑这篇文章用事实和引文来回答问题。

Closed 5 years ago . 5年前关闭。

I am currently using Pandas and Spark for data analysis.我目前正在使用 Pandas 和 Spark 进行数据分析。 I found Dask provides parallelized NumPy array and Pandas DataFrame.我发现 Dask 提供了并行化的 NumPy 数组和 Pandas DataFrame。

Pandas is easy and intuitive for doing data analysis in Python. Pandas 在 Python 中进行数据分析既简单又直观。 But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.但是我发现由于系统内存有限,我很难在 Pandas 中处理多个更大的数据帧。

Simple Answer:简单回答:

Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. Apache Spark 是一个包罗万象的框架,它结合了分布式计算、SQL 查询、机器学习等,它在 JVM 上运行,并且通常与其他大数据框架(如 Hadoop)共同部署。 ... Generally Dask is smaller and lighter weight than Spark. ... 一般来说,Dask 比 Spark 更小、重量更轻。

I get to know below details from http://dask.pydata.org/en/latest/spark.html我从http://dask.pydata.org/en/latest/spark.html了解以下详细信息

  • Dask is light weighted Dask 是轻量级的
  • Dask is typically used on a single machine, but also runs well on a distributed cluster. Dask 通常在单台机器上使用,但也可以在分布式集群上运行良好。
  • Dask to provides parallel arrays, dataframes, machine learning, and custom algorithms Dask 提供并行数组、数据帧、机器学习和自定义算法
  • Dask has an advantage for Python users because it is itself a Python library, so serialization and debugging when things go wrong happens more smoothly. Dask 对 Python 用户来说有一个优势,因为它本身就是一个 Python 库,所以出现问题时的序列化和调试会更顺利。
  • Dask gives up high-level understanding to allow users to express more complex parallel algorithms. Dask 放弃了高级别的理解,让用户可以表达更复杂的并行算法。
  • Dask is lighter weight and is easier to integrate into existing code and hardware. Dask 重量更轻,更容易集成到现有代码和硬件中。
  • If you want a single project that does everything and you're already on Big Data hardware then Spark is a safe bet如果你想要一个可以做所有事情的单一项目并且你已经在大数据硬件上,那么 Spark 是一个安全的选择
  • Spark is typically used on small to medium sized cluster but also runs well on a single machine. Spark 通常用于中小型集群,但也可以在单台机器上运行良好。

I understand more things about Dask from the below link https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster我从以下链接中了解了有关 Dask 的更多信息https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster

  • If you're running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.如果您在使用 Pandas、NumPy 或其他 Python 计算时在单台机器上遇到内存问题、存储限制或 CPU 边界问题,Dask 可以帮助您在一台机器上扩展所有内核,或向外扩展在集群中的所有内核和内存上。
  • Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data Dask 在单台机器上运行良好,可充分利用笔记本电脑上的所有内核并处理大于内存的数据
  • scales up resiliently and elastically on clusters with hundreds of nodes.在具有数百个节点的集群上弹性伸缩。
  • Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Dask 使用 Python 原生处理不同格式和存储系统的数据,包括 Hadoop 分布式文件系统 (HDFS) 和 Amazon S3。 Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP. Anaconda 和 Dask 可以与您现有的企业 Hadoop 发行版一起使用,包括 Cloudera CDH 和 Hortonworks HDP。

http://dask.pydata.org/en/latest/dataframe-overview.html http://dask.pydata.org/en/latest/dataframe-overview.html

Limitations限制

Dask.DataFrame does not implement the entire Pandas interface. Dask.DataFrame 没有实现整个 Pandas 接口。 Users expecting this will be disappointed.Notably, dask.dataframe has the following limitations:期待这个的用户会失望。值得注意的是,dask.dataframe 有以下限制:

  1. Setting a new index from an unsorted column is expensive从未排序的列设置新索引的成本很高
  2. Many operations, like groupby-apply and join on unsorted columns require setting the index, which as mentioned above, is expensive许多操作,如 groupby-apply 和未排序列上的 join 需要设置索引,如上所述,这很昂贵
  3. The Pandas API is very large. Pandas API 非常大。 Dask.dataframe does not attempt to implement many pandas features or any of the more exotic data structures like NDFrames Dask.dataframe 不会尝试实现许多 Pandas 功能或任何更奇特的数据结构,如 NDFrames

Thanks to the Dask developers.感谢 Dask 开发人员。 It seems like very promising technology.这似乎是非常有前途的技术。

Overall I can understand Dask is simpler to use than spark.总的来说,我可以理解 Dask 比 spark 更易于使用。 Dask is as flexible as Pandas with more power to compute with more cpu's parallely. Dask 与 Pandas 一样灵活,具有更强的计算能力和更多的 CPU 并行性。

I understand all the above facts about Dask.我了解有关 Dask 的所有上述事实。

So, roughly how much amount of data(in terabyte) can be processed with Dask?那么,Dask 大约可以处理多少数据量(以 TB 为单位)?


解决方案:

参考: https://stackoom.com/en/question/2d99g
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章