At what situation I can use Dask instead of Apache Spark? [closed]

問題:

Closed . 關閉 This question is opinion-based . 這個問題是基於意見的 It is not currently accepting answers. 它目前不接受答案。

Want to improve this question?想改善這個問題嗎? Update the question so it can be answered with facts and citations by editing this post .更新問題,以便通過編輯這篇文章用事實和引文來回答問題。

Closed 5 years ago . 5年前關閉。

I am currently using Pandas and Spark for data analysis.我目前正在使用 Pandas 和 Spark 進行數據分析。 I found Dask provides parallelized NumPy array and Pandas DataFrame.我發現 Dask 提供了並行化的 NumPy 數組和 Pandas DataFrame。

Pandas is easy and intuitive for doing data analysis in Python. Pandas 在 Python 中進行數據分析既簡單又直觀。 But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.但是我發現由於系統內存有限,我很難在 Pandas 中處理多個更大的數據幀。

Simple Answer:簡單回答:

Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. Apache Spark 是一個包羅萬象的框架,它結合了分佈式計算、SQL 查詢、機器學習等,它在 JVM 上運行,並且通常與其他大數據框架(如 Hadoop)共同部署。 ... Generally Dask is smaller and lighter weight than Spark. ... 一般來說,Dask 比 Spark 更小、重量更輕。

I get to know below details from http://dask.pydata.org/en/latest/spark.html我從http://dask.pydata.org/en/latest/spark.html瞭解以下詳細信息

  • Dask is light weighted Dask 是輕量級的
  • Dask is typically used on a single machine, but also runs well on a distributed cluster. Dask 通常在單臺機器上使用,但也可以在分佈式集羣上運行良好。
  • Dask to provides parallel arrays, dataframes, machine learning, and custom algorithms Dask 提供並行數組、數據幀、機器學習和自定義算法
  • Dask has an advantage for Python users because it is itself a Python library, so serialization and debugging when things go wrong happens more smoothly. Dask 對 Python 用戶來說有一個優勢,因爲它本身就是一個 Python 庫,所以出現問題時的序列化和調試會更順利。
  • Dask gives up high-level understanding to allow users to express more complex parallel algorithms. Dask 放棄了高級別的理解,讓用戶可以表達更復雜的並行算法。
  • Dask is lighter weight and is easier to integrate into existing code and hardware. Dask 重量更輕,更容易集成到現有代碼和硬件中。
  • If you want a single project that does everything and you're already on Big Data hardware then Spark is a safe bet如果你想要一個可以做所有事情的單一項目並且你已經在大數據硬件上,那麼 Spark 是一個安全的選擇
  • Spark is typically used on small to medium sized cluster but also runs well on a single machine. Spark 通常用於中小型集羣,但也可以在單臺機器上運行良好。

I understand more things about Dask from the below link https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster我從以下鏈接中瞭解了有關 Dask 的更多信息https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster

  • If you're running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.如果您在使用 Pandas、NumPy 或其他 Python 計算時在單臺機器上遇到內存問題、存儲限制或 CPU 邊界問題,Dask 可以幫助您在一臺機器上擴展所有內核,或向外擴展在集羣中的所有內核和內存上。
  • Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data Dask 在單臺機器上運行良好,可充分利用筆記本電腦上的所有內核並處理大於內存的數據
  • scales up resiliently and elastically on clusters with hundreds of nodes.在具有數百個節點的集羣上彈性伸縮。
  • Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Dask 使用 Python 原生處理不同格式和存儲系統的數據,包括 Hadoop 分佈式文件系統 (HDFS) 和 Amazon S3。 Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP. Anaconda 和 Dask 可以與您現有的企業 Hadoop 發行版一起使用,包括 Cloudera CDH 和 Hortonworks HDP。

http://dask.pydata.org/en/latest/dataframe-overview.html http://dask.pydata.org/en/latest/dataframe-overview.html

Limitations限制

Dask.DataFrame does not implement the entire Pandas interface. Dask.DataFrame 沒有實現整個 Pandas 接口。 Users expecting this will be disappointed.Notably, dask.dataframe has the following limitations:期待這個的用戶會失望。值得注意的是,dask.dataframe 有以下限制:

  1. Setting a new index from an unsorted column is expensive從未排序的列設置新索引的成本很高
  2. Many operations, like groupby-apply and join on unsorted columns require setting the index, which as mentioned above, is expensive許多操作,如 groupby-apply 和未排序列上的 join 需要設置索引,如上所述,這很昂貴
  3. The Pandas API is very large. Pandas API 非常大。 Dask.dataframe does not attempt to implement many pandas features or any of the more exotic data structures like NDFrames Dask.dataframe 不會嘗試實現許多 Pandas 功能或任何更奇特的數據結構,如 NDFrames

Thanks to the Dask developers.感謝 Dask 開發人員。 It seems like very promising technology.這似乎是非常有前途的技術。

Overall I can understand Dask is simpler to use than spark.總的來說,我可以理解 Dask 比 spark 更易於使用。 Dask is as flexible as Pandas with more power to compute with more cpu's parallely. Dask 與 Pandas 一樣靈活,具有更強的計算能力和更多的 CPU 並行性。

I understand all the above facts about Dask.我瞭解有關 Dask 的所有上述事實。

So, roughly how much amount of data(in terabyte) can be processed with Dask?那麼,Dask 大約可以處理多少數據量(以 TB 爲單位)?


解決方案:

參考: https://stackoom.com/en/question/2d99g
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章