Python太慢了嗎?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然Python比許多編譯語言都慢,但它易於使用,而且功能多樣。對於許多人來說,語言的實用性要勝過速度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我是一名Python工程師,因此你可能會認爲我帶有偏見。 但是我想澄清一些對Python的批評,並反思一下在使用Python進行數據工程、數據科學及分析等日常工作中,對速度的擔憂是否有必要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Python太慢了嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我認爲,這類問題應該基於特定的上下文或用例來說。與C之類的編譯語言相比,Python處理數值的速度慢嗎?是的,它是慢的。這一事實人們很多年前就已經知道了,這就是爲什麼會存在在速度方面起着至關重要作用的Python庫了,比如numpy,它的底層使用的是C。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是對於所有用例來說,Python難道都比其他(更難學習和使用的)語言慢得多嗎?如果你查看那些爲解決特定問題而優化了的Python庫的性能基準測試,就會發現與編譯語言相比,它們的表現是相當不錯的。例如,看看FastAPI的性能基準測試——顯然,作爲編譯語言的Go比Python快得多。不過,FastAPI在構建REST API方面還是勝過了一些Go庫:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b5\/b500337ee7556598c7b0ff065a6dd635.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Web框架基準測試——圖片由作者提供"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"旁註:上面的列表中不包含具有更高性能的C++和Java Web框架。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣地,在數據密集型神經成像管道中,對Dask(用Python編寫)和Spark(用Scala編寫)進行對比時,作者得出瞭如下結論:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總體而言,我們的結果表明,兩個引擎之間的性能沒有實質性的差異。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們應該捫心自問的問題是我們真正需要的是怎樣的速度。如果你每天只需觸發一次ETL作業,則可以不必關心它是需要20秒還是200秒。然後,你可能更希望代碼易於理解、打包和維護,特別是考慮到,與昂貴的引擎耗時相比,計算資源變得越來越便宜了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"代碼速度 vs 實用性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從實用的角度來看,在爲日常工作而選擇編程語言時,我們需要回答幾個不同的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"用這種語言你能可靠地解決多個業務問題嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你只關心速度,那就不要用Python了。對於各種用例,都有更快的替代方案。Python的最主要的優點在於它的可讀性、易用性,以及可以用它解決各種問題。Python可以作爲一種“粘合劑”,將各種不同的系統、服務和用例連接在一起。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"你能找到足夠多的懂這門語言的員工嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於Python非常易於學習和使用,因此Python的用戶數一直在不斷增長。以前用Excel處理數值的業務用戶,現在可以非常快速地學會用Pandas編碼,從而學會在不依賴IT資源的情況下自給自足。同時,這也卸下了IT和分析部門的負擔。同時還縮短了價值實現的時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如今,找到懂Python並能用這種語言維護Spark數據處理應用程序的數據工程師,比找到做同樣事情的Java或Scala工程師要容易得多。許多組織僅僅因爲找到會“講”這種語言的員工的機會更高些,而逐漸在許多用例上都轉向使用Python來處理了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下,我知道有些公司迫切需要Java或C#開發人員來維護現有的應用程序,但是這些語言很難(需要數年的時間才能掌握),而且對於新手程序員來說似乎沒有吸引力,因爲他們可以利用更簡單的語言(比如Go或Python)來獲得更多的收入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"不同領域專家之間的協同效應"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你的公司使用Python,那麼業務用戶、數據分析師、數據科學家、數據工程師、後端和Web開發人員、DevOps工程師甚至系統管理員都很有可能會使用相同的語言。這可以在項目中產生協同效應,使來自不同領域的人們可以一起工作,並利用相同的工具。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"數據處理中真正的瓶頸是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據我自己的工作,我通常遇到的瓶頸其實不是語言本身,而是外部資源。更具體地,我們來看幾個例子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"寫入到關係型數據庫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當以ETL的方式進行數據處理時,我們最終需要將這些數據加載到某個集中的地方。雖然我們可以利用Python中的多線程來更快地(通過使用更多的線程)將數據寫入到某些關係型數據庫中,但是並行寫入數的增加可能會使該數據庫的CPU容量達到最大值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際上,當我使用多線程來加快對AWS上RDS Aurora數據庫的寫入時,這種情況就發生過一次。隨後我注意到writer節點的CPU利用率上升得非常高,以至於我不得不通過使用更少的線程來故意降低代碼的速度,以確保不會破壞數據庫實例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這意味着Python具有並行化和加速許多操作的機制,但是關係型數據庫(受CPU內核數量的限制)有其侷限性,僅通過使用更快的編程語言是不可能解決這個問題的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"調用外部API"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個語言本身不是瓶頸的例子是使用外部REST API(你可能希望從中提取數據以滿足數據分析的需求)。雖然我們可以利用並行來加快數據提取的速度,但這可能是徒勞的,因爲許多外部API限制了我們在特定時間段內可以發出的請求數。因此,你可能經常會發現自己需要故意降低腳本的運行速度,以確保不超過API的請求限制:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"time.sleep(10)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"處理大數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據我處理大量數據集的經驗,無論使用哪種語言,都無法將真正的“大數據”加載到筆記本電腦的內存中。對於此類用例,你可能需要利用分佈式處理框架,如Dask、Spark、Ray等。使用單個服務器實例或筆記本電腦時,可以處理的數據量是有限的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想把實際的數據處理工作轉移到一組計算節點上,甚至可能想利用GPU實例來進一步加快計算速度,那麼Python恰好擁有一個龐大的框架生態系統,可以簡化這項任務:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你想利用GPU來加快數據科學的計算速度嗎?使用Pytorch、Tensorflow、Ray或Rapids(甚至是使用SQL ——BlazingSQL)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你想加快處理大數據的Python代碼的速度嗎?使用Spark(或Databricks)、Dask或Prefect(在底層抽象化了Dask)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你想加快數據分析的處理速度嗎?使用fast專用於內存的列式數據庫,僅通過使用SQL查詢即可確保高速處理。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你需要對計算節點集羣上進行的數據處理進行編排和監控的話,有幾個Python編寫的工作流管理平臺可以使用,它們可以加快數據管道的開發和維護,比如Apache Airflow、Prefect或Dagster。如果你想了解更多相關知識,請查看我之前的文章。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"順便說一句,我可以想象一些抱怨Python的人並沒有充分利用它,或者可能沒有使用恰當的數據結構來解決手頭的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總而言之,如果你需要快速處理大量的數據,可能需要更多的計算資源,而不是更快的編程語言,而且有些Python庫可以方便地將工作分發到數百個節點上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"結論"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中,我們討論了Python是否是當前數據處理領域的真正瓶頸。雖然Python比許多編譯語言都慢,但它易於使用,而且功能多樣。我們注意到,對於許多人來說,語言的實用性要勝過速度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,我們討論了,至少在數據工程中,語言本身可能不是瓶頸,而是外部系統的限制,以及無論選擇哪種編程語言,都無法在單個機器上處理的大量數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] TechEmpower:"},{"type":"link","attrs":{"href":"https:\/\/www.techempower.com\/benchmarks\/#section=data-r19&hw=ph&test=query&l=ziix33-1r&f=0-0-0-0-0-1ekg-0-0-4fti4g-0-0","title":"","type":null},"content":[{"type":"text","text":"Web框架基準測試"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2]"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/1907.13030","title":"","type":null},"content":[{"type":"text","text":"“數據密集型神經成像管道的Dask和Apache Spark的性能比較”"}]},{"type":"text","text":"——"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/search\/cs?searchtype=author&query=Dugr%C3%A9%2C+M","title":"","type":null},"content":[{"type":"text","text":"Mathieu Dugré"}]},{"type":"text","text":","},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/search\/cs?searchtype=author&query=Hayot-Sasson%2C+V","title":"","type":null},"content":[{"type":"text","text":"Valérie Hayot-Sasson"}]},{"type":"text","text":","},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/search\/cs?searchtype=author&query=Glatard%2C+T","title":"","type":null},"content":[{"type":"text","text":"Tristan Glatard"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/is-python-really-a-bottleneck-786d063e2921?gi=3b1490fa23d3","title":"","type":null},"content":[{"type":"text","text":"https:\/\/towardsdatascience.com\/is-python-really-a-bottleneck-786d063e2921?gi=3b1490fa23d3"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章