僅用幾行代碼,讓Python函數執行快30倍

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Python是一種流行的編程語言,也是數據科學社區中最受歡迎的語言。與其他流行編程語言相比,Python的主要缺點是它的動態特性和多功能屬性拖慢了速度表現。Python代碼是在運行時被解釋的,而不是在編譯時被編譯爲原生代碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/dc\/67\/dca4d36177130d61f6e67f8827aab867.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Python多線程處理的基本指南"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"C語言的執行速度比Python代碼快10到100倍。但如果對比開發速度的話,Python比C語言要快。對於數據科學研究來說,開發速度遠比運行時性能更重要。由於存在大量API、框架和包,Python更受數據科學家和數據分析師的青睞,只是它在性能優化方面落後太多了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中,我們將討論如何用多處理模塊並行執行自定義Python函數,並進一步對比運行時間指標。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多處理入門"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考慮一個單核心CPU,如果它被同時分配多個任務,就必須不斷地中斷當前執行的任務並切換到下一個任務才能保持所有進程正常運行。對於多核處理器來說,CPU可以在不同內核中同時執行多個任務,這一概念被稱爲並行處理。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"它爲什麼如此重要?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據整理、特徵工程和數據探索都是數據科學模型開發管道中的重要元素。在輸入機器學習模型之前,原始數據需要做工程處理。對於較小的數據集來說,執行過程只需幾秒鐘就能完成;但對於較大的數據集而言,這項任務就比較繁重了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"並行處理是提高Python程序性能的一種有效方法。Python有一個多處理模塊,讓我們能夠跨CPU的不同內核並行執行程序。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將使用來自"},{"type":"text","marks":[{"type":"strong"}],"text":"multiprocessing"},{"type":"text","text":"模塊的"},{"type":"text","marks":[{"type":"strong"}],"text":"Pool"},{"type":"text","text":"類,針對多個輸入值並行執行一個函數。這個概念稱爲數據並行性,它是Pool類的主要目標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我將使用從"},{"type":"link","attrs":{"href":"https:\/\/www.kaggle.com\/","title":"","type":null},"content":[{"type":"text","text":"Kaggle"}]},{"type":"text","text":"下載的"},{"type":"link","attrs":{"href":"https:\/\/www.kaggle.com\/c\/quora-question-pairs","title":"","type":null},"content":[{"type":"text","text":"Quora問題對相似性數據"}]},{"type":"text","text":"集來演示這個模塊。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述數據集包含了很多在Quora平臺上提出的文本問題。我將在一個Python函數上執行多處理模塊,這個函數通過刪除停用詞、刪除HTML標籤、刪除標點符號、詞幹提取等過程來處理文本數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"preprocess()"},{"type":"text","text":" 就是執行上述文本處理步驟的函數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以在"},{"type":"link","attrs":{"href":"https:\/\/gist.github.com\/satkr7\/7d66e00bc2db9742a6c77cbf206af3f9","title":"","type":null},"content":[{"type":"text","text":"這裏"}]},{"type":"text","text":"找到託管在我的GitHub上的函數preprocess()的代碼片段。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,我們使用multiprocessing模塊中的Pool類爲數據集的不同塊並行執行該函數。數據集的每個塊都將並行處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import multiprocessing\nfrom functools import partial\nfrom QuoraTextPreprocessing import preprocess\n\nBUCKET_SIZE = 50000\n\ndef run_process(df, start):\n df = df[start:start+BUCKET_SIZE]\n print(start, \"to \",start+BUCKET_SIZE)\n temp = df[\"question\"].apply(preprocess)\n \nchunks = [x for x in range(0,df.shape[0], BUCKET_SIZE)] \npool = multiprocessing.Pool()\nfunc = partial(run_process, df)\ntemp = pool.map(func,chunks)\npool.close()\npool.join()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該數據集有537,361條記錄(文本問題)需要處理。對於50,000的桶大小,數據集被分成11個較小的數據塊,這些塊可以並行處理以加快程序的執行時間。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"基準測試:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人們常問的問題是使用多處理模塊後執行速度能快多少。我在實現了數據並行性,對整個數據集執行一次preprocess()函數後對比了基準執行時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"運行測試的機器有64GB內存和10個CPU內核。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/2a\/b5\/2a67d7020dfa3041751b25e3c12150b5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"多處理和單處理執行的基準時間"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上圖中,我們可以觀察到Python函數的並行處理將執行速度提高了近"},{"type":"text","marks":[{"type":"strong"}],"text":"30倍"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以在我的GitHub中找到用於記錄基準測試數據的Python"},{"type":"link","attrs":{"href":"https:\/\/gist.github.com\/satkr7\/1087afdd4291638122186f5741564dd9","title":"","type":null},"content":[{"type":"text","text":"文件"}]},{"type":"text","text":"。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/7c\/75\/7c628e8fdeb1abdf6450c123dc061175.gif","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"基準測試過程"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"結論"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中,我們討論了Python中多處理模塊的實現,該模塊可用於加速Python函數的執行。添加幾行多處理代碼後,具有537k實例的數據集的執行時間幾乎快了30倍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"處理大型數據集的時候,我建議大家使用並行處理,因爲它可以節省大量時間並加快工作流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請參閱我關於加速Python工作流程的其他文章:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/4-libraries-that-can-parallelize-the-existing-pandas-ecosystem-f46e5a257809","title":"","type":null},"content":[{"type":"text","text":"4個可以並行化現有Pandas生態系統的庫"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/400x-time-faster-pandas-data-frame-iteration-16fb47871a0a","title":"","type":null},"content":[{"type":"text","text":"Pandas數據幀迭代速度提高400倍"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/optimize-pandas-memory-usage-while-reading-large-datasets-1b047c762c9b","title":"","type":null},"content":[{"type":"text","text":"優化大數據集的Pandas內存使用"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/3x-times-faster-pandas-with-pypolars-7550e605805e","title":"","type":null},"content":[{"type":"text","text":"使用PyPolars將Pandas的速度提高3倍"}]}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文章"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1]多處理文檔:"},{"type":"link","attrs":{"href":"https:\/\/docs.python.org\/3\/library\/multiprocessing.html","title":"","type":null},"content":[{"type":"text","text":"https:\/\/docs.python.org\/3\/library\/multiprocessing.html"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/25x-times-faster-python-function-execution-in-a-few-lines-of-code-4c82bdd0f64c","title":"","type":null},"content":[{"type":"text","text":"https:\/\/towardsdatascience.com\/25x-times-faster-python-function-execution-in-a-few-lines-of-code-4c82bdd0f64c"}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"福利推薦"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2021年ArchSummit全球架構師峯會深圳站設置【高可用高性能業務架構】專題,將邀請阿里巴巴、同程旅行、Shopee等一線專家,分享在不同團隊、不同業務場景、不同技術棧下,如何實現業務的快速開發並保證其架構具備良好的擴展性和容錯能力,如何分析關乎用戶體驗的系統瓶頸和構建高性能系統。9月3-4日,我們重回深圳。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"點擊底部【閱讀原文】查看所有上線專題。大會席位 9 折限時優惠,聯繫票務小姐姐小倩預定現場席位:18514549229(同微信)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/1b\/cc\/1b17e9bb4e1a751496da02a4ddc19ecc.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static.geekbang.org\/infoq\/5c6947ecc1649.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章