仅用几行代码,让Python函数执行快30倍

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Python是一种流行的编程语言,也是数据科学社区中最受欢迎的语言。与其他流行编程语言相比,Python的主要缺点是它的动态特性和多功能属性拖慢了速度表现。Python代码是在运行时被解释的,而不是在编译时被编译为原生代码。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/dc\/67\/dca4d36177130d61f6e67f8827aab867.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Python多线程处理的基本指南"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"C语言的执行速度比Python代码快10到100倍。但如果对比开发速度的话,Python比C语言要快。对于数据科学研究来说,开发速度远比运行时性能更重要。由于存在大量API、框架和包,Python更受数据科学家和数据分析师的青睐,只是它在性能优化方面落后太多了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中,我们将讨论如何用多处理模块并行执行自定义Python函数,并进一步对比运行时间指标。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多处理入门"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考虑一个单核心CPU,如果它被同时分配多个任务,就必须不断地中断当前执行的任务并切换到下一个任务才能保持所有进程正常运行。对于多核处理器来说,CPU可以在不同内核中同时执行多个任务,这一概念被称为并行处理。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"它为什么如此重要?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据整理、特征工程和数据探索都是数据科学模型开发管道中的重要元素。在输入机器学习模型之前,原始数据需要做工程处理。对于较小的数据集来说,执行过程只需几秒钟就能完成;但对于较大的数据集而言,这项任务就比较繁重了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"并行处理是提高Python程序性能的一种有效方法。Python有一个多处理模块,让我们能够跨CPU的不同内核并行执行程序。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"实现"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们将使用来自"},{"type":"text","marks":[{"type":"strong"}],"text":"multiprocessing"},{"type":"text","text":"模块的"},{"type":"text","marks":[{"type":"strong"}],"text":"Pool"},{"type":"text","text":"类,针对多个输入值并行执行一个函数。这个概念称为数据并行性,它是Pool类的主要目标。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我将使用从"},{"type":"link","attrs":{"href":"https:\/\/www.kaggle.com\/","title":"","type":null},"content":[{"type":"text","text":"Kaggle"}]},{"type":"text","text":"下载的"},{"type":"link","attrs":{"href":"https:\/\/www.kaggle.com\/c\/quora-question-pairs","title":"","type":null},"content":[{"type":"text","text":"Quora问题对相似性数据"}]},{"type":"text","text":"集来演示这个模块。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述数据集包含了很多在Quora平台上提出的文本问题。我将在一个Python函数上执行多处理模块,这个函数通过删除停用词、删除HTML标签、删除标点符号、词干提取等过程来处理文本数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"preprocess()"},{"type":"text","text":" 就是执行上述文本处理步骤的函数。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以在"},{"type":"link","attrs":{"href":"https:\/\/gist.github.com\/satkr7\/7d66e00bc2db9742a6c77cbf206af3f9","title":"","type":null},"content":[{"type":"text","text":"这里"}]},{"type":"text","text":"找到托管在我的GitHub上的函数preprocess()的代码片段。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在,我们使用multiprocessing模块中的Pool类为数据集的不同块并行执行该函数。数据集的每个块都将并行处理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import multiprocessing\nfrom functools import partial\nfrom QuoraTextPreprocessing import preprocess\n\nBUCKET_SIZE = 50000\n\ndef run_process(df, start):\n df = df[start:start+BUCKET_SIZE]\n print(start, \"to \",start+BUCKET_SIZE)\n temp = df[\"question\"].apply(preprocess)\n \nchunks = [x for x in range(0,df.shape[0], BUCKET_SIZE)] \npool = multiprocessing.Pool()\nfunc = partial(run_process, df)\ntemp = pool.map(func,chunks)\npool.close()\npool.join()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"该数据集有537,361条记录(文本问题)需要处理。对于50,000的桶大小,数据集被分成11个较小的数据块,这些块可以并行处理以加快程序的执行时间。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"基准测试:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人们常问的问题是使用多处理模块后执行速度能快多少。我在实现了数据并行性,对整个数据集执行一次preprocess()函数后对比了基准执行时间。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"运行测试的机器有64GB内存和10个CPU内核。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/2a\/b5\/2a67d7020dfa3041751b25e3c12150b5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"多处理和单处理执行的基准时间"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从上图中,我们可以观察到Python函数的并行处理将执行速度提高了近"},{"type":"text","marks":[{"type":"strong"}],"text":"30倍"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以在我的GitHub中找到用于记录基准测试数据的Python"},{"type":"link","attrs":{"href":"https:\/\/gist.github.com\/satkr7\/1087afdd4291638122186f5741564dd9","title":"","type":null},"content":[{"type":"text","text":"文件"}]},{"type":"text","text":"。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/7c\/75\/7c628e8fdeb1abdf6450c123dc061175.gif","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"基准测试过程"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"结论"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中,我们讨论了Python中多处理模块的实现,该模块可用于加速Python函数的执行。添加几行多处理代码后,具有537k实例的数据集的执行时间几乎快了30倍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"处理大型数据集的时候,我建议大家使用并行处理,因为它可以节省大量时间并加快工作流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"请参阅我关于加速Python工作流程的其他文章:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/4-libraries-that-can-parallelize-the-existing-pandas-ecosystem-f46e5a257809","title":"","type":null},"content":[{"type":"text","text":"4个可以并行化现有Pandas生态系统的库"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/400x-time-faster-pandas-data-frame-iteration-16fb47871a0a","title":"","type":null},"content":[{"type":"text","text":"Pandas数据帧迭代速度提高400倍"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/optimize-pandas-memory-usage-while-reading-large-datasets-1b047c762c9b","title":"","type":null},"content":[{"type":"text","text":"优化大数据集的Pandas内存使用"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/3x-times-faster-pandas-with-pypolars-7550e605805e","title":"","type":null},"content":[{"type":"text","text":"使用PyPolars将Pandas的速度提高3倍"}]}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"参考文章"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1]多处理文档:"},{"type":"link","attrs":{"href":"https:\/\/docs.python.org\/3\/library\/multiprocessing.html","title":"","type":null},"content":[{"type":"text","text":"https:\/\/docs.python.org\/3\/library\/multiprocessing.html"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文链接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/25x-times-faster-python-function-execution-in-a-few-lines-of-code-4c82bdd0f64c","title":"","type":null},"content":[{"type":"text","text":"https:\/\/towardsdatascience.com\/25x-times-faster-python-function-execution-in-a-few-lines-of-code-4c82bdd0f64c"}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"福利推荐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2021年ArchSummit全球架构师峰会深圳站设置【高可用高性能业务架构】专题,将邀请阿里巴巴、同程旅行、Shopee等一线专家,分享在不同团队、不同业务场景、不同技术栈下,如何实现业务的快速开发并保证其架构具备良好的扩展性和容错能力,如何分析关乎用户体验的系统瓶颈和构建高性能系统。9月3-4日,我们重回深圳。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"点击底部【阅读原文】查看所有上线专题。大会席位 9 折限时优惠,联系票务小姐姐小倩预定现场席位:18514549229(同微信)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/1b\/cc\/1b17e9bb4e1a751496da02a4ddc19ecc.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static.geekbang.org\/infoq\/5c6947ecc1649.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章