快速提示-使用Modin加速Pandas
https://python.freelycode.com/contribution/detail/1454
github
https://github.com/modin-project/modin
說明手冊
1. 安裝
pip install modin
安裝的時候,提示要安裝cpython
2. 使用方法,加一行代碼
# import pandas as pd
import modin.pandas as pd
示例1:
import modin.pandas as pd
import numpy as np
frame_data = np.random.randint(0, 100, size=(2**10, 2**8))
df = pd.DataFrame(frame_data)
4. 速度提升
import modin.pandas as pd
df = pd.read_csv("my_dataset.csv")
5. 文件測試
1. 文件大小
-rw-r--r-- 1 toucan toucan 289K Dec 20 17:17 IthaGenes_variations_export_all.csv
2. pandas讀入
# 運行 python read_pandas.py
$cat read_pandas.py
from timeit import default_timer as timer
import pandas as pd
from functools import reduce
# run 2 tierations of read_csv to get an average
time = []
for i in range(0,2):
start = timer()
df = pd.read_csv("IthaGenes_variations_export_all.csv")
end = timer()
time.append(end - start)
time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)
輸出結果:
$python read_pandas.py
0.009299777299929701
3. modin讀入
# 運行 python read_modin.py
$cat read_pandas.py
from timeit import default_timer as timer
import modin.pandas as pd
from functools import reduce
# run 2 tierations of read_csv to get an average
time = []
for i in range(0,10):
start = timer()
df = pd.read_csv("IthaGenes_variations_export_all.csv")
end = timer()
time.append(end - start)
time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)
輸出結果:
$python read_pandas.py
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-13_10-37-43_6323/logs.
Waiting for redis server at 127.0.0.1:35024 to respond...
Waiting for redis server at 127.0.0.1:62923 to respond...
Starting Redis shard with 10.0 GB max memory.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 6283886592 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 7.0 GB memory using /tmp.
0.20180192090001584
問題
是不是由於輸入文件太小,筆記本的內存不足,沒有顯示出優勢來呢。我再虛擬機中是設置有12Gb內存的。本虛擬機機只有2核,4個線程。
實踐2
1, 文件大小
換爲大文件791M
-rw-rw-r-- 1 toucan toucan 791M Apr 13 10:43 hapmap_3.3_hg19_pop_stratified_af.vcf
2. pandas讀入
$cat read_pandas.py
from timeit import default_timer as timer
import pandas as pd
from functools import reduce
# run 2 tierations of read_csv to get an average
time = []
for i in range(0,22):
start = timer()
df = pd.read_csv("hapmap_3.3_hg19_pop_stratified_af.vcf",sep="\t")
end = timer()
time.append(end - start)
time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)
輸出:
$python read_pandas.py
12.275232009363746
3. modin讀入
$cat read_pandas.py
from timeit import default_timer as timer
import modin.pandas as pd
from functools import reduce
# run 2 tierations of read_csv to get an average
time = []
for i in range(0,22):
start = timer()
df = pd.read_csv("hapmap_3.3_hg19_pop_stratified_af.vcf",sep="\t")
end = timer()
time.append(end - start)
time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)
在top中,不是以python來運行,而是
輸出:
$python read_pandas.py
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-13_10-52-14_6531/logs.
Waiting for redis server at 127.0.0.1:48416 to respond...
Waiting for redis server at 127.0.0.1:29343 to respond...
Starting Redis shard with 10.0 GB max memory.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 6283886592 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 7.0 GB memory using /tmp.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0413 10:52:51.888617 6546 node_manager.cc:245] Last heartbeat was sent 524 ms ago
W0413 10:52:56.548642 6546 node_manager.cc:245] Last heartbeat was sent 539 ms ago
W0413 10:53:03.599262 6546 node_manager.cc:245] Last heartbeat was sent 532 ms ago
W0413 10:53:07.556165 6546 node_manager.cc:245] Last heartbeat was sent 782 ms ago
W0413 10:53:08.947691 6546 node_manager.cc:245] Last heartbeat was sent 636 ms ago
W0413 10:53:17.075079 6546 node_manager.cc:245] Last heartbeat was sent 643 ms ago
W0413 10:53:19.810811 6546 node_manager.cc:245] Last heartbeat was sent 804 ms ago
W0413 10:53:20.800647 6546 node_manager.cc:245] Last heartbeat was sent 513 ms ago
W0413 10:53:22.806788 6546 node_manager.cc:245] Last heartbeat was sent 699 ms ago
W0413 10:54:00.502030 6546 node_manager.cc:245] Last heartbeat was sent 585 ms ago
W0413 10:54:10.019619 6546 node_manager.cc:245] Last heartbeat was sent 513 ms ago
W0413 10:54:24.286998 6546 node_manager.cc:245] Last heartbeat was sent 732 ms ago
W0413 10:54:28.974217 6546 node_manager.cc:245] Last heartbeat was sent 865 ms ago
W0413 10:54:44.903314 6546 node_manager.cc:245] Last heartbeat was sent 537 ms ago
W0413 10:54:45.480008 6546 node_manager.cc:245] Last heartbeat was sent 576 ms ago
W0413 10:54:50.389829 6546 node_manager.cc:245] Last heartbeat was sent 556 ms ago
W0413 10:54:52.274536 6546 node_manager.cc:245] Last heartbeat was sent 522 ms ago
W0413 10:54:52.873443 6546 node_manager.cc:245] Last heartbeat was sent 599 ms ago
W0413 10:55:15.301537 6546 node_manager.cc:245] Last heartbeat was sent 1008 ms ago
W0413 10:55:16.863193 6546 node_manager.cc:245] Last heartbeat was sent 1129 ms ago
W0413 10:55:18.049829 6546 node_manager.cc:245] Last heartbeat was sent 603 ms ago
W0413 10:55:24.432444 6546 node_manager.cc:245] Last heartbeat was sent 959 ms ago
W0413 10:56:02.659128 6546 node_manager.cc:245] Last heartbeat was sent 643 ms ago
W0413 10:56:09.559237 6546 node_manager.cc:245] Last heartbeat was sent 607 ms ago
W0413 10:56:12.926802 6546 node_manager.cc:245] Last heartbeat was sent 595 ms ago
W0413 10:56:14.754364 6546 node_manager.cc:245] Last heartbeat was sent 830 ms ago
W0413 10:56:17.414083 6546 node_manager.cc:245] Last heartbeat was sent 526 ms ago
W0413 10:56:21.293486 6546 node_manager.cc:245] Last heartbeat was sent 539 ms ago
W0413 10:56:23.624935 6546 node_manager.cc:245] Last heartbeat was sent 576 ms ago
W0413 10:56:25.183625 6546 node_manager.cc:245] Last heartbeat was sent 703 ms ago
W0413 10:57:06.594352 6546 node_manager.cc:245] Last heartbeat was sent 544 ms ago
W0413 10:57:09.569542 6546 node_manager.cc:245] Last heartbeat was sent 693 ms ago
W0413 10:57:12.113721 6546 node_manager.cc:245] Last heartbeat was sent 506 ms ago
W0413 10:57:13.748317 6546 node_manager.cc:245] Last heartbeat was sent 690 ms ago
W0413 10:57:18.617753 6546 node_manager.cc:245] Last heartbeat was sent 1032 ms ago
W0413 10:57:25.745839 6546 node_manager.cc:245] Last heartbeat was sent 580 ms ago
W0413 10:57:38.555815 6546 node_manager.cc:245] Last heartbeat was sent 1772 ms ago
W0413 10:58:38.673028 6546 node_manager.cc:245] Last heartbeat was sent 1484 ms ago
W0413 10:58:59.426427 6546 node_manager.cc:245] Last heartbeat was sent 2125 ms ago
21.333674854863624
等了幾分鐘,計算才得出21秒。
結論
在本虛擬機上測試,由於CPU核數不多,modin的優勢並沒有明顯體現,反而更
慢。