考慮這麼一個場景,有海量txt文件,一個個batch讀進來,測試一下torch DataLoader的效率如何。
基本信息:
- 本機配置:8核32G內存,工作站內置一塊2T的機械硬盤,數據均放在該硬盤上
- 操作系統:ubuntu 16.04 LTS
- pytorch:1.0
- python:3.6
1、首先生成很多隨機文本txt
def gen_test_txt():
population = list(string.ascii_letters) + ['\n']
for i in range(1000):
with open(f'./test_txt/{i}.txt', 'w') as f:
f.write(
''.join(random.choices(population, k=1000000))
)
2、然後順序讀取作爲benchmark
def test_torch_reader():
class Dst(Dataset):
def __init__(self, paths):
self.paths = paths
def __len__(self):
return len(self.paths)
def __getitem__(self, i):
open(self.paths[i], 'r').read()
return 1
dst = Dst([f'./test_txt/{i}.txt' for i in range(1000)])
loader = DataLoader(dst, 128, num_workers=0)
ts = time()
time_cost = []
for i, ele in enumerate(loader, 1):
dur = time() - ts
time_cost.append(dur)
print(i, dur)
ts = time()
print(f"{sum(time_cost):.3f}, "
f"{np.mean(time_cost):.3f}, "
f"{np.std(time_cost):.3f}, "
f"{max(time_cost):.3f}, "
f"{min(time_cost):.3f}")
plt.plot(time_cost)
plt.grid()
plt.show()
每個batch耗時的基本統計信息如下,
基本維持在0.9 sec / batch
total, mean, std, max, min
7.148, 0.893, 0.074, 1.009, 0.726
可見,一共是1000個文件,batch size 128,也就是8個batch,總共耗時7.1s,接下來清除cache,
3、設置num_workers爲4
每隔4個batch,要準備4個batch,且是串行的,因此時間增大4倍,接下來3個batch幾乎不佔用時間
total, mean, std, max, min
7.667, 0.958, 1.652, 3.983, 0.000
接下來實驗在SSD上進行,同樣num_workers先0後4,如下
total, mean, std, max, min
3.251, 0.406, 0.026, 0.423, 0.338
SSD上,對比機械硬盤更加穩定
然後是num_workers = 4,
total, mean, std, max, min
1.934, 0.242, 0.421, 1.088, 0.000
觀察到同樣的現象,但尖峯應該是0.4*4=1.6,這裏反而epoch 4 (0-index)降爲一半爲0.8
基本結論:可以看到,不管是在SSD,還是機械硬盤上,總的耗時基本不變(SSD小一些,但原因也可能是實驗不充分),並沒有因爲numworkers增大而減小,令我很費解!我一貫的理解是:比如num_workers爲4,那麼每個worker計算一個batch,因爲本機多核且大於4,講道理4個worker並行處理,因此時間爲num_workers=0的1/4才合理,那原因是爲何呢?(這個實驗本來是爲了load audio數據,其實在audio上作類似實驗也是一致的現象)
補充了一個實驗,嘗試用ray讀取,代碼如下,
def test_ray():
ray.init()
@ray.remote
def read(paths):
for path in paths:
open(path, 'r').read()
return 1
def ray_read(paths, n_cpu=4):
chunk_size = len(paths) // n_cpu
object_ids = []
for i in range(n_cpu):
x = read.remote(paths[i * chunk_size: (i + 1) * chunk_size])
object_ids.append(x)
return ray.get(object_ids)
def batch(l, bs):
out = []
i = 0
while i < len(l):
out.append(l[i: i + bs])
i += bs
return out
paths = [os.path.expanduser(f'~/test_txt/{i}.txt') for i in range(1000)]
paths = batch(paths, 128)
time_cost = []
ts = time()
for i, ele in enumerate(paths, 1):
# read(paths[i - 1])
ray_read(paths[i - 1], 8)
dur = time() - ts
time_cost.append(dur)
print(i, dur)
ts = time()
print(f"{sum(time_cost):.3f}, "
f"{np.mean(time_cost):.3f}, "
f"{np.std(time_cost):.3f}, "
f"{max(time_cost):.3f}, "
f"{min(time_cost):.3f}")
plt.plot(time_cost)
plt.grid()
plt.show()
流程是這樣的:將輸入paths分成n_cpu個chunk,chunk之間通過ray異步執行,結果是:同樣是在SSD上,理論上每個batch耗時是之前的1/4,也就是0.1s左右,但實測是0.2s,也就是說,n_cpu最大有效值就是2。