【Pytorch】記錄自定義DataLoader時一個易犯的隱性錯誤

Mark下在自定義DataLoader時犯的一個隱性錯誤，最後還是通過閱讀源碼發現了癥結。

場景：根據一個dataframe自定義DataLoader

原始數據：

import pandas as pd
import random

df = pd.DataFrame({'feature':[[random.randint(0, 5) for _ in range(5)] for _ in range(10)], 'label':[random.randint(0, 5) for _ in range(10)]})

想當然的寫下了如下代碼：

# 用元組記錄每一行數據
def load_data(data):
    contents = []
    for i in range(len(data)):
        feature = data.iloc[i]['feature']
        label = data.iloc[i]['label']
        contents.append((feature, label))
    return contents
    
class Dataset1(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __getitem__(self, item):
        return self.data[item]
    
    def __len__(self):
        return len(self.data) 

dataset1 = Dataset1(result)
dataloader1 = DataLoader(dataset1, batch_size=2)

一切似乎很尋常，下面開始表演真正的技術：

for x,y in dataloader2:
    print(x)
    print(y)
    
"""[tensor([0, 2]), tensor([3, 5]), tensor([4, 3]), tensor([2, 5]), tensor([3, 1])]
tensor([5, 1])
[tensor([0, 3]), tensor([4, 0]), tensor([1, 3]), tensor([4, 2]), tensor([0, 1])]
tensor([2, 0])
[tensor([3, 4]), tensor([0, 1]), tensor([0, 2]), tensor([0, 2]), tensor([4, 1])]
tensor([5, 1])
[tensor([1, 1]), tensor([0, 5]), tensor([0, 0]), tensor([2, 1]), tensor([5, 5])]
tensor([1, 5])
[tensor([3, 3]), tensor([5, 3]), tensor([2, 0]), tensor([0, 4]), tensor([4, 4])]
tensor([2, 2])"""

???完全不是想象中的結果，再掙扎一下：

class Dataset2(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __getitem__(self, item):
        return {'feature': self.data[item][0], 'label':self.data[item][1]}, 
    
    def __len__(self):
        return len(self.data) 

dataset2 = Dataset2(result)
dataloader2 = DataLoader(dataset2, batch_size=2)

for x in dataloader2:
    print(x['feature'])
    print(x['label'])

這次直接報錯了：

TypeError: list indices must be integers or slices, not str

看下實際的迭代結果：

for x in dataloader2:
    print(x)
'''
[{'feature': [tensor([0, 2]), tensor([3, 5]), tensor([4, 3]), tensor([2, 5]), tensor([3, 1])], 'label': tensor([5, 1])}]
[{'feature': [tensor([0, 3]), tensor([4, 0]), tensor([1, 3]), tensor([4, 2]), tensor([0, 1])], 'label': tensor([2, 0])}]
[{'feature': [tensor([3, 4]), tensor([0, 1]), tensor([0, 2]), tensor([0, 2]), tensor([4, 1])], 'label': tensor([5, 1])}]
[{'feature': [tensor([1, 1]), tensor([0, 5]), tensor([0, 0]), tensor([2, 1]), tensor([5, 5])], 'label': tensor([1, 5])}]
[{'feature': [tensor([3, 3]), tensor([5, 3]), tensor([2, 0]), tensor([0, 4]), tensor([4, 4])], 'label': tensor([2, 2])}]
'''

還是和預想的結果完全不一樣，究竟哪裏出了問題呢？只能查看源碼了，在生成DataLoader的每個batch時，會調用collat_fn進行數據的整合和轉化，而默認的default_collate如下：

def default_collate(batch):
    r"""Puts each data field into a tensor with outer dimension batch size"""

    error_msg = "batch must contain tensors, numbers, dicts or lists; found {}"
    elem_type = type(batch[0])
    if isinstance(batch[0], torch.Tensor):
        out = None
        if _use_shared_memory:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            numel = sum([x.numel() for x in batch])
            storage = batch[0].storage()._new_shared(numel)
            out = batch[0].new(storage)
        return torch.stack(batch, 0, out=out)
    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
            and elem_type.__name__ != 'string_':
        elem = batch[0]
        if elem_type.__name__ == 'ndarray':
            # array of string classes and object
            if re.search('[SaUO]', elem.dtype.str) is not None:
                raise TypeError(error_msg.format(elem.dtype))

            return torch.stack([torch.from_numpy(b) for b in batch], 0)
        if elem.shape == ():  # scalars
            py_type = float if elem.dtype.name.startswith('float') else int
            return numpy_type_map[elem.dtype.name](list(map(py_type, batch)))
    elif isinstance(batch[0], int_classes):
        return torch.LongTensor(batch)
    elif isinstance(batch[0], float):
        return torch.DoubleTensor(batch)
    elif isinstance(batch[0], string_classes):
        return batch
    elif isinstance(batch[0], container_abcs.Mapping):
        return {key: default_collate([d[key] for d in batch]) for key in batch[0]}
    elif isinstance(batch[0], container_abcs.Sequence):
        transposed = zip(*batch)
        return [default_collate(samples) for samples in transposed]

    raise TypeError((error_msg.format(type(batch[0]))))

顯然，DataLoader會根據數據的類型進行轉化，對於上面的方案一，其類型屬於container_abcs.Sequence，其會迭代進行zip和序列化操作，而對於方案二，其每個key的元素也爲container_abcs.Sequence，會進行相同的處理，而這些會導致上述的問題。我們來實驗下：

for i in zip(*([0, 3, 4, 2, 3], [2, 5, 3, 5, 1])):
    print(i)

這和方案一的生成結果是一致的。

因此，採用container_abcs.Sequence類型的數據來封裝每條數據，是有風險的，其每列/特徵上的元素會被視爲是異構的，所以會逐個進行處理。那麼該如何規避上述問題呢？

一種可信的解決方案是將container_abcs.Sequence類型的數據轉換爲np.array類型，從而避免zip操作。

import numpy as np
class Dataset3(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __getitem__(self, item):
        return np.array(self.data[item][0]), self.data[item][1]
    
    def __len__(self):
        return len(self.data) 

dataset3 = Dataset3(result)
dataloader3 = DataLoader(dataset3, batch_size=2)
for x,y in dataloader3:
    print(x)
    print(y)

"""
tensor([[0, 3, 4, 2, 3],
        [2, 5, 3, 5, 1]])
tensor([5, 1])
tensor([[0, 4, 1, 4, 0],
        [3, 0, 3, 2, 1]])
tensor([2, 0])
tensor([[3, 0, 0, 0, 4],
        [4, 1, 2, 2, 1]])
tensor([5, 1])
tensor([[1, 0, 0, 2, 5],
        [1, 5, 0, 1, 5]])
tensor([1, 5])
tensor([[3, 5, 2, 0, 4],
        [3, 3, 0, 4, 4]])
tensor([2, 2])
"""

一切終於回覆正常了！

教訓：事出反常必有妖，簡單的API可能蘊含了很多tricks，出現問題要發現問題的點，通過分析源碼瞭解其運作機制。可以嘗試寫類似的API來加強對底層邏輯的認知。

【Pytorch】記錄自定義DataLoader時一個易犯的隱性錯誤

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

【字典樹】用python實現Trie樹

docker的基本使用筆記

動態規劃法求解最長上升子串問題

【Pytorch】記錄自定義DataLoader時一個易犯的隱性錯誤

自定義數據迭代器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結