建議收藏-使用pytorch時遇到的問題彙總

報錯TypeError: unhashable type: 'numpy.ndarray'
原因：在將pytorch的longTensor轉爲numpy，並用於dict的key的時候，會出現這樣的錯誤。其實程序輸出已經是int了，但是還是會被認爲是ndarray。
解決：在原來的基礎上加上.item()
```
classId = support_y[i].long().cpu().detach().numpy().item()
```
數據加載的時候遇到TypeError: 'int' object is not callable
原因：數據不是Tensor類型的而是np.array或其他類型的。
解決：
```
tensor = torch.LongTensor(data_x)
data_x = autograd.Variable(tensor)
tensor = torch.LongTensor(data_y)
data_y = autograd.Variable(tensor)
```
數據加載的時候遇到RuntimeError: DataLoader worker (pid(s) 18620, 45872) exited unexpectedly
解決：loader中令num_workers=0
RuntimeError: input.size(-1) must be equal to input_size. Expected 10, got 2000
原因：使用view時維度指定錯誤，LSTM(input,(h0,c0)) 指定batch_first=True後，input就是(batch_size,seq_len,input_size)否則爲input(seq_len, batch, input_size)
解決：
```
lstm_out, self.hidden = self.lstm(
        embeds.view(self.batch_size, 200, EMBEDDING_DIM), self.hidden) 
```
報錯AttributeError: module 'torch.utils.data' has no attribute 'random_split'。
原因：pytorch1.1.0版本的random_split在torch.utils.data裏，而0.4.0版本的random_split在torch.utils.data.dataset裏。
解決：from torch.utils.data.dataset import random_split。
報錯ValueError: Sum of input lengths does not equal the length of the input dataset!
原因：數據集問題。
報錯TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tens or to host memory first.
解決：將準確率計算改爲：
```
acc1 = (pred_cls1 == val_y1).cpu().sum().numpy() / pred_cls1.size()[0]
```
報錯RuntimeError: Input and hidden tensors are not at the same device, found input t ensor at cuda:1 and hidden tensor at cuda:0
原因：因爲使用了
```
if torch.cuda.device_count() > 1:
	print("Let's use", torch.cuda.device_count(), "GPUs!")
	model = nn.DataParallel(model)
model.to(device)
```
而tensor沒有指定卡的ID。
解決：兩種方式。
1）先定義一個device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')（這裏面已經定義了device在卡0上“cuda:0”），然後將model = torch.nn.DataParallel(model，devices_ids=[0, 1, 2]）(假設有三張卡)。此後需要將tensor 也遷移到GPU上去。注意所有的tensor必須要在同一張GPU上面，即：tensor1 = tensor1.to(device), tensor2 = tensor2.to(device)等等。注意：一定不能僅僅是tensor1.to(device)而不賦值，這樣只會創建副本。
2）直接用tensor.cuda()的方法。即先model = torch.nn.DataParallel(model, device_ids=[0, 1, 2]) (假設有三塊卡，卡的ID 爲0， 1， 2)，然後tensor1 = tensor1.cuda(0), tensor2=tensor2.cuda(0)等等。（我這裏面把所有的tensor全放進ID 爲 0 的卡里面，也可以將全部的tensor都放在ID 爲1 的卡里面）
參考網址：學習Pytorch過程遇到的坑（持續更新中）
報錯‘DataParallel’ object has no attribute ‘init_hidden’。
原因：nn.DataParallel(m)這句返回的已經不是原始的m了，而是一個DataParallel，原始的m保存在DataParallel的module變量裏面。
解決：在DataParallel和to(device)之後如果需要修改model，則需要
```
if isinstance(model, nn.DataParallel):
    model = model.module
```
報錯Assertioncur_target >= 0 && cur_target < n_classes’ failed.`。
原因：在分類訓練中經常遇到這個問題，一般來說在網絡中輸出的種類數和label設置的種類數量不同的時候就會出現這個錯誤。Pytorch有個要求，在使用CrossEntropyLoss這個函數進行驗證時label必須是以0開始的。
解決：

tags_ids = range(len(tags_set)) # 從0開始
tag2id = pd.Series(tags_ids, index=tags_set)
報錯RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.
原因：原本是GPU訓練的模型要加載在CPU上。
解決：model = torch.load(model_path, map_location='cpu')
同理，如果原本4塊GPU訓練的，改爲一塊，則model = torch.load(model_path, map_location='cuda:0')；
如果是4塊到兩塊：就把map_location改爲：map_location={'cuda:1': 'cuda:0'}。
報錯size mismatch for word_embeddings.weight: copying a param with shape torch.Size([3403, 128]) from checkpoint, the shape in current model is torch.Size([12386, 128]).
原因：加載的模型的word_embedding層參數和當前model輸入的參數不匹配。
解決：word2id、tag2id的長度要一致。
報錯RuntimeError: Expected hidden[0] size (2, 359, 256), got (2, 512, 256)。
原因：使用了DataLoader加載數據，數據集中的訓練實例數不能被batch_size整除，最後一個batch的大小不等於batch_size，而hidden_layer初始化的時候用固定的batch_size初始化：autograd.Variable(torch.zeros(self.num_layers * 2, self.batch_size, self.hidden_dim // 2))
解決：如果模型不能處理批量大小的在線更改，就應在torch.utils.data中設置drop_last=True，因此，在培訓期間只處理整批數據。即testset_loader = DataLoader(test_db, batch_size=args.batch_size, shuffle=False, num_workers=1, pin_memory=True,drop_last=True)
在pytorch訓練過程中出現loss=nan的情況。
原因：
1.學習率太高。學習率比較大的時候，參數可能over shoot了，結果就是找不到極小值點；
減小學習率可以讓參數朝着極值點前進
。2.loss函數有問題。
3.對於迴歸問題，可能出現了除0 的計算，加一個很小的餘項可能可以解決。
4.數據本身，是否存在Nan，可以用numpy.any(numpy.isnan(x))檢查一下input和target。
5.target本身應該是能夠被loss函數計算的，比如sigmoid激活函數的target應該大於0，同樣的需要檢查數據集。
解決：減小學習速率或者增大batch_size。
報錯RuntimeError: Trying to backward through the graph second time, but the buffers have already been freed. Please specify retain_variables=True when calling backward for the first time.
原因：網絡中存在多個sub-network，有2個甚至2個以上的loss需要分別對網絡參數進行更新，比如兩個需要分別執行loss1.backward() loss2.backward()。兩個loss可能會有共同的部分，所以在執行第一次loss1.backward()完成之後，Pytorch會自動釋放保存着的計算圖，所以執行第二次loss2.backward()的時候就會出現計算圖丟失的情況。
解決：
1 執行loss.backward(retain_graph=True)保留計算圖，但這樣很可能會出現內存溢出(CUDA out of memory)的情況。因爲Pytorch的機制是每次調用.backward()都會free掉所有buffers，所以它提示，讓retain_graph。然而當retain後，buffers就不會被free了，所以會OOM。參考網址：https://blog.csdn.net/Mundane_World/article/details/81038274
2 當不需要計算生成器的梯度，因此在使用生成數據計算辨別器時使用.detach()作爲輸入數據，這樣就對當前圖進行拆分，返回一個新的從當前圖中分離的 Variable，返回的 Variable 永遠不會需要梯度.參考網址：https://blog.csdn.net/u011276025/article/details/76997425
3 對於我的代碼，如果retain_graph=True則內存溢出，又找不到需要.detach()的地方，最後發現是因爲我的model每次訓練的時候沒有重新初始化隱藏層。需要在model.zero_grad()之後model.hidden = model.init_hidden()來清空 LSTM 的隱狀態，將其從上個實例的歷史中分離出來，避免受之前運行代碼的干擾。如果不重新初始化，會有報錯。

參考文獻：點擊我進行查看原文

建議收藏-使用pytorch時遇到的問題彙總

k8s相關部署踩坑記錄

必知必會:selenuim/pyppeteer模擬登陸防檢測,能夠屏蔽一小部分簡單的檢測

django踩坑記錄(1)

爬蟲js解密分析:某某雲文學

django踩坑記錄(3)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結