在tensorflow中可以使用tensorboard來查看訓練過程中loss的變化,來判斷模型是否已經收斂 ,或者需要查看train dataset與dev dataset上效果來判斷是否有過擬合的現象。
pytorch中已經集成了tensorboard的API,不用再使用tensorboardX來調用api了。但是啓動tensorboard的web頁面時,需要安裝tensorboard。並且不需要再安裝tensorflow了,就可以完成監控指標的刷新。
用一個CNN的的例子:
# -*- encoding: utf-8 -*-
import warnings
# pytorch=1.4.0
# tensorboard=1.14.0
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import numpy as np
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torch.utils.tensorboard import SummaryWriter
from torchvision.transforms import transforms
# transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5))])
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])
trainset = torchvision.datasets.FashionMNIST('./data', download=True, train=True, transform=transform)
testset = torchvision.datasets.FashionMNIST('./data', download=True, train=False, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)
dataiter = iter(trainloader)
images, labels = dataiter.next()
# create grid of images
img_grid = torchvision.utils.make_grid(images)
def matplotlib_imshow(img, one_channel=False):
if one_channel:
img = img.mean(dim=0)
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
if one_channel:
plt.imshow(npimg, cmap="Greys")
else:
plt.imshow(np.transpose(npimg, (1, 2, 0)))
# show images
matplotlib_imshow(img_grid, one_channel=True)
writer = SummaryWriter('runs/experiment', flush_secs=1)
# write to tensorboard
writer.add_image('four_fashion_mnist_images', img_grid)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 5)
self.conv2 = nn.Conv2d(6, 16, 5)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(16 * 4 * 4, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 4 * 4)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
lossfn = nn.CrossEntropyLoss()
optimizer = Adam(net.parameters(), lr=1e-3)
print(net)
# write to tensorboard
num_epochs = 1
loss_value = 0.0
for epoch in range(num_epochs):
for i, data in enumerate(trainloader, 0):
images, labels = data
optimizer.zero_grad()
# forward and backward
outputs = net(images)
# output shape [batch_size, num_classes], labels shape [batch_size]
loss = lossfn(outputs, labels)
loss.backward()
optimizer.step()
loss_value += loss.item()
if i % 1000 == 999:
writer.add_scalar('training_loss', loss_value / 1000, epoch * len(trainloader) + i)
print(f'epoch:{epoch}, global step:{epoch * len(trainloader) + i},loss:{loss_value / 1000}')
loss_value = 0.0
print('Finished Training')
主要是完成了對一個batch的圖片的打印,及在訓練過程中打印loss。從代碼中可以看出是訓練了1個epoch,global step是乘上epoch了,每1000個batch計算一個平均loss,並打印。
注意的一個地方:在初始化SummaryWriter
的時候,有一個參數可以控制flush的時間,是以sec來計的,默認是120s也就是2min鍾。如果訓練時候比較短,並且沒有修改這個flush時間話就不會在tensorboard上隨着訓練的步數updating。如果最後也不close()
的話,就不會打出來了。這個點還困擾了我半天的時間。
當然這裏的網絡也是比較簡單的,兩個卷積層,兩個池化層,還有三個全連接層。最後是一個多分類的問題。輸出是[batch_size, num_classes]的值。
loss function用的是交叉熵,nn.CrossEntropyLoss
,它的輸入是兩個值,一個predict值一個真實值。predict的值的shape是[batch_size, num_classes],真實值的shape是[batch_size],也就是[0,num_classes)中取值。
比如:一個三分類,batch_size爲2, 則predict是
[[0.1, 0.1, 0.8],
[0.2, 0.6, 0.2]]
則真實值是
[2,1]
注意交叉熵的輸入值。
再一個注意的地方是:
optimizer用的Adam,第一個參數是params,也就是模型需要訓練的參數:可以直接通過 net.parameters()
來獲取。
參考:
https://pytorch.org/docs/stable/tensorboard.html
https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html