Pytorch學習:張量、自動微分和計算圖

1)重溫Numpy
2)Pytorch中的張量:Tensor
3)Pytorch自動微分器:Autograd
4)自定義Autograd
5)Pytorch計算圖
6)把計算圖打包成layers: nn Module
7)自動梯度更新器:Optim
8)自定義Module
9)動態計算圖

總的來說,Pytorch主要提供了兩個主要特徵:

  1. 一個n維的張量,與numpy中的array類似,但可以在GPU上運算;
  2. 自動微分機制來訓練一個神經網絡;

本文中,會通過一個包含ReLu激活函數的全連接神經網絡來作爲示例,來解決一個由監督問題。爲了簡化起見,該網絡僅僅包含一個隱藏層,訓練數據 (X , y)來自隨機產生的數據。這個有監督學習的目標是爲了最小化網絡輸出與真實輸出之間的歐式距離。

重溫Numpy

numpy作爲一個科學計算套件想必大家都很熟悉。其主要特徵是能夠將運算擴展到矩陣形式,並可以通過並行等技術,加速矩陣的運算速度。通過利用numpy中的array,能夠實現像matlab中的大規模矩陣快速計算。所不一樣的是,matlab正版軟件動輒上千美元,而python numpy作爲開源套件大大促進了開發者的研究和交流。

Numpy本身不包含任何與深度學習有關的技術,例如前向傳播、計算圖、梯度更新等。但完全可以利用numpy構造一個簡單的全連接神經網絡:

# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

網絡輸出與真實輸出的距離代價如下:

看起來效果似乎很不錯。但是現代許多技術中需要使用GPU大量並行計算,提高速度,很遺憾,numpy不能在支持在一些GPU運算平臺如CUDA上並行計算。

Pytorch中最爲重要的一個技術,Tensor,便解決了該問題。

Pytorch:Tensors

大家可以把Pytorch中的Tensor理解爲可以在GPU上計算的Numpy array。相比array,tensor提供了其他的一些與深度學習息息相關的操作,如:tensor可以追蹤一個計算圖和梯度。

上面用numpy手寫的全連接神經網絡可以用tensor重寫爲:

# -*- coding: utf-8 -*-

import torch

#此處不同,需要在運算前,指定運算的設備, CPU or GPU?
dtype = torch.float
device = torch.device("cpu")      # 在CPU上運行
# device = torch.device("cuda:0") # 在GPU上運行

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype) #定義好tensor後,需要指定設備
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Autograd

上面的例子中,我們手算出了神經網絡反向傳播的梯度,這對簡單的網絡當然可以,但當神經網絡稍微複雜一些,手動計算梯度的複雜程度就大大增加。

幸運的是pytorch提供給了我們自動計算梯度的套件。

autograd是pytorch中的一個套件 ,當使用autograd時,神經網絡的前向傳播會定義一個計算圖(computational graph),該圖中,node是tensor,edge是函數(produce output from input);在這個graph中進行反向傳播可以很容易計算梯度。

具體的,如果x是一個tensor,也是graph中的一個node,如果有:x.requires_grad=True,那麼x.grad將會是另一個tensor,其中包含了求導後的梯度值

我們使用autograd的套件重寫上述代碼:

# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
# w1 w2是我們想更新的網絡參數,反向求導是loss對w1 w2求導,所以對誰求導或者說想要更新
# 哪個tensor,那麼這個tensor的requires_grad=True
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    # 我們可以通過autograd的機制直接反向求導,不用在保留中間值,所以上面代碼前向傳播的三步
    # 可以直接合起來
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    # 注意的是,使用autograd機制時,需要保證待求導的tensor(這裏是loss)維度爲1,
    # 也就是其必須是標量
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    # 調用autograd機制中的backward方法,就可以自動完成反向求導
    # w1.grad w2.grad兩個tensor中分別存儲了loss對w1和w2的梯度
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

定製一個新的autograd函數

上面可以看出,autograd機制主要提供了兩個函數,forwardbackward

forward從輸入tensor計算得到一個輸出tensor;backward接受輸出tensor相對於某個標量值的梯度,然後進一步計算出輸入tensor相對於該同一個標量的梯度

在pytorch中我們可以通過繼承torch.autograd.Function來定製一個自己的自動微分算子

本節中,通過定製一個自動微分算子來非線性實現ReLU

# -*- coding: utf-8 -*-
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
      # ctx可以用來存儲一些與反向計算有關的信息
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

Pytorch:nn module——把計算圖打包成layer

雖然計算圖可以定義許多複雜運算算子,自動求導,但是對於大規模神經網絡,這種僅僅用計算圖求解的方法還是太簡陋。

和tensorflow與keras中相似,爲了解決這個問題,pytorch考慮將計算圖中的一些計算打包進layer裏,layer裏有一些待學習的參數,這些參數將會在後面基於梯度信息更新。這個layer在pytorch裏通過nnmodule實現。

nn module中包含了一些必要的layer,從input計算出output;除此,還包含一些有用的loss function可以直接調用。接下來,我們將調用nn module重寫上面的模型:

# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    # 和上面通過構建計算圖訓練網絡不同,這裏nn module初始化已經將待學習的參數的 
    # requires_grad=True
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

Pytorch: optim

直到目前爲止,我們在計算出參數的更新梯度後,仍然是手動地對參數進行梯度下降更新。這對一些簡單的網絡,且使用SGD等優化方法的模型仍然能起作用。但是當模型較複雜,況且考慮到當下更多使用諸如Adarad、RMSProp等更加高級的優化方法,手動進行梯度更新不太可能。

所幸,Pytorch提供了optim package,該包裏包含了許多定義好的優化器(optimizer),可以在autograd自動獲得更新梯度的基礎上,進一步使用optim自動對參數進行更新。

本節將在上節定義的nn module基礎上,使用optim進行自動更新梯度:

# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    # 在反向傳播前,一定要用optimizer對象中的zero_grad方法,對參數初始化;
    # 因爲只要backward方法被調用,梯度會在緩存中累計
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

這樣以來,一個全連接神經網絡中最重要的三個環節:前向計算(forward)、反向傳播(backward)、梯度更新(update grad)分別由pytorch中的三個package實現:nn moudule、 backward()、optimizer.step()整個代碼的編寫變得非常輕量

Pytorch:Custom nn Module 自定義模型

許多研究中,我們想用搭建的模型可能比較複雜,不像sequential這樣簡單,可能是多條線各種結構。pytorch允許通過繼承nn module來重新定製自己的模型。

在此,我們通過繼承nn Module的方法重新自定義上述代碼中的Sequential模型:

 -*- coding: utf-8 -*-
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        # 初始化時,需要定義layer的形式,並將其屬性化
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Pytorch: 動態計算圖

和tensorflow、keras、caffe等框架很大的一個不同是,pytorch支持動態的計算圖

一般而言,tensorflow、keras、caffe等框架在我們定義好模型(計算圖結構後),就不允許我們再進行更改,或者調整了。相反,pytorch允許計算圖調整。除此,爲了支持動態計算圖,pytroch還有權重共享等機制。

作爲動態圖和權重共享的示例,我們實現了一個非常奇怪的模型:一個完全連接的ReLU網絡,該網絡在每個前向傳遞中選擇1到4之間的隨機數,並使用那麼多隱藏層,多次重複使用相同的權重計算最裏面的隱藏層。

對於此模型,我們可以使用常規的Python流控制來實現循環,並且可以通過在定義前向傳遞時簡單地多次重複使用同一模塊來實現最內層之間的權重共享。

# -*- coding: utf-8 -*-
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章