PyTorch自定義激活函數和爲激活函數添加可學習的參數

在深度學習框架PyTorch中已經內置了很多激活函數，如ReLU等，但是有時根據個人需要，需要自定義激活函數，甚至需要爲激活函數添加可學習的參數，如PReLU，具體參見PyTorch官方激活函數源碼實現。

對於不需要可學習參數的激活函數的實現，比較簡單，具體見PyTorch官方教程的一個例子（所用PyTorch版本爲1.2）：

# Inherit from Function
class LinearFunction(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    # bias is an optional argument
    def forward(ctx, input, weight, bias=None):
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    # This function has only a single output, so it gets only one gradient
    @staticmethod
    def backward(ctx, grad_output):
        # This is a pattern that is very convenient - at the top of backward
        # unpack saved_tensors and initialize all gradients w.r.t. inputs to
        # None. Thanks to the fact that additional trailing Nones are
        # ignored, the return statement is simple even when the function has
        # optional inputs.
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None

        # These needs_input_grad checks are optional and there only to
        # improve efficiency. If you want to make your code simpler, you can
        # skip them. Returning gradients for inputs that don't require it is
        # not an error.
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0).squeeze(0)

        return grad_input, grad_weight, grad_bias

Now, to make it easier to use these custom ops, we recommend aliasing their apply method:

linear = LinearFunction.apply

Here, we give an additional example of a function that is parametrized by non-Tensor arguments:

class MulConstant(Function):
    @staticmethod
    def forward(ctx, tensor, constant):
        # ctx is a context object that can be used to stash information
        # for backward computation
        ctx.constant = constant
        return tensor * constant

    @staticmethod
    def backward(ctx, grad_output):
        # We return as many input gradients as there were arguments.
        # Gradients of non-Tensor arguments to forward must be None.
        return grad_output * ctx.constant, None

可以看到，這裏不需要設置可學習參數，實現起來相對簡單。這種直接繼承Function的方法

import torch.nn.functional as F

（上述代碼中ctx是context的意思，用來保存反向傳播中需要的上下文環境，如一些變量，在定義反向傳播函數時可以取出這些需要的變量。）

並自己定義前向和反向的多用在激活函數不可導的情況，這樣可以反向求梯度時對其進行近似求導，自定義求導的方式，這是自定義激活函數的最大優勢（個人認爲）。

那麼怎樣爲激活函數添加可學習的參數呢？這就涉及到nn.Module與nn.Functional（torch.nn.functional）的區別。 nn.Module是一個包裝好的類，具體定義了一個網絡層，可以維護狀態和存儲參數信息；而nn.Functional僅僅提供了一個計算，不會維護狀態信息和存儲參數。比如前面提到的ReLU等激活函數，dropout，pooling等沒有訓練參數，可以使用functional模塊。

那麼我們要爲自定義的激活函數添加可學習的參數，該怎麼辦呢？這還要回到前邊提到的nn.Module與nn.Functional之間的關係。我們知道繼承於nn.Module的一個層是內含有可學習參數的，如權重weights和偏置bias。但是具體的操作，如卷積等還是由nn.Functional來執行的。nn.Module只是對nn.Functional進行了一層封裝，使之可以維護狀態和存儲參數信息。對於簡單的沒有可學習的參數的層，可以用nn.Functional模塊中的操作也可以用 nn.Module中的操作，如nn.ReLU與F.relu等。不同的是nn.ReLU作爲網絡的一個層，而F.relu僅僅作爲一個函數運算。

說這麼多，其實就是告訴你：若要爲激活函數添加可學習的參數，需要利用nn.Module類對nn.Functional類進行封裝。具體見下面的PyTorch官方示例：

class Linear(nn.Module):
    def __init__(self, input_features, output_features, bias=True):
        super(Linear, self).__init__()
        self.input_features = input_features
        self.output_features = output_features

        # nn.Parameter is a special kind of Tensor, that will get
        # automatically registered as Module's parameter once it's assigned
        # as an attribute. Parameters and buffers need to be registered, or
        # they won't appear in .parameters() (doesn't apply to buffers), and
        # won't be converted when e.g. .cuda() is called. You can use
        # .register_buffer() to register buffers.
        # nn.Parameters require gradients by default.
        self.weight = nn.Parameter(torch.Tensor(output_features, input_features))
        if bias:
            self.bias = nn.Parameter(torch.Tensor(output_features))
        else:
            # You should always register all possible parameters, but the
            # optional ones can be None if you want.
            self.register_parameter('bias', None)

        # Not a very smart way to initialize weights
        self.weight.data.uniform_(-0.1, 0.1)
        if bias is not None:
            self.bias.data.uniform_(-0.1, 0.1)

    def forward(self, input):
        # See the autograd section for explanation of what happens here.
        return LinearFunction.apply(input, self.weight, self.bias)

    def extra_repr(self):
        # (Optional)Set the extra information about this module. You can test
        # it by printing an object of this class.
        return 'in_features={}, out_features={}, bias={}'.format(
            self.in_features, self.out_features, self.bias is not None
        )

注意：調用上述代碼中的Linear的正確方式爲:

output = Linear()(input, self.weight, self.bias)
###Or
self.linear = Linear()
output = self.linear(input, self.weight, self.bias)

至此，我們應該學會了怎樣自定義激活函數和爲激活函數添加可學習參數的問題。但是對於nn.Module類對nn.Functional類還有一些細節要講。

首先，當我們繼承Function（from torch.autograd import Function）時需要自定義前向和反向，這是PyTorch非常靈活的地方。要設置可學習參數需要利用nn.Module類進行封裝。然後，是繼承nn.Module類的問題。繼承nn.Module類不需要自己寫反向，只需要定義前向函數。這是因爲nn.Module類中的操作是調用由一系列nn.Functional類中的函數操作來完成的，而這些操作均已經定義了反向求導的操作，所以在nn.Module類中無需再重新定義。其實對於可導的操作，在繼承nn.Module類時會利torch.autograd進行自動求導。通常我們在自定義損失函數時只需繼承nn.Module類，將計算Loss的方法寫在forward函數中，無需寫反向，即可實現自動求反向操作。關於反向傳播可以參看CS231n筆記|4 反向傳播與神經網絡。

參考：

1. CS231n筆記|4 反向傳播與神經網絡

2. 探討Pytorch中nn.Module與nn.autograd.Function的backward()函數

3. Pytorch入門學習（八）-----自定義層的實現（甚至不可導operation的backward寫法）

4. Extending PyTorch

5. 定義torch.autograd.Function的子類，自己定義某些操作，且定義反向求導函數

6. pytorch系列12 --pytorch自定義損失函數custom loss function