PyTorch自定义激活函数和为激活函数添加可学习的参数

在深度学习框架PyTorch中已经内置了很多激活函数，如ReLU等，但是有时根据个人需要，需要自定义激活函数，甚至需要为激活函数添加可学习的参数，如PReLU，具体参见PyTorch官方激活函数源码实现。

对于不需要可学习参数的激活函数的实现，比较简单，具体见PyTorch官方教程的一个例子（所用PyTorch版本为1.2）：

# Inherit from Function
class LinearFunction(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    # bias is an optional argument
    def forward(ctx, input, weight, bias=None):
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    # This function has only a single output, so it gets only one gradient
    @staticmethod
    def backward(ctx, grad_output):
        # This is a pattern that is very convenient - at the top of backward
        # unpack saved_tensors and initialize all gradients w.r.t. inputs to
        # None. Thanks to the fact that additional trailing Nones are
        # ignored, the return statement is simple even when the function has
        # optional inputs.
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None

        # These needs_input_grad checks are optional and there only to
        # improve efficiency. If you want to make your code simpler, you can
        # skip them. Returning gradients for inputs that don't require it is
        # not an error.
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0).squeeze(0)

        return grad_input, grad_weight, grad_bias

Now, to make it easier to use these custom ops, we recommend aliasing their apply method:

linear = LinearFunction.apply

Here, we give an additional example of a function that is parametrized by non-Tensor arguments:

class MulConstant(Function):
    @staticmethod
    def forward(ctx, tensor, constant):
        # ctx is a context object that can be used to stash information
        # for backward computation
        ctx.constant = constant
        return tensor * constant

    @staticmethod
    def backward(ctx, grad_output):
        # We return as many input gradients as there were arguments.
        # Gradients of non-Tensor arguments to forward must be None.
        return grad_output * ctx.constant, None

可以看到，这里不需要设置可学习参数，实现起来相对简单。这种直接继承Function的方法

import torch.nn.functional as F

（上述代码中ctx是context的意思，用来保存反向传播中需要的上下文环境，如一些变量，在定义反向传播函数时可以取出这些需要的变量。）

并自己定义前向和反向的多用在激活函数不可导的情况，这样可以反向求梯度时对其进行近似求导，自定义求导的方式，这是自定义激活函数的最大优势（个人认为）。

那么怎样为激活函数添加可学习的参数呢？这就涉及到nn.Module与nn.Functional（torch.nn.functional）的区别。 nn.Module是一个包装好的类，具体定义了一个网络层，可以维护状态和存储参数信息；而nn.Functional仅仅提供了一个计算，不会维护状态信息和存储参数。比如前面提到的ReLU等激活函数，dropout，pooling等没有训练参数，可以使用functional模块。

那么我们要为自定义的激活函数添加可学习的参数，该怎么办呢？这还要回到前边提到的nn.Module与nn.Functional之间的关系。我们知道继承于nn.Module的一个层是内含有可学习参数的，如权重weights和偏置bias。但是具体的操作，如卷积等还是由nn.Functional来执行的。nn.Module只是对nn.Functional进行了一层封装，使之可以维护状态和存储参数信息。对于简单的没有可学习的参数的层，可以用nn.Functional模块中的操作也可以用 nn.Module中的操作，如nn.ReLU与F.relu等。不同的是nn.ReLU作为网络的一个层，而F.relu仅仅作为一个函数运算。

说这么多，其实就是告诉你：若要为激活函数添加可学习的参数，需要利用nn.Module类对nn.Functional类进行封装。具体见下面的PyTorch官方示例：

class Linear(nn.Module):
    def __init__(self, input_features, output_features, bias=True):
        super(Linear, self).__init__()
        self.input_features = input_features
        self.output_features = output_features

        # nn.Parameter is a special kind of Tensor, that will get
        # automatically registered as Module's parameter once it's assigned
        # as an attribute. Parameters and buffers need to be registered, or
        # they won't appear in .parameters() (doesn't apply to buffers), and
        # won't be converted when e.g. .cuda() is called. You can use
        # .register_buffer() to register buffers.
        # nn.Parameters require gradients by default.
        self.weight = nn.Parameter(torch.Tensor(output_features, input_features))
        if bias:
            self.bias = nn.Parameter(torch.Tensor(output_features))
        else:
            # You should always register all possible parameters, but the
            # optional ones can be None if you want.
            self.register_parameter('bias', None)

        # Not a very smart way to initialize weights
        self.weight.data.uniform_(-0.1, 0.1)
        if bias is not None:
            self.bias.data.uniform_(-0.1, 0.1)

    def forward(self, input):
        # See the autograd section for explanation of what happens here.
        return LinearFunction.apply(input, self.weight, self.bias)

    def extra_repr(self):
        # (Optional)Set the extra information about this module. You can test
        # it by printing an object of this class.
        return 'in_features={}, out_features={}, bias={}'.format(
            self.in_features, self.out_features, self.bias is not None
        )

注意：调用上述代码中的Linear的正确方式为:

output = Linear()(input, self.weight, self.bias)
###Or
self.linear = Linear()
output = self.linear(input, self.weight, self.bias)

至此，我们应该学会了怎样自定义激活函数和为激活函数添加可学习参数的问题。但是对于nn.Module类对nn.Functional类还有一些细节要讲。

首先，当我们继承Function（from torch.autograd import Function）时需要自定义前向和反向，这是PyTorch非常灵活的地方。要设置可学习参数需要利用nn.Module类进行封装。然后，是继承nn.Module类的问题。继承nn.Module类不需要自己写反向，只需要定义前向函数。这是因为nn.Module类中的操作是调用由一系列nn.Functional类中的函数操作来完成的，而这些操作均已经定义了反向求导的操作，所以在nn.Module类中无需再重新定义。其实对于可导的操作，在继承nn.Module类时会利torch.autograd进行自动求导。通常我们在自定义损失函数时只需继承nn.Module类，将计算Loss的方法写在forward函数中，无需写反向，即可实现自动求反向操作。关于反向传播可以参看CS231n笔记|4 反向传播与神经网络。

参考：

1. CS231n笔记|4 反向传播与神经网络

2. 探讨Pytorch中nn.Module与nn.autograd.Function的backward()函数

3. Pytorch入门学习（八）-----自定义层的实现（甚至不可导operation的backward写法）

4. Extending PyTorch

5. 定义torch.autograd.Function的子类，自己定义某些操作，且定义反向求导函数

6. pytorch系列12 --pytorch自定义损失函数custom loss function