本文参考–PyTorch官方教程中文版链接：http://pytorch123.com/FirstSection/PyTorchIntro/
Pytorch中文文档：https://pytorch-cn.readthedocs.io/zh/latest/package_references/Tensor/
PyTorch英文文档：https://pytorch.org/docs/stable/tensors.html
《深度学习之PyTorch物体检测实战》
第一次接触PyTorch，网上很难找到最新版本的教程，先从它的官方资料入手吧！

默认导入模块：

import os
import json
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
import torchvision
from torchvision import models
from torch.utils.data import Dataset
from torchvision import transforms
from torch.utils.data import DataLoader
import visdom
# from tensorboardX import SummaryWriter
from torch.utils.tensorboard import SummaryWriter

全连接层

nn.Linear(in_features, out_features, bias=True)

>>> linear = nn.Linear(784, 10)
>>> input = torch.randn(4, 784)
>>> output = linear(input)
>>> output.shape
torch.Size([4, 10])

卷积层

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, 
		dilation=1, groups=1, bias=True, padding_mode='zeros')

dilation：空洞卷积，当大于1的时候可以增大感受野，同时保持特征图的尺寸
groups：可实现组卷积，即在卷积操作时不是逐点卷积，而是将输入通道范围分为多个组，稀疏连接达到降低计算量的目的

通过.weight和.bias查看卷积核的权重与偏置

>>> conv = nn.Conv2d(1, 1, 3, 1, 1)
>>> conv.weight.shape
torch.Size([1, 1, 3, 3])
>>> conv.bias.shape
torch.Size([1])

输入特征图必须写为 $(N, C, H, W)$ 的形式

>>> input = torch.randn(1, 1, 5, 5)
>>> output = conv(input)
>>> output.shape
torch.Size([1, 1, 5, 5])

池化层

最大池化层

nn.MaxPool2d(kernel_size, stride=None, padding=0, 
			dilation=1, return_indices=False, ceil_mode=False)

return_indices – if True, will return the max indices along with the outputs.
ceil_mode – when True, will use ceil instead of floor to compute the output shape
stride – 注意：stride 默认值为 kernel_size，而非1

>>> max_pooling = nn.MaxPool2d(2, stride=2)
>>> input = torch.randn(1, 1, 4, 4)
>>> max_pooling(input)
tensor([[[[0.9636, 0.7075],
          [1.0641, 1.1749]]]])
>>> max_pooling(input).shape
torch.Size([1, 1, 2, 2])

平均池化层

nn.AvgPool2d(kernel_size, stride=None, padding=0, 
			ceil_mode=False, count_include_pad=True, divisor_override=None)

If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.

ceil_mode – when True, will use ceil instead of floor to compute the output shape
count_include_pad – when True, will include the zero-padding in the averaging calculation
divisor_override – if specified, it will be used as divisor, otherwise attr:kernel_size will be used

The parameters kernel_size, stride, padding can either be:

a single int – in which case the same value is used for the height and width dimension
a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension

全局平均池化层

nn.Sequential(
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten()
            }

激活函数层

当然，下面的层也可以用torch.nn.functional中的函数替代

Sigmoid层

nn.Sigmoid()

>>> sigmoid = nn.Sigmoid()
>>> sigmoid(torch.Tensor([1, 1, 2, 2]))
tensor([0.7311, 0.7311, 0.8808, 0.8808])

ReLU层

nn.ReLU(inplace=False)

>>> relu = nn.ReLU(inplace=True)
>>> input = torch.randn(2, 2)
>>> input
tensor([[-0.4853,  2.3864],
        [ 0.7122, -0.6493]])
>>> relu(input)
tensor([[0.0000, 2.3864],
        [0.7122, 0.0000]])
>>> input
tensor([[0.0000, 2.3864],
        [0.7122, 0.0000]])

Softmax层

nn.Softmax(dim=None)

>>> softmax = nn.Softmax(dim=1)
>>> score = torch.randn(1, 4)
>>> score
tensor([[ 0.3101,  3.5648,  1.0988, -1.5856]])
>>> softmax(score)
tensor([[0.0342, 0.8855, 0.0752, 0.0051]])

LogSoftmax层

nn.LogSoftmax(dim=None)

后接nn.NLLLoss层相当于CrossEntropyLoss层

Dropout层

nn.Dropout(p=0.5, inplace=False)

>>> dropout = nn.Dropout(0.5, inplace=False)
>>> input = torch.randn(1, 20)
>>> output = dropout(input)
>>> output
tensor([[-2.9413,  0.0000,  1.8461,  1.9605,  0.2774, -0.0000, -2.5381, -2.0313,
         -0.1914,  0.0000,  0.5346, -0.0000,  0.0000,  4.4960, -3.8345, -1.0938,
          4.3297,  2.1258, -4.1431,  0.0000]])
>>> input
tensor([[-1.4707,  0.5105,  0.9231,  0.9802,  0.1387, -0.4195, -1.2690, -1.0156,
         -0.0957,  0.8108,  0.2673, -2.0898,  0.6666,  2.2480, -1.9173, -0.5469,
          2.1648,  1.0629, -2.0716,  0.9974]])

BN层

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, 
					affine=True, track_running_stats=True)

num_features – $C$ from an expected input of size $(N, C, H, W)$
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

Because the Batch Normalization is done over the C dimension, computing statistics on $(N, H, W)$ slices, it’s common terminology to call this Spatial Batch Normalization.

The mean and standard-deviation are calculated per-dimension over the mini-batches and $\gamma$ and $\beta$ are learnable parameter vectors of size C (where C is the input size). By default, the elements of $\gamma$ are set to 1 and the elements of $\beta$ are set to 0.

>>> bn = nn.BatchNorm2d(64)
>>> input = torch.randn(4, 64, 28, 28)
>>> output = bn(input)
>>> output.shape
torch.Size([4, 64, 28, 28])

损失函数层

NLLLoss

nn.NLLLoss(weight=None, size_average=None, 
			ignore_index=-100, reduce=None, reduction='mean')

The input given through a forward call is expected to contain log-probabilities of each class. input has to be a Tensor of size either $(minibatch, C)$ or $(minibatch, C, d_1, d_2, ..., d_K)$ with $K≥1$ for the K-dimensional case(described later).

It is useful to train a classification problem with C (C = number of classes) classes.

The target that this loss expects should be a class index in the range $[0, C-1]$ where C = number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).

The unreduced (i.e. with reduction set to 'none') loss can be described as:

$\begin{aligned} l(x, y) &= L = \{l_1,...,l_N\}^T \\ l_n &= -weight[y_n]\cdot x_{n,y_n} \{y_n\neq ignore\_index \} \end{aligned}$ $\begin{aligned} 其中&x为输入，y为标签，weight表示每个类别在计算loss时的权重，\\& x_{n,y_n}表示第n个样本中正确类别的log(score) \end{aligned}$

If reduction is ‘mean’ (default ‘mean’), then
$\begin{aligned} l(x, y) &=\sum_{n=1}^N \frac {l_n}{\sum_{n=1}^N} \end{aligned}$

If reduction is ‘sum’ (default ‘mean’), then
$\begin{aligned} l(x, y) &=\sum_{n=1}^N l_n \end{aligned}$

Can also be used for higher dimension inputs, such as 2D images, by providing an input of size $(minibatch, C, d_1, d_2, ..., d_K)$ with $K≥1$ , where K is the number of dimensions, and a target of appropriate shape (see below). In the case of images, it computes NLL loss per-pixel.

weight (Tensor, optional) – a manual rescaling weight given to each class. If given, it has to be a Tensor of size C. Otherwise, it is treated as if having all ones. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set. 就是在计算loss时给每个类别加的权重
size_average (bool, optional) – Deprecated
ignore_index (int, optional) – Specifies a target value that is ignored and does not contribute to the input gradient.
reduce (bool, optional) – Deprecated
reduction (string, optional) – Specifies the reduction to apply to the output: ’none' | ’mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: ‘mean’

Shape:

Input: $(N, C)$ where $C$ = number of classes, or $(N, C, d_1, d_2, ..., d_K)$ with $K≥1$ in the case of K-dimensional loss.
Target: $(N)$ where each value is $0 \leq \text{targets}[i] \leq C-1$ , or $(N, d_1, d_2, ..., d_K)$ with $K≥1$ in the case of K-dimensional loss.
Output: scalar. If reduction is ‘none’, then the same size as the target: $(N)$ , or $(N, d_1, d_2, ..., d_K)$ with $K≥1$ in the case of K-dimensional loss.

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()

input = torch.randn(3, 5, requires_grad=True)
target = torch.tensor([1, 0, 4])

output = loss(m(input), target)

N, C = 5, 4
loss = nn.NLLLoss()

# input is of size N x C x height x width
data = torch.randn(N, C, 8, 8)
m = nn.LogSoftmax(dim=1)

# each element in target has to have 0 <= value < C
target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C)
output = loss(m(data), target)

CrossEntropyLoss

nn.CrossEntropyLoss(weight=None, size_average=None, 
					ignore_index=-100, reduce=None, reduction='mean')

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

其实就是Softmax + CrossEntropyLoss，虽然现在还没看过源码，但应该也是因为它们两个结合在一起在梯度反向传播的时候结果就会是漂亮的 $y-t$

参数的意义跟上面的nn.NLLLoss一样，这里就不多说了

loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)

output = loss(input, target)
output.backward()

优化器

SGD(包含了Momentum以及Nesterov Momentum)

optim.SGD(params, lr=<required parameter>, momentum=0, 
			dampening=0, weight_decay=0, nesterov=False)

dampening (float, optional) – dampening for momentum (default: 0)
疑问：这个dampening是干啥的看源码时再解答

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# 每次优化之前都要先清空梯度
optimizer.zero_grad()
loss.backward()
optimizer.step()

Adagrad

optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, 
			initial_accumulator_value=0, eps=1e-10)

lr (float, optional) – learning rate (default: 1e-2)
lr_decay (float, optional) – learning rate decay (default: 0)

RMSProp

optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, 
			weight_decay=0, momentum=0, centered=False)

alpha (float, optional) – smoothing constant (default: 0.99)
momentum (float, optional) – momentum factor (default: 0)
centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

这个alpha应该就是RMSProp中遗忘过去梯度的动量参数，那么这个momentum又是什么？同样也只能等看了源码再解答

Adadelta

optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)

lr (float, optional) – coefficient that scale delta before it is applied to the parameters (default: 1.0) 按照Adadelta原公式的话应该是不用lr的，这里却有lr参数，还是需要阅读源码后再解答
rho (float, optional) – coefficient used for computing a running average of squared gradients (default: 0.9)

Adam

optim.Adam(params, lr=0.001, betas=(0.9, 0.999), 
			eps=1e-08, weight_decay=0, amsgrad=False)

amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

PyTorch(三)：常见的网络层

目录

全连接层

卷积层

池化层

最大池化层

平均池化层

全局平均池化层

激活函数层

Sigmoid层

ReLU层

Softmax层

LogSoftmax层

Dropout层

BN层

损失函数层

NLLLoss

CrossEntropyLoss

优化器

SGD(包含了Momentum以及Nesterov Momentum)

Adagrad

RMSProp

Adadelta

Adam

前端使用 Konva 实现可视化设计器（13）- 折线 - 最优路径应用【思路篇】

PyTorch(四)：實踐--在CIFAR10數據集上訓練基於PyTorch的第一個網絡

經典網絡結構(二)：VGG

深度學習入門(十)：CNN的實現及可視化

經典網絡結構(一)：LeNet、AlexNet

深度學習入門(九)：卷積層和池化層的實現

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結