CNN原理

CNN (Convolutional Neural Network)


写作本文的reference

简单阐述

在这里插入图片描述 在这里插入图片描述
拓扑图 模型图

Left: A regular 3-layer Neural Network. Right: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).

A ConvNet is made up of Layers. Every Layer has a simple API:It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters.

一个简单的卷积神经网络(ConvNet)就是有一组层(a sequence of layers)构成的。通常用以下三种层来构建神经网络:卷积层(Convolutional Layer),池化层(Pooling Layer),全连接层(Fully-Connected Layer)

  • 一个CIFAR-10 分类神经网络例子Example Architecture: Overview*. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:
    • INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.

    • CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.

    • RELU layer will apply an elementwise activation function, such as the $ max(0,x) $ thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]).

    • POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].

    • FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.

In summary

  • 一个卷积网络就是有一组 可以将一个volume转化为另一个volume的 层组成的
  • 目前流行的层有 CONV/ POOL/ RLU/ FC
  • 各个层通过不同的函数来实现将一个volume转化为另一个volume的
  • 有的层有参数(parameter),如 CONV/ FC;有的层没有参数,如 RELU/ POOL
  • 有的层有超参数(hyperparameter),如 CONV/ FC/ POOL;有的层没有,如 RELU

参数(parameter)通常是在模型训练的过程中,我们根据训练集数据自动得到的。超参(hyperparameter)通常是在模型训练前,我们手动设置的,其目的是为了在训练参数的时候让模型的表现更好。

Convolutional Layer 卷积层

卷积层是承担了主要运算量核心构建层

filter on a layer

  • filter 也就是卷积块,典型的如[553][5*5*3]大小。

  • receptive field 感受域,也就是filter的size,宽和高的积,如上个例子中的[55][5*5]​

  • Local Connectivity 局部连通性。

  • 两个例子阐述计算 connection 或者 weight 的数量

    Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 553 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.

    Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3320 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).

Spatial arrangement

Three hyperparameters control the size of the output volume: the depth, stride and zero-padding

  • depth:输出的深度
  • stride:步长 通常取1或者2
  • zero-padding:零填充

计算size of output volume

  • WW size of input

  • FF size of filter

  • PP size of padding

  • SS stride

size of output volume is (W+2PF)/S+1(W + 2*P - F)/S+1

In general, setting zero padding to be P=(F1)/2P=(F−1)/2 when the stride is S=1S=1 ensures that the input volume and output volume will have the same size spatially.

Constraints on strides. 根据四个参数计算得到的必须是整数

Parameter Sharing
当output volume的每一个值都由不同的filter计算得到的话,参数的个数十分巨大。因此,对于output volume的每一个深度切片(depth slice),使用同一个filter。

Using the real-world example above, we see that there are 555596 = 290,400 neurons in the first Conv Layer, and each has 11113 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high.

With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 961111*3 = 34,848 unique weights, or 34,944 parameters (+96 biases).

Summary. the Conv Layer:

  • Accepts a volume of size W1×H1×D1W_1 × H_1 × D_1
  • Requires four hyperparameters:
    • Number of filters KK,
    • their spatial extent FF,
    • the stride SS,
    • the amount of zero padding PP.
  • Produces a volume of size W2×H2×D2W_2×H_2×D_2​ where:
    • W2=(W1F+2P)/S+1W_2=(W_1− F + 2P ) / S + 1
    • H2=(H1F+2P)/S+1H_2 = (H_1 − F + 2P) / S + 1 (i.e. width and height are computed equally by symmetry)
    • D2=KD_2=K​
  • With parameter sharing, it introduces FFD1F⋅F⋅D_1​ weights per filter, for a total of (FFD1)K(F⋅F⋅D_1)⋅K​ weights and KK​ biases.
  • In the output volume, the dd-th depth slice (of size W2×H2W_2×H_2) is the result of performing a valid convolution of the dd-th filter over the input volume with a stride of SS, and then offset by dd-th bias.

A common setting of the hyperparameters is F=3,S=1,P=1F=3,S=1,P=1. However, there are common conventions and rules of thumb that motivate these hyperparameters. See the ConvNet architectures section below.

Pooling Layer 池化层

Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting.

the pooling layer:

  • Accepts a volume of size W1×H1×D1W_1×H_1×D_1
  • Requires two hyperparameters:
    • their spatial extent FF,
    • the stride SS,
  • Produces a volume of size W2×H2×D2W2×H2×D2 where:
    • W2=(W1F)/S+1W_2=(W_1−F)/S+1
    • H2=(H1F)/S+1H_2=(H_1−F)/S+1
    • D2=D1D_2=D_1
  • Introduces zero parameters since it computes a fixed function of the input
  • For Pooling layers, it is not common to pad the input using zero-padding.

It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with $F=3,S=2 $(also called overlapping pooling), and more commonly F=2,S=2F=2,S=2. Pooling sizes with larger receptive fields are too destructive.

常用的池化层函数,如L2范式,平均化,但经过事实检验,效果都没有取最大值函数好。

还发现丢弃池化层对于训练良好的生成模型(例如变分自动编码器(VAE)或生成对抗网络(GAN))非常重要。 Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will feature very few to no pooling layers

Normalization Layer 正则化层

在实践中作用效果不大,因此逐渐被抛弃if any. For various types of normalizations, see the discussion in Alex Krizhevsky’s cuda-convnet library API.

Fully-connected Layer 全连接层

没太看懂这一段到底要说什么。。。

FC Layer 可以用 Conv Layer 来实现,因为它们的本质都是矩阵的点乘。而且效果要好?

ConvNet Architectures

the most common ConvNet architecture follows the pattern:

INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:

  • INPUT -> FC, implements a linear classifier. Here N = M = K = 0.
  • INPUT -> CONV -> RELU -> FC
  • INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC. Here we see that there is a single CONV layer between every POOL layer.
  • INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation.

Prefer a stack of small filter CONV to one large receptive field CONV layer. 优先使用一组小的filter而不是一个大的filter。相比之下,同等感受域范围下的一组小的感受域 contain non-linearities that make their features more expressive,而且参数更少。

在具体实践中,有效的方法是查看目前在ImageNet上运行的比较好的网络,下载经过训练的模型,并根据自己的数据进行微调,而不是从头开始训练或者设计一个网络。

Layer Sizing Patterns

具体关于超参数(hyperparameters)的选取

  • input layer

    输入层的size应该至少可以被2除两次,常见的有32,64,96,224,384,512等

  • conv layer

    通常使用3×3或者5×5,S=1S = 1F=3,P=1F=3, P=1或者F=5,P=2F=5, P=2。一般7 * 7的只在第一层才会用。

  • pool layer

    通常选用max函数,2×2的感受域,stride为2;另一个不太常见的设置是使用3×3感受野,步幅为2

案例研究

  • LeNet.

    Yann LeCun在1990年代开发了卷积网络的第一个成功应用。 其中,最著名的是LeNet体系结构,该体系结构用于读取邮政编码,数字等

  • AlexNet.

    Alex Krizhevsky,Ilya Sutskever和Geoff Hinton开发的,最早在计算机视觉中普及卷积网络的作品。 AlexNet于2012年参加了ImageNet ILSVRC挑战赛,并大大超越了第二名(前5名的错误率为16%,而第二名的错误率为26%)。 网络具有与LeNet非常相似的架构,但是更深,更大,并且具有彼此堆叠的卷积层(以前通常只有一个CONV层总是紧随其后是POOL层)。

  • ZFNet.

2013年ILSVRC冠军是Matthew Zeiler和Rob Fergus的卷积网络。 它被称为ZFNet(Zeiler&Fergus Net的缩写)。 通过调整体系结构超参数,特别是通过扩展中间卷积层的大小并减小第一层的步幅和过滤器大小,对AlexNet进行了改进。

  • GoogLeNet.

    2014年ILSVRC获奖者是Szegedy等人的卷积网络。 来自Google。 它的主要贡献是开发了一个Inception模块,该模块显着减少了网络中的参数数量(4M,而AlexNet为60M)。 此外,本文使用平均池化而不是ConvNet顶部的完全连接层,从而消除了似乎无关紧要的大量参数。 GoogLeNet也有多个后续版本,最近的是Inception-v4

  • VGGNet

    ILSVRC 2014的亚军是来自Karen Simonyan和Andrew Zisserman的网络,该网络被称为VGGNet。 它的主要贡献在于表明网络深度是获得良好性能的关键因素。 他们最终的最佳网络包含16个CONV / FC层,并且吸引人的是,它具有极其均匀的体系结构,从头到尾仅执行3x3卷积和2x2池化。 他们的预训练模型可用于Caffe中的即插即用功能。 VGGNet的缺点是评估成本更高,并且使用更多的内存和参数(140M)。 这些参数中的大多数都位于第一个完全连接的层中,并且由于发现这些FC层可以在不降低性能的情况下被删除,从而大大减少了必要参数的数量。

  • ResNet. Residual Network

    由Kaiming He等人开发的残差网络。 是ILSVRC 2015的获胜者。它具有特殊的跳过连接和大量使用批标准化的功能。 该体系结构还缺少网络末端的完全连接的层。 还向读者介绍了Kaiming的演示文稿(视频,幻灯片),以及一些最近的实验,这些实验在Torch中再现了这些网络。 ResNets目前是最先进的卷积神经网络模型,并且是在实践中使用ConvNets的默认选择(截至2016年5月10日)。 特别是,还可以看到更进一步的发展,这些变化调整了Kaiming He等人的原始架构。 深度残留网络中的身份映射(2016年3月发布)。

VGGNet in detail 举个例子. Lets break down the VGGNet in more detail as a case study. The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding). We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights:

INPUT: [224x224x3]        memory:  224*224*3=150K   weights: 0
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*64)*64 = 36,864
POOL2: [112x112x64]  memory:  112*112*64=800K   weights: 0
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*128)*128 = 147,456
POOL2: [56x56x128]  memory:  56*56*128=400K   weights: 0
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
POOL2: [28x28x256]  memory:  28*28*256=200K   weights: 0
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512]  memory:  14*14*512=100K   weights: 0
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512]  memory:  7*7*512=25K  weights: 0
FC: [1x1x4096]  memory:  4096  weights: 7*7*512*4096 = 102,760,448
FC: [1x1x4096]  memory:  4096  weights: 4096*4096 = 16,777,216
FC: [1x1x1000]  memory:  1000 weights: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

As is common with Convolutional Networks, notice that most of the memory (and also compute time) is used in the early CONV layers, and that most of the parameters are in the last FC layers. In this particular case, the first FC layer contains 100M weights, out of a total of 140M.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章