CNN原理

CNN (Convolutional Neural Network)


寫作本文的reference

簡單闡述

在這裏插入圖片描述 在這裏插入圖片描述
拓撲圖 模型圖

Left: A regular 3-layer Neural Network. Right: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).

A ConvNet is made up of Layers. Every Layer has a simple API:It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters.

一個簡單的卷積神經網絡(ConvNet)就是有一組層(a sequence of layers)構成的。通常用以下三種層來構建神經網絡:卷積層(Convolutional Layer),池化層(Pooling Layer),全連接層(Fully-Connected Layer)

  • 一個CIFAR-10 分類神經網絡例子Example Architecture: Overview*. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:
    • INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.

    • CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.

    • RELU layer will apply an elementwise activation function, such as the $ max(0,x) $ thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]).

    • POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].

    • FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.

In summary

  • 一個卷積網絡就是有一組 可以將一個volume轉化爲另一個volume的 層組成的
  • 目前流行的層有 CONV/ POOL/ RLU/ FC
  • 各個層通過不同的函數來實現將一個volume轉化爲另一個volume的
  • 有的層有參數(parameter),如 CONV/ FC;有的層沒有參數,如 RELU/ POOL
  • 有的層有超參數(hyperparameter),如 CONV/ FC/ POOL;有的層沒有,如 RELU

參數(parameter)通常是在模型訓練的過程中,我們根據訓練集數據自動得到的。超參(hyperparameter)通常是在模型訓練前,我們手動設置的,其目的是爲了在訓練參數的時候讓模型的表現更好。

Convolutional Layer 卷積層

卷積層是承擔了主要運算量核心構建層

filter on a layer

  • filter 也就是卷積塊,典型的如[553][5*5*3]大小。

  • receptive field 感受域,也就是filter的size,寬和高的積,如上個例子中的[55][5*5]​

  • Local Connectivity 局部連通性。

  • 兩個例子闡述計算 connection 或者 weight 的數量

    Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 553 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.

    Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3320 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).

Spatial arrangement

Three hyperparameters control the size of the output volume: the depth, stride and zero-padding

  • depth:輸出的深度
  • stride:步長 通常取1或者2
  • zero-padding:零填充

計算size of output volume

  • WW size of input

  • FF size of filter

  • PP size of padding

  • SS stride

size of output volume is (W+2PF)/S+1(W + 2*P - F)/S+1

In general, setting zero padding to be P=(F1)/2P=(F−1)/2 when the stride is S=1S=1 ensures that the input volume and output volume will have the same size spatially.

Constraints on strides. 根據四個參數計算得到的必須是整數

Parameter Sharing
當output volume的每一個值都由不同的filter計算得到的話,參數的個數十分巨大。因此,對於output volume的每一個深度切片(depth slice),使用同一個filter。

Using the real-world example above, we see that there are 555596 = 290,400 neurons in the first Conv Layer, and each has 11113 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high.

With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 961111*3 = 34,848 unique weights, or 34,944 parameters (+96 biases).

Summary. the Conv Layer:

  • Accepts a volume of size W1×H1×D1W_1 × H_1 × D_1
  • Requires four hyperparameters:
    • Number of filters KK,
    • their spatial extent FF,
    • the stride SS,
    • the amount of zero padding PP.
  • Produces a volume of size W2×H2×D2W_2×H_2×D_2​ where:
    • W2=(W1F+2P)/S+1W_2=(W_1− F + 2P ) / S + 1
    • H2=(H1F+2P)/S+1H_2 = (H_1 − F + 2P) / S + 1 (i.e. width and height are computed equally by symmetry)
    • D2=KD_2=K​
  • With parameter sharing, it introduces FFD1F⋅F⋅D_1​ weights per filter, for a total of (FFD1)K(F⋅F⋅D_1)⋅K​ weights and KK​ biases.
  • In the output volume, the dd-th depth slice (of size W2×H2W_2×H_2) is the result of performing a valid convolution of the dd-th filter over the input volume with a stride of SS, and then offset by dd-th bias.

A common setting of the hyperparameters is F=3,S=1,P=1F=3,S=1,P=1. However, there are common conventions and rules of thumb that motivate these hyperparameters. See the ConvNet architectures section below.

Pooling Layer 池化層

Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting.

the pooling layer:

  • Accepts a volume of size W1×H1×D1W_1×H_1×D_1
  • Requires two hyperparameters:
    • their spatial extent FF,
    • the stride SS,
  • Produces a volume of size W2×H2×D2W2×H2×D2 where:
    • W2=(W1F)/S+1W_2=(W_1−F)/S+1
    • H2=(H1F)/S+1H_2=(H_1−F)/S+1
    • D2=D1D_2=D_1
  • Introduces zero parameters since it computes a fixed function of the input
  • For Pooling layers, it is not common to pad the input using zero-padding.

It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with $F=3,S=2 $(also called overlapping pooling), and more commonly F=2,S=2F=2,S=2. Pooling sizes with larger receptive fields are too destructive.

常用的池化層函數,如L2範式,平均化,但經過事實檢驗,效果都沒有取最大值函數好。

還發現丟棄池化層對於訓練良好的生成模型(例如變分自動編碼器(VAE)或生成對抗網絡(GAN))非常重要。 Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will feature very few to no pooling layers

Normalization Layer 正則化層

在實踐中作用效果不大,因此逐漸被拋棄if any. For various types of normalizations, see the discussion in Alex Krizhevsky’s cuda-convnet library API.

Fully-connected Layer 全連接層

沒太看懂這一段到底要說什麼。。。

FC Layer 可以用 Conv Layer 來實現,因爲它們的本質都是矩陣的點乘。而且效果要好?

ConvNet Architectures

the most common ConvNet architecture follows the pattern:

INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:

  • INPUT -> FC, implements a linear classifier. Here N = M = K = 0.
  • INPUT -> CONV -> RELU -> FC
  • INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC. Here we see that there is a single CONV layer between every POOL layer.
  • INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation.

Prefer a stack of small filter CONV to one large receptive field CONV layer. 優先使用一組小的filter而不是一個大的filter。相比之下,同等感受域範圍下的一組小的感受域 contain non-linearities that make their features more expressive,而且參數更少。

在具體實踐中,有效的方法是查看目前在ImageNet上運行的比較好的網絡,下載經過訓練的模型,並根據自己的數據進行微調,而不是從頭開始訓練或者設計一個網絡。

Layer Sizing Patterns

具體關於超參數(hyperparameters)的選取

  • input layer

    輸入層的size應該至少可以被2除兩次,常見的有32,64,96,224,384,512等

  • conv layer

    通常使用3×3或者5×5,S=1S = 1F=3,P=1F=3, P=1或者F=5,P=2F=5, P=2。一般7 * 7的只在第一層纔會用。

  • pool layer

    通常選用max函數,2×2的感受域,stride爲2;另一個不太常見的設置是使用3×3感受野,步幅爲2

案例研究

  • LeNet.

    Yann LeCun在1990年代開發了卷積網絡的第一個成功應用。 其中,最著名的是LeNet體系結構,該體系結構用於讀取郵政編碼,數字等

  • AlexNet.

    Alex Krizhevsky,Ilya Sutskever和Geoff Hinton開發的,最早在計算機視覺中普及卷積網絡的作品。 AlexNet於2012年參加了ImageNet ILSVRC挑戰賽,並大大超越了第二名(前5名的錯誤率爲16%,而第二名的錯誤率爲26%)。 網絡具有與LeNet非常相似的架構,但是更深,更大,並且具有彼此堆疊的卷積層(以前通常只有一個CONV層總是緊隨其後是POOL層)。

  • ZFNet.

2013年ILSVRC冠軍是Matthew Zeiler和Rob Fergus的卷積網絡。 它被稱爲ZFNet(Zeiler&Fergus Net的縮寫)。 通過調整體系結構超參數,特別是通過擴展中間卷積層的大小並減小第一層的步幅和過濾器大小,對AlexNet進行了改進。

  • GoogLeNet.

    2014年ILSVRC獲獎者是Szegedy等人的卷積網絡。 來自Google。 它的主要貢獻是開發了一個Inception模塊,該模塊顯着減少了網絡中的參數數量(4M,而AlexNet爲60M)。 此外,本文使用平均池化而不是ConvNet頂部的完全連接層,從而消除了似乎無關緊要的大量參數。 GoogLeNet也有多個後續版本,最近的是Inception-v4

  • VGGNet

    ILSVRC 2014的亞軍是來自Karen Simonyan和Andrew Zisserman的網絡,該網絡被稱爲VGGNet。 它的主要貢獻在於表明網絡深度是獲得良好性能的關鍵因素。 他們最終的最佳網絡包含16個CONV / FC層,並且吸引人的是,它具有極其均勻的體系結構,從頭到尾僅執行3x3卷積和2x2池化。 他們的預訓練模型可用於Caffe中的即插即用功能。 VGGNet的缺點是評估成本更高,並且使用更多的內存和參數(140M)。 這些參數中的大多數都位於第一個完全連接的層中,並且由於發現這些FC層可以在不降低性能的情況下被刪除,從而大大減少了必要參數的數量。

  • ResNet. Residual Network

    由Kaiming He等人開發的殘差網絡。 是ILSVRC 2015的獲勝者。它具有特殊的跳過連接和大量使用批標準化的功能。 該體系結構還缺少網絡末端的完全連接的層。 還向讀者介紹了Kaiming的演示文稿(視頻,幻燈片),以及一些最近的實驗,這些實驗在Torch中再現了這些網絡。 ResNets目前是最先進的卷積神經網絡模型,並且是在實踐中使用ConvNets的默認選擇(截至2016年5月10日)。 特別是,還可以看到更進一步的發展,這些變化調整了Kaiming He等人的原始架構。 深度殘留網絡中的身份映射(2016年3月發佈)。

VGGNet in detail 舉個例子. Lets break down the VGGNet in more detail as a case study. The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding). We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights:

INPUT: [224x224x3]        memory:  224*224*3=150K   weights: 0
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*64)*64 = 36,864
POOL2: [112x112x64]  memory:  112*112*64=800K   weights: 0
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*128)*128 = 147,456
POOL2: [56x56x128]  memory:  56*56*128=400K   weights: 0
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
POOL2: [28x28x256]  memory:  28*28*256=200K   weights: 0
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512]  memory:  14*14*512=100K   weights: 0
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512]  memory:  7*7*512=25K  weights: 0
FC: [1x1x4096]  memory:  4096  weights: 7*7*512*4096 = 102,760,448
FC: [1x1x4096]  memory:  4096  weights: 4096*4096 = 16,777,216
FC: [1x1x1000]  memory:  1000 weights: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

As is common with Convolutional Networks, notice that most of the memory (and also compute time) is used in the early CONV layers, and that most of the parameters are in the last FC layers. In this particular case, the first FC layer contains 100M weights, out of a total of 140M.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章