Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.

The Architecture

It contains eight learned layers — five convolutional and three fully-connected.

ReLU Nonlinearity

The standard way to model a neuron’s output f as a function of its input $\small x$ is with $\small f(x) = \tanh(x)$ or $\small f(x) = (1 + e^{-x})^{-1}$ . In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity $\small f(x) = \max(0, x)$ . Following Nair and Hinton [20], we refer to neurons with this nonlinearity as Rectified Linear Units (ReLUs). Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units. This is demonstrated in Figure 1. This plot shows that we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron models.

Figure 1: A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). The learning rates for each network were chosen independently to make training as fast as possible. No regularization of any kind was employed. The magnitude of the effect demonstrated here varies with network architecture, but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons.

重點：

圖1可以看出，虛線的 tanh 的錯誤率會趨於0.25而很難再降低，即意味着如果我們使用傳統的飽和神經元模型，我們就無法用這麼大的神經網絡來做實驗。

We are not the first to consider alternatives to traditional neuron models in CNNs. For example, Jarrett et al. [11] claim that the nonlinearity $\small f(x) = |\tanh(x)|$ works particularly well with their type of contrast normalization followed by local average pooling on the Caltech-101 dataset. However, on this dataset the primary concern is preventing overfitting, so the effect they are observing is different from the accelerated ability to fit the training set which we report when using ReLUs. Faster learning has a great influence on the performance of large models trained on large datasets.

本文並不是第一個提出對神經元模型做改變的工作，如 [What is the best multi-stage architecture for object recognition?] 中提出的 $\small f(x) = |\tanh(x)|$ 模型。

作者提出一個觀點：快速學習對在大型數據集上訓練的大型模型的性能有很大的影響。

Training on Multiple GPUs

The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU. Choosing the pattern of connectivity is a problem for cross-validation, but this allows us to precisely tune the amount of communication until it is an acceptable fraction of the amount of computation.

This scheme reduces our top-1 and top-5 error rates by 1.7% and 1.2%, respectively, as compared with a net with half as many kernels in each convolutional layer trained on one GPU. The two-GPU net takes slightly less time to train than the one-GPU net

我們採用的並行方案基本上是在每個GPU中放置一半核（或神經元），還有一個額外的技巧：GPU間的通訊只在某些層進行。這就是說，例如，第3層的核需要從第2層中所有核映射輸入。然而，第4層的核只需要從第3層中位於同一GPU的那些核映射輸入。選擇連接模式是一個交叉驗證的問題，但是這讓我們可以精確地調整通信量，直到它的計算量在可接受的部分。

Local Response Normalization

Denoting by $\small a_{x, y}^{i}$ the activity of a neuron computed by applying kernel $\small i$ at position $\small (x, y)$ and then applying the ReLU nonlinearity, the response-normalized activity $\small b_{x, y}^{i}$ is given by the expression

$\small b_{x, y}^{i}=a_{x, y}^{i} /\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^{j}\right)^{2}\right)^{\beta}$

where the sum runs over $\small n$ “adjacent” kernel maps at the same spatial position, and $\small N$ is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.

We applied this normalization after applying the ReLU nonlinearity in certain layers (see Section 3.5). This scheme bears some resemblance to the local contrast normalization scheme of Jarrett et al. [11], but ours would be more correctly termed “brightness normalization”, since we do not subtract the mean activity.

重點：

1. 公式計算的是 n個“相鄰的”的核映射、位於相同空間位置的點的和，N是該層中的核總數；

2. 核映射的順序是任意的，且在訓練開始前就確定；

3. 受到在真實神經元中發現的類型啓發，這種響應歸一化實現了一種側向抑制，在使用不同覈計算神經元輸出的過程中創造對大激活度的競爭；

4. 在某些層應用ReLU歸一化後再應用這種歸一化。

5. 我們的方案更正確的命名爲“brightness normalization”，因爲我們不減去平均活躍度；

6. 響應歸一化將我們的top-1與top-5誤差率分別減少了1.4%與1.2%；

7. 在CIFAR-10數據集上的有效性：四層CNN不帶歸一化時的測試誤差率是13%，帶歸一化時是11%。

Overlapping Pooling

Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g., [17, 11, 4]). To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced $\small s$ pixels apart, each summarizing a neighborhood of size $\small z \times z$ centered at the location of the pooling unit. If we set $\small s = z$ , we obtain traditional local pooling as commonly employed in CNNs. If we set $\small s<z$ , we obtain overlapping pooling. This is what we use throughout our network, with $\small s = 2$ and $\small z = 3$ . This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme $\small s = 2$ , $\small z = 2$ , which produces output of equivalent dimensions. We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit.

重點：採用重疊池化，能夠起到一點點抑制過擬合的作用。

Overall Architecture

As depicted in Figure 2, the net contains eight layers with weights; the first five are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. Our network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.

The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU. The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fully-connected layers are connected to all neurons in the previous layer.

Response-normalization layers follow the first and second convolutional layers. Max-pooling layers, of the kind described in Section 3.4, follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

重點：

1. 網絡共 5 個卷積層，3 個全鏈接層；

2. 最後一個全鏈接層的輸出是 1000 個 softmax，用於 1000 個類；

3. 採用 multinomial logistic regression objective；

4. 第2/4/5 卷積層的輸入只爲當下 GPU 的前一層輸出；第3 卷積層和全鏈接層與所有的前一層輸出連結（來自兩個 GPU）；

5. Response-normalization layers 在第1/2 兩個卷積層後面使用；

6. Max-pooling 緊跟在每一個 response-normalization layer 和第 5 個卷積層後面；

7. Relu 使用在每個卷積層和全鏈接層後面。

The first convolutional layer filters the 224×224×3 input image with 96 kernels of size 11×11×3 with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 × 3 × 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 × 3 × 192 , and the fifth convolutional layer has 256 kernels of size 3 × 3 × 192. The fully-connected layers have 4096 neurons each.

重點

1. 第 1 個卷積層利用 96 個大小爲11×11×3、步長爲4個像素的核，來對大小爲224×224×3的輸入圖像進行濾波；

2. 第 2 個卷積層需要將第 1 個卷積層的（response-normalized 及池化的）輸出作爲自己的輸入，且利用256個大小爲5×5×48的核對其進行濾波；

3. 第 3/4/5 個卷積層彼此相連，沒有任何介於中間的pooling層與歸一化層；

4. 第 3 個卷積層有384個大小爲3×3×256的核被連接到第 2 個卷積層的（response-normalized、池化的）輸出；

5. 第 4 個卷積層擁有384個大小爲3×3×192的核；

6. 第 5 個卷積層擁有256個大小爲3×3×192的核；

7. 全連接層都各有4096個神經元。

Reducing Overfitting

Below, we describe the two primary ways in which we combat overfitting.

Data Augmentation

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk. In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free.

The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches . This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Therefore to each RGB image pixel $\small I_{x y}=\left[I_{x y}^{R}, I_{x y}^{G}, I_{x y}^{B}\right]^{T}$ we add the following quantity:

$\small \left[\mathbf{p}_{1}, \mathbf{p}_{2}, \mathbf{p}_{3}\right]\left[\alpha_{1} \lambda_{1}, \alpha_{2} \lambda_{2}, \alpha_{3} \lambda_{3}\right]^{T}$

where $\small \mathbf{p}_{i}$ and $\small \lambda _{i}$ are $\small i$ th eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values, respectively, and $\small \alpha _{i}$ is the aforementioned random variable.

重點：

兩種增加訓練樣本的方法：

1. 平移和水平映射；

2. 改變訓練圖像中RGB通道的強度。

特別學習PCA圖像增強算法：

1. 首先，用PCA遍歷整個數據集；

2. 對於每個圖像，成倍增加前面計算到的主成分；比例大小爲對應特徵值乘以一個從均值爲0、標準差爲0.1的高斯分佈中提取的隨機變量。

Dropout

Combining the predictions of many different models is a very successful way to reduce test errors [1, 3], but it appears to be too expensive for big neural networks that already take several days to train. There is, however, a very efficient version of model combination that only costs about a factor of two during training. The recently-introduced technique, called “dropout” [Improving neural networks by preventing co-adaptation of feature detectors], consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in backpropagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks. We use dropout in the first two fully-connected layers of Figure 2. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.

重點：

1. dropout 技術可以減少神經元之間複雜的相互適應，因爲神經元並不是總要依賴於其他神經元的存在；

2. 本文的 dropout 層在前兩個全鏈接層（即第 6/7 層）；

3. dropout 有效地解決了過擬合問題；

4. 但 dropout 使得訓練時間增加了近一倍（收斂速度是不加dropout的兩倍）。

NIN (Network In Network)

Introduction

Convolutional neural networks (CNNs) [1] consist of alternating convolutional layers and pooling layers. Convolution layers take inner product of the linear filter and the underlying receptive field followed by a nonlinear activation function at every local portion of the input. The resulting outputs are called feature maps.

The convolution filter in CNN is a generalized linear model (GLM) for the underlying data patch, and we argue that the level of abstraction is low with GLM. By abstraction we mean that the feature is invariant to the variants of the same concept [2]. Replacing the GLM with a more potent nonlinear function approximator can enhance the abstraction ability of the local model. GLM can achieve a good extent of abstraction when the samples of the latent concepts are linearly separable, i.e. the variants of the concepts all live on one side of the separation plane defined by the GLM. Thus conventional CNN implicitly makes the assumption that the latent concepts are linearly separable. However, the data for the same concept often live on a nonlinear manifold, therefore the representations that capture these concepts are generally highly nonlinear function of the input. In NIN, the GLM is replaced with a ”micro network” structure which is a general nonlinear function approximator. In this work, we choose multilayer perceptron [3] as the instantiation of the micro network, which is a universal function approximator and a neural network trainable by back-propagation.

這一段給出了本文的 motivation：

卷積濾波器是線性的模型，叫做 GLM；然而數據通常存在於非線性流形上，捕捉這些概念的表示通常是輸入的高度非線性函數。

因此，本文將 GLM 用 ”micro network” 的結構替代，這個結構要求是非線性的。

本文的 micro network 選用的是多層感知器，通過反向傳播是可以被訓練的。

Figure 1: Comparison of linear convolution layer and mlpconv layer. The linear convolution layer includes a linear filter while the mlpconv layer includes a micro network (we choose the multilayer perceptron in this paper). Both layers map the local receptive field to a confidence value of the latent concept.

傳統的卷積層，局部感受野的運算僅僅只是一個單層的神經網絡；

利用多層感知器的微型網絡，對每個局部感受野的神經元進行更加複雜的非線性運算。

The resulting structure which we call an mlpconv layer is compared with CNN in Figure 1. Both the linear convolutional layer and the mlpconv layer map the local receptive field to an output feature vector. The mlpconv maps the input local patch to the output feature vector with a multilayer perceptron (MLP) consisting of multiple fully connected layers with nonlinear activation functions. The MLP is shared among all local receptive fields. The feature maps are obtained by sliding the MLP over the input in a similar manner as CNN and are then fed into the next layer. The overall structure of the NIN is the stacking of multiple mlpconv layers. It is called “Network In Network” (NIN) as we have micro networks (MLP), which are composing elements of the overall deep network, within mlpconv layers.

重點：

1. 含有 MLP 的卷積層叫做 mlpconv layer，mlpconv 指的是： multilayer perceptron + convolution；mlpconv 貫穿整個的網絡叫 Network in Netwrok；

2. mlpconv 使用多層感知器將輸入局部patch映射到輸出特徵向量；多層感知器由多個具有非線性激活函數的全連接層組成；

3. MLP在所有局部感受野之間共享；

4. 通過將MLP以類似於CNN的方式在輸入上滑動獲得特徵圖，然後將其輸入到下一層；

Instead of adopting the traditional fully connected layers for classification in CNN, we directly output the spatial average of the feature maps from the last mlpconv layer as the confidence of categories via a global average pooling layer, and then the resulting vector is fed into the softmax layer.

In traditional CNN, it is difficult to interpret how the category level information from the objective cost layer is passed back to the previous convolution layer due to the fully connected layers which act as a black box in between.

In contrast, global average pooling is more meaningful and interpretable as it enforces correspondance between feature maps and categories, which is made possible by a stronger local modeling using the micro network.

Furthermore, the fully connected layers are prone to overfitting and heavily depend on dropout regularization [4] [5], while global average pooling is itself a structural regularizer, which natively prevents overfitting for the overall structure.

重點：

1. 我們不採用傳統的全連通層進行分類，而是直接輸出來自最後一層 mlpconv 的 feature maps 的空間平均作爲類別的置信度，通過全局平均池層，然後將得到的向量送入 softmax 層。

2. 原因：

1）在傳統的CNN中，由於完全連接的層之間充當了一個黑盒子，很難解釋來自目標成本層的類別級信息是如何傳遞迴之前的卷積層的。全局平均池更有意義和可解釋性，因爲它加強了feature map和categories之間的對應，這是通過使用 micro network 進行更強的局部建模來實現的。

2）全連通層容易發生過擬合，嚴重依賴dropout正則化[4][5]，而全局平均池本身就是一個結構正則化器，從本質上防止了整體結構的過擬合。（爲什麼全局平均池本身就是一個結構正則化器？）

Network In Network

MLP Convolution Layers

Given no priors about the distributions of the latent concepts, it is desirable to use a universal function approximator for feature extraction of the local patches, as it is capable of approximating more abstract representations of the latent concepts. Radial basis network and multilayer perceptron are two well known universal function approximators. We choose multilayer perceptron in this work for two reasons. First, multilayer perceptron is compatible with the structure of convolutional neural networks, which is trained using back-propagation. Second, multilayer perceptron can be a deep model itself, which is consistent with the spirit of feature re-use [2]. This new type of layer is called mlpconv in this paper, in which MLP replaces the GLM to convolve over the input. Figure 1 illustrates the difference between linear convolutional layer and mlpconv layer. The calculation performed by mlpconv layer is shown as follows:

$\small \begin{aligned} f_{i, j, k_{1}}^{1} &=\max \left(w_{k_{1}}^{1} T_{x_{i, j}}+b_{k_{1}}, 0\right) \\ & \vdots \\ f_{i, j, k_{n}}^{n} &=\max \left(w_{k_{n}}^{n} T f_{i, j}^{n-1}+b_{k_{n}}, 0\right) \end{aligned}$ （2）

Here $\small n$ is the number of layers in the multilayer perceptron. Rectified linear unit is used as the activation function in the multilayer perceptron.

是多層感知器中的層數；rectified linear unit 爲多層感知器的激活函數。

重點：

有兩種 universal function approximators：Radial basis network 徑向基神經網絡 and multilayer perceptron 多層感知器。

本文選自多想感知器的原因：

1. 多層感知器與卷積神經網絡的結構兼容，卷積神經網絡通過反向傳播進行訓練；

2. 多層感知器本身可以是一個深度模型，這與特徵重用的實質是一致的。

Comparison to maxout layers:

the maxout layers in the maxout network performs max pooling across multiple affine feature maps [Maxout Networks]. The feature maps of maxout layers are calculated as follows:

$\small f_{i, j, k}=\max _{m}\left(w_{k_{m}}^{T} x_{i, j}\right)$

Maxout over linear functions forms a piecewise linear function which is capable of modeling any convex function. For a convex function, samples with function values below a specific threshold form a convex set. Therefore, by approximating convex functions of the local patch, maxout has the capability of forming separation hyperplanes for concepts whose samples are within a convex set (i.e. l2 balls, convex cones). Mlpconv layer differs from maxout layer in that the convex function approximator is replaced by a universal function approximator, which has greater capability in modeling various distributions of latent concepts.

maxout線性函數形成了一個分段線性函數，可以給任何凸函數建模。對於一個凸函數來說，函數值在特定閾值下的樣本點形成一個凸集，因此，通過擬合局部塊的凸函數，可以形成樣本點在凸集內的概念的分割超平面（例如，l2 balls, convex cones）。mlpconv層和maxout層的不同之處在與見凸函數擬合器用通用函數擬合器替代，使其能對更多的隱含概念分佈建模。[此段轉載於這裏]

Global Average Pooling

In this paper, we propose another strategy called global average pooling to replace the traditional fully connected layers in CNN. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer.

One advantage of global average pooling over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer.

Futhermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input. We can see global average pooling as a structural regularizer that explicitly enforces feature maps to be confidence maps of concepts (categories). This is made possible by the mlpconv layers, as they makes better approximation to the confidence maps than GLMs.

在本文中，我們提出了另一個策略，叫做全局平均池化層，用它來替代CNN中的全連接層。想法是在最後一個mlpconv層生成一個分類任務中相應類別的特徵圖。我們沒有在特徵圖最頂端增加全連接層，而是求每個特徵圖的平均值，得到的結果向量直接輸入softmax層。GAP相比全連接層的優點在於通過增強特徵圖與類比間的對應關係使卷積結構保留的更好，使特徵圖分類是可信的得到很好的解釋；另一個優點是GAP層中沒有參數設置，因此避免了過擬合；此外，GAP匯聚了空間信息，所以對輸入的空間轉換更魯棒。

我們可以看到GAP作爲一個正則化器，加強了特徵圖與概念（類別）的可信度的聯繫。這是通過mlpconv層實現的，因爲他們比GLM更好逼近置信圖（conficence maps）。[此段轉載於這裏]

Figure 2: The overall structure of Network In Network. In this paper the NINs include the stacking of three mlpconv layers and one global average pooling layer.

Network In Network Structure

The overall structure of NIN is a stack of mlpconv layers, on top of which lie the global average pooling and the objective cost layer. Sub-sampling layers can be added in between the mlpconv layers as in CNN and maxout networks. Figure 2 shows an NIN with three mlpconv layers. Within each mlpconv layer, there is a three-layer perceptron. The number of layers in both NIN and the micro networks is flexible and can be tuned for specific tasks.

NIN的整體結構是一系列mlpconve層的堆疊，最上層接一個GAP層和分類層。mlpconv層間的子層可以被相加，像CNN和maxout網絡一樣。圖2展示了一個包含三個mlpconv層的NIN。每個mlpconv層，包含一個三層的感知器，NIN和微型網絡的層數都是靈活的，可以根據具體任務微調。

VGG

[paper] Very Deep Convolutional Networks for Large-Scale Image Recognition

[pytorch] https://pytorch.org/docs/stable/_modules/torchvision/models/vgg.html

Convnet Configurations

To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.

Architecture

The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel.

The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity).

The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers.

Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2.

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000- way ILSVRC classification and thus contains 1000 channels (one for each class).

The final layer is the soft-max layer.

The configuration of the fully connected layers is the same in all networks.

All hidden layers are equipped with tfhe rectification (ReLU (Krizhevsky et al., 2012)) non-linearity.

We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.

重點：

1. 唯一的圖像預處理是減去整個圖像集的平均值；

2. 卷積核基本都是 3x3 的，只有 VGG-16C 中使用了 1x1 的卷積核，如表1；

3. stride = 1；padding = 1 for 3x3 conv.；

4. 採用 max-pooling；5次；核大小爲 2x2；stride = 2；

5. 三個全鏈接層，通道數分別爲 4096， 4096， 1000；

6. 最後一層是 soft-max；

7. 在 VGG 的各個版本中，三個全鏈接層的結構一致；

8. 每個隱藏層都配有 ReLU；

9. 並沒有使用 Local Response Normalisation，因爲實驗發現沒有啥作用。（對比組爲 VGG-A-LRN，見表1。）

Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “convhreceptive field sizei-hnumber of channelsi”. The ReLU activation function is not shown for brevity.

Configurations

The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512. In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).

Table 2: Number of parameters (in millions).

重點：

1. 列出了不同的網絡配置；除深度不同，每種網絡的其它設置都一樣；

2. 參數量同一些淺層網絡還要小，如 Sermanet et al., 2014 [OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks]；主要原因在於只用了 3x3 的卷積核。

Discussion

It is easy to see that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5; three such layers have a 7 × 7 effective receptive field.

So what have we gained by using, for instance, a stack of three 3×3 conv. layers instead of a single 7×7 layer?

First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative.

Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has $\small C$ channels, the stack is parametrised by $\small 3\left(3^{2} C^{2}\right)=27 C^{2}$ weights; at the same time, a single 7 × 7 conv. layer would require $\small 7^{2} C^{2}=49 C^{2}$ parameters, i.e. 81% more.

This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).

重點：

1. 2 層 3×3 的感受野= 1層 5×5 感受野；3 層 3×3 感受野 = 1 層 7×7 感受野；

2. 爲什麼要用 3×3 呢？

1）我們加入了三個非線性的校正層，而不是一個，這使得決策函數更有辨別力；

2）參數更少；

3）這可以看作是對 7×7 濾波器進行正則化，迫使它們通過 3×3 濾波器進行分解(在中間注入非線性)。

The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers. Even though in our case the 1 × 1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that 1×1 conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).

重點：

1×1 卷積層的加入(VGG-C，表1)是在不影響卷積層感受野的情況下，增加非線性決策函數的一種方法。

GoogLeNet (Szegedy et al., 2014 [Going Deeper with Convolutions] ), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNet (22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions). Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation. As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy.

和 GoogLeNet 的直接對比：VGG 更淺，參數小，準確率高。

Training

The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to 256, momentum to 0.9.

The training was regularised by weight decay (the L2 penalty multiplier set to 5 · 10E-4 ) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5).

The learning rate was initially set to 10E-2 , and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs).

We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.

重點：

1. 訓練是通過使用小批量梯度下降和動量優化多項邏輯迴歸目標來進行的；batch大小設置爲256，動量設置爲0.9;

2. 訓練通過weight decay(L2懲罰乘數設置爲5·10E-4)和前兩個全連接層的dropout正則化(dropout ratio設置爲0.5)進行正則化；

3. 學習速率最初設置爲10E-2，當驗證集精度不再提高時，學習率減小10倍。總的來說，學習率降低了3次，經過370K次迭代(74個epoch)後學習停止；

4. 我們推測，儘管與(Krizhevsky et al.， 2012)相比，VGG參數多、網絡深，但比較而言，VGG 需要更少的收斂時間，因爲(a)更深的網絡和更小的卷積施加的隱式正則化;(b)預先設定某些層。

【關於 implicit regularisation 的文章和網頁】

Implicit Regularization in Deep Learning Behnam Neyshabur，2017

Understanding implicit regularization in deep learning by analyzing trajectories of gradient descent Nadav Cohen and Wei Hu，2019

Why Deep Learning Works: Implicit Self-Regularization in Deep Neural Networks Michael W. Mahoney，2018

The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fullyconnected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning.

For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and 10E-2 variance. The biases were initialised with zero.

It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).

To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012).

重點：

1. 兩種初始化：

1）先訓練VGG-A；然後訓練其它 VGG 模型時，用已經訓練過的 VGG-A 模型作爲初始化；

2）採用 Glorot & Bengio 的隨機初始化過程;

2. 數據處理：水平翻轉和圖像顏色轉移等。

經典 network -- 圖像分類篇（01 AlexNet / NIN / VGG）（持續更新）

系列目錄：

經典 network -- 圖像分類篇（01 AlexNet / NIN / VGG）

AlexNet

The Architecture

Reducing Overfitting

NIN (Network In Network)

Introduction

Network In Network

VGG

Convnet Configurations

Training

HTTP URL 詳解

IEEE-explore， Springer 文獻免費下載辦法 & IEEE 論文latex / doc 模板下載地址

經典 network -- 圖像分類篇（01 AlexNet / NIN / VGG）（持續更新）

MyDLNote - Attention: [NLA系列] Asymmetric Non-local Neural Networks for Semantic Segmentation

經典 network -- 圖像分類篇（03 ResNet v1-v2）

MyDLNote - Network: Deep High-Resolution Representation Learning for Human Pose Estimation

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結