
1. Top ILSVRC Models

在将卷积神经网络应用于图像分类时,要确切地知道如何准备用于建模的图像可能会非常困难,例如缩放(scaling)或归一化(normalizing)像素值。此外,图像数据增强(image data augmentation)可改善模型性能并减少泛化误差,而测试时间增强(test-time augmentation)可改善拟合模型的预测性能

最好的做法是仔细研究文献中描述的性能最高的模型上使用的数据准备,训练时间增强(train-time augmentation)和测试时间增强(test-time augmentation)的类型,而不是盲目猜测可能有效的方法。

2010年至2017年间,ImageNet Large Scale Visual Recognition Challenge(ILSVRC)产生了一系列用于图像分类的最先进的深度学习卷积神经网络模型,其结构和配置已成为该领域的启发式方法和最佳实践。

可以回顾在该年度竞赛中表现出色的模型的论文,以发现图像增强所使用的方法类型。在为自己的图像分类任务准备图像数据时,这些可用作建议和最佳实践。在以下各节中,将回顾四个顶级模型中使用的数据准备和图像增强:SuperVision / AlexNet,GoogLeNet / Inception,VGG和ResNet。

2. SuperVision (AlexNet) Data Preparation

论文《ImageNet Classification with Deep Convolutional Neural Networks》中提出的AlexNet网络在ILSVRC-2010和ILSVRC-2012图像分类任务上取得了最佳成绩,这些结果激发了人们对计算机视觉深度学习的兴趣。他们称其模型为SuperVision,但此后被称为AlexNet。

2.1 Data Preparation



2.2 Train-Time Augmentation

具体来说,增强是在内存中执行的,结果没有保存。**即时增强( just-in-time augmentation)**现在是使用该方法的标准方法。


The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224×224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches.


The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1.

2.3 Test-Time Augmentation



At test time, the network makes a prediction by extracting five 224×224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

3. GoogLeNet (Inception) Data Preparation

2014年的论文《 Going Deeper with Convolutions 》 中提出了GoogLeNet,在目标检测方面取得了最佳成绩。

3.1 Data Preparation


第一篇论文中提出的网络通常称为Inception v1。2015年的后续论文《Rethinking the Inception Architecture for Computer Vision》提出了Inception v2和v3。Keras中提供了Inception v3模型和模型权重。在此实现中,基于TensorFlow实现,图像不居中;取而代之的是,将每个图像的像素值缩放到 [-1,1] 范围,图像输入形状为 299×299 像素。在最近的论文中似乎没有提到这种归一化和缺乏居中。

3.2 Train-Time Augmentation



Still, one prescription that was verified to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3

另外,使用光度失真(photometric distortions),涉及图像属性(例如颜色,对比度和亮度)的随机变化。调整图像以适合模型的预期输入形状,并随机选择不同的插值方法(interpolation methods)。

In addition, we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing

3.3 Test-Time Augmentation



Specifically, we resize the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This results in 4×3×6×2 = 144 crops per image.


The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction.

4. VGG Data Preparation

2015年,牛津大学VGG在论文《Very Deep Convolutional Networks for Large-Scale Image Recognition》 提出了 VGG Net。

4.1 Data Preparation


During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel.

4.2 Train-Time Augmentation

首先训练具有固定但较小图像尺寸的模型,保留模型权重,然后将其用作训练具有较大但仍固定尺寸图像的新模型的初始化权重方法。这是为了加快对较大(第二个)模型的训练。另一种图像缩放方法,称为多尺度训练(multi-scale training),涉及为每个图像随机选择图像缩放大小。

在两种训练方法中,然后将输入图像作为较小的输入裁剪。此外,将水平翻转(horizontal flips)和颜色偏移(color shifts)应用于裁剪图像。

To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift.

4.3 Test-Time Augmentation

在训练时评估的多尺度方法也在测试时进行了评估,通常被称为尺度抖动(scale jitter)


5. ResNet Data Preparation

5.1 Data Preparation


5.2 Train-Time Augmentation

图像数据增强是上述方法的组合,主要参考AlexNet和VGG。图像被随机调整为小尺寸或大尺寸,在VGG中使用了比例增强(scale augmentation)。然后采取可能的水平翻转和颜色增强来截取一小块方形图像。

5.3 Test-Time Augmentation

与AlexNet一样,测试集中每个图像的创建了10个方形裁剪图像,尽管这些图像是在每个具有固定大小的测试图像的多个版本上计算的,从而实现了针对VGG所述的尺度抖动(scale jittering),然后将所有变化的预测取平均。

In testing, for comparison studies we adopt the standard 10-crop testing. In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully-convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).

6. Data Preparation Recommendations

  • Data Preparation:必须为输入图像选择固定尺寸,并且所有图像都必须调整为该形状。像素缩放的最常见类型涉及将每个通道的像素值居中,然后进行某种类型的归一化处理。
  • Train-Time Augmentation:需要训练时增强,最常见的是输入图像的大小调整和裁剪,以及诸如变换,翻转和颜色更改之类的图像修改。
  • Test-Time Augmentation:测试时间增强的重点是输入图像的系统裁剪,以确保检测到输入图像中存在的特征。


還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.