【思考】Fast R-CNN的“去SVM化”

    好奇的点始于,为什么R-CNN需要采用SVM来提高性能,而不可以直接采用神经网络“自带的”softmax分类器。为什么Fast R-CNN不需要SVM分类器,搭建一个多头输出网络就可以取得较好的效果呢?

Fast R-CNN的网络架构
Fast R-CNN的网络架构

    R-CNN的arXiv版本附录中也讨论了使用SVM而不是softmax这一点,实验表明单用softmax会使mAP下降大约五个百分点。作者认为是构建SVM和CNN网络这两者使用的数据集时的不同造成的差异。由于训练CNN网络需要更大量的数据,在构建CNN网络使用的数据集中对于RoI正例位置的精确度要求更低,使得CNN网络的softmax分类器对于精确的位置不敏感。因此SVM使得mAP有较大提升。(引文中第一段是数据集构建方法,第二段是构建CNN网络的训练数据时的考虑,第三段是为什么不用softmax的思考)

 

...

To review the definitions briefly, for finetuning we map each object proposal to the ground-truth instance with which it has maximum IoU overlap (if any) and label it as a positive for the matched ground-truth class if the IoU is at least 0.5. All other proposals are labeled “background” (i.e., negative examples for all classes). For training SVMs, in contrast, we take only the ground-truth boxes as positive examples for their respective classes and label proposals with less than 0.3 IoU overlap with all instances of a class as a negative for that class. Proposals that fall into the grey zone (more than 0.3 IoU overlap, but are not ground truth) are ignored.

...

Our hypothesis is that this difference in how positives and negatives are defined is not fundamentally important and arises from the fact that fine-tuning data is limited. Our current scheme introduces many “jittered” examples (those proposals with overlap between 0.5 and 1, but not ground truth), which expands the number of positive examples by approximately 30x. We conjecture that this large set is needed when fine-tuning the entire network to avoid overfitting. However, we also note that using these jittered examples is likely suboptimal because the network is not being fine-tuned for precise localization.

This leads to the second issue: Why, after fine-tuning, train SVMs at all? It would be cleaner to simply apply the last layer of the fine-tuned network, which is a 21-way softmax regression classifier, as the object detector. We tried this and found that performance on VOC 2007 dropped from 54.2% to 50.9% mAP. This performance drop likely arises from a combination of several factors including that the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training. This result shows that it’s possible to obtain close to the same level of performance without training SVMs after fine-tuning. We conjecture that with some additional tweaks to fine-tuning the remaining performance gap may be closed. If true, this would simplify and speed up R-CNN training with no loss in detection performance.

    那么Fast R-CNN可以摆脱SVM是因为数据集构建时对位置要求更精确,或得以更有效率地构建数据集了吗?还是网络的设计使得训练Fast R-CNN网络不需要更多数据?他们使用的数据量相同吗?

    由下面的引文可以看出,Fast R-CNN构建神经网络的数据集时与R-CNN的神经网络数据集的相同和不同之处在于:类别正例都采用RoI与标签IoU大于0.5的数据,不同在于背景(负例)的构建,Fast R-CNN采用IoU为(0.1,0.5)区间内作为负例,R-CNN采用[0, 0.5)区间内作为负例,同时,Fast R-CNN把[0, 0.1)区间内的RoI作为hard example mining。

    但这与R-CNN解释时的两个说法矛盾:需要大量数据训练神经网络参数;IoU大于0.5即作为正例对mAP会造成一定影响。需要进一步考察Fast R-CNN的定位方法。

Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u ≥ 1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5), following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.

    Fast R-CNN在测试时的输入是原图像和2000个候选RoI,这一点与R-CNN无异,但其输出为RoI的“分数”和bbox的Regression Reference。正式这个调整值将Fast R-CNN有监督地引导向更精确的定位。先前R-CNN时作者解释过这种数据集训练神经网络的定位精度不足是它和加入SVM之后的算法的差距,所以这也应该是Fast R-CNN提出的动机之一。

    我们最近做的心电图分类算法,也是先训练了一维CNN网络分类心电信号,然后将CNN网络提取的特征输入到SVM分类器中,效果得到很大的提升。

    两种方式采用的数据集相同,又按照神经网络前后的耦合程度,SVM得到的结果不应大幅好于神经网络。我之前怀疑是数据量不足,导致全连接网络不容易训练,但如果全连接网络没能学习到分类所需要的参数,又如何很好地指导全连接网络之前的特征提取器(卷积层)的训练呢?是因为在全连接网络中发生了信息的耗损吗?

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章