【思考】Fast R-CNN的“去SVM化”

    好奇的點始於,爲什麼R-CNN需要採用SVM來提高性能,而不可以直接採用神經網絡“自帶的”softmax分類器。爲什麼Fast R-CNN不需要SVM分類器,搭建一個多頭輸出網絡就可以取得較好的效果呢?

Fast R-CNN的網絡架構
Fast R-CNN的網絡架構

    R-CNN的arXiv版本附錄中也討論了使用SVM而不是softmax這一點,實驗表明單用softmax會使mAP下降大約五個百分點。作者認爲是構建SVM和CNN網絡這兩者使用的數據集時的不同造成的差異。由於訓練CNN網絡需要更大量的數據,在構建CNN網絡使用的數據集中對於RoI正例位置的精確度要求更低,使得CNN網絡的softmax分類器對於精確的位置不敏感。因此SVM使得mAP有較大提升。(引文中第一段是數據集構建方法,第二段是構建CNN網絡的訓練數據時的考慮,第三段是爲什麼不用softmax的思考)

 

...

To review the definitions briefly, for finetuning we map each object proposal to the ground-truth instance with which it has maximum IoU overlap (if any) and label it as a positive for the matched ground-truth class if the IoU is at least 0.5. All other proposals are labeled “background” (i.e., negative examples for all classes). For training SVMs, in contrast, we take only the ground-truth boxes as positive examples for their respective classes and label proposals with less than 0.3 IoU overlap with all instances of a class as a negative for that class. Proposals that fall into the grey zone (more than 0.3 IoU overlap, but are not ground truth) are ignored.

...

Our hypothesis is that this difference in how positives and negatives are defined is not fundamentally important and arises from the fact that fine-tuning data is limited. Our current scheme introduces many “jittered” examples (those proposals with overlap between 0.5 and 1, but not ground truth), which expands the number of positive examples by approximately 30x. We conjecture that this large set is needed when fine-tuning the entire network to avoid overfitting. However, we also note that using these jittered examples is likely suboptimal because the network is not being fine-tuned for precise localization.

This leads to the second issue: Why, after fine-tuning, train SVMs at all? It would be cleaner to simply apply the last layer of the fine-tuned network, which is a 21-way softmax regression classifier, as the object detector. We tried this and found that performance on VOC 2007 dropped from 54.2% to 50.9% mAP. This performance drop likely arises from a combination of several factors including that the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training. This result shows that it’s possible to obtain close to the same level of performance without training SVMs after fine-tuning. We conjecture that with some additional tweaks to fine-tuning the remaining performance gap may be closed. If true, this would simplify and speed up R-CNN training with no loss in detection performance.

    那麼Fast R-CNN可以擺脫SVM是因爲數據集構建時對位置要求更精確,或得以更有效率地構建數據集了嗎?還是網絡的設計使得訓練Fast R-CNN網絡不需要更多數據?他們使用的數據量相同嗎?

    由下面的引文可以看出,Fast R-CNN構建神經網絡的數據集時與R-CNN的神經網絡數據集的相同和不同之處在於:類別正例都採用RoI與標籤IoU大於0.5的數據,不同在於背景(負例)的構建,Fast R-CNN採用IoU爲(0.1,0.5)區間內作爲負例,R-CNN採用[0, 0.5)區間內作爲負例,同時,Fast R-CNN把[0, 0.1)區間內的RoI作爲hard example mining。

    但這與R-CNN解釋時的兩個說法矛盾:需要大量數據訓練神經網絡參數;IoU大於0.5即作爲正例對mAP會造成一定影響。需要進一步考察Fast R-CNN的定位方法。

Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u ≥ 1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5), following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.

    Fast R-CNN在測試時的輸入是原圖像和2000個候選RoI,這一點與R-CNN無異,但其輸出爲RoI的“分數”和bbox的Regression Reference。正式這個調整值將Fast R-CNN有監督地引導向更精確的定位。先前R-CNN時作者解釋過這種數據集訓練神經網絡的定位精度不足是它和加入SVM之後的算法的差距,所以這也應該是Fast R-CNN提出的動機之一。

    我們最近做的心電圖分類算法,也是先訓練了一維CNN網絡分類心電信號,然後將CNN網絡提取的特徵輸入到SVM分類器中,效果得到很大的提升。

    兩種方式採用的數據集相同,又按照神經網絡前後的耦合程度,SVM得到的結果不應大幅好於神經網絡。我之前懷疑是數據量不足,導致全連接網絡不容易訓練,但如果全連接網絡沒能學習到分類所需要的參數,又如何很好地指導全連接網絡之前的特徵提取器(卷積層)的訓練呢?是因爲在全連接網絡中發生了信息的耗損嗎?

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章