R-CNN

Rich feature hierarchies for accurate object detection and semantic segmentation

Here is the paper link. To understand this paper,we should learn some background konwledge first.

background konwledge

1,non-maximum suppression algorithm(NMS): to determine whether the bounding-boxes produced by selective search will be choosen as part of the final bounding-box set or not.


function pick = nms(boxes, overlap)
% top = nms(boxes, overlap)
% Non-maximum suppression. (FAST VERSION)
% Greedily select high-scoring detections and skip detections
% that are significantly covered by a previously selected
% detection.
%
% NOTE: This is adapted from Pedro Felzenszwalb's version (nms.m),
% but an inner loop has been eliminated to significantly speed it
% up in the case of a large number of boxes


% Copyright (C) 2011-12 by Tomasz Malisiewicz
% All rights reserved.
% 
% This file is part of the Exemplar-SVM library and is made
% available under the terms of the MIT license (see COPYING file).
% Project homepage: https://github.com/quantombone/exemplarsvm




if isempty(boxes)
  pick = [];
  return;
end


x1 = boxes(:,1);
y1 = boxes(:,2);
x2 = boxes(:,3);
y2 = boxes(:,4);
s = boxes(:,end);


area = (x2-x1+1) .* (y2-y1+1);    %计算出每一个bounding box的面积
[vals, I] = sort(s);                %根据score递增排序


pick = s*0;
counter = 1;
while ~isempty(I)
  last = length(I);
  i = I(last);  
  pick(counter) = i;            %选择score最大bounding box加入到候选队列
  counter = counter + 1;

  xx1 = max(x1(i), x1(I(1:last-1)));
  yy1 = max(y1(i), y1(I(1:last-1)));
  xx2 = min(x2(i), x2(I(1:last-1)));
  yy2 = min(y2(i), y2(I(1:last-1)));

  w = max(0.0, xx2-xx1+1);
  h = max(0.0, yy2-yy1+1);

  inter = w.*h;     %计算出每一bounding box与当前score最大的box的交集面积
  o = inter ./ (area(i) + area(I(1:last-1)) - inter);  %IoU(intersection-over-union)

  I = I(find(o<=overlap));  %找出IoU小于overlap阈值的index
end


pick = pick(1:(counter-1));

2, standard Hard-negative mining
假设有一个训练集 D, 其中有很多负样本,但是很多很多负样本是对训练没有帮助的。如医学影像中大面积的背景区域。因此要使用 Hard-negative mining 来帮助我们来筛选有好的代表性负样本,即那些很难区分的负样本来训练。
方法如下:
suppose C1D be an initial training set. E(M,D) denotes the negative traning data which are classified correctly by M . H(M,D) denotes the negative traning data which are classified incorrectly by M . The algorithm repeatedly trains a model and updates the Ci ,Mi as follows

 1,traning a model Mi according to the traning dataset Ci .
 2, H(Mi,D)Ci ,stop and return Mi .
 3,Let Ci:=CiX for any XE(Mi,D) . (‘\’ 号表示 CiX 的相对差集,即去掉当前模型可以正确识别的负样本)
 4,Let Ci+1:=CiX for any XD and XH(Mi,D)Ci (加入当前模型可以不可以正确识别的负样本)

6,How to calculate Average precision:

[so,si]=sort(-out);%out is the result from your classifier,as descending order
tp=groundtruth(si)>0;
fp=groundtruth(si)<0;

fp=cumsum(fp);%判为负样本的数
tp=cumsum(tp);%判为正样本的数
rec=tp/sum(groundtruth>0);%召回率
prec=tp./(fp+tp);%精确度

ap=VOCap(rec,prec);

function ap = VOCap(rec,prec)

mrec=[0 ; rec ; 1];
mpre=[0 ; prec ; 0];
for i=numel(mpre)-1:-1:1
    mpre(i)=max(mpre(i),mpre(i+1));
end
i=find(mrec(2:end)~=mrec(1:end-1))+1;%去掉mrec中重复的部分,+1是保持下标的一致
ap=sum((mrec(i)-mrec(i-1)).*mpre(i));%area=(x(2)-x(1))*y

7, detection 的衡量


dataset: The ILSVRC2013 detection dataset is split into threesets: train (395,918), val (20,121), and test (40,152), The val and test splits are exhaustively annotated, meaning that in each image all instances from all 200 classes are labeled with bounding boxes.The train set unlike val and test, the train images (due to their large number) are not exhaustively annotated. So our general strategy is to rely heavily on the val set and use some of the train images as an auxiliary source of positive examples.


Method

1, Category-independent region proposals. Selective search was adapted. However Selective search is not scale invariant, so we resized each image to fixed size(500*500)

2, CNN which extracts a fixed-length feature vector from each region.
1) region proposal transformations.
As the shapes of each region are differnt and CNN needs a fixed-size input, so we need resize each region. In this paper, authors suggested three methods.(Fig.2). According to a pilot set of experiments, the wrap transformation with context paddling outperform the alternatives by a large margin(3-5 mAP points)

Fig.2

2) Supervised pre-training on ILSVRC2012 classification datsset and domain-specific fine-tuning on ILSVRC2013 detection dataset (treat all regions with >= 0.5 IoU with ground-truth box as positive for that boxes and the rest as negatives).

3, traning 200 SVM binary classifiers(treat all regions with >= 0.3 IoU with ground-truth box as true-positive for that boxes and the rest as false-positive,the overlap threshold 0.3 was determined by outcome on validation set )
Since the number of negative data is much larger than yhe positive data,so we adapt the standard hard negative mining methods.

4,Bounding-box regression.
The input to our training algorithm is a set of N training pairs (Pi,Gi)i=1,...,N . P denotes proposal boxes,G denotes the ground-truth boxes. So how to determine the pairs count.Intuitively, if P is far from all ground-truth boxes, then the task of transforming P to a ground-truth box G does not make sense. Using examples like P would lead to a hopeless learning problem. Therefore, we only learn from a proposal P if it is nearby at least one ground-truth box. We implement “nearness” by assigning P to the ground-truth box G with which it has maximum IoU overlap (in case it overlaps more than one) if and only if the overlap is greater than a threshold (which we set to 0.6 using a validation set). All unassigned proposals are discarded. (如果对于某一个 Pi 如果与多个 G 的IoU 大于0。6,也不会被采用。)

Overview

总结:文章没有直接利用CNN来做分类,而是使用其来做特征提取,用 SVM 来做最后分类。在 Appendix.B 中有相应说明,可能原因解释如下:

the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.

发布了57 篇原创文章 · 获赞 4 · 访问量 2万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章