R-CNN

Rich feature hierarchies for accurate object detection and semantic segmentation
Here is the paper link. To understand this paper,we should learn some background konwledge first.

background konwledge
1,non-maximum suppression algorithm(NMS)： to determine whether the bounding-boxes produced by selective search will be choosen as part of the final bounding-box set or not.


function pick = nms(boxes, overlap)
% top = nms(boxes, overlap)
% Non-maximum suppression. (FAST VERSION)
% Greedily select high-scoring detections and skip detections
% that are significantly covered by a previously selected
% detection.
%
% NOTE: This is adapted from Pedro Felzenszwalb's version (nms.m),
% but an inner loop has been eliminated to significantly speed it
% up in the case of a large number of boxes


% Copyright (C) 2011-12 by Tomasz Malisiewicz
% All rights reserved.
% 
% This file is part of the Exemplar-SVM library and is made
% available under the terms of the MIT license (see COPYING file).
% Project homepage: https://github.com/quantombone/exemplarsvm




if isempty(boxes)
  pick = [];
  return;
end


x1 = boxes(:,1);
y1 = boxes(:,2);
x2 = boxes(:,3);
y2 = boxes(:,4);
s = boxes(:,end);


area = (x2-x1+1) .* (y2-y1+1);    %計算出每一個bounding box的面積
[vals, I] = sort(s);                %根據score遞增排序


pick = s*0;
counter = 1;
while ~isempty(I)
  last = length(I);
  i = I(last);  
  pick(counter) = i;            %選擇score最大bounding box加入到候選隊列
  counter = counter + 1;

  xx1 = max(x1(i), x1(I(1:last-1)));
  yy1 = max(y1(i), y1(I(1:last-1)));
  xx2 = min(x2(i), x2(I(1:last-1)));
  yy2 = min(y2(i), y2(I(1:last-1)));

  w = max(0.0, xx2-xx1+1);
  h = max(0.0, yy2-yy1+1);

  inter = w.*h;     %計算出每一bounding box與當前score最大的box的交集面積
  o = inter ./ (area(i) + area(I(1:last-1)) - inter);  %IoU（intersection-over-union）

  I = I(find(o<=overlap));  %找出IoU小於overlap閾值的index
end


pick = pick(1:(counter-1));

2, standard Hard-negative mining
假設有一個訓練集 D, 其中有很多負樣本，但是很多很多負樣本是對訓練沒有幫助的。如醫學影像中大面積的背景區域。因此要使用 Hard-negative mining 來幫助我們來篩選有好的代表性負樣本，即那些很難區分的負樣本來訓練。
方法如下：
suppose $C_{1} \in D$ be an initial training set. $E (M, D)$ denotes the negative traning data which are classified correctly by $M$ . $H (M, D)$ denotes the negative traning data which are classified incorrectly by $M$ . The algorithm repeatedly trains a model and updates the $C_{i}$ , $M_{i}$ as follows

1,traning a model $M_{i}$ according to the traning dataset $C_{i}$ .
2, $H (M_{i}, D) \in C_{i}$ ,stop and return $M_{i}$ .
3,Let $C_{i}^{^{'}} := C_{i} ∖ X$ for any $X \in E (M_{i}, D)$ . （‘\’ 號表示 $C_{i}$ 與 $X$ 的相對差集，即去掉當前模型可以正確識別的負樣本）
4,Let $C_{i + 1} := C_{i}^{^{'}} \cup X$ for any $X \in D$ and $X \cap H (M_{i}, D) ∖ C_{i} \neq \emptyset$ (加入當前模型可以不可以正確識別的負樣本)

6,How to calculate Average precision:

[so,si]=sort(-out);%out is the result from your classifier,as descending order
tp=groundtruth(si)>0;
fp=groundtruth(si)<0;

fp=cumsum(fp);%判爲負樣本的數
tp=cumsum(tp);%判爲正樣本的數
rec=tp/sum(groundtruth>0);%召回率
prec=tp./(fp+tp);%精確度

ap=VOCap(rec,prec);

function ap = VOCap(rec,prec)

mrec=[0 ; rec ; 1];
mpre=[0 ; prec ; 0];
for i=numel(mpre)-1:-1:1
    mpre(i)=max(mpre(i),mpre(i+1));
end
i=find(mrec(2:end)~=mrec(1:end-1))+1;%去掉mrec中重複的部分，+1是保持下標的一致
ap=sum((mrec(i)-mrec(i-1)).*mpre(i));%area=（x(2)-x(1)）*y

7, detection 的衡量

dataset: The ILSVRC2013 detection dataset is split into threesets: train (395,918), val (20,121), and test (40,152), The val and test splits are exhaustively annotated, meaning that in each image all instances from all 200 classes are labeled with bounding boxes.The train set unlike val and test, the train images (due to their large number) are not exhaustively annotated. So our general strategy is to rely heavily on the val set and use some of the train images as an auxiliary source of positive examples.

Method
1, Category-independent region proposals. Selective search was adapted. However Selective search is not scale invariant, so we resized each image to fixed size(500*500)

2, CNN which extracts a fixed-length feature vector from each region.
1) region proposal transformations.
As the shapes of each region are differnt and CNN needs a fixed-size input, so we need resize each region. In this paper, authors suggested three methods.(Fig.2). According to a pilot set of experiments, the wrap transformation with context paddling outperform the alternatives by a large margin(3-5 mAP points)

Fig.2

2) Supervised pre-training on ILSVRC2012 classification datsset and domain-specific fine-tuning on ILSVRC2013 detection dataset (treat all regions with >= 0.5 IoU with ground-truth box as positive for that boxes and the rest as negatives).

3, traning 200 SVM binary classifiers(treat all regions with >= 0.3 IoU with ground-truth box as true-positive for that boxes and the rest as false-positive,the overlap threshold 0.3 was determined by outcome on validation set )
Since the number of negative data is much larger than yhe positive data,so we adapt the standard hard negative mining methods.

4,Bounding-box regression.
The input to our training algorithm is a set of N training pairs ${(P_{i}, G_{i})}_{i = 1, . . ., N}$ . P denotes proposal boxes,G denotes the ground-truth boxes. So how to determine the pairs count.Intuitively, if P is far from all ground-truth boxes, then the task of transforming P to a ground-truth box G does not make sense. Using examples like P would lead to a hopeless learning problem. Therefore, we only learn from a proposal P if it is nearby at least one ground-truth box. We implement “nearness” by assigning P to the ground-truth box G with which it has maximum IoU overlap (in case it overlaps more than one) if and only if the overlap is greater than a threshold (which we set to 0.6 using a validation set). All unassigned proposals are discarded. (如果對於某一個 $P_{i}$ 如果與多個 $G$ 的IoU 大於0。6，也不會被採用。)

Overview

總結：文章沒有直接利用CNN來做分類，而是使用其來做特徵提取，用 SVM 來做最後分類。在 Appendix.B 中有相應說明，可能原因解釋如下：

the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.