(ECCV 2016) Instance-sensitive Fully Convolutional Networks
Paper: https://arxiv.org/abs/1603.08678
our FCN is designed to compute a small set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances.
our method does not have any high-dimensional layer related to the mask resolution, but instead exploits image local coherence for estimating instances.
Introduction
we develop an end-to-end fully convolutional network that is capable of segmenting candidate instances.
our method computes a set of instance-sensitive score maps, where each pixel is a classifier of relative positions to an object instance.
With this set of score maps, we are able to generate an object instance segment in each sliding window by assembling the output from the score maps.
DeepMask, our method has no layer whose size is related to the mask size
Related Work
Instance-sensitive FCNs for Segment Proposal
From FCN to InstanceFCN
the output is indeed reusable for most pixels, except for those where one object is conjunct the other
Instance-sensitive score maps
we propose an FCN where each output pixel is a classifier of relative positions of instances.
In our practice, we define the relative positions using a
Instance assembling module
We slide a window of resolution
Local Coherence
for a pixel in a natural image, its prediction is most likely the same when evaluated in two neighboring windows.
This allows us to conserve a large number of parameters when the mask resolution
This not only reduces the computational cost of the mask prediction layers, but more importantly, reduces the number of parameters required for mask regression, leading to less risk of overfitting on small datasets
Algorithm and Implementation
Network architecture
use the VGG-16 network [22] pre-trained on ImageNet [23] as the feature extractor. We follow the practice in [24] to reduce the network stride and increase feature map resolution: the max pooling layer pool4 (between conv4 3 and conv5 1) is modified to have a stride of 1 instead of 2, and accordingly the filters in conv5 1 to conv5 3 are adjusted by the “hole algorithm” [24].
On top of the feature map, there are two fully convolutional branches, one for estimating segment instances and the other for scoring the instances.
Figure 4. Details of the InstanceFCN architecture. On the top is a fully convolutional branch for generating k2 instance-sensitive score maps, followed by the assembling module that outputs instances. On the bottom is a fully convolutional branch for predicting the objectness score of each window. The highly scored output instances are on the right. In this figure, the objectness map and the “all instances” map have been sub-sampled for the purpose of illustration.
Training
We adopt the image-centric strategy in [21,19].
The forward pass computes the set of instance-sensitive score maps and the objectness score map.
After that, a set of 256 sliding windows are randomly sampled [21,19], and the instances are only assembled from these 256 windows for computing the loss function.
The loss function is defined as:
Here
We follow the scale jittering in [26] for training: a training image is resized such that its shorter side is randomly sampled from
Inference
A forward pass of the network is run on the input image, generating the instance-sensitive score maps and the objectness score map.
The assembling module then applies densely sliding windows on these maps to produce a segment instance at each position.
To handle multiple scales, we resize the shorter side of images to
For each output segment, we truncate the values to form a binary mask.
Then we adopt non-maximum suppression (NMS) to generate the final set of segment proposals.
Experiments
Experiments on PASCAL VOC 2012
Ablations on the number of relative positions
Our method is not sensitive to
Ablation comparisons with the DeepMask scheme
Comparisons on Instance Semantic Segmentation
Experiments on MS COCO
Conclusion
We have presented InstanceFCN, a fully convolutional scheme for proposing segment instances.
A simple assembling module is then able to generate segment instances from these score maps.
Our network architecture handles instance segmentation without using any high-dimensional layers that depend on the mask resolution.