Training Protocol
-
backbone:
ResNet-101 or modified aligned Xception -
pretrain:
ImageNet-1K -
dataset:
PASCAL VOC 2012 (20 foreground object classes, 1 background class)
10582 (trainaug) training images, 1449 (val), 1456 (test) -
lr schedule:
“poly” policy (initial lr:
0.007)
initial learning rate is multiplied by with -
crop size:
513×513
For atrous convolution with large rates to be effective, large crop size is required. -
fine-tune batch normalization parameters
whenoutput stride
= 16
output stride:
the ratio of input image spatial resolution to final output resolution.
added modules (ASPP
,decoder
, etc) on top of ResNet all include batch normalization parameters. -
batch size
=16
The batch normalization parameters are trained withdecay
= 0.9997.
After training on the trainaug set with30K
iterations andinitial learning rate
= 0.007, we thenfreeze batch normalization parameters
, employoutput stride
= 8, and train on the official PASCAL VOC 2012 trainval set for another30K
iterations andsmaller base learning rate
= 0.001. -
random scale data augmentation:
scaling input image (from 0.5 to 2.0) and randomly left-right flipping during training -
include batch normalization parameters in the proposed decoder module
-
train end-to-end
-
Upsampling logits
Inference strategy on val set
-
output stride
=8
the model is trained withoutput stride
=16, and applyoutput stride
=8 to get more detailed feature map during inference.