學習前言

來看看很厲害的Mask R-CNN實例分割的原理吧，還是挺有意思的呢！

什麼是Mask R-CNN

Mask R-CNN是He Kaiming大神2017年的力作，其在進行目標檢測的同時進行實例分割，取得了出色的效果。
其網絡的設計也比較簡單，在Faster R-CNN基礎上，在原本的兩個分支上（分類+座標迴歸）增加了一個分支進行語義分割，

源碼下載

https://github.com/bubbliiiing/mask-rcnn-keras
喜歡的可以點個star噢。

Mask R-CNN實現思路

一、預測部分

1、主幹網絡介紹

Faster-RCNN使用Resne101作爲主幹特徵提取網絡，對應着圖像中的CNN部分，其對輸入進來的圖片有尺寸要求，需要可以整除2的6次方。在進行特徵提取後，利用長寬壓縮了兩次、三次、四次、五次的特徵層來進行特徵金字塔結構的構造。

ResNet101有兩個基本的塊，分別名爲Conv Block和Identity Block，其中Conv Block輸入和輸出的維度是不一樣的，所以不能連續串聯，它的作用是改變網絡的維度；Identity Block輸入維度和輸出維度相同，可以串聯，用於加深網絡的。
Conv Block的結構如下：

Identity Block的結構如下：

這兩個都是殘差網絡結構。

以官方使用的coco數據集輸入的shape爲例，輸入的shape爲1024x1024，shape變化如下：

我們取出長寬壓縮了兩次、三次、四次、五次的結果來進行特徵金字塔結構的構造。

實現代碼：

from keras.layers import ZeroPadding2D,Conv2D,MaxPooling2D,BatchNormalization,Activation,Add


def identity_block(input_tensor, kernel_size, filters, stage, block,
                   use_bias=True, train_bn=True):
    nb_filter1, nb_filter2, nb_filter3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    x = Conv2D(nb_filter1, (1, 1), name=conv_name_base + '2a',
                  use_bias=use_bias)(input_tensor)
    x = BatchNormalization(name=bn_name_base + '2a')(x, training=train_bn)
    x = Activation('relu')(x)

    x = Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same',
                  name=conv_name_base + '2b', use_bias=use_bias)(x)
    x = BatchNormalization(name=bn_name_base + '2b')(x, training=train_bn)
    x = Activation('relu')(x)

    x = Conv2D(nb_filter3, (1, 1), name=conv_name_base + '2c',
                  use_bias=use_bias)(x)
    x = BatchNormalization(name=bn_name_base + '2c')(x, training=train_bn)

    x = Add()([x, input_tensor])
    x = Activation('relu', name='res' + str(stage) + block + '_out')(x)
    return x

def conv_block(input_tensor, kernel_size, filters, stage, block,
               strides=(2, 2), use_bias=True, train_bn=True):

    nb_filter1, nb_filter2, nb_filter3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    x = Conv2D(nb_filter1, (1, 1), strides=strides,
                  name=conv_name_base + '2a', use_bias=use_bias)(input_tensor)
    x = BatchNormalization(name=bn_name_base + '2a')(x, training=train_bn)
    x = Activation('relu')(x)

    x = Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same',
                  name=conv_name_base + '2b', use_bias=use_bias)(x)
    x = BatchNormalization(name=bn_name_base + '2b')(x, training=train_bn)
    x = Activation('relu')(x)

    x = Conv2D(nb_filter3, (1, 1), name=conv_name_base +
                  '2c', use_bias=use_bias)(x)
    x = BatchNormalization(name=bn_name_base + '2c')(x, training=train_bn)

    shortcut = Conv2D(nb_filter3, (1, 1), strides=strides,
                         name=conv_name_base + '1', use_bias=use_bias)(input_tensor)
    shortcut = BatchNormalization(name=bn_name_base + '1')(shortcut, training=train_bn)

    x = Add()([x, shortcut])
    x = Activation('relu', name='res' + str(stage) + block + '_out')(x)
    return x

def get_resnet(input_image,stage5=False, train_bn=True):
    # Stage 1
    x = ZeroPadding2D((3, 3))(input_image)
    x = Conv2D(64, (7, 7), strides=(2, 2), name='conv1', use_bias=True)(x)
    x = BatchNormalization(name='bn_conv1')(x, training=train_bn)
    x = Activation('relu')(x)
    # Height/4,Width/4,64
    C1 = x = MaxPooling2D((3, 3), strides=(2, 2), padding="same")(x)
    # Stage 2
    x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1), train_bn=train_bn)
    x = identity_block(x, 3, [64, 64, 256], stage=2, block='b', train_bn=train_bn)
    # Height/4,Width/4,256
    C2 = x = identity_block(x, 3, [64, 64, 256], stage=2, block='c', train_bn=train_bn)
    # Stage 3
    x = conv_block(x, 3, [128, 128, 512], stage=3, block='a', train_bn=train_bn)
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='b', train_bn=train_bn)
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='c', train_bn=train_bn)
    # Height/8,Width/8,512
    C3 = x = identity_block(x, 3, [128, 128, 512], stage=3, block='d', train_bn=train_bn)
    # Stage 4
    x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a', train_bn=train_bn)
    block_count = 22
    for i in range(block_count):
        x = identity_block(x, 3, [256, 256, 1024], stage=4, block=chr(98 + i), train_bn=train_bn)
    # Height/16,Width/16,1024
    C4 = x
    # Stage 5
    if stage5:
        x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a', train_bn=train_bn)
        x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b', train_bn=train_bn)
        # Height/32,Width/32,2048
        C5 = x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c', train_bn=train_bn)
    else:
        C5 = None
    return [C1, C2, C3, C4, C5]

2、特徵金字塔FPN的構建

特徵金字塔FPN的構建是爲了實現特徵多尺度的融合，在Mask R-CNN當中，我們取出在主幹特徵提取網絡中長寬壓縮了兩次C2、三次C3、四次C4、五次C5的結果來進行特徵金字塔結構的構造。

提取到的P2、P3、P4、P5、P6可以作爲RPN網絡的有效特徵層，利用RPN建議框網絡對有效特徵層進行下一步的操作，對先驗框進行解碼獲得建議框。

提取到的P2、P3、P4、P5可以作爲Classifier和Mask網絡的有效特徵層，利用Classifier預測框網絡對有效特徵層進行下一步的操作，對建議框解碼獲得最終預測框；利用Mask語義分割網絡對有效特徵層進行下一步的操作，獲得每一個預測框內部的語義分割結果。

實現代碼如下：

# 獲得Resnet裏的壓縮程度不同的一些層
_, C2, C3, C4, C5 = get_resnet(input_image, stage5=True, train_bn=config.TRAIN_BN)

# 組合成特徵金字塔的結構
# P5長寬共壓縮了5次
# Height/32,Width/32,256
P5 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c5p5')(C5)
# P4長寬共壓縮了4次
# Height/16,Width/16,256
P4 = Add(name="fpn_p4add")([
    UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5),
    Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c4p4')(C4)])
# P4長寬共壓縮了3次
# Height/8,Width/8,256
P3 = Add(name="fpn_p3add")([
    UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4),
    Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c3p3')(C3)])
# P4長寬共壓縮了2次
# Height/4,Width/4,256
P2 = Add(name="fpn_p2add")([
    UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3),
    Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c2p2')(C2)])
    
# 各自進行一次256通道的卷積，此時P2、P3、P4、P5通道數相同
# Height/4,Width/4,256
P2 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p2")(P2)
# Height/8,Width/8,256
P3 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p3")(P3)
# Height/16,Width/16,256
P4 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p4")(P4)
# Height/32,Width/32,256
P5 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p5")(P5)
# 在建議框網絡裏面還有一個P6用於獲取建議框
# Height/64,Width/64,256
P6 = MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5)

# P2, P3, P4, P5, P6可以用於獲取建議框
rpn_feature_maps = [P2, P3, P4, P5, P6]
# P2, P3, P4, P5用於獲取mask信息
mrcnn_feature_maps = [P2, P3, P4, P5]

3、獲得Proposal建議框

由上一步獲得的有效特徵層在圖像中就是Feature Map，其有兩個應用，一個是和ROIAsign結合使用、另一個是進入到Region Proposal Network進行建議框的獲取。

在進行建議框獲取的時候，我們使用的有效特徵層是P2、P3、P4、P5、P6，它們使用同一個RPN建議框網絡獲取先驗框調整參數，還有先驗框內部是否包含物體。

在Mask R-cnn中，RPN建議框網絡的結構和Faster RCNN中的RPN建議框網絡類似。

首先進行一次3x3的通道數爲512的卷積。

然後再分別進行一次anchors_per_location x 4的卷積 和一次anchors_per_location x 2的卷積。

anchors_per_location x 4的卷積 用於預測 公用特徵層上 每一個網格點上每一個先驗框的變化情況。（爲什麼說是變化情況呢，這是因爲Faster-RCNN的預測結果需要結合先驗框獲得預測框，預測結果就是先驗框的變化情況。）

anchors_per_location x 2的卷積 用於預測 公用特徵層上 每一個網格點上 每一個預測框內部是否包含了物體。

當我們輸入的圖片的shape是1024x1024x3的時候，公用特徵層的shape就是256x256x256、128x128x256、64x64x256、32x32x256，相當於把輸入進來的圖像分割成不同大小的網格，然後每個網格默認存在3（anchors_per_location ）個先驗框，這些先驗框有不同的大小，在圖像上密密麻麻。

anchors_per_location x 4的卷積的結果會對這些先驗框進行調整，獲得一個新的框。
anchors_per_location x 2的卷積會判斷上述獲得的新框是否包含物體。

到這裏我們可以獲得了一些有用的框，這些框會利用anchors_per_location x 2的卷積判斷是否存在物體。

到此位置還只是粗略的一個框的獲取，也就是一個建議框。然後我們會在建議框裏面繼續找東西。

實現代碼爲：

#------------------------------------#
#   五個不同大小的特徵層會傳入到
#   RPN當中，獲得建議框
#------------------------------------#
def rpn_graph(feature_map, anchors_per_location):
    
    shared = Conv2D(512, (3, 3), padding='same', activation='relu',
                       name='rpn_conv_shared')(feature_map)
    
    x = Conv2D(2 * anchors_per_location, (1, 1), padding='valid',
                  activation='linear', name='rpn_class_raw')(shared)
    # batch_size,num_anchors,2
    # 代表這個先驗框對應的類
    rpn_class_logits = Reshape([-1,2])(x)

    rpn_probs = Activation(
        "softmax", name="rpn_class_xxx")(rpn_class_logits)
    
    x = Conv2D(anchors_per_location * 4, (1, 1), padding="valid",
                  activation='linear', name='rpn_bbox_pred')(shared)
    # batch_size,num_anchors,4
    # 這個先驗框的調整參數
    rpn_bbox = Reshape([-1,4])(x)

    return [rpn_class_logits, rpn_probs, rpn_bbox]

#------------------------------------#
#   建立建議框網絡模型
#   RPN模型
#------------------------------------#
def build_rpn_model(anchors_per_location, depth):
    input_feature_map = Input(shape=[None, None, depth],
                                 name="input_rpn_feature_map")
    outputs = rpn_graph(input_feature_map, anchors_per_location)
    return Model([input_feature_map], outputs, name="rpn_model")

4、Proposal建議框的解碼

通過第二步我們獲得了許多個先驗框的預測結果。預測結果包含兩部分。

anchors_per_location x 4的卷積 用於預測 有效特徵層上 每一個網格點上每一個先驗框的變化情況。**

anchors_per_location x 1的卷積 用於預測 有效特徵層上 每一個網格點上 每一個預測框內部是否包含了物體。

相當於就是將整個圖像分成若干個網格；然後從每個網格中心建立3個先驗框，當輸入的圖像是1024,1024,3的時候，總共先驗框數量爲196608+49152+12288+3072+768 = 261,888‬

當輸入圖像shape不同時，先驗框的數量也會發生改變。

先驗框雖然可以代表一定的框的位置信息與框的大小信息，但是其是有限的，無法表示任意情況，因此還需要調整。

anchors_per_location x 4中的anchors_per_location 表示了這個網格點所包含的先驗框數量，其中的4表示了框的中心與長寬的調整情況。

實現代碼如下：

#----------------------------------------------------------#
#   Proposal Layer
#   該部分代碼用於將先驗框轉化成建議框
#----------------------------------------------------------#

def apply_box_deltas_graph(boxes, deltas):
    # 計算先驗框的中心和寬高
    height = boxes[:, 2] - boxes[:, 0]
    width = boxes[:, 3] - boxes[:, 1]
    center_y = boxes[:, 0] + 0.5 * height
    center_x = boxes[:, 1] + 0.5 * width
    # 計算出調整後的先驗框的中心和寬高
    center_y += deltas[:, 0] * height
    center_x += deltas[:, 1] * width
    height *= tf.exp(deltas[:, 2])
    width *= tf.exp(deltas[:, 3])
    # 計算左上角和右下角的點的座標
    y1 = center_y - 0.5 * height
    x1 = center_x - 0.5 * width
    y2 = y1 + height
    x2 = x1 + width
    result = tf.stack([y1, x1, y2, x2], axis=1, name="apply_box_deltas_out")
    return result


def clip_boxes_graph(boxes, window):
    """
    boxes: [N, (y1, x1, y2, x2)]
    window: [4] in the form y1, x1, y2, x2
    """
    # Split
    wy1, wx1, wy2, wx2 = tf.split(window, 4)
    y1, x1, y2, x2 = tf.split(boxes, 4, axis=1)
    # Clip
    y1 = tf.maximum(tf.minimum(y1, wy2), wy1)
    x1 = tf.maximum(tf.minimum(x1, wx2), wx1)
    y2 = tf.maximum(tf.minimum(y2, wy2), wy1)
    x2 = tf.maximum(tf.minimum(x2, wx2), wx1)
    clipped = tf.concat([y1, x1, y2, x2], axis=1, name="clipped_boxes")
    clipped.set_shape((clipped.shape[0], 4))
    return clipped

class ProposalLayer(Layer):

    def __init__(self, proposal_count, nms_threshold, config=None, **kwargs):
        super(ProposalLayer, self).__init__(**kwargs)
        self.config = config
        self.proposal_count = proposal_count
        self.nms_threshold = nms_threshold
    # [rpn_class, rpn_bbox, anchors]
    def call(self, inputs):

        # 代表這個先驗框內部是否有物體[batch, num_rois, 1]
        scores = inputs[0][:, :, 1]

        # 代表這個先驗框的調整參數[batch, num_rois, 4]
        deltas = inputs[1]

        # [0.1 0.1 0.2 0.2]，改變數量級
        deltas = deltas * np.reshape(self.config.RPN_BBOX_STD_DEV, [1, 1, 4])

        # Anchors
        anchors = inputs[2]

        # 篩選出得分前6000個的框
        pre_nms_limit = tf.minimum(self.config.PRE_NMS_LIMIT, tf.shape(anchors)[1])
        # 獲得這些框的索引
        ix = tf.nn.top_k(scores, pre_nms_limit, sorted=True,
                         name="top_anchors").indices
        
        # 獲得這些框的得分
        scores = utils.batch_slice([scores, ix], lambda x, y: tf.gather(x, y),
                                   self.config.IMAGES_PER_GPU)
        # 獲得這些框的調整參數
        deltas = utils.batch_slice([deltas, ix], lambda x, y: tf.gather(x, y),
                                   self.config.IMAGES_PER_GPU)
        # 獲得這些框對應的先驗框
        pre_nms_anchors = utils.batch_slice([anchors, ix], lambda a, x: tf.gather(a, x),
                                    self.config.IMAGES_PER_GPU,
                                    names=["pre_nms_anchors"])

        # [batch, N, (y1, x1, y2, x2)]
        # 對先驗框進行解碼
        boxes = utils.batch_slice([pre_nms_anchors, deltas],
                                  lambda x, y: apply_box_deltas_graph(x, y),
                                  self.config.IMAGES_PER_GPU,
                                  names=["refined_anchors"])

        # [batch, N, (y1, x1, y2, x2)]
        # 防止超出圖片範圍
        window = np.array([0, 0, 1, 1], dtype=np.float32)
        boxes = utils.batch_slice(boxes,
                                  lambda x: clip_boxes_graph(x, window),
                                  self.config.IMAGES_PER_GPU,
                                  names=["refined_anchors_clipped"])


        # 非極大抑制
        def nms(boxes, scores):
            indices = tf.image.non_max_suppression(
                boxes, scores, self.proposal_count,
                self.nms_threshold, name="rpn_non_max_suppression")
            proposals = tf.gather(boxes, indices)
            # 如果數量達不到設置的建議框數量的話
            # 就padding
            padding = tf.maximum(self.proposal_count - tf.shape(proposals)[0], 0)
            proposals = tf.pad(proposals, [(0, padding), (0, 0)])
            return proposals

        proposals = utils.batch_slice([boxes, scores], nms,
                                      self.config.IMAGES_PER_GPU)
        return proposals

    def compute_output_shape(self, input_shape):
        return (None, self.proposal_count, 4)

5、對Proposal建議框加以利用（Roi Align）

讓我們對建議框有一個整體的理解：
事實上建議框就是對圖片哪一個區域有物體存在進行初步篩選。

實際上，Mask R-CNN到這裏的操作是，通過主幹特徵提取網絡，我們可以獲得多個公用特徵層，然後建議框會對這些公用特徵層進行截取。

其實公用特徵層裏的每一個點相當於原圖片上某個區域內部所有特徵的濃縮。

建議框會對其對應的公用特徵層進行截取，然後將截取的結果進行resize，在classifier模型裏，截取後的內容會resize到7x7x256的大小。在mask模型裏，截取後的內容會resize到14x14x256的大小。

在利用建議框對公用特徵層進行截取的時候要注意，要找到建議框屬於那個特徵層，這個要從建議框的大小進行判斷。

在classifier模型裏，其會利用一次通道數爲1024的7x7的卷積和一次通道數爲1024的1x1的卷積對ROIAlign獲得的7x7x256的區域進行卷積，兩次通道數爲1024卷積用於模擬兩次1024的全連接，然後再分別全連接到num_classes和num_classes * 4上，分別代表這個建議框內的物體，以及這個建議框的調整參數。

在mask模型裏，其首先會對resize後的局部特徵層進行四次3x3的256通道的卷積，再進行一次反捲積，再進行一次通道數爲num_classes的卷積，最終結果代表每一個像素點分的類。最終的shape爲28x28xnum_classes，代表每個像素點的類別。

#------------------------------------#
#   五個不同大小的特徵層會傳入到
#   RPN當中，獲得建議框
#------------------------------------#
def rpn_graph(feature_map, anchors_per_location):
    
    shared = Conv2D(512, (3, 3), padding='same', activation='relu',
                       name='rpn_conv_shared')(feature_map)
    
    x = Conv2D(2 * anchors_per_location, (1, 1), padding='valid',
                  activation='linear', name='rpn_class_raw')(shared)
    # batch_size,num_anchors,2
    # 代表這個先驗框對應的類
    rpn_class_logits = Reshape([-1,2])(x)

    rpn_probs = Activation(
        "softmax", name="rpn_class_xxx")(rpn_class_logits)
    
    x = Conv2D(anchors_per_location * 4, (1, 1), padding="valid",
                  activation='linear', name='rpn_bbox_pred')(shared)
    # batch_size,num_anchors,4
    # 這個先驗框的調整參數
    rpn_bbox = Reshape([-1,4])(x)

    return [rpn_class_logits, rpn_probs, rpn_bbox]

#------------------------------------#
#   建立建議框網絡模型
#   RPN模型
#------------------------------------#
def build_rpn_model(anchors_per_location, depth):
    input_feature_map = Input(shape=[None, None, depth],
                                 name="input_rpn_feature_map")
    outputs = rpn_graph(input_feature_map, anchors_per_location)
    return Model([input_feature_map], outputs, name="rpn_model")


#------------------------------------#
#   建立classifier模型
#   這個模型的預測結果會調整建議框
#   獲得最終的預測框
#------------------------------------#
def fpn_classifier_graph(rois, feature_maps, image_meta,
                         pool_size, num_classes, train_bn=True,
                         fc_layers_size=1024):
    # ROI Pooling，利用建議框在特徵層上進行截取
    # Shape: [batch, num_rois, POOL_SIZE, POOL_SIZE, channels]
    x = PyramidROIAlign([pool_size, pool_size],
                        name="roi_align_classifier")([rois, image_meta] + feature_maps)

    # Shape: [batch, num_rois, 1, 1, fc_layers_size]，相當於兩次全連接
    x = TimeDistributed(Conv2D(fc_layers_size, (pool_size, pool_size), padding="valid"),
                           name="mrcnn_class_conv1")(x)
    x = TimeDistributed(BatchNormalization(), name='mrcnn_class_bn1')(x, training=train_bn)
    x = Activation('relu')(x)

    # Shape: [batch, num_rois, 1, 1, fc_layers_size]
    x = TimeDistributed(Conv2D(fc_layers_size, (1, 1)),
                           name="mrcnn_class_conv2")(x)
    x = TimeDistributed(BatchNormalization(), name='mrcnn_class_bn2')(x, training=train_bn)
    x = Activation('relu')(x)

    # Shape: [batch, num_rois, fc_layers_size]
    shared = Lambda(lambda x: K.squeeze(K.squeeze(x, 3), 2),
                       name="pool_squeeze")(x)

    # Classifier head
    # 這個的預測結果代表這個先驗框內部的物體的種類
    mrcnn_class_logits = TimeDistributed(Dense(num_classes),
                                            name='mrcnn_class_logits')(shared)
    mrcnn_probs = TimeDistributed(Activation("softmax"),
                                     name="mrcnn_class")(mrcnn_class_logits)


    # BBox head
    # 這個的預測結果會對先驗框進行調整
    # [batch, num_rois, NUM_CLASSES * (dy, dx, log(dh), log(dw))]
    x = TimeDistributed(Dense(num_classes * 4, activation='linear'),
                           name='mrcnn_bbox_fc')(shared)
    # Reshape to [batch, num_rois, NUM_CLASSES, (dy, dx, log(dh), log(dw))]
    mrcnn_bbox = Reshape((-1, num_classes, 4), name="mrcnn_bbox")(x)

    return mrcnn_class_logits, mrcnn_probs, mrcnn_bbox



def build_fpn_mask_graph(rois, feature_maps, image_meta,
                         pool_size, num_classes, train_bn=True):
    # ROI Pooling，利用建議框在特徵層上進行截取
    # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels]
    x = PyramidROIAlign([pool_size, pool_size],
                        name="roi_align_mask")([rois, image_meta] + feature_maps)

    # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels]
    x = TimeDistributed(Conv2D(256, (3, 3), padding="same"),
                           name="mrcnn_mask_conv1")(x)
    x = TimeDistributed(BatchNormalization(),
                           name='mrcnn_mask_bn1')(x, training=train_bn)
    x = Activation('relu')(x)

    # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels]
    x = TimeDistributed(Conv2D(256, (3, 3), padding="same"),
                           name="mrcnn_mask_conv2")(x)
    x = TimeDistributed(BatchNormalization(),
                           name='mrcnn_mask_bn2')(x, training=train_bn)
    x = Activation('relu')(x)

    # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels]
    x = TimeDistributed(Conv2D(256, (3, 3), padding="same"),
                           name="mrcnn_mask_conv3")(x)
    x = TimeDistributed(BatchNormalization(),
                           name='mrcnn_mask_bn3')(x, training=train_bn)
    x = Activation('relu')(x)

    # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels]
    x = TimeDistributed(Conv2D(256, (3, 3), padding="same"),
                           name="mrcnn_mask_conv4")(x)
    x = TimeDistributed(BatchNormalization(),
                           name='mrcnn_mask_bn4')(x, training=train_bn)
    x = Activation('relu')(x)

    # Shape: [batch, num_rois, 2xMASK_POOL_SIZE, 2xMASK_POOL_SIZE, channels]
    x = TimeDistributed(Conv2DTranspose(256, (2, 2), strides=2, activation="relu"),
                           name="mrcnn_mask_deconv")(x)
    # 反捲積後再次進行一個1x1卷積調整通道，使其最終數量爲numclasses，代表分的類
    x = TimeDistributed(Conv2D(num_classes, (1, 1), strides=1, activation="sigmoid"),
                           name="mrcnn_mask")(x)
    return x


#----------------------------------------------------------#
#   ROIAlign Layer
#   利用建議框在特徵層上截取內容
#----------------------------------------------------------#

def log2_graph(x):
    return tf.log(x) / tf.log(2.0)

def parse_image_meta_graph(meta):
    """
    將meta裏面的參數進行分割
    """
    image_id = meta[:, 0]
    original_image_shape = meta[:, 1:4]
    image_shape = meta[:, 4:7]
    window = meta[:, 7:11]  # (y1, x1, y2, x2) window of image in in pixels
    scale = meta[:, 11]
    active_class_ids = meta[:, 12:]
    return {
        "image_id": image_id,
        "original_image_shape": original_image_shape,
        "image_shape": image_shape,
        "window": window,
        "scale": scale,
        "active_class_ids": active_class_ids,
    }

class PyramidROIAlign(Layer):
    def __init__(self, pool_shape, **kwargs):
        super(PyramidROIAlign, self).__init__(**kwargs)
        self.pool_shape = tuple(pool_shape)

    def call(self, inputs):
        # 建議框的位置
        boxes = inputs[0]

        # image_meta包含了一些必要的圖片信息
        image_meta = inputs[1]

        # 取出所有的特徵層[batch, height, width, channels]
        feature_maps = inputs[2:]

        y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)
        h = y2 - y1
        w = x2 - x1

        # 獲得輸入進來的圖像的大小
        image_shape = parse_image_meta_graph(image_meta)['image_shape'][0]
        
        # 通過建議框的大小找到這個建議框屬於哪個特徵層
        image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)
        roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))
        roi_level = tf.minimum(5, tf.maximum(
            2, 4 + tf.cast(tf.round(roi_level), tf.int32)))
        # batch_size, box_num
        roi_level = tf.squeeze(roi_level, 2)

        # Loop through levels and apply ROI pooling to each. P2 to P5.
        pooled = []
        box_to_level = []
        # 分別在P2-P5中進行截取
        for i, level in enumerate(range(2, 6)):
            # 找到每個特徵層對應box
            ix = tf.where(tf.equal(roi_level, level))
            level_boxes = tf.gather_nd(boxes, ix)
            box_to_level.append(ix)

            # 獲得這些box所屬的圖片
            box_indices = tf.cast(ix[:, 0], tf.int32)

            # 停止梯度下降
            level_boxes = tf.stop_gradient(level_boxes)
            box_indices = tf.stop_gradient(box_indices)

            # Result: [batch * num_boxes, pool_height, pool_width, channels]
            pooled.append(tf.image.crop_and_resize(
                feature_maps[i], level_boxes, box_indices, self.pool_shape,
                method="bilinear"))

        pooled = tf.concat(pooled, axis=0)

        # 將順序和所屬的圖片進行堆疊
        box_to_level = tf.concat(box_to_level, axis=0)
        box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1)
        box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range],
                                 axis=1)

        # box_to_level[:, 0]表示第幾張圖
        # box_to_level[:, 1]表示第幾張圖裏的第幾個框
        sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1]
        # 進行排序，將同一張圖裏的某一些聚集在一起
        ix = tf.nn.top_k(sorting_tensor, k=tf.shape(
            box_to_level)[0]).indices[::-1]

        # 按順序獲得圖片的索引
        ix = tf.gather(box_to_level[:, 2], ix)
        pooled = tf.gather(pooled, ix)

        # 重新reshape爲原來的格式
        # 也就是
        # Shape: [batch, num_rois, POOL_SIZE, POOL_SIZE, channels]
        shape = tf.concat([tf.shape(boxes)[:2], tf.shape(pooled)[1:]], axis=0)
        pooled = tf.reshape(pooled, shape)
        return pooled

    def compute_output_shape(self, input_shape):
        return input_shape[0][:2] + self.pool_shape + (input_shape[2][-1], )

6、預測框的解碼

在第四部分獲得的建議框也代表了圖片上的某一些區域，它在後面的在classifier模型裏也起到了先驗框的作用。

也就是classifier模型的預測結果，代表了建議框內部物體的種類和調整參數。

建議框調整後的結果，也就是最終的預測結果，這個預測結果就可以在圖片上進行繪製了。

預測框的解碼過程包括瞭如下幾個步驟：
1、取出不屬於背景，並且得分大於config.DETECTION_MIN_CONFIDENCE的建議框。
2、然後利用建議框和classifier模型的預測結果進行解碼，獲得最終預測框的位置。
3、利用得分和最終預測框的位置進行非極大抑制，防止重複檢測。

建議框解碼過程的代碼如下：

#----------------------------------------------------------#
#   Detection Layer
#----------------------------------------------------------#

def refine_detections_graph(rois, probs, deltas, window, config):
    """細化分類建議並過濾重疊部分並返回最終結果探測。
    Inputs:
        rois: [N, (y1, x1, y2, x2)] in normalized coordinates
        probs: [N, num_classes]. Class probabilities.
        deltas: [N, num_classes, (dy, dx, log(dh), log(dw))]. Class-specific
                bounding box deltas.
        window: (y1, x1, y2, x2) in normalized coordinates. The part of the image
            that contains the image excluding the padding.

    Returns detections shaped: [num_detections, (y1, x1, y2, x2, class_id, score)] where
        coordinates are normalized.
    """
    # 找到得分最高的類
    class_ids = tf.argmax(probs, axis=1, output_type=tf.int32)
    # 序號+類
    indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1)
    # 取出成績
    class_scores = tf.gather_nd(probs, indices)
    # 還有框的調整參數
    deltas_specific = tf.gather_nd(deltas, indices)
    # 進行解碼
    # Shape: [boxes, (y1, x1, y2, x2)] in normalized coordinates
    refined_rois = apply_box_deltas_graph(
        rois, deltas_specific * config.BBOX_STD_DEV)
    # 防止超出0-1
    refined_rois = clip_boxes_graph(refined_rois, window)

    # 去除背景
    keep = tf.where(class_ids > 0)[:, 0]
    # 去除背景和得分小的區域
    if config.DETECTION_MIN_CONFIDENCE:
        conf_keep = tf.where(class_scores >= config.DETECTION_MIN_CONFIDENCE)[:, 0]
        keep = tf.sets.set_intersection(tf.expand_dims(keep, 0),
                                        tf.expand_dims(conf_keep, 0))
        keep = tf.sparse_tensor_to_dense(keep)[0]

    # 獲得除去背景並且得分較高的框還有種類與得分
    # 1. Prepare variables
    pre_nms_class_ids = tf.gather(class_ids, keep)
    pre_nms_scores = tf.gather(class_scores, keep)
    pre_nms_rois = tf.gather(refined_rois,   keep)
    unique_pre_nms_class_ids = tf.unique(pre_nms_class_ids)[0]

    def nms_keep_map(class_id):

        ixs = tf.where(tf.equal(pre_nms_class_ids, class_id))[:, 0]

        class_keep = tf.image.non_max_suppression(
                tf.gather(pre_nms_rois, ixs),
                tf.gather(pre_nms_scores, ixs),
                max_output_size=config.DETECTION_MAX_INSTANCES,
                iou_threshold=config.DETECTION_NMS_THRESHOLD)

        class_keep = tf.gather(keep, tf.gather(ixs, class_keep))

        gap = config.DETECTION_MAX_INSTANCES - tf.shape(class_keep)[0]
        class_keep = tf.pad(class_keep, [(0, gap)],
                            mode='CONSTANT', constant_values=-1)

        class_keep.set_shape([config.DETECTION_MAX_INSTANCES])
        return class_keep

    # 2. 進行非極大抑制
    nms_keep = tf.map_fn(nms_keep_map, unique_pre_nms_class_ids,
                         dtype=tf.int64)
    # 3. 找到符合要求的需要被保留的建議框
    nms_keep = tf.reshape(nms_keep, [-1])
    nms_keep = tf.gather(nms_keep, tf.where(nms_keep > -1)[:, 0])
    # 4. Compute intersection between keep and nms_keep
    keep = tf.sets.set_intersection(tf.expand_dims(keep, 0),
                                    tf.expand_dims(nms_keep, 0))
    keep = tf.sparse_tensor_to_dense(keep)[0]

    # 尋找得分最高的num_keep個框
    roi_count = config.DETECTION_MAX_INSTANCES
    class_scores_keep = tf.gather(class_scores, keep)
    num_keep = tf.minimum(tf.shape(class_scores_keep)[0], roi_count)
    top_ids = tf.nn.top_k(class_scores_keep, k=num_keep, sorted=True)[1]
    keep = tf.gather(keep, top_ids)

    # Arrange output as [N, (y1, x1, y2, x2, class_id, score)]
    detections = tf.concat([
        tf.gather(refined_rois, keep),
        tf.to_float(tf.gather(class_ids, keep))[..., tf.newaxis],
        tf.gather(class_scores, keep)[..., tf.newaxis]
        ], axis=1)

    # 如果達不到數量的話就padding
    gap = config.DETECTION_MAX_INSTANCES - tf.shape(detections)[0]
    detections = tf.pad(detections, [(0, gap), (0, 0)], "CONSTANT")
    return detections

def norm_boxes_graph(boxes, shape):
    h, w = tf.split(tf.cast(shape, tf.float32), 2)
    scale = tf.concat([h, w, h, w], axis=-1) - tf.constant(1.0)
    shift = tf.constant([0., 0., 1., 1.])
    return tf.divide(boxes - shift, scale)

class DetectionLayer(Layer):

    def __init__(self, config=None, **kwargs):
        super(DetectionLayer, self).__init__(**kwargs)
        self.config = config

    def call(self, inputs):
        rois = inputs[0]
        mrcnn_class = inputs[1]
        mrcnn_bbox = inputs[2]
        image_meta = inputs[3]

        # 找到window的小數形式
        m = parse_image_meta_graph(image_meta)
        image_shape = m['image_shape'][0]
        window = norm_boxes_graph(m['window'], image_shape[:2])

        # Run detection refinement graph on each item in the batch
        detections_batch = utils.batch_slice(
            [rois, mrcnn_class, mrcnn_bbox, window],
            lambda x, y, w, z: refine_detections_graph(x, y, w, z, self.config),
            self.config.IMAGES_PER_GPU)

        # Reshape output
        # [batch, num_detections, (y1, x1, y2, x2, class_id, class_score)] in
        # normalized coordinates
        return tf.reshape(
            detections_batch,
            [self.config.BATCH_SIZE, self.config.DETECTION_MAX_INSTANCES, 6])

    def compute_output_shape(self, input_shape):
        return (None, self.config.DETECTION_MAX_INSTANCES, 6)

7、mask語義分割信息的獲取

在第六步中，我們獲得了最終的預測框，這個預測框相比於之前獲得的建議框更加準確，因此我們把這個預測框作爲mask模型的區域截取部分，利用這個預測框對mask模型中用到的公用特徵層進行截取。

截取後，利用mask模型再對像素點進行分類，獲得語義分割結果。

二、訓練部分

Faster-RCNN訓練所用的損失函數由幾個部分組成，一部分是建議框網絡的損失函數，一部分是classifier網絡的損失函數，另一部分是mask網絡的損失函數。

1、建議框網絡的訓練

公用特徵層如果要獲得建議框的預測結果，需要再進行一次3x3的卷積後，進行一個anchors_per_location x 1通道的1x1卷積，還有一個anchors_per_location x 4通道的1x1卷積。

在Mask R-CNN中，anchors_per_location 也就是先驗框的數量默認情況下是3，所以兩個1x1卷積的結果實際上也就是：

anchors_per_location x 4的卷積 用於預測 有效特徵層上 每一個網格點上每一個先驗框的變化情況。**

anchors_per_location x 1的卷積 用於預測 有效特徵層上 每一個網格點上 每一個建議框內部是否包含了物體。

也就是說，我們直接利用Mask R-CNN建議框網絡預測到的結果，並不是建議框在圖片上的真實位置，需要解碼才能得到真實位置。

而在訓練的時候，我們需要計算loss函數，這個loss函數是相對於Mask R-CNN建議框網絡的預測結果的。我們需要把圖片輸入到當前的Mask R-CNN建議框的網絡中，得到建議框的結果；同時還需要進行編碼，這個編碼是把真實框的位置信息格式轉化爲Mask R-CNN建議框預測結果的格式信息。

也就是，我們需要找到 每一張用於訓練的圖片的每一個真實框對應的先驗框，並求出如果想要得到這樣一個真實框，我們的建議框預測結果應該是怎麼樣的。

從建議框預測結果獲得真實框的過程被稱作解碼，而從真實框獲得建議框預測結果的過程就是編碼的過程。

因此我們只需要將解碼過程逆過來就是編碼過程了。

實現代碼如下：


def build_rpn_targets(image_shape, anchors, gt_class_ids, gt_boxes, config):
    # 1代表正樣本
    # -1代表負樣本
    # 0代表忽略
    rpn_match = np.zeros([anchors.shape[0]], dtype=np.int32)
    # 創建該部分內容利用先驗框和真實框進行編碼
    rpn_bbox = np.zeros((config.RPN_TRAIN_ANCHORS_PER_IMAGE, 4))

    '''
    iscrowd=0的時候，表示這是一個單獨的物體，輪廓用Polygon(多邊形的點)表示，
    iscrowd=1的時候表示兩個沒有分開的物體，輪廓用RLE編碼表示，比如說一張圖片裏面有三個人，
    一個人單獨站一邊，另外兩個摟在一起（標註的時候距離太近分不開了），這個時候，
    單獨的那個人的註釋裏面的iscrowing=0,segmentation用Polygon表示，
    而另外兩個用放在同一個anatation的數組裏面用一個segmention的RLE編碼形式表示
    '''
    crowd_ix = np.where(gt_class_ids < 0)[0]
    if crowd_ix.shape[0] > 0:
        non_crowd_ix = np.where(gt_class_ids > 0)[0]
        crowd_boxes = gt_boxes[crowd_ix]
        gt_class_ids = gt_class_ids[non_crowd_ix]
        gt_boxes = gt_boxes[non_crowd_ix]
        crowd_overlaps = utils.compute_overlaps(anchors, crowd_boxes)
        crowd_iou_max = np.amax(crowd_overlaps, axis=1)
        no_crowd_bool = (crowd_iou_max < 0.001)
    else:
        no_crowd_bool = np.ones([anchors.shape[0]], dtype=bool)

    # 計算先驗框和真實框的重合程度 [num_anchors, num_gt_boxes]
    overlaps = utils.compute_overlaps(anchors, gt_boxes)

    # 1. 重合程度小於0.3則代表爲負樣本
    anchor_iou_argmax = np.argmax(overlaps, axis=1)
    anchor_iou_max = overlaps[np.arange(overlaps.shape[0]), anchor_iou_argmax]
    rpn_match[(anchor_iou_max < 0.3) & (no_crowd_bool)] = -1
    # 2. 每個真實框重合度最大的先驗框是正樣本
    gt_iou_argmax = np.argwhere(overlaps == np.max(overlaps, axis=0))[:,0]
    rpn_match[gt_iou_argmax] = 1
    # 3. 重合度大於0.7則代表爲正樣本
    rpn_match[anchor_iou_max >= 0.7] = 1

    # 正負樣本平衡
    # 找到正樣本的索引
    ids = np.where(rpn_match == 1)[0]
    # 如果大於(config.RPN_TRAIN_ANCHORS_PER_IMAGE // 2)則刪掉一些
    extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE // 2)
    if extra > 0:
        ids = np.random.choice(ids, extra, replace=False)
        rpn_match[ids] = 0
    # 找到負樣本的索引
    ids = np.where(rpn_match == -1)[0]
    # 使得總數爲config.RPN_TRAIN_ANCHORS_PER_IMAGE
    extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE -
                        np.sum(rpn_match == 1))
    if extra > 0:
        # Rest the extra ones to neutral
        ids = np.random.choice(ids, extra, replace=False)
        rpn_match[ids] = 0

    # 找到內部真實存在物體的先驗框，進行編碼
    ids = np.where(rpn_match == 1)[0]
    ix = 0 
    for i, a in zip(ids, anchors[ids]):
        gt = gt_boxes[anchor_iou_argmax[i]]
        # 計算真實框的中心，高寬
        gt_h = gt[2] - gt[0]
        gt_w = gt[3] - gt[1]
        gt_center_y = gt[0] + 0.5 * gt_h
        gt_center_x = gt[1] + 0.5 * gt_w
        # 計算先驗框中心，高寬
        a_h = a[2] - a[0]
        a_w = a[3] - a[1]
        a_center_y = a[0] + 0.5 * a_h
        a_center_x = a[1] + 0.5 * a_w
        # 編碼運算
        rpn_bbox[ix] = [
            (gt_center_y - a_center_y) / a_h,
            (gt_center_x - a_center_x) / a_w,
            np.log(gt_h / a_h),
            np.log(gt_w / a_w),
        ]
        # 改變數量級
        rpn_bbox[ix] /= config.RPN_BBOX_STD_DEV
        ix += 1

    return rpn_match, rpn_bbox

利用上述代碼我們可以獲得，真實框對應的所有的iou較大先驗框，並計算了真實框對應的所有iou較大的先驗框應該有的預測結果。

Mask R-CNN會忽略一些重合度相對較高但是不是非常高的先驗框，一般將重合度在0.3-0.7之間的先驗框進行忽略。

利用建議框網絡應該有的預測結果和實際上的預測結果進行對比就可以獲得建議框網絡的loss。

2、Classiffier模型的訓練

上一部分提供了RPN網絡的loss，在Mask R-CNN的模型中，我們還需要對建議框進行調整獲得最終的預測框。在classiffier模型中，建議框相當於是先驗框。

因此，我們需要計算所有建議框和真實框的重合程度，並進行篩選，如果某個真實框和建議框的重合程度大於0.5則認爲該建議框爲正樣本，如果重合程度小於0.5則認爲該建議框爲負樣本

因此我們可以對真實框進行編碼，這個編碼是相對於建議框的，也就是，當我們存在這些建議框的時候，我們的Classiffier模型需要有什麼樣的預測結果才能將這些建議框調整成真實框。

實現代碼如下：

#----------------------------------------------------------#
#   Detection Target Layer
#   該部分代碼會輸入建議框
#   判斷建議框和真實框的重合情況
#   篩選出內部包含物體的建議框
#   利用建議框和真實框編碼
#   調整mask的格式使得其和預測格式相同
#----------------------------------------------------------#

def overlaps_graph(boxes1, boxes2):
    """
    用於計算boxes1和boxes2的重合程度
    boxes1, boxes2: [N, (y1, x1, y2, x2)].
    返回 [len(boxes1), len(boxes2)]
    """
    b1 = tf.reshape(tf.tile(tf.expand_dims(boxes1, 1),
                            [1, 1, tf.shape(boxes2)[0]]), [-1, 4])
    b2 = tf.tile(boxes2, [tf.shape(boxes1)[0], 1])
    b1_y1, b1_x1, b1_y2, b1_x2 = tf.split(b1, 4, axis=1)
    b2_y1, b2_x1, b2_y2, b2_x2 = tf.split(b2, 4, axis=1)
    y1 = tf.maximum(b1_y1, b2_y1)
    x1 = tf.maximum(b1_x1, b2_x1)
    y2 = tf.minimum(b1_y2, b2_y2)
    x2 = tf.minimum(b1_x2, b2_x2)
    intersection = tf.maximum(x2 - x1, 0) * tf.maximum(y2 - y1, 0)
    b1_area = (b1_y2 - b1_y1) * (b1_x2 - b1_x1)
    b2_area = (b2_y2 - b2_y1) * (b2_x2 - b2_x1)
    union = b1_area + b2_area - intersection
    iou = intersection / union
    overlaps = tf.reshape(iou, [tf.shape(boxes1)[0], tf.shape(boxes2)[0]])
    return overlaps


def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    asserts = [
        tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals],
                  name="roi_assertion"),
    ]
    with tf.control_dependencies(asserts):
        proposals = tf.identity(proposals)

    # 移除之前獲得的padding的部分
    proposals, _ = trim_zeros_graph(proposals, name="trim_proposals")
    gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes")
    gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros,
                                   name="trim_gt_class_ids")
    gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2,
                         name="trim_gt_masks")

    # Handle COCO crowds
    # A crowd box in COCO is a bounding box around several instances. Exclude
    # them from training. A crowd box is given a negative class ID.
    crowd_ix = tf.where(gt_class_ids < 0)[:, 0]
    non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0]
    crowd_boxes = tf.gather(gt_boxes, crowd_ix)
    gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix)
    gt_boxes = tf.gather(gt_boxes, non_crowd_ix)
    gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2)

    # 計算建議框和所有真實框的重合程度 [proposals, gt_boxes]
    overlaps = overlaps_graph(proposals, gt_boxes)

    # 計算和 crowd boxes 的重合程度 [proposals, crowd_boxes]
    crowd_overlaps = overlaps_graph(proposals, crowd_boxes)
    crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1)
    no_crowd_bool = (crowd_iou_max < 0.001)

    # Determine positive and negative ROIs
    roi_iou_max = tf.reduce_max(overlaps, axis=1)
    # 1. 正樣本建議框和真實框的重合程度大於0.5
    positive_roi_bool = (roi_iou_max >= 0.5)
    positive_indices = tf.where(positive_roi_bool)[:, 0]
    # 2. 負樣本建議框和真實框的重合程度小於0.5，Skip crowds.
    negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0]

    # Subsample ROIs. Aim for 33% positive
    # 進行正負樣本的平衡
    # 取出最大33%的正樣本
    positive_count = int(config.TRAIN_ROIS_PER_IMAGE *
                         config.ROI_POSITIVE_RATIO)
    positive_indices = tf.random_shuffle(positive_indices)[:positive_count]
    positive_count = tf.shape(positive_indices)[0]
    # 保持正負樣本比例
    r = 1.0 / config.ROI_POSITIVE_RATIO
    negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count
    negative_indices = tf.random_shuffle(negative_indices)[:negative_count]
    # 獲得正樣本和負樣本
    positive_rois = tf.gather(proposals, positive_indices)
    negative_rois = tf.gather(proposals, negative_indices)

    # 獲取建議框和真實框重合程度
    positive_overlaps = tf.gather(overlaps, positive_indices)
    
    # 判斷是否有真實框
    roi_gt_box_assignment = tf.cond(
        tf.greater(tf.shape(positive_overlaps)[1], 0),
        true_fn = lambda: tf.argmax(positive_overlaps, axis=1),
        false_fn = lambda: tf.cast(tf.constant([]),tf.int64)
    )
    # 找到每一個建議框對應的真實框和種類
    roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment)
    roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment)

    # 解碼獲得網絡應該有得預測結果
    deltas = utils.box_refinement_graph(positive_rois, roi_gt_boxes)
    deltas /= config.BBOX_STD_DEV

    # 切換mask的形式[N, height, width, 1]
    transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1)
    
    # 取出對應的層
    roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment)

    # Compute mask targets
    boxes = positive_rois
    if config.USE_MINI_MASK:
        # Transform ROI coordinates from normalized image space
        # to normalized mini-mask space.
        y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
        gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
        gt_h = gt_y2 - gt_y1
        gt_w = gt_x2 - gt_x1
        y1 = (y1 - gt_y1) / gt_h
        x1 = (x1 - gt_x1) / gt_w
        y2 = (y2 - gt_y1) / gt_h
        x2 = (x2 - gt_x1) / gt_w
        boxes = tf.concat([y1, x1, y2, x2], 1)
    box_ids = tf.range(0, tf.shape(roi_masks)[0])
    masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
                                     box_ids,
                                     config.MASK_SHAPE)
    # Remove the extra dimension from masks.
    masks = tf.squeeze(masks, axis=3)

    # 防止resize後的結果不是1或者0
    masks = tf.round(masks)

    # 一般傳入config.TRAIN_ROIS_PER_IMAGE個建議框進行訓練，
    # 如果數量不夠則padding
    rois = tf.concat([positive_rois, negative_rois], axis=0)
    N = tf.shape(negative_rois)[0]
    P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0)
    rois = tf.pad(rois, [(0, P), (0, 0)])
    roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)])
    roi_gt_class_ids = tf.pad(roi_gt_class_ids, [(0, N + P)])
    deltas = tf.pad(deltas, [(0, N + P), (0, 0)])
    masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)])

    return rois, roi_gt_class_ids, deltas, masks

def trim_zeros_graph(boxes, name='trim_zeros'):
    """
    如果前一步沒有滿POST_NMS_ROIS_TRAINING個建議框，會有padding
    要去掉padding
    """
    non_zeros = tf.cast(tf.reduce_sum(tf.abs(boxes), axis=1), tf.bool)
    boxes = tf.boolean_mask(boxes, non_zeros, name=name)
    return boxes, non_zeros

class DetectionTargetLayer(Layer):
    """找到建議框的ground_truth

    Inputs:
    proposals: [batch, N, (y1, x1, y2, x2)]建議框
    gt_class_ids: [batch, MAX_GT_INSTANCES]每個真實框對應的類
    gt_boxes: [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)]真實框的位置
    gt_masks: [batch, height, width, MAX_GT_INSTANCES]真實框的語義分割情況

    Returns: 
    rois: [batch, TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)]內部真實存在目標的建議框
    target_class_ids: [batch, TRAIN_ROIS_PER_IMAGE]每個建議框對應的類
    target_deltas: [batch, TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw)]每個建議框應該有的調整參數
    target_mask: [batch, TRAIN_ROIS_PER_IMAGE, height, width]每個建議框語義分割情況
    """

    def __init__(self, config, **kwargs):
        super(DetectionTargetLayer, self).__init__(**kwargs)
        self.config = config

    def call(self, inputs):
        proposals = inputs[0]
        gt_class_ids = inputs[1]
        gt_boxes = inputs[2]
        gt_masks = inputs[3]

        # 對真實框進行編碼
        names = ["rois", "target_class_ids", "target_bbox", "target_mask"]
        outputs = utils.batch_slice(
            [proposals, gt_class_ids, gt_boxes, gt_masks],
            lambda w, x, y, z: detection_targets_graph(
                w, x, y, z, self.config),
            self.config.IMAGES_PER_GPU, names=names)
        return outputs

    def compute_output_shape(self, input_shape):
        return [
            (None, self.config.TRAIN_ROIS_PER_IMAGE, 4),  # rois
            (None, self.config.TRAIN_ROIS_PER_IMAGE),  # class_ids
            (None, self.config.TRAIN_ROIS_PER_IMAGE, 4),  # deltas
            (None, self.config.TRAIN_ROIS_PER_IMAGE, self.config.MASK_SHAPE[0],
             self.config.MASK_SHAPE[1])  # masks
        ]

    def compute_mask(self, inputs, mask=None):
        return [None, None, None, None]

3、mask模型的訓練

mask模型在訓練的時候要注意，當我們利用建議框網絡在mask模型需要用到的公用特徵層進行截取的時候，截取的情況和真實框截下來的不一樣，因此還需要算出來我們用於截取的框相對於真實框的位置，獲得正確的語義分割信息。

使用代碼如下，中間一大部分用於計算真實框相對於建議框的位置。計算完成後利用這個相對位置可以對語義分割信息進行截取，獲得正確的語義信息

# Compute mask targets
boxes = positive_rois
if config.USE_MINI_MASK:
    # Transform ROI coordinates from normalized image space
    # to normalized mini-mask space.
    y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
    gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
    gt_h = gt_y2 - gt_y1
    gt_w = gt_x2 - gt_x1
    y1 = (y1 - gt_y1) / gt_h
    x1 = (x1 - gt_x1) / gt_w
    y2 = (y2 - gt_y1) / gt_h
    x2 = (x2 - gt_x1) / gt_w
    boxes = tf.concat([y1, x1, y2, x2], 1)
box_ids = tf.range(0, tf.shape(roi_masks)[0])
masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
                                    box_ids,
                                    config.MASK_SHAPE)

這樣的話，就可以通過上述獲得的mask和模型的預測結果進行結合訓練模型了。

訓練自己的Mask-RCNN模型

Mask-RCNN整體的文件夾構架如下：

1、數據集準備

本文適合訓練自己的數據集的同學使用。首先利用labelme標註數據。

將其放在before文件夾裏：

本文寫了一個labelme到數據集的轉換代碼，在before外部運行即可。

運行後會生成train_dataset，這個train_dataset放到Mask-RCNN模型的根目錄即可

生成代碼如下：

import argparse
import json
import os
import os.path as osp
import warnings
 
import PIL.Image
import yaml
 
from labelme import utils
import base64
 
def main():
    count = os.listdir("./before/") 
    index = 0
    for i in range(0, len(count)):
        path = os.path.join("./before", count[i])

        if os.path.isfile(path) and path.endswith('json'):
            data = json.load(open(path))
            
            if data['imageData']:
                imageData = data['imageData']
            else:
                imagePath = os.path.join(os.path.dirname(path), data['imagePath'])
                with open(imagePath, 'rb') as f:
                    imageData = f.read()
                    imageData = base64.b64encode(imageData).decode('utf-8')
            img = utils.img_b64_to_arr(imageData)
            label_name_to_value = {'_background_': 0}
            for shape in data['shapes']:
                label_name = shape['label']
                if label_name in label_name_to_value:
                    label_value = label_name_to_value[label_name]
                else:
                    label_value = len(label_name_to_value)
                    label_name_to_value[label_name] = label_value
            
            # label_values must be dense
            label_values, label_names = [], []
            for ln, lv in sorted(label_name_to_value.items(), key=lambda x: x[1]):
                label_values.append(lv)
                label_names.append(ln)
            
            assert label_values == list(range(len(label_values)))
            
            lbl = utils.shapes_to_label(img.shape, data['shapes'], label_name_to_value)
            
            captions = ['{}: {}'.format(lv, ln)
                for ln, lv in label_name_to_value.items()]
            lbl_viz = utils.draw_label(lbl, img, captions)

            if not os.path.exists("train_dataset"):
                os.mkdir("train_dataset")
            label_path = "train_dataset/mask"
            if not os.path.exists(label_path):
                os.mkdir(label_path)
            img_path = "train_dataset/imgs"
            if not os.path.exists(img_path):
                os.mkdir(img_path)
            yaml_path = "train_dataset/yaml"
            if not os.path.exists(yaml_path):
                os.mkdir(yaml_path)
            label_viz_path = "train_dataset/label_viz"
            if not os.path.exists(label_viz_path):
                os.mkdir(label_viz_path)

            PIL.Image.fromarray(img).save(osp.join(img_path, str(index)+'.jpg'))

            utils.lblsave(osp.join(label_path, str(index)+'.png'), lbl)
            PIL.Image.fromarray(lbl_viz).save(osp.join(label_viz_path, str(index)+'.png'))
 
            warnings.warn('info.yaml is being replaced by label_names.txt')
            info = dict(label_names=label_names)
            with open(osp.join(yaml_path, str(index)+'.yaml'), 'w') as f:
                yaml.safe_dump(info, f, default_flow_style=False)
            index = index+1
            print('Saved : %s' % str(index))
if __name__ == '__main__':
    main()

2、參數修改

在數據集生成好之後，根據要求修改train.py文件夾下的參數即可訓練。Num_classes的數量是分類的總個數+1。

dataset.py內修改自己要分的類，分別是load_shapes函數和load_mask函數內和類有關的內容，即將原有的circle、square等修改成自己要分的類。

在train文件夾下面修改ShapesConfig(Config)的內容，NUM_CLASS等於自己要分的類的數量+1。

IMAGE_MAX_DIM、IMAGE_MIN_DIM、BATCH_SIZE和IMAGES_PER_GPU根據自己的顯存情況修改。RPN_ANCHOR_SCALES根據IMAGE_MAX_DIM和IMAGE_MIN_DIM進行修改。

STEPS_PER_EPOCH代表每個世代訓練多少次。

3、模型訓練

全部修改完成後就可以運行train.py訓練了。

睿智的目標檢測19——Keras搭建Mask R-CNN實例分割平臺

睿智的目標檢測19——Keras搭建Mask R-CNN實例分割平臺

學習前言

什麼是Mask R-CNN

源碼下載

Mask R-CNN實現思路

一、預測部分

1、主幹網絡介紹

2、特徵金字塔FPN的構建

3、獲得Proposal建議框

4、Proposal建議框的解碼

5、對Proposal建議框加以利用（Roi Align）

6、預測框的解碼

7、mask語義分割信息的獲取

二、訓練部分

1、建議框網絡的訓練

2、Classiffier模型的訓練

3、mask模型的訓練

訓練自己的Mask-RCNN模型

1、數據集準備

2、參數修改

3、模型訓練

python gdal 安裝使用（Windows， python 3.6.8）

睿智的目標檢測35——Pytorch 搭建YoloV4-Tiny目標檢測平臺

睿智的目標檢測36——Pytorch搭建Efficientdet目標檢測平臺

睿智的目標檢測34——Keras 搭建YoloV4-Tiny目標檢測平臺

神經網絡學習小記錄39——MobileNetV3（small）模型的復現詳解

神經網絡學習小記錄45——Keras常用學習率下降方式彙總

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結