引言
SSD 目標檢測算法在 2016 年被提出,它的速度要快於 Faster-RCNN,其精度要高於 YOLO(YOLOv3 除外),在本文中,我們主要針對其網絡結構進行說明。
網絡結構
其實 SSD 的網絡是基於 VGG 網絡來建立的,VGG 網絡如下圖所示:
SSD 網絡將 VGG 中的全連接層去掉後又在後面接了十層卷積層,將 VGG 中的 Conv4_3,新加的 Conv7,Conv8_2,Conv9_2,Conv10_2,Conv11_2 的結果輸出,達到多尺度輸出(類似於金字塔)的效果,如下圖所示:
將一張 300x300x3 的圖片輸入網絡,其經歷的變換如圖所示:
我們在不同的特徵圖上畫固定比例的框,大分辨率上的框對檢測小目標有幫助,小分辨率上的框對檢測大目標有幫助。所以可以得到多個尺度的預測值。
搭建 SSD 網絡
import tensorflow as tf
class SSD(tf.keras.Model):
def __init__(self, num_class=21):
super(SSD, self).__init__()
# conv1
self.conv1_1 = tf.keras.layers.Conv2D(64, 3, activation='relu', padding='same')
self.conv1_2 = tf.keras.layers.Conv2D(64, 3, activation='relu', padding='same')
self.pool1 = tf.keras.layers.MaxPooling2D(2, strides=2, padding='same')
# conv2
self.conv2_1 = tf.keras.layers.Conv2D(128, 3, activation='relu', padding='same')
self.conv2_2 = tf.keras.layers.Conv2D(128, 3, activation='relu', padding='same')
self.pool2 = tf.keras.layers.MaxPooling2D(2, strides=2, padding='same')
# conv3
self.conv3_1 = tf.keras.layers.Conv2D(256, 3, activation='relu', padding='same')
self.conv3_2 = tf.keras.layers.Conv2D(256, 3, activation='relu', padding='same')
self.conv3_3 = tf.keras.layers.Conv2D(256, 3, activation='relu', padding='same')
self.pool3 = tf.keras.layers.MaxPooling2D(2, strides=2, padding='same')
# conv4
self.conv4_1 = tf.keras.layers.Conv2D(512, 3, activation='relu', padding='same')
self.conv4_2 = tf.keras.layers.Conv2D(512, 3, activation='relu', padding='same')
self.conv4_3 = tf.keras.layers.Conv2D(512, 3, activation='relu', padding='same')
self.pool4 = tf.keras.layers.MaxPooling2D(2, strides=2, padding='same')
# conv5
self.conv5_1 = tf.keras.layers.Conv2D(512, 3, activation='relu', padding='same')
self.conv5_2 = tf.keras.layers.Conv2D(512, 3, activation='relu', padding='same')
self.conv5_3 = tf.keras.layers.Conv2D(512, 3, activation='relu', padding='same')
self.pool5 = tf.keras.layers.MaxPooling2D(3, strides=1, padding='same')
# fc6, => vgg backbone is finished. now they are all SSD blocks
self.fc6 = tf.keras.layers.Conv2D(1024, 3, dilation_rate=6, activation='relu', padding='same')
# fc7
self.fc7 = tf.keras.layers.Conv2D(1024, 1, activation='relu', padding='same')
# Block 8/9/10/11: 1x1 and 3x3 convolutions strides 2 (except lasts)
# conv8
self.conv8_1 = tf.keras.layers.Conv2D(256, 1, activation='relu', padding='same')
self.conv8_2 = tf.keras.layers.Conv2D(512, 3, strides=2, activation='relu', padding='same')
# conv9
self.conv9_1 = tf.keras.layers.Conv2D(128, 1, activation='relu', padding='same')
self.conv9_2 = tf.keras.layers.Conv2D(256, 3, strides=2, activation='relu', padding='same')
# conv10
self.conv10_1 = tf.keras.layers.Conv2D(128, 1, activation='relu', padding='same')
self.conv10_2 = tf.keras.layers.Conv2D(256, 3, activation='relu', padding='valid')
# conv11
self.conv11_1 = tf.keras.layers.Conv2D(128, 1, activation='relu', padding='same')
self.conv11_2 = tf.keras.layers.Conv2D(256, 3, activation='relu', padding='valid')
def call(self, x, training=False):
h = self.conv1_1(x)
h = self.conv1_2(h)
h = self.pool1(h)
h = self.conv2_1(h)
h = self.conv2_2(h)
h = self.pool2(h)
h = self.conv3_1(h)
h = self.conv3_2(h)
h = self.conv3_3(h)
h = self.pool3(h)
h = self.conv4_1(h)
h = self.conv4_2(h)
h = self.conv4_3(h)
print(h.shape)
h = self.pool4(h)
h = self.conv5_1(h)
h = self.conv5_2(h)
h = self.conv5_3(h)
h = self.pool5(h)
h = self.fc6(h) # [1,19,19,1024]
h = self.fc7(h) # [1,19,19,1024]
print(h.shape)
h = self.conv8_1(h)
h = self.conv8_2(h) # [1,10,10, 512]
print(h.shape)
h = self.conv9_1(h)
h = self.conv9_2(h) # [1, 5, 5, 256]
print(h.shape)
h = self.conv10_1(h)
h = self.conv10_2(h) # [1, 3, 3, 256]
print(h.shape)
h = self.conv11_1(h)
h = self.conv11_2(h) # [1, 1, 1, 256]
print(h.shape)
return h
當我們將一張 300x300x3 的圖片輸入,得到的結果爲:
model = SSD(21)
x = model(tf.ones(shape=[1,300,300,3]))
(1, 38, 38, 512)
(1, 19, 19, 1024)
(1, 10, 10, 512)
(1, 5, 5, 256)
(1, 3, 3, 256)
(1, 1, 1, 256)
空洞卷積
在以上代碼中構建 self.fc6 這層卷積層時,我們使用了空洞卷積(dilated convolution),這引入了一個新的參數,即擴張率(dilation rate),其原理如圖所示:
a 圖對應擴張率爲 1 時的 3x3 卷積核,其實就和普通的卷積操作一樣,b 圖對應擴張率爲 2 時的 3x3 卷積核,也就是對於圖像中一個 7x7 的區域,只有 9 個紅色的點和 3x3 的卷積核發生卷積操作,其餘的點略過。也可以理解爲卷積核大小爲 7x7,但是隻有圖中的 9 個點的權重不爲 0,其餘都爲 0。 可以看到雖然卷積核大小隻有 3x3,但是這個卷積的感受野已經增大到了 7x7,c 圖對應擴張率爲 4 時的 3x3 卷積核,能達到15x15的感受野。對比傳統的conv操作,3層3x3的卷積加起來,stride爲1的話,只能達到 (kernel-1)xlayer+1=7 的感受野,也就是和層數成線性關係,而空洞卷積的感受野是指數級的增長。