本文介紹了yolo v1的實現過程,同時就其實現過程中幾個關鍵的函數進行了詳細的註釋。
本文結構如下: 一、源碼概述 二、建立網絡 三、訓練 四、測試
源碼概述
源碼地址:https://github.com/1273545169/object-detection/tree/master/yolo
./data
下存放的是voc數據集和模型的權重;
config
是配置文件,可以在此修改模型參數;
yolo_net.py
建立網絡和loss函數;
utils
中的pascal_voc.py
用於處理訓練樣例;
train.py
模型訓練
predict.py
模型測試
./test
中存放的是測試圖片;
一、建立網絡
通過yolo_net.py
中的build_network()
方法建立網絡,使用了dropout方法來防止過擬合
def build_network(self,
images,
num_outputs,
alpha,
keep_prob=0.5,
is_training=True,
scope='yolo'):
二、訓練
使用train.py
中的main()
來訓練模型
pascal = pascal_voc('train')
yolo = YOLONet()
solver = Solver(yolo, pascal)
2.1、從PASCAL VOC 中獲取訓練數據並進行處理
首先,從VOC2007/ImageSets/Main/
中獲得所有訓練樣例的索引index,再通過def load_pascal_JPEGImages_annotation(index)
方法利用圖片的索引值index來得到訓練樣例的路徑和圖片信息。此外,使用def prepare()
增加水平翻轉的訓練樣例,使得模型的擬合能力更好。
def load_pascal_JPEGImages_annotation(self, index):
"""
Load image and bounding boxes info from XML file in the PASCAL VOC
format.
"""
# data/VOCdevkit/VOC2007/JPEGImages存放源圖片
# imname爲訓練樣例路徑
imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
im = cv2.imread(imname)
h_ratio = 1.0 * self.image_size / im.shape[0]
w_ratio = 1.0 * self.image_size / im.shape[1]
# im = cv2.resize(im, [self.image_size, self.image_size])
label = np.zeros((self.cell_size, self.cell_size, 25))
# data/VOCdevkit/VOC2007/Annotations存放的是xml文件
# 包含圖片的boxes等信息,一張圖片一個xml文件,與PEGImages中源圖片一一對應
filename = os.path.join(self.data_path, 'Annotations', index + '.xml')
# 將xml文檔解析爲樹
tree = ET.parse(filename)
# 得到圖片中所有的box info
objs = tree.findall('object')
for obj in objs:
bbox = obj.find('bndbox')
# Make pixel indexes 0-based
x1 = max(min((float(bbox.find('xmin').text) - 1) * w_ratio, self.image_size - 1), 0)
y1 = max(min((float(bbox.find('ymin').text) - 1) * h_ratio, self.image_size - 1), 0)
x2 = max(min((float(bbox.find('xmax').text) - 1) * w_ratio, self.image_size - 1), 0)
y2 = max(min((float(bbox.find('ymax').text) - 1) * h_ratio, self.image_size - 1), 0)
# 得到類的索引值
cls_ind = self.class_to_ind[obj.find('name').text.lower().strip()]
# boxes (x1,y1,x2,y2)->(x,y,w,h)
boxes = [(x2 + x1) / 2.0, (y2 + y1) / 2.0, x2 - x1, y2 - y1]
# 確定(x,y)在哪個網格中
x_ind = int(boxes[0] * self.cell_size / self.image_size)
y_ind = int(boxes[1] * self.cell_size / self.image_size)
if label[y_ind, x_ind, 0] == 1:
continue
# p(object)
label[y_ind, x_ind, 0] = 1
# box
label[y_ind, x_ind, 1:5] = boxes
# p(class)
label[y_ind, x_ind, 5 + cls_ind] = 1
return imname, label, len(objs)
2. 2、loss函數
使用yolo_net.py
中的loss_layer()
方法來設置loss函數
# predicts爲模型預測值,shape(45,7,7,30),labels爲真實值,shape(45,7,7,25)
def loss_layer(self, predicts, labels, scope='loss_layer'):
with tf.variable_scope(scope):
# 預測值
# class-20
predict_classes = tf.reshape(
predicts[:, :self.boundary1],
[self.batch_size, self.cell_size, self.cell_size, self.num_class])
# confidence-2
predict_confidence = tf.reshape(
predicts[:, self.boundary1:self.boundary2],
[self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell])
# bounding box-2*4
predict_boxes = tf.reshape(
predicts[:, self.boundary2:],
[self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell, 4])
# 實際值
# shape(45,7,7,1)
# response中的值爲0或者1.對應的網格中存在目標爲1,不存在目標爲0.
# 存在目標指的是存在目標的中心點,並不是說存在目標的一部分。所以,目標的中心點所在的cell其對應的值才爲1,其餘的值均爲0
response = tf.reshape(
labels[..., 0],
[self.batch_size, self.cell_size, self.cell_size, 1])
# shape(45,7,7,1,4)
boxes = tf.reshape(
labels[..., 1:5],
[self.batch_size, self.cell_size, self.cell_size, 1, 4])
# shape(45,7,7,2,4),boxes的四個值,取值範圍爲0~1
boxes = tf.tile(
boxes, [1, 1, 1, self.boxes_per_cell, 1]) / self.image_size
# shape(45,7,7,20)
classes = labels[..., 5:]
# self.offset shape(7,7,2)
# offset shape(1,7,7,2)
offset = tf.reshape(
tf.constant(self.offset, dtype=tf.float32),
[1, self.cell_size, self.cell_size, self.boxes_per_cell])
# shape(45,7,7,2)
x_offset = tf.tile(offset, [self.batch_size, 1, 1, 1])
# shape(45,7,7,2)
y_offset = tf.transpose(offset, (0, 2, 1, 3))
# convert the x, y to the coordinates relative to the top left point of the image
# the predictions of w, h are the square root
# shape(45,7,7,2,4) ->(x,y,w,h)
predict_boxes_tran = tf.stack(
[(predict_boxes[..., 0] + x_offset) / self.cell_size,
(predict_boxes[..., 1] + y_offset) / self.cell_size,
tf.square(predict_boxes[..., 2]),
tf.square(predict_boxes[..., 3])], axis=-1)
# 預測box與真實box的IOU,shape(45,7,7,2)
iou_predict_truth = self.calc_iou(predict_boxes_tran, boxes)
# calculate I tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
# shape(45,7,7,1), find the maximum iou_predict_truth in every cell
# 在訓練時,如果該單元格內確實存在目標,那麼只選擇IOU最大的那個邊界框來負責預測該目標,而其它邊界框認爲不存在目標
object_mask = tf.reduce_max(iou_predict_truth, 3, keep_dims=True)
# object prosibility (45,7,7,2)
object_probs = tf.cast(
(iou_predict_truth >= object_mask), tf.float32) * response
# calculate no_I tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
# noobject prosibility(45,7,7,2)
noobject_probs = tf.ones_like(
object_probs, dtype=tf.float32) - object_probs
# shape(45,7,7,2,4),對boxes的四個值進行規整,xy爲相對於網格左上角,wh爲取根號後的值,範圍0~1
boxes_tran = tf.stack(
[boxes[..., 0] * self.cell_size - x_offset,
boxes[..., 1] * self.cell_size - y_offset,
tf.sqrt(boxes[..., 2]),
tf.sqrt(boxes[..., 3])], axis=-1)
# class_loss shape(45,7,7,20)
class_delta = response * (predict_classes - classes)
class_loss = tf.reduce_mean(
tf.reduce_sum(tf.square(class_delta), axis=[1, 2, 3]),
name='class_loss') * self.class_scale
# object_loss confidence=iou*p(object)
# p(object)的值爲1或0
object_delta = object_probs * (predict_confidence - iou_predict_truth)
object_loss = tf.reduce_mean(
tf.reduce_sum(tf.square(object_delta), axis=[1, 2, 3]),
name='object_loss') * self.object_scale
# noobject_loss p(object)的值爲0
noobject_delta = noobject_probs * predict_confidence
noobject_loss = tf.reduce_mean(
tf.reduce_sum(tf.square(noobject_delta), axis=[1, 2, 3]),
name='noobject_loss') * self.noobject_scale
# coord_loss
coord_mask = tf.expand_dims(object_probs, 4)
boxes_delta = coord_mask * (predict_boxes - boxes_tran)
coord_loss = tf.reduce_mean(
tf.reduce_sum(tf.square(boxes_delta), axis=[1, 2, 3, 4]),
name='coord_loss') * self.coord_scale
}
三、預測
使用predict.py
來進行預測
def interpret_output(self, output):
class_probs = np.reshape(
output[0:self.boundary1],
(self.cell_size, self.cell_size, self.num_class))
confs = np.reshape(
output[self.boundary1:self.boundary2],
(self.cell_size, self.cell_size, self.boxes_per_cell))
boxes = np.reshape(
output[self.boundary2:],
(self.cell_size, self.cell_size, self.boxes_per_cell, 4))
x_offset = np.transpose(np.reshape(np.array([np.arange(self.cell_size)] * self.cell_size * self.boxes_per_cell),
[self.boxes_per_cell, self.cell_size, self.cell_size]), [1, 2, 0])
y_offset = np.transpose(x_offset, [1, 0, 2])
# convert the x, y to the coordinates relative to the top left point of the image
# the predictions of w, h are the square root
# multiply the width and height of image
boxes = tf.stack([(boxes[:, :, :, 0] + tf.constant(x_offset, dtype=tf.float32)) / self.cell_size * self.image_size,
(boxes[:, :, :, 1] + tf.constant(y_offset, dtype=tf.float32)) / self.cell_size * self.image_size,
tf.square(boxes[:, :, :, 2]) * self.image_size,
tf.square(boxes[:, :, :, 3]) * self.image_size], axis=3)
# 對bounding box的篩選分別三步進行
# 第一步:求得每個bounding box所對應的最大的confidence,結果有7*7*2個
# 第二步:根據confidence threshold來對bounding box篩選
# 第三步:NMS
# shape(7,7,2,20)
class_confs = tf.expand_dims(confs, -1) * tf.expand_dims(class_probs, 2)
# 4維變2維 shape(7*7*2,20)
class_confs = tf.reshape(class_confs, [-1, self.num_class])
# shape(7*7*2,4)
boxes = tf.reshape(boxes, [-1, 4])
# 第一步:find each box class, only select the max confidence
# 求得每個bounding box所對應的最大的class confidence,有7*7*2個bounding box,所以個結果有98個
class_index = tf.argmax(class_confs, axis=1)
class_confs = tf.reduce_max(class_confs, axis=1)
# 第二步:filter the boxes by the class confidence threshold
filter_mask = class_confs >= self.threshold
class_index = tf.boolean_mask(class_index, filter_mask)
class_confs = tf.boolean_mask(class_confs, filter_mask)
boxes = tf.boolean_mask(boxes, filter_mask)
# 第三步: non max suppression (do not distinguish different classes)
# 一個目標可能有多個預測框,通過NMS可以去除多餘的預測框,確保一個目標只有一個預測框
# box (x, y, w, h) -> nms_boxes (x1, y1, x2, y2)
nms_boxes = tf.stack([boxes[:, 0] - 0.5 * boxes[:, 2], boxes[:, 1] - 0.5 * boxes[:, 3],
boxes[:, 0] + 0.5 * boxes[:, 2], boxes[:, 1] + 0.5 * boxes[:, 3]], axis=1)
# NMS:
# 先將class_confs按照降序排列,然後計算第一個confs所對應的box與其餘box的iou,
# 若大於iou_threshold,則將其餘box的值設爲0。
nms_index = tf.image.non_max_suppression(nms_boxes, class_confs,
max_output_size=10,
iou_threshold=self.iou_threshold)
class_index = tf.gather(class_index, nms_index)
class_confs = tf.gather(class_confs, nms_index)
boxes = tf.gather(boxes, nms_index)
# tensor -> numpy,因爲tensor中沒有len()方法
class_index = class_index.eval(session=self.sess)
class_confs = class_confs.eval(session=self.sess)
boxes = boxes.eval(session=self.sess)
result = []
for i in range(len(class_index)):
result.append([self.classes[class_index[i]],
boxes[i][0], boxes[i][1], boxes[i][2], boxes[i][3],
class_confs[i]])
return result
預測結果如下:
完整代碼:
https://github.com/1273545169/object-detection/tree/master/yolo
訓練過程中更多細節請看:
yolo v1 tensorflow版分類訓練與檢測訓練