一、目標檢測
1.1 目標檢測是什麼?
目標檢測:判斷圖像中目標的類別和位置
目標檢測兩要素
- 分類:分類向量[p0, …, pn]
- 迴歸:迴歸邊界框[x1, y1, x2, y2]
1.2 代碼示例
# -*- coding: utf-8 -*-
import os
import time
import torch.nn as nn
import torch
import numpy as np
import torchvision.transforms as transforms
import torchvision
from PIL import Image
from matplotlib import pyplot as plt
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# classes_coco
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
if __name__ == "__main__":
# path_img = os.path.join(BASE_DIR, "demo_img1.png")
path_img = os.path.join(BASE_DIR, "demo_img2.png")
# config
preprocess = transforms.Compose([
transforms.ToTensor(),
])
# 1. load data & model
input_image = Image.open(path_img).convert("RGB")
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
# 2. preprocess
img_chw = preprocess(input_image)
# 3. to device
if torch.cuda.is_available():
img_chw = img_chw.to('cuda')
model.to('cuda')
# 4. forward
input_list = [img_chw]
with torch.no_grad():
tic = time.time()
print("input img tensor shape:{}".format(input_list[0].shape))
output_list = model(input_list)
output_dict = output_list[0]
print("pass: {:.3f}s".format(time.time() - tic))
for k, v in output_dict.items():
print("key:{}, value:{}".format(k, v))
# 5. visualization
out_boxes = output_dict["boxes"].cpu()
out_scores = output_dict["scores"].cpu()
out_labels = output_dict["labels"].cpu()
fig, ax = plt.subplots(figsize=(12, 12))
ax.imshow(input_image, aspect='equal')
num_boxes = out_boxes.shape[0]
max_vis = 40
thres = 0.5
for idx in range(0, min(num_boxes, max_vis)):
score = out_scores[idx].numpy()
bbox = out_boxes[idx].numpy()
class_name = COCO_INSTANCE_CATEGORY_NAMES[out_labels[idx]]
if score < thres:
continue
ax.add_patch(plt.Rectangle((bbox[0], bbox[1]), bbox[2] - bbox[0], bbox[3] - bbox[1], fill=False,
edgecolor='red', linewidth=3.5))
ax.text(bbox[0], bbox[1] - 2, '{:s} {:.3f}'.format(class_name, score), bbox=dict(facecolor='blue', alpha=0.5),
fontsize=14, color='white')
plt.show()
plt.close()
# appendix
classes_pascal_voc = ['__background__',
'aeroplane', 'bicycle', 'bird', 'boat',
'bottle', 'bus', 'car', 'cat', 'chair',
'cow', 'diningtable', 'dog', 'horse',
'motorbike', 'person', 'pottedplant',
'sheep', 'sofa', 'train', 'tvmonitor']
# classes_coco
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
二、目標檢測的實現
2.1 模型是如何完成目標檢測的?
將3D張量映射到兩個張量
- 分類張量: shape爲 [N, c+1]
- 邊界框張量: shape爲 [N, 4]
2.2 邊界框數量N如何確定?
傳統方法——滑動窗策略
缺點:
- 重複計算量大
- 窗口大小難確定
利用卷積減少重複計算:
說明:
通過上圖可以發現,下面的最後輸出的2x2x4的左上角的向量,它對應着原圖中左上角14x14的窗口中的圖像經過卷積神經網絡得到的分類輸出,同理,其他三個向量就對應着原圖中另外三個窗口經過網絡的輸出
所以,只要將上面的FC層改爲下面的卷積層,那麼通過一次卷積操作,就能得到原圖中不同窗口對應的分類輸出,這就實現了利用卷積操作來實現滑動窗口的策略
重要概念:
特徵圖一個像素對應原圖的一塊區域
三、深度學習目標檢測模型簡介
3.1 深度學習目標檢測模型
目標檢測模型簡介:
目標檢測綜述——《Object Detection in 20 Years A Survey》
3.2 One-stage和two-stage
One-stage和two-stage:
One-stage:
輸入圖像經過網絡直接得到分類和位置信息
Two-stage:
圖像經過Proposal generation層,得到候選框的位置信息,然後經過ROI pooling層,再經過卷積,最終得到分類和位置信息
注意:
Proposal generation得到的不是feature map,而是候選框的位置信息,候選框個數一般默認爲2000個
3.3 經典目標檢測模型的流程
3.3.1 One Stage——YOLO
模型結構:
總體流程:
輸入爲3-d張量,經過卷積得到一個特徵向量,然後Resize得到一個feature map,然後將其劃分爲nxn的網格(其中一個網格對應原圖中的一塊區域),然後對每個網絡進行位置的迴歸和分類
3.3.2 Two-stage——Faster RCNN
模型結構:
總體流程:
輸入爲3-d張量,經過Backbone網絡得到一個feature map,RPN層對其上的anchor box進行前景和背景的二分類,並且對前景區域推薦候選框,然後對推薦的候選框的分類概率進行降序排序,然後再進行非極大值抑制,得到2000個候選框
ROI層中的自適應池化層對不同候選區域進行池化,然後再經過全連接層,進行最終的位置迴歸和分類
四、Pytorch中的Faster RCNN訓練
4.1 Faster RCNN代碼結構
- torchvision.models.detection.fasterrcnn_resnet50_fpn() 返回 FasterRCNN實例
- class FasterRCNN(GeneralizedRCNN)
- class GeneralizedRCNN(nn.Module)
繼承關係:FasterRCNN繼承於GeneralizedRCNN繼承於nn.Module
FasterRCNN和GeneralizedRCNN的forward函數中主要部分:
forward():
- features = self.backbone(images.tensors)
- proposals, proposal_losses = self.rpn(images, features, targets)
- detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
第一部分:通過backbone對輸入圖像進行特徵提取,得到特徵圖
第二部分:主要爲兩個模塊——rpn+NMS
rpn:將輸入的特徵向量映射到分類向量和位置迴歸向量,self.head()函數實現
NMS:從數十萬個proposals中挑選出num_anchors_per_level個
第三部分:roi_heads(包含roi pooling層以及後面所有的部分)
select_training_samples():進一步從2000個proposals中篩選512個
box_roi_pool():對特徵圖進行摳圖,得到同樣大小的特徵圖
box_features():將上一步得到的特徵圖經過兩個fc層,得到特徵向量
box_predictor():將得到的特徵向量經過兩個fc層,得到兩個特徵向量(類別和位置)
最終得到輸出detections(包括類別,位置信息,損失函數值),然後再將位置信息映射到原始尺寸上,得到最終輸出
Faster RCNN 主要組件
- backbone
- rpn
- filter_proposals(NMS)
- roi_heads
Faster RCNN的數據流:
- Feature map: [256, h_f, w_f]
- 2 Softmax: [num_anchors, h_f, w_f]
- Regressors: [num_anchors*4, h_f, w_f]
- NMS OUT: [n_proposals=2000, 4]
- ROI Layer: [512, 256, 7, 7],從2000箇中再選512個
- FC1 FC2: [512, 1024]
- c+1 Softmax: [512, c+1]
- Regressors: [512, (c+1)*4]
4.2 Faster RCNN——行人檢測
數據: PennFudanPed數據集, 70張行人照片共345行人標籤
官方地址: http://www.cis.upenn.edu/~jshi/ped_html/
模型: fasterrcnn_resnet50_fpn 進行finetune
# -*- coding: utf-8 -*-
"""
# @file name : fasterrcnn_train.py
# @author : TingsongYu https://github.com/TingsongYu
# @date : 2019-11-30
# @brief : 訓練faster rcnn
"""
import os
import time
import torch.nn as nn
import torch
import random
import numpy as np
import torchvision.transforms as transforms
import torchvision
from PIL import Image
import torch.nn.functional as F
from tools.my_dataset import PennFudanDataset
from tools.common_tools import set_seed
from torch.utils.data import DataLoader
from matplotlib import pyplot as plt
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.transforms import functional as F
set_seed(1) # 設置隨機種子
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# classes_coco
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
def vis_bbox(img, output, classes, max_vis=40, prob_thres=0.4):
fig, ax = plt.subplots(figsize=(12, 12))
ax.imshow(img, aspect='equal')
out_boxes = output_dict["boxes"].cpu()
out_scores = output_dict["scores"].cpu()
out_labels = output_dict["labels"].cpu()
num_boxes = out_boxes.shape[0]
for idx in range(0, min(num_boxes, max_vis)):
score = out_scores[idx].numpy()
bbox = out_boxes[idx].numpy()
class_name = classes[out_labels[idx]]
if score < prob_thres:
continue
ax.add_patch(plt.Rectangle((bbox[0], bbox[1]), bbox[2] - bbox[0], bbox[3] - bbox[1], fill=False,
edgecolor='red', linewidth=3.5))
ax.text(bbox[0], bbox[1] - 2, '{:s} {:.3f}'.format(class_name, score), bbox=dict(facecolor='blue', alpha=0.5),
fontsize=14, color='white')
plt.show()
plt.close()
class Compose(object):
def __init__(self, transforms):
self.transforms = transforms
def __call__(self, image, target):
for t in self.transforms:
image, target = t(image, target)
return image, target
class RandomHorizontalFlip(object):
def __init__(self, prob):
self.prob = prob
def __call__(self, image, target):
if random.random() < self.prob:
height, width = image.shape[-2:]
image = image.flip(-1)
bbox = target["boxes"]
bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
target["boxes"] = bbox
return image, target
class ToTensor(object):
def __call__(self, image, target):
image = F.to_tensor(image)
return image, target
if __name__ == "__main__":
# config
LR = 0.001
num_classes = 2
batch_size = 1
start_epoch, max_epoch = 0, 30
train_dir = os.path.join(BASE_DIR, "..", "..", "data", "PennFudanPed")
train_transform = Compose([ToTensor(), RandomHorizontalFlip(0.5)])
# step 1: data
train_set = PennFudanDataset(data_dir=train_dir, transforms=train_transform)
# 收集batch data的函數
def collate_fn(batch):
return tuple(zip(*batch))
train_loader = DataLoader(train_set, batch_size=batch_size, collate_fn=collate_fn)
# step 2: model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) # replace the pre-trained head with a new one
model.to(device)
# step 3: loss
# in lib/python3.6/site-packages/torchvision/models/detection/roi_heads.py
# def fastrcnn_loss(class_logits, box_regression, labels, regression_targets)
# step 4: optimizer scheduler
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=LR, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
# step 5: Iteration
for epoch in range(start_epoch, max_epoch):
model.train()
for iter, (images, targets) in enumerate(train_loader):
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
# if torch.cuda.is_available():
# images, targets = images.to(device), targets.to(device)
loss_dict = model(images, targets) # images is list; targets is [ dict["boxes":**, "labels":**], dict[] ]
losses = sum(loss for loss in loss_dict.values())
print("Training:Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] Loss: {:.4f} ".format(
epoch, max_epoch, iter + 1, len(train_loader), losses.item()))
optimizer.zero_grad()
losses.backward()
optimizer.step()
lr_scheduler.step()
# test
model.eval()
# config
vis_num = 5
vis_dir = os.path.join(BASE_DIR, "..", "..", "data", "PennFudanPed", "PNGImages")
img_names = list(filter(lambda x: x.endswith(".png"), os.listdir(vis_dir)))
random.shuffle(img_names)
preprocess = transforms.Compose([transforms.ToTensor(), ])
for i in range(0, vis_num):
path_img = os.path.join(vis_dir, img_names[i])
# preprocess
input_image = Image.open(path_img).convert("RGB")
img_chw = preprocess(input_image)
# to device
if torch.cuda.is_available():
img_chw = img_chw.to('cuda')
model.to('cuda')
# forward
input_list = [img_chw]
with torch.no_grad():
tic = time.time()
print("input img tensor shape:{}".format(input_list[0].shape))
output_list = model(input_list)
output_dict = output_list[0]
print("pass: {:.3f}s".format(time.time() - tic))
# visualization
vis_bbox(input_image, output_dict, COCO_INSTANCE_CATEGORY_NAMES, max_vis=20, prob_thres=0.5) # for 2 epoch for nms