《YOLOv4: Optimal Speed and Accuracy of Object Detection》論文翻譯

最新的YoloV4已經出來好久了,今天主要讀一下看看相比於YoloV3有什麼改進和創新的地方,主要是來學習學習。廢話不多說,開始。


Abstract

摘要

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.

 大量的特徵可以提高卷積神經網絡(CNN)的準確率。 這需要在大型數據集上對這些特徵的組合進行實際測試,並且需要理論上對結果進行分析。 一些特性操作用在定的模型上並且是爲了解決特定的問題,或者僅針對小的數據集;而一些特性,如BN層和殘差連接,適用於大多數模型、任務和數據集。 我們假設有一些通用特性如:加權殘差連接(WRC)、(CSP),交叉小批歸一化(CMBN),自我對抗訓練(SAT)和Mish激活。 我們使用新的特性:WRC,CSP,CMBN,SAT,Mish激活,馬賽克數據增強,CMBN, DropBlock正則化和CIOU損失,將他們結合起來實現了在MSCOCO數據集上最優的結果:43.5%AP (65.7%AP50),在Tesla V100實時速度爲65FPS。

1. Introduction

1. 介紹

The majority of CNN-based object detectors are largely

applicable only for recommendation systems. For example,
searching for free parking spaces via urban video cameras
is executed by slow accurate models, whereas car collision
warning is related to fast inaccurate models. Improving
the real-time object detector accuracy enables using them
not only for hint generating recommendation systems, but
also for stand-alone process management and human input
reduction. Real-time object detector operation on conven
tional Graphics Processing Units (GPU) allows their mass
usage at an affordable price. The most accurate modern
neural networks do not operate in real time and require large
number of GPUs for training with a large mini-batch-size.
We address such problems through creating a CNN that op
erates in real-time on a conventional GPU, and for which
training requires only one conventional GPU.

The main goal of this work is designing a fast operating

speed of an object detector in production systems and opti
mization for parallel computations, rather than the low com
putation volume theoretical indicator (BFLOP). We hope
that the designed object can be easily trained and used. For
example, anyone who uses a conventional GPU to train and
test can achieve real-time, high quality, and convincing ob
ject detection results, as the YOLOv4 results shown in Fig
ure 1. Our contributions are summarized as follows:
 
1. We develope an effificient and powerful object detection
model. It makes everyone can use a 1080 Ti or 2080 Ti
GPU to train a super fast and accurate object detector.
 
2. We verify the inflfluence of state-of-the-art Bag-of
Freebies and Bag-of-Specials methods of object detec
tion during the detector training.
 
3. We modify state-of-the-art methods and make them
more effecient and suitable for single GPU training,
including CBN [89], PAN [49], SAM [85], etc.

  大多數基於CNN的目標檢測器基本上用於推薦系統。 例如,通過城市攝像機搜索免費停車位是通過慢速精確執行的 模型,而汽車碰撞警告與快速不準確的模型有關。 提高實時目標檢測器的準確性不僅可以用於提示生成推薦系統,而且還可以用於獨立的過程管理和人力投入減少。在GPU上的實時對象檢測器大量使用的話需要需要承擔起價格。 大多數神經網絡不能實時運行,並且需要大量的GPU來進行小批量的訓練。  我們通過創建一個在傳統GPU上實時運行的CNN來解決這些問題,而訓練只需要一個傳統GPU。

 本工作的主要目的是生產系統中設計一個快速目標檢測器並且能夠通過並行計算來優化,而不是降低計算量的理論指標 (BFLOP)。 我們希望設計的對象可以很容易地訓練和使用。 例如,任何人能夠使用常規GPU進行訓練和測試並實現實時、高質量和傳統對象檢測結果,如圖1所示的YOLOv4結果。 我們的貢獻總結如下:

 1. 我們開發了一個高效、強大的目標檢測模型。 它使每個人都可以使用1080Ti或2080TiGPU來訓練超級快速和精確的物體探測器。

2. 我們驗證了在探測器訓練過程中,state-of-the-art Bag-of Freebies and Bag-of-Specials對物體檢測的影響。

 3. 我們修改了當前的方法,使它們更有效,更適合於單個GPU訓練,包括CBN[89]、PAN[49]、SAM[85]等。

 

2. Related work

2.1. Object detection models

2. 相關工作

2.1. 物體檢測方法

A modern detector is usually composed of two parts,

a backbone which is pre-trained on ImageNet and a head
which is used to predict classes and bounding boxes of ob
jects. For those detectors running on GPU platform, their
backbone could be VGG [68], ResNet [26], ResNeXt [86],
or DenseNet [30]. For those detectors running on CPU plat
form, their backbone could be SqueezeNet [31], MobileNet
[28, 66, 27, 74], or ShufflfleNet [97, 53]. As to the head part,
it is usually categorized into two kinds, i.e., one-stage object
detector and two-stage object detector. The most represen
tative two-stage object detector is the R-CNN [19] series,
including fast R-CNN [18], faster R-CNN [64], R-FCN [9],
and Libra R-CNN [58]. It is also possible to make a two
stage object detector an anchor-free object detector, such as
RepPoints [87]. As for one-stage object detector, the most
representative models are YOLO [61, 62, 63], SSD [50],
and RetinaNet [45]. In recent years, anchor-free one-stage
object detectors are developed. The detectors of this sort are
CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object
detectors developed in recent years often insert some lay
ers between backbone and head, and these layers are usu
ally used to collect feature maps from different stages. We
can call it the neck of an object detector. Usually, a neck
is composed of several bottom-up paths and several top
down paths. Networks equipped with this mechanism in
clude Feature Pyramid Network (FPN) [44], Path Aggrega
tion Network (PAN) [49], BiFPN [77], and NAS-FPN [17].
In addition to the above models, some researchers put their
emphasis on directly building a new backbone (DetNet [43],
DetNAS [7]) or a new whole model (SpineNet [12], HitDe
tector [20]) for object detection.
 

To sum up, an ordinary object detector is composed of

several parts:
 
Input: Image, Patches, Image Pyramid
 
Backbones: VGG16 [68], ResNet-50 [26], SpineNet
[12], EffificientNet-B0/B7 [75], CSPResNeXt50 [81],
CSPDarknet53 [81]
 
Neck:
 
Additional blocks: SPP [25], ASPP [5], RFB
[47], SAM [85]
 
Path-aggregation blocks: FPN [44], PAN [49],
NAS-FPN [17], Fully-connected FPN, BiFPN
[77], ASFF [48], SFAM [98]
 
Heads
 
Dense Prediction (one-stage):
RPN [64], SSD [50], YOLO [61], RetinaNet
[45] (anchor based)
CornerNet [37], CenterNet [13], MatrixNet
[60], FCOS [78] (anchor free)
Sparse Prediction (two-stage):
Faster R-CNN [64], R-FCN [9], Mask R
CNN [23] (anchor based)
RepPoints [87] (anchor free)

 現代檢測器通常由兩個部分組成,一個是在ImageNet上預先訓練的骨幹,一個是用來預測對象的類和包圍框的頭。 對於運行在GPU平臺上的檢測器來說,它們的骨幹可以是VGG[68]、ResNet[26]、ResneXt[86]或DenseNet[30]。對於運行在CPU平臺上的檢測器來說,它們的骨幹可以是SqueezeNet [31], MobileNet

[28, 66, 27, 74], or ShufflfleNet [97, 53].對於頭部部分,通常分爲兩類,即一級對象檢測器和二級對象檢測器。最具代表性的兩級物體檢測器是R-CNN [19] 系列,包括fast R-CNN [18], faster R-CNN [64], R-FCN [9],and Libra R-CNN [58]. 還可以使兩級物體檢測器成爲沒有錨點對象檢測器,如RepPoint[87]。 對於一級對象檢測器,最具代表性的模型是YOLO[61,62,63]、SSD[50]和RetinaNet [45]。 近年來,無錨點一級目標探測器正在發展。這類檢測器有CenterNet [13], CornerNet [37, 38], FCOS [78],等。 近年來發展起來的對象檢測器 在骨幹和頭部之間插入一些層,這些層通常用於收集不同階段的特徵圖。 我們可以稱之爲物體探測器的頸部。 通常頸部由幾條自下而上的路徑和幾條自上而下的路徑組成。有此機制的網絡包括特徵金字塔網絡(FPN)[44]、路徑聚合網絡(PAN)[49]、BiFPN[77]和NAS-FPN[17]。 除了上述模型外,一些研究人員還強調直接爲物體檢測器構建一個新的骨幹(DetNet[43],Det NAS[7])或一個新的整體模型(Spine Net[12],HitDetector[20]  Tection。

綜上所述,一個普通的物體探測器由幾部分組成:

  • 輸入:圖像,pitch,圖像金字塔
  • 骨幹:VGG16 [68], ResNet-50 [26], SpineNet
    [12], EffificientNet-B0/B7 [75], CSPResNeXt50 [81],
    CSPDarknet53 [81]
  • 頸部:​​​​​
    • 額外的塊:SPP [25], ASPP [5], RFB[47], SAM [85]
    • 路徑聚和塊:FPN [44], PAN [49],NAS-FPN [17], Fully-connected FPN, BiFPN[77], ASFF [48], SFAM [98]
  • 頭部:
    •  密集預測(一階段): 
      • RPN[64],SSD[50],YOLO[61],RetinaNet[45](基於錨)
      • CornerNet [37], CenterNet [13], MatrixNet
        [60], FCOS [78] (anchor free)
    •  稀疏預測(兩階段): 
      • Faster R-CNN [64], R-FCN [9], Mask R
        CNN [23] (anchor based)(基於錨)
      • RepPoints [87] (無錨)

2.2. Bag of freebies

2.2. Bag of freebies

Usually, a conventional object detector is trained off

line. Therefore, researchers always like to take this advan
tage and develop better training methods which can make
the object detector receive better accuracy without increas
ing the inference cost. We call these methods that only
change the training strategy or only increase the training
cost as “bag of freebies.” What is often adopted by object
detection methods and meets the defifinition of bag of free
bies is data augmentation. The purpose of data augmenta
tion is to increase the variability of the input images, so that
the designed object detection model has higher robustness
to the images obtained from different environments. For
examples, photometric distortions and geometric distortions
are two commonly used data augmentation method and they
defifinitely benefifit the object detection task. In dealing with
photometric distortion, we adjust the brightness, contrast,
hue, saturation, and noise of an image. For geometric dis
tortion, we add random scaling, cropping, flflipping, and ro
tating.

The data augmentation methods mentioned above are all

pixel-wise adjustments, and all original pixel information in
the adjusted area is retained. In addition, some researchers
engaged in data augmentation put their emphasis on sim
ulating object occlusion issues. They have achieved good
results in image classifification and object detection. For ex
ample, random erase [100] and CutOut [11] can randomly
select the rectangle region in an image and fifill in a random
or complementary value of zero. As for hide-and-seek [69]
and grid mask [6], they randomly or evenly select multiple
rectangle regions in an image and replace them to all ze
ros. If similar concepts are applied to feature maps, there
are DropOut [71], DropConnect [80], and DropBlock [16]
methods. In addition, some researchers have proposed the
methods of using multiple images together to perform data
augmentation. For example, MixUp [92] uses two images
to multiply and superimpose with different coeffificient ra
tios, and then adjusts the label with these superimposed ra
tios. As for CutMix [91], it is to cover the cropped image
to rectangle region of other images, and adjusts the label
according to the size of the mix area. In addition to the
above mentioned methods, style transfer GAN [15] is also
used for data augmentation, and such usage can effectively
reduce the texture bias learned by CNN.

Different from the various approaches proposed above,

some other bag of freebies methods are dedicated to solving
the problem that the semantic distribution in the dataset may
have bias. In dealing with the problem of semantic distri
bution bias, a very important issue is that there is a problem
of data imbalance between different classes, and this prob
lem is often solved by hard negative example mining [72]
or online hard example mining [67] in two-stage object de
tector. But the example mining method is not applicable
to one-stage object detector, because this kind of detector
belongs to the dense prediction architecture. Therefore Lin
et al. [45] proposed focal loss to deal with the problem
of data imbalance existing between various classes. An
other very important issue is that it is diffificult to express the
relationship of the degree of association between different
categories with the one-hot hard representation. This rep
resentation scheme is often used when executing labeling.
The label smoothing proposed in [73] is to convert hard la
bel into soft label for training, which can make model more
robust. In order to obtain a better soft label, Islam et al. [33]
introduced the concept of knowledge distillation to design
the label refifinement network.

The last bag of freebies is the objective function of

Bounding Box (BBox) regression. The traditional object
detector usually uses Mean Square Error (MSE) to di
rectly perform regression on the center point coordinates
and height and width of the BBox, i.e., {xcenter, ycenter,
w, h}, or the upper left point and the lower right point,
i.e., {xtop lef t, ytop lef t, xbottom right, ybottom right}. As
for anchor-based method, it is to estimate the correspond
ing offset, for example {xcenter of f set, ycenter of f set,
wof f set, hof f set} and {xtop lef t of f set, ytop lef t of f set,
xbottom right of f set, ybottom right of f set}. However, to di
rectly estimate the coordinate values of each point of the
BBox is to treat these points as independent variables, but
in fact does not consider the integrity of the object itself. In
order to make this issue processed better, some researchers
recently proposed IoU loss [90], which puts the coverage of
predicted BBox area and ground truth BBox area into con
sideration. The IoU loss computing process will trigger the
calculation of the four coordinate points of the BBox by ex
ecuting IoU with the ground truth, and then connecting the
generated results into a whole code. Because IoU is a scale
invariant representation, it can solve the problem that when
traditional methods calculate the l1 or l2 loss of {x, y, w,
h}, the loss will increase with the scale. Recently, some
researchers have continued to improve IoU loss. For exam
ple, GIoU loss [65] is to include the shape and orientation
of object in addition to the coverage area. They proposed to
fifind the smallest area BBox that can simultaneously cover
the predicted BBox and ground truth BBox, and use this
BBox as the denominator to replace the denominator origi
nally used in IoU loss. As for DIoU loss [99], it additionally
considers the distance of the center of an object, and CIoU
loss [99], on the other hand simultaneously considers the
overlapping area, the distance between center points, and
the aspect ratio. CIoU can achieve better convergence speed
and accuracy on the BBox regression problem.

 通常,傳統的物體檢測器是離線訓練的。 因此,研究人員總是喜歡利用這一優勢,開發更好的訓練方法,使對象檢測器能夠達到更高的準確率而不增加推理成本。 我們把這些只改變訓練策略或只增加訓練成本的方法稱爲“bag of freebies.”。該方法經常被物體檢測器使用並且滿足“bag of freebies”方法也叫做數據增強。 數據增強的目的是增加輸入圖像的可變性,使設計的物體檢測模型對從不同環境中獲得的圖像具有較高的魯棒性。 例如,光度畸變和幾何畸變是兩種常用的數據增強方法它們肯定有利於目標檢測任務。 在處理光度失真時,我們調整圖像的亮度、對比度、色調、飽和度和噪聲。 對於幾何畸變,我們 添加隨機縮放、裁剪、翻轉和旋轉。

 上述數據增強方法均爲像素級調整,保留調整區域內所有原始像素信息。 此外,一些從事數據增強的研究人員強調模擬對象遮擋問題。它們在圖像分類和目標檢測方面取得了良好的效果。 例如, 例如,在圖像中隨機擦除或剪切矩形區域,並隨機填充零或其互補值。至於hide-and-seek和網格掩碼,它們隨機或均勻地 在圖像中選擇多個矩形區域 並將它們替換爲所有零。 如果將類似的概念應用於特徵映射,則有DropOut、DropConnect和DropBlock方法。 此外,一些研究人員也有專業人士 提出了使用多幅圖像拼接在一起的數據增強的方法。 例如,將兩張圖片以不同的比例疊加在一起,然後調整這些帶有疊加比率的標籤。 至於裁剪混合,它是將裁剪後的圖像覆蓋到其他圖像的矩形區域,並根據混合區域的大小調整標籤。 除了上述方法外,風格遷移GAN網絡也被用於數據增強,這樣的使用可以有效地減少CNN學習的紋理偏差。

 與上述提出的各種方法不同,其他一些bag of freebies方法致力於解決數據集中語義分佈可能存在偏差的問題。  在處理語義分佈偏差問題中,一個非常重要的問題是不同類之間存在數據不平衡問題,這個問題通常是通過兩級對象檢測器中進行負例採樣或在線負例採樣來解決。 但實例挖掘方法不適用於一級對象檢測器,因爲這種檢測器屬於密集的預測體系結構。 因此,Lin等人提出了focal loss來處理各類之間存在的數據不平衡問題。  另一個非常重要的問題是,很難表達不同類別之間關聯程度與one-hot標籤之間的關係。標籤平滑是將硬標籤轉換爲軟標籤進行訓練,使模型更加穩健,在製作標籤時這種方案經常被使用。 爲了獲得更好的軟標籤,Islam等人引入知識蒸餾的概念來設計標籤細化網絡。

 最後一個bag of freebies是BoundingBox(BBox)迴歸的目標函數。 傳統的對象檢測器通常使用均方誤差(MeanSquare Error,MSE)直接對中心座標和高度、寬度的BBox進行迴歸,即{xcenter,ycenter,w,h},或左上點和右下點,即{xtop_left,ytop_left,xbottom_left,ybottom_right}。 如對於基於錨的方法,它是估計相應的偏移量,例如{xcenter_offset,ycenter_offset,woffset,hoffset}和f集的{xtop_left_offset,ytop_left_offset,xbottom_right_offset,ybottom_right_offset}。 然而,直接估計BBox每個點的座標值並將這些點視爲自變量,實際上沒有考慮對象本身的完整性。 爲了使這一問題得到更好的處理,一些研究人員最近提出了IoU損失,將預測的BBox和真實的BBox放在一起考慮。 通過將IoU與地面真相執行,IoU損失通過計算BBox的四個座標點與真實標籤的IoU,並將得到的結果加入到整個代碼中 。 由於IoU是一個尺度不變表示,因此可以解決傳統方法計算{x,y,w,h}的L1或L2損失時,損失會隨着尺度的增加而增加的問題。 最近,一些研究人員在繼續改善IoU損失。 例如,GIOU損失 除了覆蓋區域外,還包括物體的形狀和方向。 他們提出找到最小的區域BBox,可以同時覆蓋預測的BBox和真實BBox,並使用這個BBox作爲分母以取代原來在IoU損失中使用的分母。  對於DIOU損失,它還考慮了物體中心的距離,CIOU損失,另一方面同時考慮了重疊區域,即CEN之間的距離點和縱橫比作爲對於DIOU損失。 在BBox迴歸問題上,CIOU可以獲得更好的收斂速度和精度。

2.3. Bag of specials

2.3. Bag of specials

For those plugin modules and post-processing methods

that only increase the inference cost by a small amount
but can signifificantly improve the accuracy of object detec
tion, we call them “bag of specials”. Generally speaking,
these plugin modules are for enhancing certain attributes in
a model, such as enlarging receptive fifield, introducing at
tention mechanism, or strengthening feature integration ca
pability, etc., and post-processing is a method for screening
model prediction results.

Common modules that can be used to enhance recep

tive fifield are SPP [25], ASPP [5], and RFB [47]. The
SPP module was originated from Spatial Pyramid Match
ing (SPM) [39], and SPMs original method was to split fea
ture map into several d × d equal blocks, where d can be
{1, 2, 3, ...}, thus forming spatial pyramid, and then extract
ing bag-of-word features. SPP integrates SPM into CNN
and use max-pooling operation instead of bag-of-word op
eration. Since the SPP module proposed by He et al. [25]
will output one dimensional feature vector, it is infeasible to
be applied in Fully Convolutional Network (FCN). Thus in
the design of YOLOv3 [63], Redmon and Farhadi improve
SPP module to the concatenation of max-pooling outputs
with kernel size k × k, where k = {1, 5, 9, 13}, and stride
equals to 1. Under this design, a relatively large k × k max
pooling effectively increase the receptive fifield of backbone
feature. After adding the improved version of SPP module,
YOLOv3-608 upgrades AP50 by 2.7% on the MS COCO
object detection task at the cost of 0.5% extra computation.
The difference in operation between ASPP [5] module and
improved SPP module is mainly from the original k×k ker
nel size, max-pooling of stride equals to 1 to several 3 × 3
kernel size, dilated ratio equals to k, and stride equals to 1
in dilated convolution operation. RFB module is to use sev
eral dilated convolutions of k×k kernel, dilated ratio equals
to k, and stride equals to 1 to obtain a more comprehensive
spatial coverage than ASPP. RFB [47] only costs 7% extra
inference time to increase the AP50 of SSD on MS COCO
by 5.7%.

The attention module that is often used in object detec

tion is mainly divided into channel-wise attention and point
wise attention, and the representatives of these two atten
tion models are Squeeze-and-Excitation (SE) [29] and Spa
tial Attention Module (SAM) [85], respectively. Although
SE module can improve the power of ResNet50 in the Im
ageNet image classifification task 1% top-1 accuracy at the
cost of only increasing the computational effort by 2%, but
on a GPU usually it will increase the inference time by
about 10%, so it is more appropriate to be used in mobile
devices. But for SAM, it only needs to pay 0.1% extra cal
culation and it can improve ResNet50-SE 0.5% top-1 accu
racy on the ImageNet image classifification task. Best of all,
it does not affect the speed of inference on the GPU at all.
In terms of feature integration, the early practice is to use
skip connection [51] or hyper-column [22] to integrate low
level physical feature to high-level semantic feature. Since
multi-scale prediction methods such as FPN have become
popular, many lightweight modules that integrate different
feature pyramid have been proposed. The modules of this
sort include SFAM [98], ASFF [48], and BiFPN [77]. The
main idea of SFAM is to use SE module to execute channel
wise level re-weighting on multi-scale concatenated feature
maps. As for ASFF, it uses softmax as point-wise level re
weighting and then adds feature maps of different scales.
In BiFPN, the multi-input weighted residual connections is
proposed to execute scale-wise level re-weighting, and then
add feature maps of different scales.

In the research of deep learning, some people put their

focus on searching for good activation function. A good
activation function can make the gradient more effificiently
propagated, and at the same time it will not cause too
much extra computational cost. In 2010, Nair and Hin
ton [56] propose ReLU to substantially solve the gradient
vanish problem which is frequently encountered in tradi
tional tanh and sigmoid activation function. Subsequently,
LReLU [54], PReLU [24], ReLU6 [28], Scaled Exponential
Linear Unit (SELU) [35], Swish [59], hard-Swish [27], and
Mish [55], etc., which are also used to solve the gradient
vanish problem, have been proposed. The main purpose of
LReLU and PReLU is to solve the problem that the gradi
ent of ReLU is zero when the output is less than zero. As
for ReLU6 and hard-Swish, they are specially designed for
quantization networks. For self-normalizing a neural net
work, the SELU activation function is proposed to satisfy
the goal. One thing to be noted is that both Swish and Mish
are continuously differentiable activation function.

The post-processing method commonly used in deep

learning-based object detection is NMS, which can be used
to fifilter those BBoxes that badly predict the same ob
ject, and only retain the candidate BBoxes with higher re
sponse. The way NMS tries to improve is consistent with
the method of optimizing an objective function. The orig
inal method proposed by NMS does not consider the con
text information, so Girshick et al. [19] added classifification
confifidence score in R-CNN as a reference, and according to
the order of confifidence score, greedy NMS was performed
in the order of high score to low score. As for soft NMS [1],
it considers the problem that the occlusion of an object may
cause the degradation of confifidence score in greedy NMS
with IoU score. The DIoU NMS [99] developers way of
thinking is to add the information of the center point dis
tance to the BBox screening process on the basis of soft
NMS. It is worth mentioning that, since none of above post
processing methods directly refer to the captured image fea
tures, post-processing is no longer required in the subse
quent development of an anchor-free method.

 對於那些只增加少量推理成本但能顯著提高目標檢測精度的插件模塊和後處理方法,我們稱之爲“Bag of specials”。 一般來說,這些插件模塊是爲了增強模型中的某些屬性,如擴大感受野、引入注意機制或 加強特徵集成能力等,後處理是篩選模型預測結果的一種方法。

 可用於增強感受野的常用模塊有SPP、ASPP和RFB。 SPP模塊起源於空間金字塔匹配(SPM),SPM原始方法爲t將特徵映射分割成幾個d×d等量塊,其中d可以是{1,2,3,...},從而形成空間金字塔,然後提取詞袋特徵。 SPP將SPM集成到CNN中並使用max-pool操作而不是單詞袋操作。 由於He等人提出的SPP模塊。 將輸出一維特徵向量,在FCN中應用是不可行得。 因此,在YOLOv3的設計中,Redmon和Farhadi將SPP模塊改進爲核大小爲k×k的最大池輸出的級聯,其中k={1、5、9、13},步長等於1。 u在此設計中,相對較大的k×k最大池有效地增加了骨幹特徵的接收場。 增加SPP模塊的改進版本後,YOLOv3-608將AP50升級2.7 以0.5%的額外計算成本完成MSCOCO對象檢測任務。 ASPP[5]模塊與改進SPP模塊在操作上的區別主要來自原始的k×k核大小max-p 在擴展卷積運算中,步長等於1到幾個3×3核大小,擴展比等於k,步長等於1。 射頻模塊是使用k×k的幾個膨脹卷積核,擴張比等於k,步長等於1,以獲得比ASPP更全面的空間覆蓋。 RFB[47]只花費7%的額外推理時間來增加MS COCO上SSD的AP50增加5.7%。

 在物體檢測中經常使用的注意模塊主要分爲注意和點注意,這兩種注意模型的代表是Squeeze-an 分別[85]d-引文(SE)[29]和空間注意模塊(SAM。 雖然SE模塊可以提高ResNet50在Image Net圖像分類任務中的功率,但在CO中,ResNet50的精度爲1 只增加2%的計算工作量,但在GPU上通常會增加10%左右的推理時間,因此在移動設備中使用更合適。 但對SAM來說,它是onl在圖像網圖像分類任務中,Y需要支付0.1%的額外計算,它可以提高ResNet50-SE0.5%圖像分類任務。 最重要的是,它不影響GPU上的推理速度 全部。

 在特徵集成方面,早期的實踐是使用跳過連接[51]或超列[22]將低級物理特徵集成到高級語義特徵中。 因爲多尺度謂詞像FPN這樣的方法已經流行起來,許多集成不同特徵金字塔的輕量級模塊已經被提出。 這類模塊包括SFAM[98]、ASFF[48]和BiFPN[77 ]。 SFAM的主要思想是利用SE模塊對多尺度級聯特徵映射進行信道級重加權。 至於ASFF,它使用Softmax作爲點級重加權,然後是a 不同尺度的DDS特徵映射。 在BiFPN中,提出了多輸入加權殘差連接來執行標度級重加權,然後添加不同尺度的特徵映射。

 在深度學習的研究中,一些人把重點放在尋找良好的激活功能上。 一個好的激活函數可以使梯度更有效地傳播,同時時間不會造成太多額外的計算成本。 在2010年,Nair和Hinton[56]提出ReLU來實質性地解決傳統t中經常遇到的梯度消失問題 和乙狀結腸激活功能。 隨後,LRELU[54]、PRELU[24]、RELU6[28]、標度指數線性單元(SELU)[35]、Swish[59]、硬瑞士[27]和Mish[55]等也被使用 爲了解決梯度消失問題,提出了一種求解梯度消失問題的方法。 而LR,LU和PR,LU的主要目的是解決當輸出小於零時R,LU的梯度爲零的問題。 至於ReLU6 而硬瑞士,它們是專門爲量化網絡設計的。 對於神經網絡的自歸一化,提出了SELU激活函數來滿足目標。 有一件事需要注意 在Swish和Mish都是連續可微激活函數。

 基於深度學習的對象檢測中常用的後處理方法是NMS,它可以用來過濾那些預測不好同一對象的BBox,並且只保留候選BBox具有較高響應的ES。 NMS試圖改進的方法與優化目標函數的方法是一致的。 NMS提出的原始方法不考慮上下文信息, 所以Girshick等人。 [19]R-CNN中添加分類置信度評分作爲參考,並根據置信度評分的順序,按高分到低分的順序執行貪婪NMS。 對於軟NMS[1],它考慮了對象的遮擋可能導致具有IoU分數的貪婪NMS中置信度分數下降的問題。 DIO U NMS[99]開發人員的思維方式是在軟NMS的基礎上,將中心點距離的信息添加到BBox篩選過程中。 值得一提的是,由於上述後處理方法都沒有直接提及 捕獲的圖像特徵,後處理不再需要在隨後的開發無錨方法。

3. Methodology

3. 方法

The basic aim is fast operating speed of neural network,

in production systems and optimization for parallel compu
tations, rather than the low computation volume theoreti
cal indicator (BFLOP). We present two options of real-time
neural networks:
 
For GPU we use a small number of groups (1 - 8) in
convolutional layers: CSPResNeXt50 / CSPDarknet53
 
For VPU - we use grouped-convolution, but we re
frain from using Squeeze-and-excitement (SE) blocks
- specififically this includes the following models:
EffificientNet-lite / MixNet [76] / GhostNet [21] / Mo
bileNetV3
 

3.1. Selection of architecture

3.1. 結構的選擇

Our objective is to fifind the optimal balance among the in
put network resolution, the convolutional layer number, the
parameter number (fifilter size2 * fifilters * channel / groups),
and the number of layer outputs (fifilters). For instance, our
numerous studies demonstrate that the CSPResNext50 is
considerably better compared to CSPDarknet53 in terms
of object classifification on the ILSVRC2012 (ImageNet)
dataset [10]. However, conversely, the CSPDarknet53 is
better compared to CSPResNext50 in terms of detecting ob
jects on the MS COCO dataset [46].
The next objective is to select additional blocks for in
creasing the receptive fifield and the best method of parame
ter aggregation from different backbone levels for different
detector levels: e.g. FPN, PAN, ASFF, BiFPN.
A reference model which is optimal for classifification is
not always optimal for a detector. In contrast to the classi-
fifier, the detector requires the following:
Higher input network size (resolution) – for detecting
multiple small-sized objects
More layers – for a higher receptive fifield to cover the
increased size of input network
More parameters – for greater capacity of a model to
detect multiple objects of different sizes in a single im
age
Hypothetically speaking, we can assume that a model
with a larger receptive fifield size (with a larger number of
convolutional layers 3 × 3) and a larger number of parame
ters should be selected as the backbone. Table 1 shows the
information of CSPResNeXt50, CSPDarknet53, and Effifi-
cientNet B3. The CSPResNext50 contains only 16 convo
lutional layers 3 × 3, a 425 × 425 receptive fifield and 20.6
M parameters, while CSPDarknet53 contains 29 convolu
tional layers 3 × 3, a 725 × 725 receptive fifield and 27.6
M parameters. This theoretical justifification, together with
our numerous experiments, show that CSPDarknet53 neu
ral network is the optimal model of the two as the backbone
for a detector.
The inflfluence of the receptive fifield with different sizes is
summarized as follows:
Up to the object size - allows viewing the entire object
Up to network size - allows viewing the context around
the object
Exceeding the network size - increases the number of
connections between the image point and the fifinal ac
tivation
We add the SPP block over the CSPDarknet53, since it
signifificantly increases the receptive fifield, separates out the
most signifificant context features and causes almost no re
duction of the network operation speed. We use PANet as
the method of parameter aggregation from different back
bone levels for different detector levels, instead of the FPN
used in YOLOv3.
Finally, we choose CSPDarknet53 backbone, SPP addi
tional module, PANet path-aggregation neck, and YOLOv3
(anchor based) head as the architecture of YOLOv4.
In the future we plan to expand signifificantly the content
of Bag of Freebies (BoF) for the detector, which theoreti
cally can address some problems and increase the detector
accuracy, and sequentially check the inflfluence of each fea
ture in an experimental fashion.
We do not use Cross-GPU Batch Normalization (CGBN
or SyncBN) or expensive specialized devices. This al
lows anyone to reproduce our state-of-the-art outcomes on
a conventional graphic processor e.g. GTX 1080Ti or RTX
2080Ti.
 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章