《YOLOv4: Optimal Speed and Accuracy of Object Detection》論文翻譯

原創

2020-07-07 22:52

最新的YoloV4已經出來好久了，今天主要讀一下看看相比於YoloV3有什麼改進和創新的地方，主要是來學習學習。廢話不多說，開始。

Abstract	摘要
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.	大量的特徵可以提高卷積神經網絡(CNN)的準確率。這需要在大型數據集上對這些特徵的組合進行實際測試，並且需要理論上對結果進行分析。一些特性操作用在定的模型上並且是爲了解決特定的問題，或者僅針對小的數據集；而一些特性，如BN層和殘差連接，適用於大多數模型、任務和數據集。我們假設有一些通用特性如：加權殘差連接(WRC)、(CSP)，交叉小批歸一化(CMBN)，自我對抗訓練(SAT)和Mish激活。我們使用新的特性：WRC，CSP，CMBN，SAT，Mish激活，馬賽克數據增強，CMBN, DropBlock正則化和CIOU損失，將他們結合起來實現了在MSCOCO數據集上最優的結果：43.5%AP (65.7%AP50)，在Tesla V100實時速度爲65FPS。
1. Introduction	1. 介紹
The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example, searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detector accuracy enables using them not only for hint generating recommendation systems, but also for stand-alone process management and human input reduction. Real-time object detector operation on conven tional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real time and require large number of GPUs for training with a large mini-batch-size. We address such problems through creating a CNN that op erates in real-time on a conventional GPU, and for which training requires only one conventional GPU. The main goal of this work is designing a fast operating speed of an object detector in production systems and opti mization for parallel computations, rather than the low com putation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing ob ject detection results, as the YOLOv4 results shown in Fig ure 1. Our contributions are summarized as follows: 1. We develope an effificient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector. 2. We verify the inflfluence of state-of-the-art Bag-of Freebies and Bag-of-Specials methods of object detec tion during the detector training. 3. We modify state-of-the-art methods and make them more effecient and suitable for single GPU training, including CBN [89], PAN [49], SAM [85], etc.	大多數基於CNN的目標檢測器基本上用於推薦系統。例如，通過城市攝像機搜索免費停車位是通過慢速精確執行的模型，而汽車碰撞警告與快速不準確的模型有關。提高實時目標檢測器的準確性不僅可以用於提示生成推薦系統，而且還可以用於獨立的過程管理和人力投入減少。在GPU上的實時對象檢測器大量使用的話需要需要承擔起價格。大多數神經網絡不能實時運行，並且需要大量的GPU來進行小批量的訓練。我們通過創建一個在傳統GPU上實時運行的CNN來解決這些問題，而訓練只需要一個傳統GPU。本工作的主要目的是生產系統中設計一個快速目標檢測器並且能夠通過並行計算來優化，而不是降低計算量的理論指標 (BFLOP)。我們希望設計的對象可以很容易地訓練和使用。例如，任何人能夠使用常規GPU進行訓練和測試並實現實時、高質量和傳統對象檢測結果，如圖1所示的YOLOv4結果。我們的貢獻總結如下： 1. 我們開發了一個高效、強大的目標檢測模型。它使每個人都可以使用1080Ti或2080TiGPU來訓練超級快速和精確的物體探測器。 2. 我們驗證了在探測器訓練過程中，state-of-the-art Bag-of Freebies and Bag-of-Specials對物體檢測的影響。 3. 我們修改了當前的方法，使它們更有效，更適合於單個GPU訓練，包括CBN[89]、PAN[49]、SAM[85]等。
2. Related work 2.1. Object detection models	2. 相關工作 2.1. 物體檢測方法
A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of ob jects. For those detectors running on GPU platform, their backbone could be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those detectors running on CPU plat form, their backbone could be SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShufflfleNet [97, 53]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most represen tative two-stage object detector is the R-CNN [19] series, including fast R-CNN [18], faster R-CNN [64], R-FCN [9], and Libra R-CNN [58]. It is also possible to make a two stage object detector an anchor-free object detector, such as RepPoints [87]. As for one-stage object detector, the most representative models are YOLO [61, 62, 63], SSD [50], and RetinaNet [45]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object detectors developed in recent years often insert some lay ers between backbone and head, and these layers are usu ally used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several top down paths. Networks equipped with this mechanism in clude Feature Pyramid Network (FPN) [44], Path Aggrega tion Network (PAN) [49], BiFPN [77], and NAS-FPN [17]. In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [43], DetNAS [7]) or a new whole model (SpineNet [12], HitDe tector [20]) for object detection. To sum up, an ordinary object detector is composed of several parts: • Input: Image, Patches, Image Pyramid • Backbones: VGG16 [68], ResNet-50 [26], SpineNet [12], EffificientNet-B0/B7 [75], CSPResNeXt50 [81], CSPDarknet53 [81] • Neck: • Additional blocks: SPP [25], ASPP [5], RFB [47], SAM [85] • Path-aggregation blocks: FPN [44], PAN [49], NAS-FPN [17], Fully-connected FPN, BiFPN [77], ASFF [48], SFAM [98] • Heads • Dense Prediction (one-stage): ◦ RPN [64], SSD [50], YOLO [61], RetinaNet [45] (anchor based) ◦ CornerNet [37], CenterNet [13], MatrixNet [60], FCOS [78] (anchor free) • Sparse Prediction (two-stage): ◦ Faster R-CNN [64], R-FCN [9], Mask R CNN [23] (anchor based) ◦ RepPoints [87] (anchor free)	現代檢測器通常由兩個部分組成，一個是在ImageNet上預先訓練的骨幹，一個是用來預測對象的類和包圍框的頭。對於運行在GPU平臺上的檢測器來說，它們的骨幹可以是VGG[68]、ResNet[26]、ResneXt[86]或DenseNet[30]。對於運行在CPU平臺上的檢測器來說，它們的骨幹可以是SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShufflfleNet [97, 53].對於頭部部分，通常分爲兩類，即一級對象檢測器和二級對象檢測器。最具代表性的兩級物體檢測器是R-CNN [19] 系列,包括fast R-CNN [18], faster R-CNN [64], R-FCN [9],and Libra R-CNN [58]. 還可以使兩級物體檢測器成爲沒有錨點對象檢測器，如RepPoint[87]。對於一級對象檢測器，最具代表性的模型是YOLO[61,62,63]、SSD[50]和RetinaNet [45]。近年來，無錨點一級目標探測器正在發展。這類檢測器有CenterNet [13], CornerNet [37, 38], FCOS [78],等。近年來發展起來的對象檢測器在骨幹和頭部之間插入一些層，這些層通常用於收集不同階段的特徵圖。我們可以稱之爲物體探測器的頸部。通常頸部由幾條自下而上的路徑和幾條自上而下的路徑組成。有此機制的網絡包括特徵金字塔網絡(FPN)[44]、路徑聚合網絡(PAN)[49]、BiFPN[77]和NAS-FPN[17]。除了上述模型外，一些研究人員還強調直接爲物體檢測器構建一個新的骨幹(DetNet[43]，Det NAS[7])或一個新的整體模型(Spine Net[12]，HitDetector[20] Tection。綜上所述，一個普通的物體探測器由幾部分組成：輸入：圖像，pitch,圖像金字塔骨幹：VGG16 [68], ResNet-50 [26], SpineNet [12], EffificientNet-B0/B7 [75], CSPResNeXt50 [81], CSPDarknet53 [81] 頸部：額外的塊：SPP [25], ASPP [5], RFB[47], SAM [85] 路徑聚和塊：FPN [44], PAN [49],NAS-FPN [17], Fully-connected FPN, BiFPN[77], ASFF [48], SFAM [98] 頭部：密集預測（一階段）： RPN[64]，SSD[50]，YOLO[61]，RetinaNet[45]（基於錨） CornerNet [37], CenterNet [13], MatrixNet [60], FCOS [78] (anchor free) 稀疏預測（兩階段）： Faster R-CNN [64], R-FCN [9], Mask R CNN [23] (anchor based)（基於錨） RepPoints [87] （無錨）
2.2. Bag of freebies	2.2. Bag of freebies
Usually, a conventional object detector is trained off line. Therefore, researchers always like to take this advan tage and develop better training methods which can make the object detector receive better accuracy without increas ing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the defifinition of bag of free bies is data augmentation. The purpose of data augmenta tion is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they defifinitely benefifit the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric dis tortion, we add random scaling, cropping, flflipping, and ro tating. The data augmentation methods mentioned above are all pixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on sim ulating object occlusion issues. They have achieved good results in image classifification and object detection. For ex ample, random erase [100] and CutOut [11] can randomly select the rectangle region in an image and fifill in a random or complementary value of zero. As for hide-and-seek [69] and grid mask [6], they randomly or evenly select multiple rectangle regions in an image and replace them to all ze ros. If similar concepts are applied to feature maps, there are DropOut [71], DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation. For example, MixUp [92] uses two images to multiply and superimpose with different coeffificient ra tios, and then adjusts the label with these superimposed ra tios. As for CutMix [91], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN. Different from the various approaches proposed above, some other bag of freebies methods are dedicated to solving the problem that the semantic distribution in the dataset may have bias. In dealing with the problem of semantic distri bution bias, a very important issue is that there is a problem of data imbalance between different classes, and this prob lem is often solved by hard negative example mining [72] or online hard example mining [67] in two-stage object de tector. But the example mining method is not applicable to one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al. [45] proposed focal loss to deal with the problem of data imbalance existing between various classes. An other very important issue is that it is diffificult to express the relationship of the degree of association between different categories with the one-hot hard representation. This rep resentation scheme is often used when executing labeling. The label smoothing proposed in [73] is to convert hard la bel into soft label for training, which can make model more robust. In order to obtain a better soft label, Islam et al. [33] introduced the concept of knowledge distillation to design the label refifinement network. The last bag of freebies is the objective function of Bounding Box (BBox) regression. The traditional object detector usually uses Mean Square Error (MSE) to di rectly perform regression on the center point coordinates and height and width of the BBox, i.e., {xcenter, ycenter, w, h}, or the upper left point and the lower right point, i.e., {xtop lef t, ytop lef t, xbottom right, ybottom right}. As for anchor-based method, it is to estimate the correspond ing offset, for example {xcenter of f set, ycenter of f set, wof f set, hof f set} and {xtop lef t of f set, ytop lef t of f set, xbottom right of f set, ybottom right of f set}. However, to di rectly estimate the coordinate values of each point of the BBox is to treat these points as independent variables, but in fact does not consider the integrity of the object itself. In order to make this issue processed better, some researchers recently proposed IoU loss [90], which puts the coverage of predicted BBox area and ground truth BBox area into con sideration. The IoU loss computing process will trigger the calculation of the four coordinate points of the BBox by ex ecuting IoU with the ground truth, and then connecting the generated results into a whole code. Because IoU is a scale invariant representation, it can solve the problem that when traditional methods calculate the l1 or l2 loss of {x, y, w, h}, the loss will increase with the scale. Recently, some researchers have continued to improve IoU loss. For exam ple, GIoU loss [65] is to include the shape and orientation of object in addition to the coverage area. They proposed to fifind the smallest area BBox that can simultaneously cover the predicted BBox and ground truth BBox, and use this BBox as the denominator to replace the denominator origi nally used in IoU loss. As for DIoU loss [99], it additionally considers the distance of the center of an object, and CIoU loss [99], on the other hand simultaneously considers the overlapping area, the distance between center points, and the aspect ratio. CIoU can achieve better convergence speed and accuracy on the BBox regression problem.	通常，傳統的物體檢測器是離線訓練的。因此，研究人員總是喜歡利用這一優勢，開發更好的訓練方法，使對象檢測器能夠達到更高的準確率而不增加推理成本。我們把這些只改變訓練策略或只增加訓練成本的方法稱爲“bag of freebies.”。該方法經常被物體檢測器使用並且滿足“bag of freebies”方法也叫做數據增強。數據增強的目的是增加輸入圖像的可變性，使設計的物體檢測模型對從不同環境中獲得的圖像具有較高的魯棒性。例如，光度畸變和幾何畸變是兩種常用的數據增強方法它們肯定有利於目標檢測任務。在處理光度失真時，我們調整圖像的亮度、對比度、色調、飽和度和噪聲。對於幾何畸變，我們添加隨機縮放、裁剪、翻轉和旋轉。上述數據增強方法均爲像素級調整，保留調整區域內所有原始像素信息。此外，一些從事數據增強的研究人員強調模擬對象遮擋問題。它們在圖像分類和目標檢測方面取得了良好的效果。例如，例如，在圖像中隨機擦除或剪切矩形區域，並隨機填充零或其互補值。至於hide-and-seek和網格掩碼，它們隨機或均勻地在圖像中選擇多個矩形區域並將它們替換爲所有零。如果將類似的概念應用於特徵映射，則有DropOut、DropConnect和DropBlock方法。此外，一些研究人員也有專業人士提出了使用多幅圖像拼接在一起的數據增強的方法。例如，將兩張圖片以不同的比例疊加在一起，然後調整這些帶有疊加比率的標籤。至於裁剪混合，它是將裁剪後的圖像覆蓋到其他圖像的矩形區域，並根據混合區域的大小調整標籤。除了上述方法外，風格遷移GAN網絡也被用於數據增強，這樣的使用可以有效地減少CNN學習的紋理偏差。與上述提出的各種方法不同，其他一些bag of freebies方法致力於解決數據集中語義分佈可能存在偏差的問題。在處理語義分佈偏差問題中，一個非常重要的問題是不同類之間存在數據不平衡問題，這個問題通常是通過兩級對象檢測器中進行負例採樣或在線負例採樣來解決。但實例挖掘方法不適用於一級對象檢測器，因爲這種檢測器屬於密集的預測體系結構。因此，Lin等人提出了focal loss來處理各類之間存在的數據不平衡問題。另一個非常重要的問題是，很難表達不同類別之間關聯程度與one-hot標籤之間的關係。標籤平滑是將硬標籤轉換爲軟標籤進行訓練，使模型更加穩健，在製作標籤時這種方案經常被使用。爲了獲得更好的軟標籤，Islam等人引入知識蒸餾的概念來設計標籤細化網絡。最後一個bag of freebies是BoundingBox(BBox)迴歸的目標函數。傳統的對象檢測器通常使用均方誤差(MeanSquare Error，MSE)直接對中心座標和高度、寬度的BBox進行迴歸，即{xcenter，ycenter，w，h}，或左上點和右下點，即{xtop_left，ytop_left，xbottom_left，ybottom_right}。如對於基於錨的方法，它是估計相應的偏移量，例如{xcenter_offset，ycenter_offset，woffset，hoffset}和f集的{xtop_left_offset，ytop_left_offset，xbottom_right_offset，ybottom_right_offset}。然而，直接估計BBox每個點的座標值並將這些點視爲自變量，實際上沒有考慮對象本身的完整性。爲了使這一問題得到更好的處理，一些研究人員最近提出了IoU損失，將預測的BBox和真實的BBox放在一起考慮。通過將IoU與地面真相執行，IoU損失通過計算BBox的四個座標點與真實標籤的IoU,並將得到的結果加入到整個代碼中。由於IoU是一個尺度不變表示，因此可以解決傳統方法計算{x，y，w，h}的L1或L2損失時，損失會隨着尺度的增加而增加的問題。最近，一些研究人員在繼續改善IoU損失。例如，GIOU損失除了覆蓋區域外，還包括物體的形狀和方向。他們提出找到最小的區域BBox，可以同時覆蓋預測的BBox和真實BBox，並使用這個BBox作爲分母以取代原來在IoU損失中使用的分母。對於DIOU損失，它還考慮了物體中心的距離，CIOU損失，另一方面同時考慮了重疊區域，即CEN之間的距離點和縱橫比作爲對於DIOU損失。在BBox迴歸問題上，CIOU可以獲得更好的收斂速度和精度。
2.3. Bag of specials	2.3. Bag of specials
For those plugin modules and post-processing methods that only increase the inference cost by a small amount but can signifificantly improve the accuracy of object detec tion, we call them “bag of specials”. Generally speaking, these plugin modules are for enhancing certain attributes in a model, such as enlarging receptive fifield, introducing at tention mechanism, or strengthening feature integration ca pability, etc., and post-processing is a method for screening model prediction results. Common modules that can be used to enhance recep tive fifield are SPP [25], ASPP [5], and RFB [47]. The SPP module was originated from Spatial Pyramid Match ing (SPM) [39], and SPMs original method was to split fea ture map into several d × d equal blocks, where d can be {1, 2, 3, ...}, thus forming spatial pyramid, and then extract ing bag-of-word features. SPP integrates SPM into CNN and use max-pooling operation instead of bag-of-word op eration. Since the SPP module proposed by He et al. [25] will output one dimensional feature vector, it is infeasible to be applied in Fully Convolutional Network (FCN). Thus in the design of YOLOv3 [63], Redmon and Farhadi improve SPP module to the concatenation of max-pooling outputs with kernel size k × k, where k = {1, 5, 9, 13}, and stride equals to 1. Under this design, a relatively large k × k max pooling effectively increase the receptive fifield of backbone feature. After adding the improved version of SPP module, YOLOv3-608 upgrades AP50 by 2.7% on the MS COCO object detection task at the cost of 0.5% extra computation. The difference in operation between ASPP [5] module and improved SPP module is mainly from the original k×k ker nel size, max-pooling of stride equals to 1 to several 3 × 3 kernel size, dilated ratio equals to k, and stride equals to 1 in dilated convolution operation. RFB module is to use sev eral dilated convolutions of k×k kernel, dilated ratio equals to k, and stride equals to 1 to obtain a more comprehensive spatial coverage than ASPP. RFB [47] only costs 7% extra inference time to increase the AP50 of SSD on MS COCO by 5.7%. The attention module that is often used in object detec tion is mainly divided into channel-wise attention and point wise attention, and the representatives of these two atten tion models are Squeeze-and-Excitation (SE) [29] and Spa tial Attention Module (SAM) [85], respectively. Although SE module can improve the power of ResNet50 in the Im ageNet image classifification task 1% top-1 accuracy at the cost of only increasing the computational effort by 2%, but on a GPU usually it will increase the inference time by about 10%, so it is more appropriate to be used in mobile devices. But for SAM, it only needs to pay 0.1% extra cal culation and it can improve ResNet50-SE 0.5% top-1 accu racy on the ImageNet image classifification task. Best of all, it does not affect the speed of inference on the GPU at all. In terms of feature integration, the early practice is to use skip connection [51] or hyper-column [22] to integrate low level physical feature to high-level semantic feature. Since multi-scale prediction methods such as FPN have become popular, many lightweight modules that integrate different feature pyramid have been proposed. The modules of this sort include SFAM [98], ASFF [48], and BiFPN [77]. The main idea of SFAM is to use SE module to execute channel wise level re-weighting on multi-scale concatenated feature maps. As for ASFF, it uses softmax as point-wise level re weighting and then adds feature maps of different scales. In BiFPN, the multi-input weighted residual connections is proposed to execute scale-wise level re-weighting, and then add feature maps of different scales. In the research of deep learning, some people put their focus on searching for good activation function. A good activation function can make the gradient more effificiently propagated, and at the same time it will not cause too much extra computational cost. In 2010, Nair and Hin ton [56] propose ReLU to substantially solve the gradient vanish problem which is frequently encountered in tradi tional tanh and sigmoid activation function. Subsequently, LReLU [54], PReLU [24], ReLU6 [28], Scaled Exponential Linear Unit (SELU) [35], Swish [59], hard-Swish [27], and Mish [55], etc., which are also used to solve the gradient vanish problem, have been proposed. The main purpose of LReLU and PReLU is to solve the problem that the gradi ent of ReLU is zero when the output is less than zero. As for ReLU6 and hard-Swish, they are specially designed for quantization networks. For self-normalizing a neural net work, the SELU activation function is proposed to satisfy the goal. One thing to be noted is that both Swish and Mish are continuously differentiable activation function. The post-processing method commonly used in deep learning-based object detection is NMS, which can be used to fifilter those BBoxes that badly predict the same ob ject, and only retain the candidate BBoxes with higher re sponse. The way NMS tries to improve is consistent with the method of optimizing an objective function. The orig inal method proposed by NMS does not consider the con text information, so Girshick et al. [19] added classifification confifidence score in R-CNN as a reference, and according to the order of confifidence score, greedy NMS was performed in the order of high score to low score. As for soft NMS [1], it considers the problem that the occlusion of an object may cause the degradation of confifidence score in greedy NMS with IoU score. The DIoU NMS [99] developers way of thinking is to add the information of the center point dis tance to the BBox screening process on the basis of soft NMS. It is worth mentioning that, since none of above post processing methods directly refer to the captured image fea tures, post-processing is no longer required in the subse quent development of an anchor-free method.	對於那些只增加少量推理成本但能顯著提高目標檢測精度的插件模塊和後處理方法，我們稱之爲“Bag of specials”。一般來說，這些插件模塊是爲了增強模型中的某些屬性，如擴大感受野、引入注意機制或加強特徵集成能力等，後處理是篩選模型預測結果的一種方法。可用於增強感受野的常用模塊有SPP、ASPP和RFB。 SPP模塊起源於空間金字塔匹配(SPM)，SPM原始方法爲t將特徵映射分割成幾個d×d等量塊，其中d可以是{1，2，3，...}，從而形成空間金字塔，然後提取詞袋特徵。 SPP將SPM集成到CNN中並使用max-pool操作而不是單詞袋操作。由於He等人提出的SPP模塊。將輸出一維特徵向量，在FCN中應用是不可行得。因此，在YOLOv3的設計中，Redmon和Farhadi將SPP模塊改進爲核大小爲k×k的最大池輸出的級聯，其中k={1、5、9、13}，步長等於1。 u在此設計中，相對較大的k×k最大池有效地增加了骨幹特徵的接收場。增加SPP模塊的改進版本後，YOLOv3-608將AP50升級2.7 以0.5%的額外計算成本完成MSCOCO對象檢測任務。 ASPP[5]模塊與改進SPP模塊在操作上的區別主要來自原始的k×k核大小max-p 在擴展卷積運算中，步長等於1到幾個3×3核大小，擴展比等於k，步長等於1。射頻模塊是使用k×k的幾個膨脹卷積核，擴張比等於k，步長等於1，以獲得比ASPP更全面的空間覆蓋。 RFB[47]只花費7%的額外推理時間來增加MS COCO上SSD的AP50增加5.7%。在物體檢測中經常使用的注意模塊主要分爲注意和點注意，這兩種注意模型的代表是Squeeze-an 分別[85]d-引文(SE)[29]和空間注意模塊(SAM。雖然SE模塊可以提高ResNet50在Image Net圖像分類任務中的功率，但在CO中，ResNet50的精度爲1 只增加2%的計算工作量，但在GPU上通常會增加10%左右的推理時間，因此在移動設備中使用更合適。但對SAM來說，它是onl在圖像網圖像分類任務中，Y需要支付0.1%的額外計算，它可以提高ResNet50-SE0.5%圖像分類任務。最重要的是，它不影響GPU上的推理速度全部。在特徵集成方面，早期的實踐是使用跳過連接[51]或超列[22]將低級物理特徵集成到高級語義特徵中。因爲多尺度謂詞像FPN這樣的方法已經流行起來，許多集成不同特徵金字塔的輕量級模塊已經被提出。這類模塊包括SFAM[98]、ASFF[48]和BiFPN[77 ]。 SFAM的主要思想是利用SE模塊對多尺度級聯特徵映射進行信道級重加權。至於ASFF，它使用Softmax作爲點級重加權，然後是a 不同尺度的DDS特徵映射。在BiFPN中，提出了多輸入加權殘差連接來執行標度級重加權，然後添加不同尺度的特徵映射。在深度學習的研究中，一些人把重點放在尋找良好的激活功能上。一個好的激活函數可以使梯度更有效地傳播，同時時間不會造成太多額外的計算成本。在2010年，Nair和Hinton[56]提出ReLU來實質性地解決傳統t中經常遇到的梯度消失問題和乙狀結腸激活功能。隨後，LRELU[54]、PRELU[24]、RELU6[28]、標度指數線性單元(SELU)[35]、Swish[59]、硬瑞士[27]和Mish[55]等也被使用爲了解決梯度消失問題，提出了一種求解梯度消失問題的方法。而LR，LU和PR，LU的主要目的是解決當輸出小於零時R，LU的梯度爲零的問題。至於ReLU6 而硬瑞士，它們是專門爲量化網絡設計的。對於神經網絡的自歸一化，提出了SELU激活函數來滿足目標。有一件事需要注意在Swish和Mish都是連續可微激活函數。基於深度學習的對象檢測中常用的後處理方法是NMS，它可以用來過濾那些預測不好同一對象的BBox，並且只保留候選BBox具有較高響應的ES。 NMS試圖改進的方法與優化目標函數的方法是一致的。 NMS提出的原始方法不考慮上下文信息，所以Girshick等人。 [19]R-CNN中添加分類置信度評分作爲參考，並根據置信度評分的順序，按高分到低分的順序執行貪婪NMS。對於軟NMS[1]，它考慮了對象的遮擋可能導致具有IoU分數的貪婪NMS中置信度分數下降的問題。 DIO U NMS[99]開發人員的思維方式是在軟NMS的基礎上，將中心點距離的信息添加到BBox篩選過程中。值得一提的是，由於上述後處理方法都沒有直接提及捕獲的圖像特徵，後處理不再需要在隨後的開發無錨方法。
3. Methodology	3. 方法
The basic aim is fast operating speed of neural network, in production systems and optimization for parallel compu tations, rather than the low computation volume theoreti cal indicator (BFLOP). We present two options of real-time neural networks: • For GPU we use a small number of groups (1 - 8) in convolutional layers: CSPResNeXt50 / CSPDarknet53 • For VPU - we use grouped-convolution, but we re frain from using Squeeze-and-excitement (SE) blocks - specififically this includes the following models: EffificientNet-lite / MixNet [76] / GhostNet [21] / Mo bileNetV3
3.1. Selection of architecture	3.1. 結構的選擇
Our objective is to fifind the optimal balance among the in put network resolution, the convolutional layer number, the parameter number (fifilter size2 * fifilters * channel / groups), and the number of layer outputs (fifilters). For instance, our numerous studies demonstrate that the CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classifification on the ILSVRC2012 (ImageNet) dataset [10]. However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting ob jects on the MS COCO dataset [46]. The next objective is to select additional blocks for in creasing the receptive fifield and the best method of parame ter aggregation from different backbone levels for different detector levels: e.g. FPN, PAN, ASFF, BiFPN. A reference model which is optimal for classifification is not always optimal for a detector. In contrast to the classi- fifier, the detector requires the following: • Higher input network size (resolution) – for detecting multiple small-sized objects • More layers – for a higher receptive fifield to cover the increased size of input network • More parameters – for greater capacity of a model to detect multiple objects of different sizes in a single im age Hypothetically speaking, we can assume that a model with a larger receptive fifield size (with a larger number of convolutional layers 3 × 3) and a larger number of parame ters should be selected as the backbone. Table 1 shows the information of CSPResNeXt50, CSPDarknet53, and Effifi- cientNet B3. The CSPResNext50 contains only 16 convo lutional layers 3 × 3, a 425 × 425 receptive fifield and 20.6 M parameters, while CSPDarknet53 contains 29 convolu tional layers 3 × 3, a 725 × 725 receptive fifield and 27.6 M parameters. This theoretical justifification, together with our numerous experiments, show that CSPDarknet53 neu ral network is the optimal model of the two as the backbone for a detector. The inflfluence of the receptive fifield with different sizes is summarized as follows: • Up to the object size - allows viewing the entire object • Up to network size - allows viewing the context around the object • Exceeding the network size - increases the number of connections between the image point and the fifinal ac tivation We add the SPP block over the CSPDarknet53, since it signifificantly increases the receptive fifield, separates out the most signifificant context features and causes almost no re duction of the network operation speed. We use PANet as the method of parameter aggregation from different back bone levels for different detector levels, instead of the FPN used in YOLOv3. Finally, we choose CSPDarknet53 backbone, SPP addi tional module, PANet path-aggregation neck, and YOLOv3 (anchor based) head as the architecture of YOLOv4. In the future we plan to expand signifificantly the content of Bag of Freebies (BoF) for the detector, which theoreti cally can address some problems and increase the detector accuracy, and sequentially check the inflfluence of each fea ture in an experimental fashion. We do not use Cross-GPU Batch Normalization (CGBN or SyncBN) or expensive specialized devices. This al lows anyone to reproduce our state-of-the-art outcomes on a conventional graphic processor e.g. GTX 1080Ti or RTX 2080Ti.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

LoRA微調語言大模型的實用技巧

一、引言隨着深度學習技術的快速發展，語言大模型在自然語言處理領域取得了顯著的進展。然而，傳統的微調方法通常需要大量的計算資源和時間，對於實際應用來說並不友好。爲了解決這個問題，LoRA微調技術應運而生。LoRA（Low-Rank Adap

2024-04-28 11:30:13

京東廣告研發——效率爲王：廣告統一檢索平臺實踐

1、系統概述實踐證明，將互聯網流量變現的在線廣告是互聯網最成功的商業模式，而電商場景是在線廣告的核心場景。京東服務中國數億的用戶和大量的商家，商品池海量。平臺在兼顧用戶體驗、平臺、廣告主收益的前提推送商品具有挑戰性。京東廣告檢索平臺

2024-04-25 23:17:47

Stable Diffusion中的embedding

Stable Diffusion中的embedding 嵌入，也稱爲文本反轉，是在 Stable Diffusion 中控制圖像樣式的另一種方法。在這篇文章中，我們將學習什麼是嵌入，在哪裏可以找到它們，以及如何使用它們。什麼是嵌入embe

2024-04-25 21:31:13

大模型區域落地再加速！百度“文心中國行”西部首站落地成都錦江

4 月 24 日，“文心中國行”西部地區首站落地成都錦江。成都市錦江區白鷺灣新經濟總部功能區、錦江區投資促進局與百度飛槳攜手合作，打造成都人工智能的新產業、新模式、新業態。來自成都政產學研各界的領導、專家、企業嘉賓，共同探討如何降低 AI

2024-04-25 11:41:53

文心中國行走進成都！4 月 24 日一起把握大模型時代的產業新機遇

4 月 24 日，文心中國行將走進成都。屆時，政府、企業與高校的相關專家和業界同仁將現場分享生成式人工智能與大模型最新進展，從人工智能政策解讀、大模型技術，到產業創新應用的實踐案例，讓參會者全方位瞭解大模型時期的發展與創新機遇。大會還特別

2024-04-23 11:41:07

文心大模型“你說我畫”：PaddleHub與PaddleSpeech的協同實踐

在人工智能領域中，自然語言處理和計算機視覺是兩個非常活躍的研究方向。隨着深度學習技術的發展，這兩個領域之間的交叉融合產生了許多令人興奮的應用場景。其中，“你說我畫”就是這樣一個結合自然語言處理和計算機視覺技術的創新應用。 “你說我畫”的核心

2024-04-22 11:29:20

探索時間序列大模型：TimeGPT的魅力與實踐

在數據科學的各個領域中，時間序列分析一直扮演着重要角色。無論是預測股票價格、氣候變化，還是分析醫療數據，時間序列模型都發揮着不可或缺的作用。然而，傳統的時間序列分析方法在處理複雜數據時常常面臨諸多挑戰，如數據稀疏性、非線性關係等。爲了應對這

2024-04-22 11:29:17

京東廣告研發——AIGC在京東廣告創意的技術應用

一、前言電商廣告圖片不僅能夠抓住消費者的眼球，還可以傳遞品牌核心價值和故事，建立起與消費者之間的情感聯繫。然而現有的廣告圖片大多依賴人工製作，存在效率和成本的限制。儘管最近AIGC技術取得了卓越的進展，但其在廣告圖片的應用還存在缺乏

2024-04-22 11:16:30

Create 2024 分論壇：百度大模型安全解決方案護航開發者一起創造未來

4月16日，百度Create AI開發者大會在深圳國際會展中心（寶安）舉行，大會以“創造未來”爲主題，匯聚了當前科技和產業革命中的開發者先鋒力量。自去年3月16日發佈知識增強大語言模型文心一言以來，百度不斷推動文心大模型的升級迭代，每一次版

2024-04-19 21:33:25

AI大模型應用架構（ALLMA）白皮書解讀

隨着人工智能技術的不斷髮展，AI大模型成爲推動生產、生活方式變革，助推產業智能化轉型升級，驅動數字經濟高質量發展等社會經濟發展方面的新引擎。爲了全面展示AI大模型的發展全貌，爲各界提供新思路，本文將對AI大模型應用架構（ALLMA）白皮書進

2024-04-19 11:29:39

文心大模型ERNIE-Tiny：輕量化技術的全面解讀

隨着人工智能技術的日益成熟，大模型成爲了衆多領域的研究熱點。大模型通過龐大的數據量和複雜的網絡結構，實現了對數據的深度挖掘和高效處理。然而，大模型的龐大體積和高計算成本也限制了其在一些實際場景中的應用。爲了解決這一問題，文心大模型ERNIE

2024-04-18 11:29:53

文檔圖像大模型

隨着信息技術的快速發展，文檔處理已經成爲日常生活和工作中不可或缺的一部分。傳統的文檔處理方法往往需要人工參與，效率低下且易出錯。近年來，隨着深度學習技術的突破，文檔圖像大模型在智能文檔處理領域嶄露頭角，爲提升文檔處理性能提供了新的解決方案。

2024-04-18 11:29:52

王海峯：百度 500 萬 AI 人才培養目標已提前達成

4 月 16 日，以“創造未來”爲主題的 Create 2024 百度 AI 開發者大會在深圳國際會展中心成功舉辦。百度首席技術官王海峯以“技術築基，星河璀璨”爲題，發表演講，解讀了智能體、代碼、多模型等多項文心大模型的關鍵技術和最新進展。

2024-04-17 23:41:11

提高 RAG 應用準確度，時下流行的 Reranker 瞭解一下？

檢索增強生成（RAG）是一種新興的 AI 技術棧，通過爲大型語言模型（LLM）提供額外的“最新知識”來增強其能力。基本的 RAG 應用包括四個關鍵技術組成部分： Embedding 模型：用於將外部文檔和用戶查詢轉換成 Embeddi

2024-04-17 21:20:19

從零開始學習大模型

隨着人工智能技術的快速發展，大模型已成爲許多領域的熱門話題。然而，大模型的創建並不是一件容易的事情。在本文中，我們將從零開始學習如何創建一個大模型，幫助讀者掌握大模型的創建過程。一、數據收集創建大模型的首要任務是收集數據。數據是大模型的

2024-04-16 11:29:26

24小時熱門文章

最新文章

最新評論文章