You Only Look Once: Unified, Real-Time Object Detection

Abstract

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

摘要

我們提出了YOLO，一種新的目標檢測方法。以前的目標檢測工作重新利用分類器來執行檢測。相反，我們將目標檢測框架看作迴歸問題從空間上分割邊界框和相關的類別概率。單個神經網絡在一次評估中直接從完整圖像上預測邊界框和類別概率。由於整個檢測流水線是單一網絡，因此可以直接對檢測性能進行端到端的優化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

我們的統一架構非常快。我們的基礎YOLO模型以45幀/秒的速度實時處理圖像。網絡的一個較小版本，快速YOLO，每秒能處理驚人的155幀，同時實現其它實時檢測器兩倍的mAP。與最先進的檢測系統相比，YOLO產生了更多的定位誤差，但不太可能在背景上的預測假陽性。最後，YOLO學習目標非常通用的表示。當從自然圖像到藝術品等其它領域泛化時，它都優於其它檢測方法，包括DPM和R-CNN。

1. Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

1. 引言

人們瞥一眼圖像，立即知道圖像中的物體是什麼，它們在哪裏以及它們如何相互作用。人類的視覺系統是快速和準確的，使我們能夠執行復雜的任務，如駕駛時沒有多少有意識的想法。快速，準確的目標檢測算法可以讓計算機在沒有專門傳感器的情況下駕駛汽車，使輔助設備能夠向人類用戶傳達實時的場景信息，並表現出對一般用途和響應機器人系統的潛力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].

目前的檢測系統重用分類器來執行檢測。爲了檢測目標，這些系統爲該目標提供一個分類器，並在不同的位置對其進行評估，並在測試圖像中進行縮放。像可變形部件模型（DPM）這樣的系統使用滑動窗口方法，其分類器在整個圖像的均勻間隔的位置上運行[10]。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

最近的方法，如R-CNN使用區域提出方法首先在圖像中生成潛在的邊界框，然後在這些提出的框上運行分類器。在分類之後，後處理用於細化邊界框，消除重複的檢測，並根據場景中的其它目標重新定位邊界框[13]。這些複雜的流程很慢，很難優化，因爲每個單獨的組件都必須單獨進行訓練。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

我們將目標檢測重新看作單一的迴歸問題，直接從圖像像素到邊界框座標和類概率。使用我們的系統，您只需要在圖像上看一次（YOLO），以預測出現的目標和位置。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.

YOLO很簡單：參見圖1。單個卷積網絡同時預測這些盒子的多個邊界框和類概率。YOLO在全圖像上訓練並直接優化檢測性能。這種統一的模型比傳統的目標檢測方法有一些好處。

圖1：YOLO檢測系統。用YOLO處理圖像簡單直接。我們的系統（1）將輸入圖像調整爲448×448，（2）在圖像上運行單個卷積網絡，以及（3）由模型的置信度對所得到的檢測進行閾值處理。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.

首先，YOLO速度非常快。由於我們將檢測視爲迴歸問題，所以我們不需要複雜的流程。測試時我們在一張新圖像上簡單的運行我們的神經網絡來預測檢測。我們的基礎網絡以每秒45幀的速度運行，在Titan X GPU上沒有批處理，快速版本運行速度超過150fps。這意味着我們可以在不到25毫秒的延遲內實時處理流媒體視頻。此外，YOLO實現了其它實時系統兩倍以上的平均精度。關於我們的系統在網絡攝像頭上實時運行的演示，請參閱我們的項目網頁：http://pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

其次，YOLO在進行預測時，會對圖像進行全面地推理。與基於滑動窗口和區域提出的技術不同，YOLO在訓練期間和測試時會看到整個圖像，所以它隱式地編碼了關於類的上下文信息以及它們的外觀。快速R-CNN是一種頂級的檢測方法[14]，因爲它看不到更大的上下文，所以在圖像中會將背景塊誤檢爲目標。與快速R-CNN相比，YOLO的背景誤檢數量少了一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

第三，YOLO學習目標的泛化表示。當在自然圖像上進行訓練並對藝術作品進行測試時，YOLO大幅優於DPM和R-CNN等頂級檢測方法。由於YOLO具有高度泛化能力，因此在應用於新領域或碰到意外的輸入時不太可能出故障。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

YOLO在精度上仍然落後於最先進的檢測系統。雖然它可以快速識別圖像中的目標，但它仍在努力精確定位一些目標，尤其是小的目標。我們在實驗中會進一步檢查這些權衡。

All of our training and testing code is open source. A variety of pretrained models are also available to download.

我們所有的訓練和測試代碼都是開源的。各種預訓練模型也都可以下載。

2. Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

2. 統一檢測

我們將目標檢測的單獨組件集成到單個神經網絡中。我們的網絡使用整個圖像的特徵來預測每個邊界框。它還可以同時預測一張圖像中的所有類別的所有邊界框。這意味着我們的網絡全面地推理整張圖像和圖像中的所有目標。YOLO設計可實現端到端訓練和實時的速度，同時保持較高的平均精度。

Our system divides the input image into an S×S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

我們的系統將輸入圖像分成S×S的網格。如果一個目標的中心落入一個網格單元中，該網格單元負責檢測該目標。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Pr(Object)∗IOUtruthpred. If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

每個網格單元預測這些盒子的B個邊界框和置信度分數。這些置信度分數反映了該模型對盒子是否包含目標的信心，以及它預測盒子的準確程度。在形式上，我們將置信度定義爲Pr(Object)∗IOUtruthpred。如果該單元格中不存在目標，則置信度分數應爲零。否則，我們希望置信度分數等於預測框與真實值之間聯合部分的交集（IOU）。

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x,y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每個邊界框包含5個預測：x，y，w，h和置信度。(x，y)座標表示邊界框相對於網格單元邊界框的中心。寬度和高度是相對於整張圖像預測的。最後，置信度預測表示預測框與實際邊界框之間的IOU。

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

每個網格單元還預測C個條件類別概率Pr(Classi|Object)。這些概率以包含目標的網格單元爲條件。每個網格單元我們只預測的一組類別概率，而不管邊界框的的數量B是多少。

At test time we multiply the conditional class probabilities and the individual box confidence predictions,

Pr (Class i | Object) * Pr (Object) * IOU truth pred = Pr (Class i) * IOU truth pred

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

在測試時，我們乘以條件類概率和單個盒子的置信度預測，

Pr (Class i | Object) * Pr (Object) * IOU truth pred = Pr (Class i) * IOU truth pred

它爲我們提供了每個框特定類別的置信度分數。這些分數編碼了該類出現在框中的概率以及預測框擬合目標的程度。

For evaluating YOLO on Pascal VOC, we use S=7, B=2. Pascal VOC has 20 labelled classes so C=20. Our final prediction is a 7×7×30 tensor.

The Model. Our system models detection as a regression problem. It divides the image into an S×S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S×S×(B∗5+C) tensor.

爲了在Pascal VOC上評估YOLO，我們使用S=7，B=2。Pascal VOC有20個標註類，所以C=20。我們最終的預測是7×7×30的張量。

模型。 我們的系統將檢測建模爲迴歸問題。它將圖像分成S×S的網格，並且每個網格單元預測B個邊界框，這些邊界框的置信度以及C個類別概率。這些預測被編碼爲S×S×(B∗5+C)的張量。

2.1. Network Design

We implement this model as a convolutional neural network and evaluate it on the Pascal VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

2.1. 網絡設計

我們將此模型作爲卷積神經網絡來實現，並在Pascal VOC檢測數據集[9]上進行評估。網絡的初始卷積層從圖像中提取特徵，而全連接層預測輸出概率和座標。

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1×1 reduction layers followed by 3×3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1×1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224×224 input image) and then double the resolution for detection.

我們的網絡架構受到GoogLeNet圖像分類模型的啓發[34]。我們的網絡有24個卷積層，後面是2個全連接層。我們只使用1×1降維層，後面是3×3卷積層，這與Lin等人[22]類似，而不是GoogLeNet使用的Inception模塊。完整的網絡如圖3所示。

圖3：架構。我們的檢測網絡有24個卷積層，其次是2個全連接層。交替1×1卷積層減少了前面層的特徵空間。我們在ImageNet分類任務上以一半的分辨率（224×224的輸入圖像）預訓練卷積層，然後將分辨率加倍來進行檢測。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

我們還訓練了快速版本的YOLO，旨在推動快速目標檢測的界限。快速YOLO使用具有較少卷積層（9層而不是24層）的神經網絡，在這些層中使用較少的濾波器。除了網絡規模之外，YOLO和快速YOLO的所有訓練和測試參數都是相同的。

The final output of our network is the 7×7×30 tensor of predictions.

我們網絡的最終輸出是7×7×30的預測張量。

2.2. Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].

2.2. 訓練

我們在ImageNet 1000類競賽數據集[30]上預訓練我們的卷積圖層。對於預訓練，我們使用圖3中的前20個卷積層，接着是平均池化層和全連接層。我們對這個網絡進行了大約一週的訓練，並且在ImageNet 2012驗證集上獲得了單一裁剪圖像88%的top-5準確率，與Caffe模型池中的GoogLeNet模型相當。我們使用Darknet框架進行所有的訓練和推斷[26]。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224×224 to 448×448.

然後我們轉換模型來執行檢測。Ren等人表明，預訓練網絡中增加捲積層和連接層可以提高性能[29]。按照他們的例子，我們添加了四個卷積層和兩個全連接層，並且具有隨機初始化的權重。檢測通常需要細粒度的視覺信息，因此我們將網絡的輸入分辨率從224×224變爲448×448。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

我們的最後一層預測類概率和邊界框座標。我們通過圖像寬度和高度來規範邊界框的寬度和高度，使它們落在0和1之間。我們將邊界框x和y座標參數化爲特定網格單元位置的偏移量，所以它們邊界也在0和1之間。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

ϕ (x) = {x, i f x > 0 0.1 x, o t h e r w i s e

我們對最後一層使用線性激活函數，所有其它層使用下面的漏泄修正線性激活：

ϕ (x) = {x, i f x > 0 0.1 x, o t h e r w i s e

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

我們優化了模型輸出中的平方和誤差。我們使用平方和誤差，因爲它很容易進行優化，但是它並不完全符合我們最大化平均精度的目標。分類誤差與定位誤差的權重是一樣的，這可能並不理想。另外，在每張圖像中，許多網格單元不包含任何對象。這將這些單元格的“置信度”分數推向零，通常壓倒了包含目標的單元格的梯度。這可能導致模型不穩定，從而導致訓練早期發散。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord=5 and λnoobj=.5.

爲了改善這一點，我們增加了邊界框座標預測損失，並減少了不包含目標邊界框的置信度預測損失。我們使用兩個參數λcoord和λnoobj來完成這個工作。我們設置λcoord=5和λnoobj=.5。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

平方和誤差也可以在大盒子和小盒子中同樣加權誤差。我們的錯誤指標應該反映出，大盒子小偏差的重要性不如小盒子小偏差的重要性。爲了部分解決這個問題，我們直接預測邊界框寬度和高度的平方根，而不是寬度和高度。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

YOLO每個網格單元預測多個邊界框。在訓練時，每個目標我們只需要一個邊界框預測器來負責。我們指定一個預測器“負責”根據哪個預測與真實值之間具有當前最高的IOU來預測目標。這導致邊界框預測器之間的專業化。每個預測器可以更好地預測特定大小，方向角，或目標的類別，從而改善整體召回率。

During training we optimize the following, multi-part loss function:

λ coord \sum i = 0 S 2 \sum j = 0 B 1 obj i j [(x i - x^i) 2 + (y i - y^i) 2] + λ coord \sum i = 0 S 2 \sum j = 0 B 1 obj i j [(w i - - \sqrt - w^i - - \sqrt) 2 + (h i - - \sqrt - h^i - - \sqrt) 2] + \sum i = 0 S 2 \sum j = 0 B 1 obj i j (C i - C^i) 2 + λ noobj \sum i = 0 S 2 \sum j = 0 B 1 noobj i j (C i - C^i) 2 + \sum i = 0 S 2 1 obj i \sum c \in classes (p i (c) - p^i (c)) 2

where

1obji1iobj denotes if object appears in cell

ii and

1objij1ijobj denotes that the

jjth bounding box predictor in cell

ii is “responsible” for that prediction.

在訓練期間，我們優化以下多部分損失函數：

λ coord \sum i = 0 S 2 \sum j = 0 B 1 obj i j [(x i - x^i) 2 + (y i - y^i) 2] + λ coord \sum i = 0 S 2 \sum j = 0 B 1 obj i j [(w i - - \sqrt - w^i - - \sqrt) 2 + (h i - - \sqrt - h^i - - \sqrt) 2] + \sum i = 0 S 2 \sum j = 0 B 1 obj i j (C i - C^i) 2 + λ noobj \sum i = 0 S 2 \sum j = 0 B 1 noobj i j (C i - C^i) 2 + \sum i = 0 S 2 1 obj i \sum c \in classes (p i (c) - p^i (c)) 2

其中

1obji1iobj表示目標是否出現在網格單元

ii中，

1objij1ijobj表示網格單元

ii中的第

jj個邊界框預測器“負責”該預測。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

注意，如果目標存在於該網格單元中（前面討論的條件類別概率），則損失函數僅懲罰分類錯誤。如果預測器“負責”實際邊界框（即該網格單元中具有最高IOU的預測器），則它也僅懲罰邊界框座標錯誤。

We train the network for about 135 epochs on the training and validation data sets from Pascal VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.

我們對Pascal VOC 2007和2012的訓練和驗證數據集進行了大約135個迭代週期的網絡訓練。在Pascal VOC 2012上進行測試時，我們的訓練包含了Pascal VOC 2007的測試數據。在整個訓練過程中，我們使用了64的批大小，0.9的動量和0.0005的衰減。

Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10−3 to 10−2. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10−2 for 75 epochs, then 10−3 for 30 epochs, and finally 10−4 for 30 epochs.

我們的學習率方案如下：對於第一個迭代週期，我們慢慢地將學習率從10−3提高到10−2。如果我們從高學習率開始，我們的模型往往會由於不穩定的梯度而發散。我們繼續以10−2的學習率訓練75個迭代週期，然後用10−3的學習率訓練30個迭代週期，最後用10−4的學習率訓練30個迭代週期。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate =.5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.

爲了避免過度擬合，我們使用丟棄和大量的數據增強。在第一個連接層之後，丟棄層使用=.5的比例，防止層之間的互相適應[18]。對於數據增強，我們引入高達原始圖像20%大小的隨機縮放和轉換。我們還在HSV色彩空間中使用高達1.5的因子來隨機調整圖像的曝光和飽和度。

2.3. Inference

Just like in training, predicting detections for a test image only requires one network evaluation. On Pascal VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.

2.3. 推斷

就像在訓練中一樣，預測測試圖像的檢測只需要一次網絡評估。在Pascal VOC上，每張圖像上網絡預測98個邊界框和每個框的類別概率。YOLO在測試時非常快，因爲它只需要一次網絡評估，不像基於分類器的方法。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2−3% in mAP.

網格設計強化了邊界框預測中的空間多樣性。通常很明顯一個目標落在哪一個網格單元中，而網絡只能爲每個目標預測一個邊界框。然而，一些大的目標或靠近多個網格單元邊界的目標可以被多個網格單元很好地定位。非極大值抑制可以用來修正這些多重檢測。對於R-CNN或DPM而言，性能不是關鍵的，非最大抑制會增加2−3%的mAP。

2.4. Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

2.4. YOLO的限制

YOLO對邊界框預測強加空間約束，因爲每個網格單元只預測兩個盒子，只能有一個類別。這個空間約束限制了我們的模型可以預測的鄰近目標的數量。我們的模型與羣組中出現的小物體（比如鳥羣）進行鬥爭。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

由於我們的模型學習從數據中預測邊界框，因此它很難泛化到新的、不常見的方向比或配置的目標。我們的模型也使用相對較粗糙的特徵來預測邊界框，因爲我們的架構具有來自輸入圖像的多個下采樣層。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

最後，當我們訓練一個近似檢測性能的損失函數時，我們的損失函數會同樣的對待小邊界框與大邊界框的誤差。大邊界框的小誤差通常是良性的，但小邊界框的小誤差對IOU的影響要大得多。我們的主要錯誤來源是不正確的定位。

3. Comparison to Other Detection Systems

Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]). Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.

3. 與其它檢測系統的比較

目標檢測是計算機視覺中的核心問題。檢測流程通常從輸入圖像上（Haar [25]，SIFT [23]，HOG [4]，卷積特徵[6]）提取一組魯棒特徵開始。然後，分類器[36,21,13,10]或定位器[1,32]被用來識別特徵空間中的目標。這些分類器或定位器在整個圖像上或在圖像中的一些子區域上以滑動窗口的方式運行[35,15,39]。我們將YOLO檢測系統與幾種頂級檢測框架進行比較，突出了關鍵的相似性和差異性。

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, non-maximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.

可變形部件模型。可變形零件模型（DPM）使用滑動窗口方法進行目標檢測[10]。DPM使用不相交的流程來提取靜態特徵，對區域進行分類，預測高評分區域的邊界框等。我們的系統用單個卷積神經網絡替換所有這些不同的部分。網絡同時進行特徵提取，邊界框預測，非極大值抑制和上下文推理。網絡內嵌訓練特徵而不是靜態特徵，併爲檢測任務優化它們。我們的統一架構導致了比DPM更快，更準確的模型。

R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].

R-CNN。R-CNN及其變種使用區域提出而不是滑動窗口來查找圖像中的目標。選擇性搜索[35]產生潛在的邊界框，卷積網絡提取特徵，SVM對邊界框進行評分，線性模型調整邊界框，非極大值抑制消除重複檢測。這個複雜流程的每個階段都必須獨立地進行精確調整，所得到的系統非常慢，測試時每張圖像需要超過40秒[14]。

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.

YOLO與R-CNN有一些相似之處。每個網格單元提出潛在的邊界框並使用卷積特徵對這些框進行評分。但是，我們的系統對網格單元提出進行了空間限制，這有助於緩解對同一目標的多次檢測。我們的系統還提出了更少的邊界框，每張圖像只有98個，而選擇性搜索則只有2000個左右。最後，我們的系統將這些單獨的組件組合成一個單一的，共同優化的模型。

Other Fast Detectors. Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.

其它快速檢測器。快速和更快的R-CNN通過共享計算和使用神經網絡替代選擇性搜索來提出區域加速R-CNN框架[14]，[28]。雖然它們提供了比R-CNN更快的速度和更高的準確度，但兩者仍然不能達到實時性能。

Many research efforts focus on speeding up the DPM pipeline [31] [38] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [31] actually runs in real-time.

許多研究工作集中在加快DPM流程上[31] [38] [5]。它們加速HOG計算，使用級聯，並將計算推動到GPU上。但是，實際上只有30Hz的DPM [31]可以實時運行。

Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.

YOLO不是試圖優化大型檢測流程的單個組件，而是完全拋棄流程，被設計爲快速檢測。

Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.

像人臉或行人等單類別的檢測器可以高度優化，因爲他們必須處理更少的變化[37]。YOLO是一種通用的檢測器，可以學習同時檢測多個目標。

Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.

Deep MultiBox。與R-CNN不同，Szegedy等人訓練了一個卷積神經網絡來預測感興趣區域[8]，而不是使用選擇性搜索。MultiBox還可以通過用單類預測替換置信度預測來執行單目標檢測。然而，MultiBox無法執行通用的目標檢測，並且仍然只是一個較大的檢測流程中的一部分，需要進一步的圖像塊分類。YOLO和MultiBox都使用卷積網絡來預測圖像中的邊界框，但是YOLO是一個完整的檢測系統。

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.

OverFeat。Sermanet等人訓練了一個卷積神經網絡來執行定位，並使該定位器進行檢測[32]。OverFeat高效地執行滑動窗口檢測，但它仍然是一個不相交的系統。OverFeat優化了定位，而不是檢測性能。像DPM一樣，定位器在進行預測時只能看到局部信息。OverFeat不能推斷全局上下文，因此需要大量的後處理來產生連貫的檢測。

MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.

MultiGrasp。我們的工作在設計上類似於Redmon等[27]的抓取檢測。我們對邊界框預測的網格方法是基於MultiGrasp系統抓取的迴歸分析。然而，抓取檢測比目標檢測任務要簡單得多。MultiGrasp只需要爲包含一個目標的圖像預測一個可以抓取的區域。不必估計目標的大小，位置或目標邊界或預測目標的類別，只找到適合抓取的區域。YOLO預測圖像中多個類別的多個目標的邊界框和類別概率。

4. Experiments

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.

4. 實驗

首先，我們在PASCAL VOC 2007上比較YOLO和其它的實時檢測系統。爲了理解YOLO和R-CNN變種之間的差異，我們探索了YOLO和R-CNN性能最高的版本之一Fast R-CNN[14]在VOC 2007上錯誤率。根據不同的誤差曲線，我們顯示YOLO可以用來重新評估Fast R-CNN檢測，並減少背景假陽性帶來的錯誤，從而顯著提升性能。我們還展示了在VOC 2012上的結果，並與目前最先進的方法比較了mAP。最後，在兩個藝術品數據集上我們顯示了YOLO可以比其它檢測器更好地泛化到新領域。

4.1. Comparison to Other Real-Time Systems

Many research efforts in object detection focus on making standard detection pipelines fast [5] [38] [31] [14] [17] [28]. However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [31]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.

4.1. 與其它實時系統的比較

目標檢測方面的許多研究工作都集中在快速制定標準檢測流程上[5]，[38]，[31]，[14]，[17]，[28]。然而，只有Sadeghi等實際上產生了一個實時運行的檢測系統（每秒30幀或更好）[31]。我們將YOLO與DPM的GPU實現進行了比較，其在30Hz或100Hz下運行。雖然其它的努力沒有達到實時性的里程碑，我們也比較了它們的相對mAP和速度來檢查目標檢測系統中精度——性能權衡。

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.

快速YOLO是PASCAL上最快的目標檢測方法；據我們所知，它是現有的最快的目標檢測器。具有52.7%的mAP，實時檢測的精度是以前工作的兩倍以上。YOLO將mAP推到63.4%的同時保持了實時性能。

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.

我們還使用VGG-16訓練YOLO。這個模型比YOLO更準確，但也比它慢得多。對於依賴於VGG-16的其它檢測系統來說，它是比較有用的，但由於它比實時的YOLO更慢，本文的其它部分將重點放在我們更快的模型上。

Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [38]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.

最快的DPM可以在不犧牲太多mAP的情況下有效地加速DPM，但仍然會將實時性能降低2倍[38]。與神經網絡方法相比，DPM相對低的檢測精度也受到限制。

R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.

減去R的R-CNN用靜態邊界框提出取代選擇性搜索[20]。雖然速度比R-CNN更快，但仍然不能實時，並且由於沒有好的邊界框提出，準確性受到了嚴重影響。

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0.5 fps it is still far from real-time.

快速R-CNN加快了R-CNN的分類階段，但是仍然依賴選擇性搜索，每張圖像需要花費大約2秒來生成邊界框提出。因此，它具有很高的mAP，但是0.5的fps仍離實時性很遠。

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8]. In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The Zeiler-Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.

Table 1:Real-Time Systems on Pascal VOC 2007. Comparing the performance and speed of fast detectors. Fast YOLO is the fastest detector on record for Pascal VOC detection and is still twice as accurate as any other real-time detector. YOLO is 10 mAP more accurate than the fast version while still well above real-time in speed.

最近更快的R-CNN用神經網絡替代了選擇性搜索來提出邊界框，類似於Szegedy等[8]。在我們的測試中，他們最精確的模型達到了7fps，而較小的，不太精確的模型以18fps運行。VGG-16版本的Faster R-CNN要高出10mAP，但比YOLO慢6倍。Zeiler-Fergus的Faster R-CNN只比YOLO慢了2.5倍，但也不太準確。

表1：Pascal VOC 2007上的實時系統。比較快速檢測器的性能和速度。快速YOLO是Pascal VOC檢測記錄中速度最快的檢測器，其精度仍然是其它實時檢測器的兩倍。YOLO比快速版本更精確10mAP，同時在速度上仍保持實時性。

4.2. VOC 2007 Error Analysis

To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast R-CNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.

4.2. VOC 2007錯誤分析

爲了進一步檢查YOLO和最先進的檢測器之間的差異，我們詳細分析了VOC 2007的結果。我們將YOLO與Fast R-CNN進行比較，因爲Fast R-CNN是PASCAL上性能最高的檢測器之一併且它的檢測代碼是可公開得到的。

We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:

Correct: correct class and IOU >.5
Localization: correct class, .1<IOU<.5
Similar: class is similar, IOU >.1
Other: class is wrong, IOU >.1
Background: IOU <.1 for any object

Figure 4 shows the breakdown of each error type averaged across all 20 classes.

Figure 4: Error Analysis: Fast R-CNN vs. YOLO These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).

我們使用Hoiem等人[19]的方法和工具。對於測試時的每個類別，我們看這個類別的前N個預測。每個預測或者是正確的，或者根據錯誤類型進行分類：

Correct：正確的類別且IOU>0.5。
Localization：正確的類別，0.1<IOU<0.5。
Similar：類別相似，IOU >0.1。
Other：類別錯誤，IOU >0.1。
Background：任何IOU <0.1的目標。

圖4顯示了在所有的20個類別上每種錯誤類型平均值的分解圖。

圖4，誤差分析：Fast R-CNN vs. YOLO。這些圖顯示了各種類別的前N個預測中定位錯誤和背景錯誤的百分比（N = #表示目標在那個類別中）。

YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.

YOLO努力地正確定位目標。定位錯誤佔YOLO錯誤的大多數，比其它錯誤源加起來都多。Fast R-CNN使定位錯誤少得多，但背景錯誤更多。它的檢測的13.6%是不包含任何目標的誤報。Fast R-CNN比YOLO預測背景檢測的可能性高出近3倍。

4.3. Combining Fast R-CNN and YOLO

YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.

4.3. 結合Fast R-CNN和YOLO

YOLO比Fast R-CNN的背景誤檢要少得多。通過使用YOLO消除Fast R-CNN的背景檢測，我們獲得了顯著的性能提升。對於R-CNN預測的每個邊界框，我們檢查YOLO是否預測一個類似的框。如果是這樣，我們根據YOLO預測的概率和兩個盒子之間的重疊來對這個預測進行提升。

The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.

Table 2: Model combination experiments on VOC 2007. We examine the effect of combining various models with the best version of Fast R-CNN. Other versions of Fast R-CNN provide only a small benefit while YOLO provides a significant performance boost.

最好的Fast R-CNN模型在VOC 2007測試集上達到了71.8%的mAP。當與YOLO結合時，其mAP增加了3.2%達到了75.0%。我們也嘗試將最好的Fast R-CNN模型與其它幾個版本的Fast R-CNN結合起來。這些模型組合產生了0.3到0.6%之間的小幅增加，詳見表2。

表2：VOC 2007模型組合實驗。我們檢驗了各種模型與Fast R-CNN最佳版本結合的效果。Fast R-CNN的其它版本只提供很小的好處，而YOLO則提供了顯著的性能提升。

The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.

來自YOLO的提升不僅僅是模型組合的副產品，因爲組合不同版本的Fast R-CNN幾乎沒有什麼好處。相反，正是因爲YOLO在測試時出現了各種各樣的錯誤，所以在提高Fast R-CNN的性能方面非常有效。

Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.

遺憾的是，這個組合並沒有從YOLO的速度中受益，因爲我們分別運行每個模型，然後結合結果。但是，由於YOLO速度如此之快，與Fast R-CNN相比，不會增加任何顯著的計算時間。

4.4. VOC 2012 Results

On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and tv/monitor YOLO scores 8−10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.

Table 3: PASCAL VOC 2012 Leaderboard. YOLO compared with the full comp4 (outside data allowed) public leaderboard as of November 6th, 2015. Mean average precision and per-class average precision are shown for a variety of detection methods. YOLO is the only real-time detector. Fast R-CNN + YOLO is the forth highest scoring method, with a 2.3% boost over Fast R-CNN.

4.4. VOC 2012的結果

在VOC 2012測試集上，YOLO得分爲57.9%的mAP。這低於現有的最新技術，接近於使用VGG-16的原始R-CNN，見表3。我們的系統與其最接近的競爭對手相比，在小目標上努力。在bottle，sheep和tv/monitor等類別上，YOLO的得分比R-CNN或Feature Edit低8−10%。然而，在cat和train等其它類別上YOLO實現了更高的性能。

表3：PASCAL VOC 2012排行榜。截至2015年11月6日，YOLO與完整comp4（允許外部數據）公開排行榜進行了比較。顯示了各種檢測方法的平均精度均值和每類的平均精度。YOLO是唯一的實時檢測器。Fast R-CNN + YOLO是評分第四高的方法，比Fast R-CNN提升了2.3％。

Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard.

我們聯合的Fast R-CNN + YOLO模型是性能最高的檢測方法之一。Fast R-CNN從與YOLO的組合中獲得了2.3%的提高，在公開排行榜上上移了5位。

4.5. Generalizability: Person Detection in Artwork

Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.

4.5. 泛化能力：藝術品中的行人檢測

用於目標檢測的學術數據集以相同分佈獲取訓練和測試數據。在現實世界的應用中，很難預測所有可能的用例，而且測試數據可能與系統之前看到的不同[3]。我們在Picasso數據集上[12]和People-Art數據集[3]上將YOLO與其它的檢測系統進行比較，這兩個數據集用於測試藝術品中的行人檢測。

Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.

Figure 5: Generalization results on Picasso and People-Art datasets.

圖5顯示了YOLO和其它檢測方法之間的比較性能。作爲參考，我們在person上提供VOC 2007的檢測AP，其中所有模型僅在VOC 2007數據上訓練。在Picasso數據集上的模型在VOC 2012上訓練，而People-Art數據集上的模型則在VOC 2010上訓練。

圖5：Picasso和People-Art數據集上的泛化結果。

R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.

R-CNN在VOC 2007上有高AP。然而，當應用於藝術品時，R-CNN明顯下降。R-CNN使用選擇性搜索來調整自然圖像的邊界框提出。R-CNN中的分類器步驟只能看到小區域，並且需要很好的邊界框提出。

DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.

DPM在應用於藝術品時保持了其AP。之前的工作認爲DPM表現良好，因爲它具有目標形狀和佈局的強大空間模型。雖然DPM不會像R-CNN那樣退化，但它開始時的AP較低。

YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Like DPM, YOLO models the size and shape of objects, as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.

Figure 6: Qualitative Results. YOLO running on sample artwork and natural images from the internet. It is mostly accurate although it does think one person is an airplane.

YOLO在VOC 2007上有很好的性能，在應用於藝術品時其AP下降低於其它方法。像DPM一樣，YOLO建模目標的大小和形狀，以及目標和目標通常出現的位置之間的關係。藝術品和自然圖像在像素級別上有很大不同，但是它們在目標的大小和形狀方面是相似的，因此YOLO仍然可以預測好的邊界框和檢測結果。

圖6：定性結果。YOLO在網絡採樣的藝術品和自然圖像上的運行結果。雖然它將人誤檢成了飛機，但它大部分上是準確的。

5. Real-Time Detection In The Wild

YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance, including the time to fetch images from the camera and display the detections.

5. 現實環境下的實時檢測

YOLO是一種快速，精確的目標檢測器，非常適合計算機視覺應用。我們將YOLO連接到網絡攝像頭，並驗證它是否能保持實時性能，包括從攝像頭獲取圖像並顯示檢測結果的時間。

The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: http://pjreddie.com/yolo/.

由此產生的系統是交互式和參與式的。雖然YOLO單獨處理圖像，但當連接到網絡攝像頭時，其功能類似於跟蹤系統，可在目標移動和外觀變化時檢測目標。系統演示和源代碼可以在我們的項目網站上找到：http://pjreddie.com/yolo/。

6. Conclusion

We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.

6. 結論

我們介紹了YOLO，一種統一的目標檢測模型。我們的模型構建簡單，可以直接在整張圖像上進行訓練。與基於分類器的方法不同，YOLO直接在對應檢測性能的損失函數上訓練，並且整個模型聯合訓練。

Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.

快速YOLO是文獻中最快的通用目的的目標檢測器，YOLO推動了實時目標檢測的最新技術。YOLO還很好地泛化到新領域，使其成爲依賴快速，強大的目標檢測應用的理想選擇。

Acknowledgements: This work is partially supported by ONR N00014-13-1-0720, NSF IIS-1338054, and The Allen Distinguished Investigator Award.

致謝：這項工作得到了ONR N00014-13-1-0720，NSF IIS-1338054和艾倫傑出研究者獎的部分支持。

References

[1] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. In Computer Vision–ECCV 2008, pages 2–15. Springer, 2008. 4

[2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In International Conference on Computer Vision (ICCV), 2009. 8

[3] H. Cai, Q. Wu, T. Corradi, and P. Hall. The cross-depiction problem: Computer vision algorithms for recognising objects in artwork and in photographs. arXiv preprint arXiv:1505.00110, 2015. 7

[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. 4, 8

[5] T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijaya-narasimhan, J. Yagnik, et al. Fast, accurate detection of 100,000 object classes on a single machine. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1814–1821. IEEE, 2013. 5

[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013. 4

[7] J. Dong, Q. Chen, S. Yan, and A. Yuille. Towards unified object detection and semantic segmentation. In Computer Vision–ECCV 2014, pages 299–314. Springer, 2014. 7

[8] D.Erhan, C.Szegedy, A.Toshev, and D.Anguelov. Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2155–2162. IEEE, 2014. 5, 6

[9] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015. 2

[10] P.F.Felzenszwalb, R.B.Girshick, D.McAllester, and D.Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. 1, 4

[11] S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware CNN model. CoRR, abs/1505.01749, 2015. 7

[12] S. Ginosar, D. Haas, T. Brown, and J. Malik. Detecting people in cubist art. In Computer Vision-ECCV 2014 Workshops, pages 101–116. Springer, 2014. 7

[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014. 1, 4, 7

[14] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015. 2, 5, 6, 7

[15] S. Gould, T. Gao, and D. Koller. Region-based segmentation and object detection. In Advances in neural information processing systems, pages 655–663, 2009. 4

[16] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In Computer Vision–ECCV 2014, pages 297–312. Springer, 2014. 7

[17] K.He, X.Zhang, S.Ren, and J.Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv preprint arXiv:1406.4729, 2014. 5

[18] G.E.Hinton, N.Srivastava, A.Krizhevsky, I.Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. 4

[19] D.Hoiem, Y.Chodpathumwan, and Q.Dai. Diagnosing error in object detectors. In Computer Vision–ECCV 2012, pages 340–353. Springer, 2012. 6

[20] K. Lenc and A. Vedaldi. R-cnn minus r. arXiv preprint arXiv:1506.06981, 2015. 5, 6

[21] R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In Image Processing. 2002. Proceedings. 2002
International Conference on, volume 1, pages I–900. IEEE, 2002. 4

[22] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2

[23] D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999. 4

[24] D. Mishkin. Models accuracy on imagenet 2012 val. https://github.com/BVLC/caffe/wiki/ Models-accuracy-on-ImageNet-2012-val. Accessed: 2015-10-2. 3

[25] C. P. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In Computer vision, 1998. sixth international conference on, pages 555–562. IEEE, 1998. 4

[26] J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 3

[27] J.Redmon and A.Angelova. Real-time grasp detection using convolutional neural networks. CoRR, abs/1412.3128, 2014. 5

[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015. 5, 6, 7

[29] S. Ren, K. He, R. B. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. CoRR, abs/1504.06066, 2015. 3, 7

[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. 3

[31] M. A. Sadeghi and D. Forsyth. 30hz object detection with dpm v5. In Computer Vision–ECCV 2014, pages 65–79. Springer, 2014. 5, 6

[32] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013. 4, 5

[33] Z.Shen and X.Xue. Do more dropouts in pool5 feature maps for better object detection. arXiv preprint arXiv:1409.6911, 2014. 7

[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 2

[35] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013. 4, 5

[36] P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 4:34–47, 2001. 4

[37] P. Viola and M. J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004. 5

[38] J. Yan, Z. Lei, L. Wen, and S. Z. Li. The fastest deformable part model for object detection. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2497–2504. IEEE, 2014. 5, 6

[39] C.L.Zitnick and P.Dollár.Edgeboxes:Locating object proposals from edges. In Computer Vision–ECCV 2014, pages 391–405. Springer, 2014. 4

YOLO翻譯