SSD: Single Shot MultiBox Detector

Abstract — 摘要

We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300x300 input, SSD achieves 74.3% mAP on VOC2007 test at 59 FPS on a Nvidia Titan X, and for 512x512 input, SSD achieves 76.9% mAP, outperforming a comparable state-of-the-art Faster R-CNN model.

本文提出一種使用單一深度神經網路進行影像物件偵測的方法。我們的方法名為 SSD，其核心思想是將邊界框的輸出空間離散化為一組在每個特徵圖位置上具有不同長寬比與尺度的預設框。在預測階段，網路針對每個預設框產生各物件類別的存在機率分數，並生成對框的調整量以更好地匹配物件形狀。此外，網路結合來自多個不同解析度特徵圖的預測結果，以自然地處理不同大小的物件。SSD 相較於需要候選區域的方法更為簡潔，因為它完全消除了候選區域生成以及後續的像素或特徵重取樣階段，將所有計算封裝在單一網路中。在 PASCAL VOC、COCO 和 ILSVRC 資料集上的實驗結果證實，SSD 具有與使用額外候選區域步驟的方法相當的精確度，且速度更快。以 300x300 的輸入而言，SSD 在 VOC2007 測試集上達到 74.3% mAP，在 Nvidia Titan X 上達到每秒 59 幀；以 512x512 輸入則達到 76.9% mAP，超越了同級的 Faster R-CNN 模型。

段落功能提出核心方法並總結主要貢獻與實驗結果。

邏輯角色作為全文的濃縮摘要，確立了 SSD 的三大賣點：單次前向傳遞、多尺度特徵圖預測、無需候選區域。

論證技巧 / 潛在漏洞用速度與精度的雙重指標建立優勢敘事，但 mAP 與速度的比較基準（硬體、批次大小）需要讀者自行查驗。

1. Introduction — 緒論

Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. This pipeline has prevailed on detection benchmarks since the Selective Search work through the current leading results on PASCAL VOC, COCO, and ILSVRC detection achieved by Faster R-CNN. Although accurate, these approaches are too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications. Often, detection speed for these approaches is measured in frames per second (FPS), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 FPS.

當前最先進的物件偵測系統基本上遵循以下流程：假設邊界框、對每個框重取樣像素或特徵、然後套用高品質分類器。自從選擇性搜尋（Selective Search）的工作以來，這一流程一直主導著偵測基準測試，直至目前由 Faster R-CNN 在 PASCAL VOC、COCO 與 ILSVRC 偵測上取得的領先成果。雖然精確度高，但這些方法的計算量對嵌入式系統而言過於龐大，即使在高階硬體上也過慢而無法用於即時應用。這些方法的偵測速度通常以每秒幀數（FPS）衡量，即使最快的高精度偵測器 Faster R-CNN 也僅能達到每秒 7 幀。

段落功能回顧現有偵測流程，指出速度瓶頸。

邏輯角色透過批判既有範式（兩階段偵測器）的效率問題，為 SSD 的「單次偵測」動機建立鋪陳。

論證技巧 / 潛在漏洞以 Faster R-CNN 僅 7 FPS 作為對比標竿，有效凸顯速度差距，但未提及同期也有人嘗試加速兩階段方法。

There have been many attempts to build faster detectors by attacking each stage of the detection pipeline, but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy. Our work introduces the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and is as accurate as approaches that do. This results in a significant improvement in speed for high-accuracy detection. The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage.

過去已有許多嘗試透過攻克偵測流程各階段來建構更快的偵測器，但迄今為止，顯著的速度提升往往以顯著的偵測精度下降為代價。我們的工作提出了第一個不對邊界框假設進行像素或特徵重取樣的深度網路物件偵測器，且其精度與需要重取樣的方法相當。這在高精度偵測方面帶來了顯著的速度提升。速度的根本改善來自於消除邊界框候選區域以及後續的像素或特徵重取樣階段。

段落功能提出核心創新主張。

邏輯角色在問題（速度慢）與解決方案（SSD）之間建立因果鏈。

論證技巧 / 潛在漏洞以「第一個」的措辭宣稱開創性，但 YOLO 在此之前也是單次偵測器，SSD 的差異在於多尺度特徵圖。

2. The SSD Architecture — SSD 架構

The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we call the base network. We then add auxiliary structure to the network to produce detections with the following key features: multi-scale feature maps for detection — we add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales.

SSD 方法基於一個前饋摺積網路，該網路產生固定數量的邊界框以及各框中物件類別的存在機率分數，隨後透過非極大值抑制（Non-Maximum Suppression）步驟產生最終偵測結果。網路的前段層基於用於高品質影像分類的標準架構（在分類層之前截斷），我們稱之為基礎網路。接著在網路上添加輔助結構以產生偵測結果，其關鍵特徵包括：多尺度特徵圖偵測 — 我們在截斷的基礎網路末端添加摺積特徵層。這些層的尺寸逐漸縮小，允許在多個尺度上進行偵測預測。

段落功能描述 SSD 的整體架構設計。

邏輯角色技術方案的核心陳述，建立多尺度特徵圖偵測的概念。

論證技巧 / 潛在漏洞將基礎網路（VGG-16）視為現成元件，專注於新增結構的說明，使讀者容易理解增量式創新。

Convolutional predictors for detection — each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. For a feature layer of size m x n with p channels, the basic element for predicting parameters of a potential detection is a 3x3xp small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. Default boxes and aspect ratios — we associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed.

用於偵測的摺積預測器 — 每個新增的特徵層（或可選擇基礎網路中的既有特徵層）可使用一組摺積濾波器產生固定數量的偵測預測。對於大小為 m x n、通道數為 p 的特徵層，用於預測潛在偵測參數的基本元素是一個 3x3xp 的小核心，產生類別分數或相對於預設框座標的形狀偏移量。預設框與長寬比 — 我們為每個特徵圖單元關聯一組預設邊界框，涵蓋網路頂部的多個特徵圖。預設框以摺積方式鋪設在特徵圖上，使得每個框相對於其對應單元的位置是固定的。

段落功能詳細說明預測機制與預設框設計。

邏輯角色補充架構細節，說明如何從多尺度特徵圖中直接進行偵測。

論證技巧 / 潛在漏洞預設框的設計靈感來自 Faster R-CNN 的錨框，但透過在多層級特徵圖上部署，有效擴展了偵測範圍。

3. Training — 訓練策略

The key difference between training SSD and training a typical detector that uses region proposals is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Once this assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies. Matching strategy: during training we need to determine which default boxes correspond to a ground truth detection. We begin by matching each ground truth box to the default box with the best jaccard overlap. Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5).

訓練 SSD 與訓練典型候選區域偵測器的關鍵差異在於，真實標註資訊需要被指派到固定偵測輸出集合中的特定輸出。一旦確定此指派，損失函數與反向傳播即以端對端方式進行。訓練還涉及選擇偵測用的預設框集合與尺度，以及困難負樣本挖掘（hard negative mining）和資料擴增策略。匹配策略：訓練時需要確定哪些預設框對應真實偵測結果。首先將每個真實框與 Jaccard 重疊度最高的預設框匹配，不同於 MultiBox，我們接著將預設框與任何 Jaccard 重疊度高於閾值（0.5）的真實框匹配。

段落功能說明訓練流程中的標籤指派與匹配策略。

邏輯角色在架構設計之後，補充使其可訓練的關鍵技術細節。

論證技巧 / 潛在漏洞0.5 的 Jaccard 閾值為經驗選擇，放寬匹配條件有助於增加正樣本數量，但可能引入雜訊。

Hard negative mining: after the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. This introduces a significant imbalance between the positive and negative training examples. Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and more stable training. Data augmentation is crucial to improving performance. We generate additional training examples by random cropping, horizontal flipping, and photometric distortions.

困難負樣本挖掘：匹配步驟之後，大多數預設框為負樣本，特別是當可能的預設框數量很大時，這會在正負訓練樣本之間造成顯著的不平衡。我們不使用全部負樣本，而是按每個預設框的最高置信度損失進行排序，選取排名靠前的負樣本，使得負正樣本比率最多為 3:1。我們發現這能帶來更快的最佳化和更穩定的訓練。資料擴增對於提升性能至關重要，我們透過隨機裁剪、水平翻轉和光度扭曲來生成額外的訓練樣本。

段落功能說明解決正負樣本不平衡的關鍵技巧。

邏輯角色補充訓練穩定性的工程細節，為後續的實驗結果提供可靠基礎。

論證技巧 / 潛在漏洞3:1 的比率為經驗法則，資料擴增的重要性在消融實驗中得到驗證。

4. Experiments — 實驗

To understand the performance of our SSD model in detail, we carried out a comprehensive analysis using PASCAL VOC 2007. In all experiments, we use VGG-16 as the base network, pre-trained on ILSVRC CLS-LOC dataset. We compare against Fast R-CNN and Faster R-CNN. Our SSD300 model achieves 74.3% mAP at 59 FPS while SSD512 achieves 76.9% mAP at 22 FPS. For reference, Faster R-CNN achieves 73.2% mAP at 7 FPS. This means SSD300 is 8.4x faster than Faster R-CNN while achieving higher mAP. Our SSD512 model outperforms Faster R-CNN in both accuracy and speed.

為深入了解 SSD 模型的性能，我們使用 PASCAL VOC 2007 進行了全面分析。所有實驗中均使用在 ILSVRC CLS-LOC 資料集上預訓練的 VGG-16 作為基礎網路，並與 Fast R-CNN 和 Faster R-CNN 進行比較。SSD300 模型達到 74.3% mAP、每秒 59 幀；SSD512 達到 76.9% mAP、每秒 22 幀。作為參考，Faster R-CNN 達到 73.2% mAP、每秒 7 幀。這意味著 SSD300 比 Faster R-CNN 快 8.4 倍，同時達到更高的 mAP。SSD512 模型在精度和速度上均超越 Faster R-CNN。

段落功能報告核心實驗結果並與基準方法比較。

邏輯角色以量化數據驗證摘要中的效能宣稱。

論證技巧 / 潛在漏洞以倍數形式呈現速度優勢（8.4x）極具說服力，但須注意不同模型的 GPU 使用條件是否一致。

We also evaluate SSD on the COCO dataset. We compare against the state-of-the-art Faster R-CNN and ION on various metrics including mAP@0.5 and mAP@[0.5:0.95]. Our SSD300 achieves 23.2% mAP@[0.5:0.95] compared to Faster R-CNN's 21.9%. However, SSD has relatively worse performance on smaller objects compared to larger objects. This is not surprising because those small objects may not even have any information at the very top layers. Increasing the input size (e.g., from 300x300 to 512x512) can help improve detecting small objects.

我們也在 COCO 資料集上評估 SSD，與最先進的 Faster R-CNN 和 ION 在多項指標上進行比較，包括 mAP@0.5 和 mAP@[0.5:0.95]。SSD300 達到 23.2% mAP@[0.5:0.95]，對比 Faster R-CNN 的 21.9%。然而，SSD 在小物件上的表現相對較差。這並不意外，因為那些小物件在最頂層特徵圖上可能根本沒有足夠的資訊。增加輸入尺寸（例如從 300x300 到 512x512）有助於改善小物件的偵測。

段落功能報告 COCO 結果並坦承小物件偵測的不足。

邏輯角色展示跨資料集的泛化能力，同時誠實面對局限性。

論證技巧 / 潛在漏洞主動承認弱點（小物件偵測較弱）並提出簡單的緩解策略（增大輸入），增加論文的可信度。

5. Conclusions — 結論

This paper introduces SSD, a fast single-shot object detector for multiple categories. A key feature of our model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network. This representation allows us to efficiently model the space of possible object shapes. We demonstrate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance. We build SSD models with at least an order of magnitude more box predictions sampling location, scale, and aspect ratio, than existing methods. We demonstrate that SSD achieves accuracy that is competitive with state-of-the-art object detectors while being significantly faster, making it the ideal candidate for real-time object detection systems.

本文提出了 SSD，一種適用於多類別的快速單次物件偵測器。模型的關鍵特徵是使用連接到網路頂部多個特徵圖的多尺度摺積邊界框輸出。此表示法使我們能夠有效地模擬可能的物件形狀空間。我們證明了在適當的訓練策略下，精心選擇的更多預設邊界框能帶來性能提升。我們構建的 SSD 模型在位置、尺度和長寬比的取樣上，其框預測數量至少比現有方法多一個數量級。實驗證明 SSD 達到了與最先進物件偵測器相當的精度，同時顯著更快，使其成為即時物件偵測系統的理想選擇。

段落功能總結核心貢獻並強調實用價值。

邏輯角色呼應摘要的主張，以「即時偵測的理想選擇」作為收尾。

論證技巧 / 潛在漏洞以「至少多一個數量級」的框預測量作為差異化論述，強化了 SSD 方法的全面性。

論證結構總覽

問題
兩階段偵測器速度不足

➔

論點
單次偵測+多尺度特徵

➔

證據
VOC/COCO 精度與速度

➔

反駁
小物件仍有不足

➔

結論
即時偵測理想方案

核心主張

透過在多尺度特徵圖上直接預測邊界框與類別，SSD 消除了候選區域生成步驟，在保持高精度的同時大幅提升偵測速度。

最強論證

SSD300 以 59 FPS 達到 74.3% mAP，比 Faster R-CNN（7 FPS, 73.2% mAP）快 8.4 倍且更準確，量化數據有力地支撐了速度與精度的雙重優勢。

最弱環節

小物件偵測表現相對較差，作者雖坦承此問題並提出增大輸入尺寸作為緩解方案，但缺乏更深入的結構性解決策略。