Simultaneous Detection and Segmentation

Abstract — 摘要

We aim to detect all instances of a category in an image and, for each instance, mark the pixels that belong to it. We call this task Simultaneous Detection and Segmentation (SDS). Unlike classical bounding box detection, SDS requires a segmentation and not just a box. Unlike classical semantic segmentation, we require individual object instances. We build on recent work that uses convolutional neural networks to classify category-independent region proposals (R-CNN), introducing a novel architecture tailored for SDS.

我們的目標是偵測影像中一個類別的所有實例，並對每個實例標記屬於它的像素。我們稱此任務為同步偵測與分割（SDS）。不同於傳統的邊界框偵測，SDS 需要分割而不僅僅是框。不同於傳統的語義分割，我們需要個別的物件實例。我們基於近期使用摺積神經網路分類與類別無關的區域候選（R-CNN）的工作，引入了一種為 SDS 量身打造的新穎架構。

段落功能定義 SDS 任務，與偵測和語義分割做明確區分。

邏輯角色以雙重區分（非僅邊界框、非僅語義分割）精確定位任務的獨特性。

論證技巧 / 潛在漏洞任務定義清晰有力，SDS 成為後來「實例分割」研究的重要先驅。

1. Introduction — 緒論

Traditional object detection outputs a bounding box for each detected object, which provides only a coarse localization. Semantic segmentation assigns each pixel a class label but does not distinguish between different instances of the same class. The SDS task bridges these two by requiring instance-level segmentation: each detected object must be accompanied by a pixel-level mask. This is critical for applications like robotic manipulation, autonomous driving, and image editing.

傳統物件偵測為每個偵測到的物件輸出邊界框，僅提供粗略的定位。語義分割為每個像素指定類別標籤，但不區分同類別的不同實例。SDS 任務橋接了這兩者，要求實例級分割：每個偵測到的物件都必須伴隨像素級遮罩。這對機器人操作、自動駕駛和影像編輯等應用至關重要。

段落功能闡述 SDS 任務的實際應用價值。

邏輯角色以具體應用場景說明為何「實例級分割」是必需的，而非僅是學術上的精煉。

論證技巧 / 潛在漏洞應用場景的列舉增強了研究的實用價值論述。

2. Method — 方法

Our approach has two stages. First, we use MCG (Multiscale Combinatorial Grouping) to generate candidate region proposals, and classify them using CNN features extracted from both the bounding box region and the foreground region. The concatenation of these two feature types provides complementary information: the bounding box features capture context while the foreground features capture the object's appearance.

我們的方法分為兩個階段。首先，我們使用 MCG（多尺度組合分組）生成候選區域提案，並利用從邊界框區域和前景區域提取的 CNN 特徵進行分類。這兩種特徵類型的串接提供了互補資訊：邊界框特徵捕捉情境，而前景特徵捕捉物件的外觀。

段落功能描述雙流特徵提取策略。

邏輯角色邊界框 + 前景的雙流設計是本文的關鍵技術創新。

論證技巧 / 潛在漏洞「情境 vs 外觀」的互補論述直觀合理，但雙流特徵的計算成本值得關注。

After initial detection, we refine the segmentation using category-specific, top-down figure-ground predictions. For each detected region, we train a CNN to predict which pixels belong to the object (figure) versus background (ground). This refinement step is critical: it improves segmentation quality by 7 points in the SDS metric, a 16% relative improvement over the baseline.

在初始偵測後，我們使用類別特定的由上而下前景-背景預測來精煉分割。對每個偵測到的區域，我們訓練一個摺積神經網路來預測哪些像素屬於物件（前景）而哪些屬於背景。此精煉步驟至關重要：它在 SDS 指標上改進了 7 個百分點，相對基線有 16% 的相對改進。

段落功能展示由上而下精煉的顯著效果。

邏輯角色 7 個百分點的改進凸顯了精煉步驟的不可或缺性。

論證技巧 / 潛在漏洞同時提供絕對值（7 點）和相對值（16%）增強了數據的可解讀性。

4. Experiments — 實驗

We evaluate on PASCAL VOC 2012 using the SDS metric (AP at IoU threshold of 0.5 measured on segmentations). Our method achieves state-of-the-art results on both the SDS task and the classical semantic segmentation task. On SDS, we obtain a mean AP of 49.7%. On semantic segmentation, our approach provides a 5 point boost (10% relative) over the previous state-of-the-art. On object detection with bounding boxes, we also achieve competitive performance.

我們在 PASCAL VOC 2012 上使用 SDS 指標（在分割上以 IoU 閾值 0.5 測量的 AP）進行評估。我們的方法在 SDS 任務和傳統語義分割任務上均達到了最先進的結果。在 SDS 上，我們獲得 49.7% 的平均 AP。在語義分割上，我們的方法相較先前的技術水準提供了 5 個百分點的提升（10% 相對改進）。在邊界框物件偵測上，我們也達到了有競爭力的效能。

段落功能提供三個任務上的定量結果。

邏輯角色在 SDS、語義分割和偵測三個任務上均有優異表現，展現方法的全面性。

論證技巧 / 潛在漏洞多任務評估增強了方法的可信度，但 49.7% 的絕對 AP 仍有很大改進空間。

5. Conclusion — 結論

We have introduced the Simultaneous Detection and Segmentation (SDS) task and proposed an effective approach combining region-based CNN features with top-down figure-ground refinement. Our results demonstrate that jointly addressing detection and segmentation leads to better performance on both tasks. We believe SDS will become an important task as the community moves beyond bounding boxes toward pixel-level understanding.

我們引入了同步偵測與分割（SDS）任務，並提出了結合基於區域的 CNN 特徵與由上而下前景-背景精煉的有效方法。我們的結果證明了同時處理偵測與分割能在兩個任務上都獲得更好的效能。我們相信隨著社群從邊界框邁向像素級理解，SDS 將成為一項重要任務。

段落功能總結貢獻並預言任務的未來重要性。

邏輯角色 SDS 確實成為了「實例分割」的前身，後續 Mask R-CNN 等工作直接延續此方向。

論證技巧 / 潛在漏洞「從邊界框到像素級」的願景精準預見了電腦視覺的發展方向，展現了作者的前瞻性。

論證結構總覽

邊界框不足
缺乏像素級精度

→

SDS 任務定義
偵測 + 實例分割

→

雙流 CNN 特徵
邊界框 + 前景

→

由上而下精煉
前景-背景預測

→

SDS/分割/偵測
三任務 SOTA

核心主張

同步處理物件偵測與像素級分割（SDS）能比單獨處理任一任務獲得更好的整體效能，且由上而下精煉是提升分割品質的關鍵。

最強論證

由上而下精煉帶來的 7 點（16% 相對）改進為核心方法提供了有力的實證支持。在三個相關任務上的全面評估增強了方法的可信度。

最弱環節

方法依賴 MCG 提供的區域候選品質，整體管線涉及多個階段，尚未實現端對端訓練。