Microsoft COCO — 雙欄批注

Abstract — 摘要

We present a new large-scale dataset for advancing the state of the art in object recognition by placing the question of recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 object types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement.

我們提出一個新的大規模資料集，旨在將辨識問題置於場景理解的更廣泛脈絡中，以推進物件辨識的技術水準。這是透過收集包含自然場景中常見物件的複雜日常場景影像來實現的。物件使用逐實例分割進行標記，以輔助精確的物件定位。我們的資料集包含 91 種連四歲小孩都能輕易辨識的物件類型，共有328,000 張影像中 250 萬個已標記實例，資料集的建立倚賴了大量群眾工作者的參與。

段落功能開宗明義宣示資料集的核心理念：將物件辨識放在場景理解的脈絡中。

邏輯角色摘要以數字說話——91 類物件、250 萬實例、328K 影像——建立規模上的信服力。

論證技巧 / 潛在漏洞「四歲小孩都能辨識」是極具修辭力的比喻，暗示資料集聚焦於基礎視覺能力而非罕見物件。但 91 類相較 ImageNet 的 1,000+ 類仍然偏少。

1. Introduction — 緒論

One of the primary goals of computer vision is the understanding of visual scenes. Scene understanding involves numerous tasks including recognizing what objects are present, localizing the objects in 2D and 3D, determining the objects' and scene's attributes, characterizing relationships between objects and providing a semantic description of the scene. Current object recognition datasets address some of these tasks, but not all, and they do not place object recognition in the context of scene understanding.

電腦視覺的主要目標之一是理解視覺場景。場景理解涉及許多任務，包括辨識場景中存在哪些物件、在 2D 和 3D 中定位物件、確定物件和場景的屬性、描述物件之間的關係以及提供場景的語義描述。現有的物件辨識資料集處理了其中一些任務，但並非全部，且它們並未將物件辨識置於場景理解的脈絡中。

段落功能描繪場景理解的全貌，並指出現有資料集的不足。

邏輯角色以「大目標→子任務列表→現有不足」的結構鋪陳研究動機。

論證技巧 / 潛在漏洞將多項子任務完整列出，既展現問題的複雜度，也為 COCO 的多面向標記提供合理性。

We address this by collecting a dataset with the following properties: (1) a large number of object instances per category, (2) a large number of categories, (3) a large number of instances with per-instance segmentation, (4) images showing objects in their natural contexts, and (5) five captions per image. The combination of these properties makes the COCO dataset unique in its breadth and depth of annotation.

我們透過收集具有以下特性的資料集來解決此問題：(1) 每個類別有大量物件實例；(2) 大量類別；(3) 大量具有逐實例分割的實例；(4) 展示物件在自然場景中的影像；(5) 每張影像五個文字描述。這些特性的組合使 COCO 資料集在標記的廣度與深度上獨一無二。

段落功能以五項設計原則定義 COCO 的獨特定位。

邏輯角色編號列表清晰呈現設計準則，為後續各節的技術決策提供準繩。

論證技巧 / 潛在漏洞結構化的五點呈現方式讓讀者快速掌握資料集的差異化優勢，論述簡潔有力。

Several datasets have been instrumental in advancing object recognition. PASCAL VOC provides high-quality annotations for 20 categories, while ImageNet offers over 14 million images spanning more than 20,000 categories. However, PASCAL VOC has limited categories and ImageNet's annotations are primarily bounding boxes without segmentation masks. SUN provides scene-level annotations but lacks per-instance segmentation. Our COCO dataset bridges these gaps by combining rich per-instance annotations with natural scene contexts.

多個資料集在推進物件辨識方面發揮了重要作用。PASCAL VOC 為 20 個類別提供高品質標記，而 ImageNet 提供超過 1,400 萬張影像，涵蓋逾 20,000 個類別。然而，PASCAL VOC 類別有限，ImageNet 的標記主要為邊界框而無分割遮罩。SUN 提供場景級標記但缺乏逐實例分割。我們的 COCO 資料集透過結合豐富的逐實例標記與自然場景脈絡來彌補這些差距。

段落功能對比現有資料集的優缺點，定位 COCO 的獨特價值。

邏輯角色以「各家之長與不足」的系統性比較，自然引出 COCO 的互補角色。

論證技巧 / 潛在漏洞公平呈現各資料集的優點後再指出不足，避免了「貶低他人」的學術禮儀問題。

3. Dataset Design — 資料集設計

The COCO dataset collection process involved several stages. First, we selected 91 common object categories that are easily recognizable and have a clear visual identity. Category selection was guided by the principle that objects should be relevant to everyday life and should occur frequently in natural images. We then used Amazon Mechanical Turk (AMT) workers to collect images and annotate them. Each image was annotated with instance-level segmentation masks, bounding boxes, and five textual captions.

COCO 資料集的收集過程包含數個階段。首先，我們選擇了 91 種常見物件類別，這些類別易於辨識且具有明確的視覺特徵。類別選擇的指導原則是物件應與日常生活相關且在自然影像中頻繁出現。接著，我們使用 Amazon Mechanical Turk（AMT）群眾工作者來收集影像並進行標記。每張影像都標記了實例級分割遮罩、邊界框及五段文字描述。

段落功能詳述資料集的建構流程與標記規範。

邏輯角色展現資料集建構的嚴謹性，為後續的可重現性與品質保證提供基礎。

論證技巧 / 潛在漏洞 AMT 群眾標記引入了品質控制的挑戰，文中需說明如何確保標記一致性。

A key design choice is our emphasis on non-iconic images. Unlike datasets that primarily contain iconic views of objects (centered, well-lit, unoccluded), COCO images depict objects in natural, often cluttered environments. This makes the recognition task more challenging but also more representative of real-world visual understanding. Statistics show that COCO images contain an average of 7.7 object instances per image, compared to fewer than 3 in PASCAL VOC.

一個關鍵的設計選擇是我們對非典型影像的強調。不同於主要包含物件典型視角（居中、光線充足、未被遮擋）的資料集，COCO 影像描繪的是自然且通常雜亂環境中的物件。這使辨識任務更具挑戰性，但也更能代表真實世界的視覺理解。統計顯示，COCO 影像平均每張包含 7.7 個物件實例，相較 PASCAL VOC 的不到 3 個。

段落功能闡述「非典型影像」這一核心設計理念。

邏輯角色將 COCO 與現有資料集在哲學層面區分開來，而非僅在數量上比較。

論證技巧 / 潛在漏洞 7.7 vs 3 的統計對比簡明有力地支持了「更接近真實世界」的主張。

4. Dataset Analysis and Evaluation — 資料集分析與評估

We provide extensive analysis of the dataset properties. COCO contains 328,000 images with 2.5 million labeled instances across 91 categories. The dataset is split into training (83k images), validation (41k images), and test (81k images) sets. We also establish baseline results for detection and segmentation tasks using state-of-the-art methods. The results show that current methods perform significantly worse on COCO than on PASCAL VOC, indicating the increased difficulty and room for improvement.

我們提供了資料集特性的詳盡分析。COCO 包含 328,000 張影像，涵蓋 91 個類別中共 250 萬個已標記實例。資料集分為訓練集（83K 影像）、驗證集（41K 影像）和測試集（81K 影像）。我們還使用最先進的方法建立了偵測與分割任務的基線結果。結果顯示，現有方法在 COCO 上的表現顯著遜於 PASCAL VOC，表明難度的提升與改進的空間。

段落功能提供資料集規模統計與基線實驗結果。

邏輯角色以「方法在 COCO 上表現更差」來論證資料集的挑戰性，間接證明其價值。

論證技巧 / 潛在漏洞「表現更差 = 更有挑戰性 = 更有價值」的邏輯推理有說服力，但需注意這也可能源於標記標準的差異。

5. Conclusion — 結論

We have introduced Microsoft COCO, a new large-scale dataset for object detection, segmentation, and captioning. Through its emphasis on non-iconic imagery, rich per-instance annotations, and natural scene contexts, COCO provides a challenging benchmark that pushes the boundaries of current recognition systems. We believe COCO will be an important resource for the computer vision community for years to come.

我們介紹了 Microsoft COCO，一個用於物件偵測、分割與圖說生成的新大規模資料集。透過強調非典型影像、豐富的逐實例標記以及自然場景脈絡，COCO 提供了一個具挑戰性的基準，推動了現有辨識系統的邊界。我們相信 COCO 將在未來多年成為電腦視覺社群的重要資源。

段落功能總結資料集的定位與願景。

邏輯角色以前瞻性語句收尾，暗示 COCO 將成為長期影響力的基準。

論證技巧 / 潛在漏洞從後見之明來看，「未來多年的重要資源」已被證實——COCO 確實成為了電腦視覺最重要的基準之一。

論證結構總覽

場景理解需求
超越單一辨識

→

現有資料集不足
缺乏場景脈絡

→

COCO 設計原則
非典型/逐實例/自然場景

→

大規模群眾標記
328K 影像/250 萬實例

→

挑戰性基準
推動辨識系統進步

核心主張

將物件辨識置於場景理解的脈絡中，以非典型影像搭配豐富的逐實例標記，建構更貼近真實世界的基準資料集。

最強論證

詳盡的資料集統計與跨資料集比較（COCO vs PASCAL VOC）清楚展示了設計選擇的效果。COCO 已被驗證為領域中最重要的基準之一。

最弱環節

91 個物件類別相較 ImageNet 的規模仍然有限；群眾標記的品質控制雖有描述但難以完全驗證標記一致性。