DETR: End-to-End Object Detection with Transformers

Abstract — 摘要

We present a new method that views object detection as a direct set prediction problem. Our approach, called DEtection TRansformer (DETR), streamlines the detection pipeline, effectively removing the need for many hand-designed components like non-maximum suppression (NMS) or anchor generation that explicitly encode prior knowledge about the task. The main ingredients of DETR are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.

我們提出一種將物件偵測視為直接集合預測問題的新方法。我們的方法稱為 DEtection TRansformer（DETR），簡化了偵測流程，有效移除了許多手工設計的元件，如非極大值抑制（NMS）或錨框生成等明確編碼任務先驗知識的步驟。DETR 的主要成分是基於集合的全域損失函數，透過二部匹配強制唯一預測，以及Transformer 編碼器-解碼器架構。給定一組固定的小型可學習物件查詢，DETR 推理物件之間的關係與全域影像上下文，直接以平行方式輸出最終的預測集合。

段落功能全文總覽——將物件偵測重新定義為集合預測問題，介紹 DETR 的核心設計。

邏輯角色透過「移除手工設計元件」的框架，將 DETR 定位為偵測領域的範式轉移。

論證技巧 / 潛在漏洞以「簡化流程」作為核心賣點極具吸引力。但此簡化是否犧牲了效能，需要後續實驗驗證。

1. Introduction — 緒論

The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest. Modern detectors address this set prediction task in an indirect way, by defining surrogate regression and classification problems on a large set of proposals, anchors, or window centers. Their performance is significantly influenced by postprocessing steps to collapse near-duplicate predictions, by the design of the anchor sets, and by the heuristics that assign target boxes to anchors. To simplify these pipelines, we propose a direct set prediction approach to bypass the surrogate tasks.

物件偵測的目標是為每個感興趣的物件預測一組邊界框和類別標籤。現代偵測器以間接方式處理此集合預測任務，透過在大量的提案、錨框或視窗中心上定義替代性的迴歸與分類問題。它們的效能顯著受到合併近似重複預測的後處理步驟、錨框集合的設計以及將目標框分配給錨框的啟發式規則的影響。為簡化這些流程，我們提出一種直接集合預測方法以繞過替代性任務。

段落功能問題陳述——批判現有偵測器依賴大量手工設計元件的間接方法。

邏輯角色列舉四項「手工設計的依賴」建立問題的嚴重性，為「端到端」的替代方案建立必要性。

論證技巧 / 潛在漏洞將錨框、NMS 等成熟元件框定為「缺陷」是大膽的修辭策略。這些元件歷經多年改進，移除它們的代價可能被低估。

Our DEtection TRansformer (DETR) makes predictions through two key ingredients: a set loss function that performs bipartite matching between predicted and ground-truth objects, eliminating the need for NMS, and a transformer-based architecture that models all pairwise interactions between elements using attention mechanisms. The self-attention mechanisms in the transformer allow DETR to perform global reasoning about the image, enabling it to capture long-range dependencies and remove duplicate detections without NMS.

我們的 DEtection TRansformer（DETR）透過兩個關鍵要素進行預測：集合損失函數在預測與真實物件之間執行二部匹配，消除對 NMS 的需求；以及基於 Transformer 的架構使用注意力機制建模所有元素之間的成對互動。Transformer 中的自注意力機制使 DETR 能夠執行對影像的全域推理，捕捉長距離依賴關係並在不使用 NMS 的情況下消除重複偵測。

段落功能方法概述——介紹 DETR 的兩大核心機制。

邏輯角色回應上段的問題，展示具體的解決方案。二部匹配取代 NMS，注意力取代局部處理。

論證技巧 / 潛在漏洞將 Transformer 的全域推理能力與偵測的去重需求巧妙連結。但全域注意力的二次複雜度可能限制處理高解析度影像。

2. The DETR Model — DETR 模型

2.1 Object Detection Set Prediction Loss — 集合預測損失

DETR infers a fixed-size set of N predictions, in a single pass through the decoder, where N is set to be significantly larger than the typical number of objects in an image. One of the main difficulties is to score predicted objects with respect to the ground truth. Our loss produces an optimal bipartite matching between predicted and ground truth objects, and then optimizes object-specific losses. We search for a permutation of N elements with the lowest cost using the Hungarian algorithm. The matching cost takes into account both the class prediction and the similarity of predicted and ground truth boxes.

DETR 在解碼器的一次通過中推斷一組固定大小的 N 個預測，其中 N 設定為顯著大於影像中物件的典型數量。主要困難之一是將預測物件與真實標註進行評分。我們的損失函數在預測與真實物件之間產生最佳二部匹配，然後最佳化物件特定的損失。我們使用匈牙利演算法搜索 N 個元素的最低成本排列。匹配成本同時考慮類別預測和預測框與真實框之間的相似度。

段落功能損失函數設計——描述二部匹配的核心機制。

邏輯角色此為 DETR 能夠「端到端」訓練的數學基礎——匈牙利演算法確保預測與標註之間的一對一對應。

論證技巧 / 潛在漏洞借用組合最佳化中成熟的匈牙利演算法是優雅的設計。但 N 必須預設且遠大於實際物件數，這種冗餘設計在計算上並非最優。

2.2 DETR Architecture — 架構設計

The overall DETR architecture is surprisingly simple and consists of three main components: a CNN backbone to extract a compact feature representation, an encoder-decoder transformer, and a simple feed forward network (FFN) that makes the final detection prediction. The encoder receives a flattened sequence of features supplemented with positional encodings and applies self-attention. The decoder receives N learned object queries and attends to the encoder output through cross-attention, producing N output embeddings that are independently decoded into box coordinates and class labels.

DETR 的整體架構驚人地簡潔，由三個主要元件組成：CNN 骨幹網路用於提取緊湊的特徵表徵、編碼器-解碼器 Transformer以及做出最終偵測預測的簡單前饋網路（FFN）。編碼器接收展平的特徵序列，輔以位置編碼並施用自注意力。解碼器接收 N 個可學習的物件查詢，透過交叉注意力關注編碼器輸出，產生 N 個輸出嵌入，各自獨立解碼為邊界框座標和類別標籤。

段落功能架構描述——完整呈現 DETR 從特徵提取到預測輸出的流程。

邏輯角色「驚人地簡潔」的描述強調了設計的優雅性——複雜的偵測任務被簡化為三個標準元件的組合。

論證技巧 / 潛在漏洞「可學習的物件查詢」是 DETR 最具創新性的概念——每個查詢可被理解為「在影像中搜索特定物件的探針」。但這種設計的可解釋性仍是開放問題。

3. Experiments — 實驗

We evaluate DETR on the COCO 2017 detection dataset. Using a ResNet-50 backbone, DETR achieves 42.0 AP, which is competitive with the well-established Faster R-CNN baseline at 42.0 AP after training schedule optimizations. With a ResNet-101 backbone, DETR achieves 43.5 AP. Notably, DETR significantly outperforms Faster R-CNN on large objects (AP_L), gaining +7.8 AP, likely due to the global reasoning performed by the transformer's self-attention. However, DETR shows lower performance on small objects, achieving only 20.5 AP_S compared to Faster R-CNN's 22.5.

我們在 COCO 2017 偵測資料集上評估 DETR。使用 ResNet-50 骨幹網路時，DETR 達到 42.0 AP，在訓練排程最佳化後與成熟的 Faster R-CNN 基線的 42.0 AP 相當。使用 ResNet-101 骨幹網路時，DETR 達到 43.5 AP。值得注意的是，DETR 在大型物件上顯著超越 Faster R-CNN（AP_L），提升了 +7.8 AP，這可能歸因於 Transformer 自注意力執行的全域推理。然而，DETR 在小型物件上表現較弱，僅達到 AP_S 20.5，相較 Faster R-CNN 的 22.5。

段落功能定量評估——在 COCO 上與 Faster R-CNN 的全面比較。

邏輯角色展示 DETR 在全面 AP 上與 Faster R-CNN 持平，在大物件上大幅領先，建立方法的競爭力。

論證技巧 / 潛在漏洞坦誠呈現小物件偵測的劣勢是誠實的科學態度。大物件 +7.8 AP 的優勢與全域推理的解釋形成自洽的論述。

We further demonstrate the versatility of DETR by extending it to panoptic segmentation. By adding a simple segmentation head on top of the decoder outputs, DETR achieves competitive panoptic quality (PQ) results, demonstrating that the unified architecture can be naturally extended to related tasks. We also perform extensive ablation studies on the number of encoder/decoder layers, the role of positional encodings, and the importance of the FFN in the transformer, confirming that each component contributes to the final performance.

我們進一步將 DETR 擴展到全景分割，展示其多用途性。透過在解碼器輸出上添加一個簡單的分割頭，DETR 達到了具競爭力的全景品質（PQ）結果，證明統一架構可自然地擴展到相關任務。我們也對編碼器/解碼器層數、位置編碼的角色以及 Transformer 中 FFN 的重要性進行了廣泛的消融研究，確認每個組件都對最終效能有貢獻。

段落功能擴展應用與消融——展示架構的泛化能力與各組件的貢獻。

邏輯角色從「單任務效能」擴展到「多任務泛化」，強化 DETR 作為通用偵測框架的定位。

論證技巧 / 潛在漏洞全景分割的擴展以「添加簡單分割頭」完成，凸顯了架構的模組化優勢。但訓練需要 300 個 epoch（遠多於 Faster R-CNN），此效率問題未被強調。

4. Conclusion — 結論

We have presented DETR, a new design for object detection systems based on transformers and bipartite matching loss for direct set prediction. DETR achieves comparable results to an optimized Faster R-CNN baseline on COCO while being conceptually simpler and requiring no hand-designed components. DETR is easily extensible to panoptic segmentation, demonstrating its potential as a unified architecture for visual recognition. We believe that the transformer-based approach opens up new possibilities for end-to-end object detection and beyond.

我們提出了 DETR，一種基於 Transformer 和二部匹配損失進行直接集合預測的物件偵測系統新設計。DETR 在 COCO 上達到與最佳化的 Faster R-CNN 基線相當的結果，同時概念上更為簡潔，不需要手工設計的元件。DETR 可輕鬆擴展至全景分割，展現其作為統一視覺辨識架構的潛力。我們相信基於 Transformer 的方法為端到端物件偵測及更廣泛的應用開啟了新的可能性。

段落功能總結——重申 DETR 的核心價值與前瞻性意義。

邏輯角色將技術貢獻上升至「範式轉移」的高度，定位 DETR 為偵測領域的新方向。

論證技巧 / 潛在漏洞後續的 Deformable DETR、DAB-DETR 等工作證明了 DETR 確實開創了新範式。但原始 DETR 的訓練效率和小物件偵測問題在結論中未被充分討論。

論證結構總覽

問題
偵測器依賴手工設計

→

論點
集合預測取代替代任務

→

方法
Transformer + 二部匹配

→

證據
COCO 上媲美 Faster R-CNN

→

結論
端到端偵測新範式

核心主張

透過 Transformer 編碼器-解碼器架構與二部匹配損失，物件偵測可被簡化為直接集合預測問題，無需 NMS 或錨框等手工設計元件。

論證最強處

以極簡的架構達到與高度最佳化的 Faster R-CNN 相當的效能，且在大物件偵測上大幅超越。全景分割的無縫擴展進一步驗證了架構的通用性。

論證最弱處

訓練需要 300 個 epoch（Faster R-CNN 僅需 36 個），小物件偵測效能不足，且固定 N 個查詢的設計在概念上不夠靈活。