Grounding DINO — 雙欄批注

Abstract — 摘要

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. Grounding DINO achieves 52.5 AP on the COCO detection zero-shot transfer benchmark, setting a new record on the ODinW zero-shot benchmark with a mean 26.1 AP.

本文提出一種開放集物件偵測器 Grounding DINO，透過結合基於 Transformer 的偵測器 DINO 與基底預訓練，可根據人類輸入（如類別名稱或指稱表達式）偵測任意物件。開放集物件偵測的關鍵方案是將語言引入封閉集偵測器以實現開放集概念泛化。為有效融合語言與視覺模態，我們將封閉集偵測器概念性地劃分為三個階段並提出緊密融合方案，包括特徵增強器、語言引導查詢選擇和跨模態解碼器進行跨模態融合。Grounding DINO 在 COCO 偵測零樣本遷移基準上達到 52.5 AP，並在 ODinW 零樣本基準上以平均 26.1 AP 創下新紀錄。

段落功能全文總覽——定位開放集偵測問題並提出三階段融合方案。

邏輯角色「結合 DINO 與基底預訓練」精確描述了方法的本質：站在兩個巨人的肩膀上。

論證技巧 / 潛在漏洞 52.5 AP 的零樣本 COCO 結果極具說服力，但「任意物件」的宣稱可能受限於訓練資料的語言覆蓋範圍。

1. Introduction — 緒論

Object detection has long been studied under the closed-set setting, where the detector is trained and evaluated on a fixed set of object categories. This setting severely limits the applicability of detectors in real-world scenarios where novel objects appear frequently and cannot all be anticipated at training time. Open-set object detection addresses this limitation by allowing the detector to recognize objects from categories not seen during training, guided by natural language descriptions. Recent advances in vision-language pretraining have opened new possibilities for open-set detection, but existing approaches often struggle to achieve strong performance on both closed-set and open-set benchmarks simultaneously.

物件偵測長期以來在封閉集設定下被研究，偵測器在固定的物件類別集合上訓練和評估。此設定嚴重限制了偵測器在真實場景中的適用性，因為新穎物件經常出現且無法在訓練時全部預見。開放集物件偵測透過允許偵測器在自然語言描述的引導下辨識訓練期間未見過的類別物件來解決此限制。視覺語言預訓練的近期進展為開放集偵測開闢了新的可能，但現有方法往往難以在封閉集和開放集基準上同時達到強勁效能。

段落功能建立研究場域——從封閉集到開放集偵測的演進。

邏輯角色先確立封閉集的限制，再引入開放集作為解決方案，最後指出現有開放集方法的不足。

論證技巧 / 潛在漏洞「封閉集 vs 開放集」的框架清晰地定義了研究動機。

2. Method — 方法

Our approach builds upon DINO, a state-of-the-art Transformer-based detector, and introduces tight cross-modality fusion at multiple stages. We divide the detection pipeline into three phases: feature extraction and enhancement, query initialization and selection, and box prediction and refinement. In the first phase, we employ a feature enhancer that performs bidirectional cross-attention between image and text features, allowing both modalities to be enriched by each other. Unlike approaches that only use language to query visual features, our bidirectional fusion allows visual context to also improve language understanding, leading to better grounding of ambiguous or context-dependent expressions.

我們的方法建構於最先進的 Transformer 偵測器 DINO 之上，並在多個階段引入緊密的跨模態融合。我們將偵測管線分為三個階段：特徵提取與增強、查詢初始化與選擇以及框預測與精化。在第一階段，我們採用特徵增強器，在影像和文字特徵之間進行雙向交叉注意力，使兩種模態相互增強。不同於僅使用語言查詢視覺特徵的方法，我們的雙向融合也允許視覺上下文改善語言理解，從而更好地定位模糊或依賴上下文的表達。

段落功能闡述核心方法——三階段管線與雙向跨模態融合。

邏輯角色「雙向融合」是方法的差異化關鍵，解決了單向融合的資訊不對稱問題。

論證技巧 / 潛在漏洞三階段的結構化描述使方法易於理解。但多層交叉注意力的計算成本可能顯著。

In the second phase, we introduce language-guided query selection. Instead of using learnable queries that are fixed regardless of the input text, our approach selects the most relevant features from the enhanced image features based on their similarity to the text input. This ensures that the decoder focuses on image regions most likely to contain the objects described by the language input. In the third phase, a cross-modality decoder iteratively refines the predicted boxes while maintaining cross-attention between object queries and both image and text features, enabling precise localization guided by linguistic context.

在第二階段，我們引入語言引導查詢選擇。不同於使用不論輸入文字為何均固定的可學習查詢，我們的方法根據增強影像特徵與文字輸入的相似度，從中選擇最相關的特徵。這確保解碼器聚焦於最可能包含語言輸入所描述物件的影像區域。在第三階段，跨模態解碼器迭代精化預測框，同時維持物件查詢與影像和文字特徵之間的交叉注意力，實現由語言上下文引導的精確定位。

段落功能闡述第二、三階段——語言引導查詢與跨模態解碼。

邏輯角色三階段逐步深入，從粗到細地融合語言與視覺資訊。

論證技巧 / 潛在漏洞語言引導查詢選擇優雅地解決了「在哪裡看」的問題，但對長文字描述或複雜組合描述的處理能力可能受限。

3. Experiments — 實驗

We evaluate Grounding DINO on COCO, LVIS, ODinW, and RefCOCO/+/g benchmarks. On COCO zero-shot detection, Grounding DINO achieves 52.5 AP without any COCO training data, surpassing previous open-set detectors including GLIP (49.8 AP) and DetCLIP (45.8 AP). On the challenging LVIS benchmark with over 1,200 categories, our method demonstrates strong performance on rare categories (AP_r), achieving 32.7 compared to 26.9 for GLIP. For referring expression comprehension on RefCOCO, RefCOCO+, and RefCOCOg, Grounding DINO achieves state-of-the-art accuracy, demonstrating its versatility in understanding diverse linguistic inputs from simple category names to complex spatial descriptions.

我們在 COCO、LVIS、ODinW 和 RefCOCO/+/g 基準上評估 Grounding DINO。在 COCO 零樣本偵測上，Grounding DINO 在不使用任何 COCO 訓練資料的情況下達到 52.5 AP，超越先前的開放集偵測器，包括 GLIP（49.8 AP）和 DetCLIP（45.8 AP）。在具挑戰性的擁有超過 1,200 個類別的 LVIS 基準上，我們的方法在稀有類別（AP_r）上表現強勁，達到 32.7 對比 GLIP 的 26.9。在 RefCOCO、RefCOCO+ 和 RefCOCOg 的指稱表達理解上，Grounding DINO 達到最先進的精確度，展現其從簡單類別名稱到複雜空間描述等多樣語言輸入的理解能力。

段落功能提供核心實證——多基準的全面量化結果。

邏輯角色 COCO 零樣本、LVIS 稀有類別、RefCOCO 指稱理解三個維度全方位驗證。

論證技巧 / 潛在漏洞稀有類別的改進（32.7 vs 26.9）特別說明了語言引導的價值。但不同語言或非英語輸入的效能未被測試。

4. Conclusion — 結論

We have presented Grounding DINO, an open-set object detector that effectively fuses language and vision through a three-phase tight fusion architecture. By marrying the powerful DINO detector with grounded pre-training, our model can detect arbitrary objects specified through natural language, achieving state-of-the-art results across multiple zero-shot, open-vocabulary, and referring expression benchmarks. Our work demonstrates that deep, multi-stage cross-modality fusion is key to bridging the gap between closed-set and open-set object detection.

我們提出了 Grounding DINO，一個透過三階段緊密融合架構有效融合語言與視覺的開放集物件偵測器。透過結合強大的 DINO 偵測器與基底預訓練，我們的模型可偵測透過自然語言指定的任意物件，在多個零樣本、開放詞彙和指稱表達基準上達到最先進結果。我們的工作展示了深度、多階段的跨模態融合是彌合封閉集與開放集物件偵測差距的關鍵。

段落功能總結全文——重申方法核心與廣泛適用性。

邏輯角色以「彌合封閉集與開放集差距」收束，定位為該領域的重要里程碑。

論證技巧 / 潛在漏洞 Grounding DINO 已成為廣泛使用的開放集偵測基礎設施，其影響力已超越論文本身。

論證結構總覽

問題
封閉集偵測限制應用

→

論點
語言融合實現開放集偵測

→

方法
三階段緊密跨模態融合

→

證據
COCO 零樣本 52.5 AP

→

結論
開放集偵測的新基準

核心主張

透過在 Transformer 偵測器的多個階段進行緊密的語言-視覺雙向融合，可建構能偵測任意自然語言描述物件的開放集偵測系統。

論證最強處

在 COCO 零樣本設定下達到 52.5 AP，且在 LVIS 稀有類別上顯著優於先前方法，證明語言引導偵測的強大泛化能力。

論證最弱處

對計算成本的分析不足，多層交叉注意力的推論延遲可能限制即時應用。此外，對非英語語言輸入的支援程度未明確討論。