Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

Abstract — 摘要

This paper investigates the Mixture of Experts (MoE) architecture in real-time open-vocabulary object detection. The authors observe that in the shallow layers, experts tend to cooperate with diverse peers to expand the search space, while deeper layers develop stable expert collaboration, with specialized combinations for specific token patterns. Based on these findings, they propose Dynamic-DINO, which converts Grounding DINO 1.5 Edge into a dynamic inference framework through efficient MoE-Tuning. A granularity decomposition mechanism splits the Feed-Forward Network into finer expert networks, and a pre-trained weight allocation strategy ensures stable initialization. Trained on only 1.56M open-source data, Dynamic-DINO surpasses the baseline pretrained on a private 20M dataset with comparable inference speed.

本文探討混合專家（MoE）架構在即時開放詞彙物件偵測中的應用。作者觀察到在淺層中，專家傾向與多元的夥伴合作以擴展搜尋空間，而深層則發展出穩定的專家協作模式，以特定的組合處理特定的權杖模式。基於這些發現，他們提出 Dynamic-DINO，透過高效的 MoE 微調將 Grounding DINO 1.5 Edge 轉換為動態推論框架。粒度分解機制將前饋網路拆分為更精細的專家網路，預訓練權重分配策略確保穩定的初始化。僅以 156 萬筆開源資料訓練，Dynamic-DINO 即超越以私有 2000 萬資料集預訓練的基線模型，且推論速度相當。

段落功能全文總覽——以觀察現象為起點，引出方法設計，並以資料效率的數據對比收尾。

邏輯角色摘要兼具「經驗觀察」與「方案預告」：先以淺層/深層的專家行為差異建立理論依據，再以此推導出粒度分解策略，最終以資料量級的懸殊對比凸顯方法的實用價值。

論證技巧 / 潛在漏洞以 1.56M vs. 20M 的資料量對比極具衝擊力，但需注意基線是官方私有資料訓練結果，兩者的資料品質與分布可能差異顯著，並非純粹的數量比較。

1. Introduction — 緒論

Open-vocabulary object detection enables flexible localization of arbitrary objects by integrating language modality, advancing beyond fixed-category detection systems. Real-time detectors have gained emphasis due to practical applications in anomaly detection, robotics, and autonomous driving. Current approaches employ dense models with fixed inference architectures, where a single FFN in each layer is required to process all tokens, encompassing extensive patterns in open scenarios, including visual patterns such as category and attribute, and contextual patterns like relative position and relationship. This causes gradient conflicts and long-tail issues.

開放詞彙物件偵測透過整合語言模態，實現對任意物件的彈性定位，突破了固定類別偵測系統的限制。即時偵測器因在異常偵測、機器人學與自動駕駛等實際應用中的需求而備受重視。現有方法採用固定推論架構的密集模型，其中每一層的單一前饋網路需處理所有權杖，涵蓋開放場景中的大量模式——包括類別與屬性等視覺模式，以及相對位置與關係等上下文模式。這導致了梯度衝突與長尾問題。

段落功能建立研究場域——定義開放詞彙偵測的重要性，並指出密集模型的根本瓶頸。

邏輯角色論證鏈起點：從「開放場景需求」出發，推導至「單一 FFN 無法應對多樣模式」的結構性問題，為引入 MoE 的動態分派建立必要性。

論證技巧 / 潛在漏洞「梯度衝突」與「長尾問題」的診斷精準切入了 dense model 的痛點，但並未提供量化證據。讀者需信任作者對問題的經驗性判斷。

The search space for MoE with N experts and top-k activation across L layers is (C_N^k)^L. Two straightforward expansion strategies exist: increasing k raises computational costs; increasing N increases memory and training overhead. The authors propose Dynamic-DINO, implementing efficient MoE-Tuning based on Grounding DINO 1.5 Edge. They employ a granularity decomposition strategy, which splits a single FFN into multiple expert networks. Crucially, activated parameters remain equivalent to a single FFN, maintaining inference efficiency.

具有 N 個專家與 top-k 啟動的 MoE 在 L 層上的搜尋空間為 (C_N^k)^L。兩種直接的擴展策略：增加 k 會提高計算成本；增加 N 會增加記憶體與訓練開銷。作者提出 Dynamic-DINO，基於 Grounding DINO 1.5 Edge 實現高效的 MoE 微調。他們採用粒度分解策略，將單一前饋網路拆分為多個專家網路。關鍵在於，啟動的參數量等同於單一 FFN，維持了推論效率。

段落功能提出核心創新——以數學分析說明搜尋空間擴展的兩難，並引出粒度分解的解法。

邏輯角色承接問題陳述，此段以搜尋空間的組合數學公式作為理性基礎，將「擴展 MoE」的兩難量化，再提出粒度分解作為跳出此兩難的第三條路。

論證技巧 / 潛在漏洞「啟動參數等同單一 FFN」是極具吸引力的工程承諾，但實際推論中路由器的計算開銷與專家的循序執行可能導致延遲增加，作者在實驗中也承認了這一點。

Representative works in open-vocabulary object detection include GLIP, OpenSeeD, OWL-ViT, Grounding DINO, and DetCLIP variants. Real-time detectors like YOLO-World inherit efficiency from the YOLO series. Grounding DINO 1.5 Edge focuses on computational efficiency, with subsequent versions expanding pre-training datasets. Dynamic-DINO uniquely incorporates MoE-driven dynamic inference to achieve significant improvements in accuracy without compromising efficiency.

開放詞彙物件偵測的代表性工作包括 GLIP、OpenSeeD、OWL-ViT、Grounding DINO 及 DetCLIP 系列變體。即時偵測器如 YOLO-World 繼承了 YOLO 系列的效率優勢。Grounding DINO 1.5 Edge 專注於計算效率，後續版本不斷擴展預訓練資料集。Dynamic-DINO 的獨特之處在於引入 MoE 驅動的動態推論，在不犧牲效率的前提下顯著提升準確度。

段落功能文獻回顧——梳理開放詞彙偵測的發展脈絡，定位 Dynamic-DINO 的學術位置。

邏輯角色建立「效率 vs. 準確度」的雙軸座標系：現有方法在此二者之間取捨，Dynamic-DINO 聲稱透過 MoE 同時改善兩端。

論證技巧 / 潛在漏洞將 Dynamic-DINO 定位為唯一結合 MoE 與即時偵測的方案，但 MoE 在大型語言模型中已廣泛使用，此處的創新主要在遷移至偵測任務的特定設計。

Mixture of Experts (MoE) enables conditional computation, scaling model capacity while ensuring efficient computation through selective expert activation. Early works adopted hard routing; recent LLM and LVLM work employs soft routers enabling dynamic token allocation. DeepSeekMoE and QwenMoE segment experts by splitting FFN intermediate dimensions. Dynamic-DINO differs by segmenting pre-trained FFN parameters for incremental fine-tuning rather than random initialization for full pre-training.

混合專家（MoE）透過選擇性專家啟動實現條件式計算，在擴展模型容量的同時確保計算效率。早期研究採用硬路由；近期的大型語言模型與大型視覺語言模型研究則使用軟路由器，實現動態的權杖分配。DeepSeekMoE 與 QwenMoE 透過拆分 FFN 中間維度來切分專家。Dynamic-DINO 的不同之處在於對預訓練 FFN 參數進行切分以進行增量微調，而非從隨機初始化開始完整預訓練。

段落功能技術背景——回顧 MoE 的演進，從硬路由到軟路由再到維度切分。

邏輯角色關鍵區別在於「微調 vs. 從頭訓練」：DeepSeekMoE 等方法需從隨機初始化完整預訓練，而 Dynamic-DINO 繼承已有的密集模型權重，大幅降低訓練成本。

論證技巧 / 潛在漏洞「增量微調」的定位巧妙地將 Dynamic-DINO 框架化為「輕量升級」而非「全新訓練」，但切分預訓練權重是否能保留原模型的表示品質，需在消融研究中驗證。

3. Method — 方法

3.1 Overview — 概覽

Dynamic-DINO extends Grounding DINO 1.5 Edge into a dynamic framework via MoE-Tuning. MoE application targets the decoder because: (1) Language-guided Query Selection retains only 900 tokens, minimizing router computational costs; (2) decoder output directly influences bounding box regression, improving fine-tuning efficiency. Training allows Cross-Attention in Feature Enhancer, MoE Layer in Cross-Modality MoE Decoder, and Detection Head to participate while freezing other parameters.

Dynamic-DINO 透過 MoE 微調將 Grounding DINO 1.5 Edge 擴展為動態框架。MoE 應用目標鎖定在解碼器，原因有二：(1) 語言引導的查詢選擇僅保留 900 個權杖，將路由器計算成本降至最低；(2) 解碼器輸出直接影響邊界框迴歸，提升微調效率。訓練時僅開放特徵增強器中的交叉注意力、跨模態 MoE 解碼器中的 MoE 層及偵測頭參與更新，其餘參數凍結。

段落功能架構定位——說明為何選擇在解碼器而非編碼器端引入 MoE。

邏輯角色提供兩個合理化論據支撐設計決策：權杖數量控制與輸出直接性。這為後續的粒度分解提供了計算可行性的前提。

論證技巧 / 潛在漏洞 900 權杖的限制讓路由器開銷可忽略不計，但也意味著 MoE 的多樣性受限於查詢選擇的品質。若查詢選擇本身偏差，MoE 的專家分派也可能偏差。

3.2 Cross-Modality MoE Decoder — 跨模態 MoE 解碼器

The FFN in each decoder layer expands into N identical FFNs via Supernet Expansion. Each FFN's intermediate hidden dimension divides evenly into k partitions, constructing k x N fine-grained experts. This creates a larger subnet search space through finer granularity. During Subnet Inference, a router R(x) selects experts using softmax normalization. Top-k experts activate via gating mechanism, ensuring activated parameters equal a single FFN: h(x) = sum of g_i multiplied by E_i(x), where g_i indicates whether expert i is selected.

每個解碼器層中的 FFN 透過超網路擴展複製為 N 個相同的 FFN。每個 FFN 的中間隱藏維度均勻分割為 k 個區塊，建構出 k x N 個精細專家。這透過更精細的粒度創造了更大的子網路搜尋空間。在子網路推論階段，路由器 R(x) 使用 softmax 正規化來選擇專家。Top-k 專家透過閘控機制啟動，確保啟動的參數量等同於單一 FFN：h(x) = g_i 乘以 E_i(x) 的總和，其中 g_i 指示專家 i 是否被選中。

段落功能核心機制——描述粒度分解如何同時擴大搜尋空間與維持推論成本。

邏輯角色此段是全文技術論證的核心：k x N 的雙維度擴展策略巧妙地將搜尋空間的指數增長與線性的參數成本脫鉤。

論證技巧 / 潛在漏洞數學上確保「啟動參數 = 單一 FFN」是優雅的設計，但精細分割後每個專家的表達容量降低。若某些模式需要大容量的單一專家，粒度分解反而可能造成瓶頸。

3.3 MoE-Tuning — MoE 微調

Expert Initialization segments parameters from the pre-trained FFN. The first layer weight W1 divides horizontally into k blocks, and the second layer W2 divides vertically. Bias is adjusted as b2* = b2/k. This ensures "the sum of the outputs from the k fine-grained experts matches the output of the original FFN". For Router Initialization, random weights replicate each centroid k times, ensuring the router invariably selects the k experts derived from the same FFN at the start of fine-tuning. The total loss combines detection loss L_det (L1, GIOU, and Focal losses) with an auxiliary load balancing loss L_aux to prevent expert collapse.

專家初始化從預訓練 FFN 中切分參數。第一層權重 W1 水平分割為 k 個區塊，第二層 W2 垂直分割。偏置調整為 b2* = b2/k。這確保了 k 個精細專家的輸出總和等同於原始 FFN 的輸出。路由器初始化方面，隨機權重將每個質心複製 k 次，確保微調開始時路由器必然選擇源自同一 FFN 的 k 個專家。總損失函數結合偵測損失（L1、GIOU 與 Focal 損失）與輔助負載平衡損失，以防止專家坍塌。

段落功能穩定性保障——說明如何確保微調起始點不劣於原模型。

邏輯角色此段解決了 MoE 微調的核心風險：從預訓練模型轉換到 MoE 架構時可能產生的效能崩潰。「輸出和等於原始 FFN」的數學保證提供了安全的起跑線。

論證技巧 / 潛在漏洞初始化策略的設計極為精巧：路由器初始狀態等效於密集模型，因此微調只能「改善或維持」而非「劣化」。但這也意味著若路由器訓練不足，模型可能退化為原始密集模型而未獲得 MoE 的優勢。

4. Experiments — 實驗

Experiments use Objects365 (V1), GoldG, and V3Det datasets totaling approximately 1.56M images, excluding COCO images from GoldG. Zero-shot evaluation is conducted on COCO, LVIS, and ODinW using standard Average Precision metrics. The model uses EfficientViT-L1 image backbone and BERT-base text backbone with 900 queries and 6 decoder layers. Dynamic-DINO achieves comparable performance with official Grounding DINO 1.5 Edge across resolutions. Notably, performance on rare classes significantly improves, indicating MoE-Tuning effectively alleviates the long-tail problem.

實驗使用 Objects365（V1）、GoldG 與 V3Det 資料集，共約 156 萬張影像，排除 GoldG 中的 COCO 影像。在 COCO、LVIS 與 ODinW 上進行零樣本評估，採用標準的平均精確度指標。模型使用 EfficientViT-L1 影像主幹與 BERT-base 文字主幹，900 個查詢與 6 個解碼器層。Dynamic-DINO 在各解析度上達到與官方 Grounding DINO 1.5 Edge 相當的表現。值得注意的是，稀有類別的效能顯著提升，表明 MoE 微調有效緩解了長尾問題。

段落功能實驗驗證——以多基準、零樣本設定展示方法的泛化能力。

邏輯角色此段提供關鍵的實證支撐，特別是「稀有類別改善」直接對應緒論中提出的「長尾問題」，形成完整的問題-解決方案閉環。

論證技巧 / 潛在漏洞「comparable performance」一詞模糊——如果僅是追平而非超越，則 MoE 的額外複雜度是否值得？但作者在稀有類別上的明確改善提供了差異化價值。消融研究中也承認推論速度略有下降。

Statistical analysis of routing distributions reveals that expert loading varies notably across layers, showing experts learned task-specific division mechanisms. In shallow layers, experts cooperate with a diverse range of peers to explore a wider search space. Deeper layers focus on consistent collaborations with 2-3 specific partners to process distinct patterns. Token routing examples confirm that distinct expert combinations are specialized in processing specific patterns: refrigerator tokens select experts 0,3 while clothing tokens select 1,7.

路由分布的統計分析揭示，專家負載在各層之間顯著不同，顯示專家學會了特定於任務的分工機制。在淺層中，專家與多元的夥伴合作以探索更廣的搜尋空間。深層則聚焦於與 2-3 個特定夥伴的穩定合作來處理特定模式。權杖路由範例確認了不同的專家組合專門處理特定模式：冰箱權杖選擇專家 0、3，而服飾權杖選擇專家 1、7。

段落功能機制分析——以實證數據解釋 MoE 學到的內部行為模式。

邏輯角色提供可解釋性證據：不僅證明方法「有效」，更解釋「為何有效」。淺層探索/深層特化的行為模式為 MoE 的動態推論提供了直覺性理解。

論證技巧 / 潛在漏洞冰箱 vs. 服飾的具體範例極具說服力，但僅為個案。系統性的模式分析（如跨資料集的專家分派一致性）將更具說服力。

5. Conclusion — 結論

Dynamic-DINO explores MoE integration in real-time open-vocabulary detection. The framework demonstrates that diverse expert combinations can adaptively process specific patterns, activating only relevant experts during inference. Building on a reproduced Grounding DINO 1.5 Edge, Dynamic-DINO extends it into a dynamic framework via MoE-Tuning with granularity decomposition, expanding the subnet search space while maintaining activated parameters. A novel weight allocation and router initialization strategy prevents performance degradation at the start of fine-tuning. Extensive validation on multiple benchmarks confirms the effectiveness of the approach.

Dynamic-DINO 探索了混合專家在即時開放詞彙偵測中的整合。該框架展示了多元的專家組合能自適應地處理特定模式，在推論時僅啟動相關的專家。基於復現的 Grounding DINO 1.5 Edge，Dynamic-DINO 透過粒度分解的 MoE 微調將其擴展為動態框架，在維持啟動參數量的前提下擴展子網路搜尋空間。新穎的權重分配與路由器初始化策略防止了微調起始時的效能退化。在多個基準上的廣泛驗證確認了方法的有效性。

段落功能總結全文——重申核心貢獻與設計哲學。

邏輯角色結論段完整呼應摘要的三項貢獻：MoE 在偵測中的探索、粒度分解策略、權重保留的初始化。形成嚴密的論證閉環。

論證技巧 / 潛在漏洞結論未充分討論局限性，包括：推論延遲的實際影響、專家循序執行的瓶頸、以及在更大規模模型上的可擴展性。作者在局限性章節中承認了計算資源限制，但結論中未予以強調。

論證結構總覽

問題
密集模型的單一 FFN
無法處理開放場景中
的多樣模式

→

論點
粒度分解 MoE
擴展搜尋空間
維持推論成本

→

證據
1.56M 開源資料
超越 20M 私有資料
基線效能

→

反駁
權重繼承確保
初始效能不退化
負載平衡防坍塌

→

結論
MoE 驅動的動態推論
是即時偵測的
高效升級路徑

作者核心主張（一句話）

透過粒度分解將預訓練的密集偵測模型轉換為混合專家架構，能在維持推論效率的同時，以極少的開源資料實現超越大規模私有資料預訓練基線的開放詞彙偵測效能。

論證最強處

權重繼承的初始化設計：確保 MoE 微調的起始輸出等同於原始密集模型，在數學上保證了效能的單調遞增。搭配負載平衡損失與路由器初始化，形成了嚴密的穩定性保障體系，使 MoE 微調成為低風險、高回報的升級路徑。

論證最弱處

推論延遲的誠實揭露：作者承認目前的實作中專家是循序執行而非平行化，導致推論速度略有下降。這削弱了「維持推論效率」的核心承諾。此外，實驗僅在 8 張 NVIDIA 3090 上進行，未驗證更大模型與資料集上的可擴展性。