InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Abstract — 摘要

Compared to the rapid development of large-scale vision transformers (ViTs) in recent years, large-scale models based on CNNs have not been thoroughly explored. This work presents InternImage, a new large-scale CNN-based foundation model that can obtain comparable or even better performance than current state-of-the-art ViTs. Different from recent CNNs that adopt large dense kernels, InternImage takes deformable convolution as the core operator, so that the model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has adaptive spatial aggregation conditioned by input and task information.

相較於近年來大規模視覺 Transformer（ViT）的快速發展，基於 CNN 的大規模模型尚未被充分探索。本研究提出 InternImage，一個新的大規模 CNN 基礎模型，能夠獲得與當前最先進的 ViT 相當甚至更優的效能。與近期採用大型密集核的 CNN 不同，InternImage 以可變形摺積作為核心運算子，使模型不僅擁有下游任務（如偵測與分割）所需的大有效感受野，還具備根據輸入與任務資訊進行自適應空間聚合的能力。

段落功能全文定位——從大型視覺模型的研究缺口切入，點明 CNN 基礎模型被忽視的現狀，並引出 InternImage 的核心定位。

邏輯角色摘要前半以「ViT 蓬勃 vs. CNN 滯後」的對比建立研究動機，再以「可變形摺積取代密集大核」劃清與同期 CNN 改良工作的區別，一句話完成「問題-方案」的雙重陳述。

論證技巧 / 潛在漏洞以 ViT 的成功作為正面對照來論證 CNN 的潛力，修辭上巧妙地將「劣勢」轉化為「未被開發的機會」。但「可變形摺積」與「大型密集核」之間的優劣比較需要後續章節的實驗支撐。

Starting from InternImage, the authors explore how to design and train CNN-based foundation models at a large scale. By customizing a series of block-level, stage-level, and model-level designs, together with a tailored large-scale training strategy, InternImage is successfully scaled to over 1 billion parameters and trained on 427 million images. It achieves 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs on these benchmarks. The code and models are available at github.com/OpenGVLab/InternImage.

以 InternImage 為起點，作者探索如何在大規模下設計與訓練基於 CNN 的基礎模型。透過客製化一系列區塊層級、階段層級與模型層級的設計，配合量身打造的大規模訓練策略，InternImage 成功擴展至超過十億個參數，並在四億兩千七百萬張影像上進行訓練。它在 COCO test-dev 上達到 65.4 mAP，在 ADE20K 上達到 62.9 mIoU，超越了當前領先的 CNN 與 ViT。

段落功能量化宣示——以具體的參數規模與基準數字佐證核心論點。

邏輯角色摘要後半承擔「實證預告」功能：十億參數與多項 SOTA 結果預先建立讀者對方法有效性的信心，為後續詳細論述提供期望框架。

論證技巧 / 潛在漏洞選擇報告 COCO 與 ADE20K 兩項高難度密集預測任務的成績，而非僅報告 ImageNet 分類，暗示 InternImage 在需要大感受野的任務上具有特殊優勢。但摘要未提及 ImageNet 上與 ViT-G 等最大模型的差距（約 0.9%），留待讀者自行發現。

1. Introduction — 緒論

Since the introduction of AlexNet, convolutional neural networks have been the de facto standard for visual recognition. However, the emergence of Vision Transformers (ViTs) has shifted the paradigm significantly. Models like ViT, Swin Transformer, and their scaled variants have demonstrated remarkable performance on a wide range of vision benchmarks. A natural question arises: Can CNN-based foundation models also achieve comparable or even better performance than ViTs when equipped with similar operator-level and architecture-level designs?

自 AlexNet 問世以來，摺積神經網路一直是視覺辨識的事實標準。然而，視覺 Transformer（ViT）的出現顯著改變了這一典範。ViT、Swin Transformer 及其擴展變體在各種視覺基準上展現了卓越的效能。一個自然的問題隨之浮現：當配備類似的運算子層級與架構層級設計時，基於 CNN 的基礎模型是否也能達到與 ViT 相當甚至更優的效能？

段落功能建立研究場域——以歷史脈絡鋪陳 CNN 到 ViT 的典範轉移，並以反問句引出核心研究問題。

邏輯角色論證鏈的起點：先確立 CNN 的歷史地位，再指出 ViT 的挑戰，最終以疑問句形式提出研究假設，引導讀者進入作者的論證框架。

論證技巧 / 潛在漏洞以反問句（「能否...？」）代替直接聲明，降低了論點的侵略性，同時激發讀者的好奇心。但此問題隱含一個前提假設：CNN 與 ViT 的差異可被歸結為「運算子層級與架構層級設計」，而非更根本的數學結構差異。

To answer this question, the authors first identify two key differences between the core operators of modern CNNs and ViTs. First, multi-head self-attention (MHSA) provides long-range dependencies and adaptive spatial aggregation, whereas standard convolutions have limited and static receptive fields. Second, ViTs benefit from advanced architectural components such as Layer Normalization, Feed-Forward Networks (FFN), and GELU activation, which are absent in traditional CNN designs. Previous attempts to bridge this gap using very large kernels (e.g., 31x31) still lag behind state-of-the-art ViTs and introduce optimization difficulties.

為回答此問題，作者首先辨識出現代 CNN 與 ViT 核心運算子之間的兩個關鍵差異。其一，多頭自注意力機制（MHSA）提供了長程相依性與自適應空間聚合，而標準摺積則具有有限且固定的感受野。其二，ViT 受益於進階的架構元件，例如層歸一化、前饋網路（FFN）和 GELU 啟動函數，這些在傳統 CNN 設計中付之闕如。先前試圖以超大核（如 31x31）彌合此差距的嘗試，仍然落後於最先進的 ViT，且引入了最佳化困難。

段落功能差距剖析——系統化地列舉 CNN 與 ViT 的核心差異，並批評既有彌合策略的不足。

邏輯角色此段承擔「問題診斷」的關鍵功能：將模糊的「CNN 不如 ViT」分解為兩個可操作的技術維度（運算子特性與架構元件），為後續方案設計提供精確的改進方向。

論證技巧 / 潛在漏洞將問題拆解為「運算子」與「架構」兩個獨立維度，使解決方案看起來既完整又可行。但暗示大核方法（如 RepLKNet、SLaK）效果不佳時，未充分考量這些方法在某些特定任務上的優勢，可能過度簡化了競爭格局。

In this work, the authors propose a different approach. Instead of using large dense kernels, they adopt deformable convolution v3 (DCNv3) as the core operator, which uses 3x3 convolution kernels with learnable sampling offsets. This design provides three key advantages: (1) flexible receptive fields that can be either short-range or long-range depending on the learned offsets; (2) input-adaptive spatial aggregation similar to MHSA; and (3) avoidance of the optimization problems inherent in dense large kernels. Combined with transformer-inspired architectural components and customized stacking and scaling rules, InternImage becomes the first CNN-based model effectively scaled to over 1 billion parameters while matching ViT-level performance.

本研究提出一條不同的路徑。作者不使用大型密集核，而是採用可變形摺積 v3（DCNv3）作為核心運算子，使用帶有可學習取樣偏移的 3x3 摺積核。此設計提供三項關鍵優勢：(1) 靈活的感受野——根據學習到的偏移量可為短程或長程；(2) 類似 MHSA 的輸入自適應空間聚合；(3) 避免密集大核固有的最佳化問題。結合受 Transformer 啟發的架構元件以及客製化的堆疊與縮放規則，InternImage 成為首個成功擴展至超過十億參數且達到 ViT 水準效能的 CNN 基礎模型。

段落功能提出解決方案——完整概述 InternImage 的技術核心與三大優勢。

邏輯角色承接上段的問題診斷，此段實現「轉折」：從「為何現有 CNN 不夠好」過渡到「InternImage 如何回應」。三項優勢精確對應上段提出的兩個差異維度，形成嚴密的邏輯閉環。

論證技巧 / 潛在漏洞以編號列表清晰呈現三項優勢，增強記憶性與說服力。「首個超過十億參數的 CNN」是一個強有力的歷史性宣稱，但其意義取決於效能是否真正匹配 ViT——若 ImageNet 上仍有 0.9% 的差距，「匹配 ViT 水準」的說法需要更精細的限定。

Vision foundation models have evolved from AlexNet through a long line of CNN architectures including VGGNet, GoogLeNet, ResNet, DenseNet, and EfficientNet. These models progressively improved accuracy and efficiency through innovations in depth, width, connectivity patterns, and neural architecture search. More recently, Vision Transformers (ViT) and their hierarchical variants such as Swin Transformer, PVT, and CSWin have introduced self-attention mechanisms that provide global receptive fields and dynamic weight computation. This has sparked renewed interest in modernizing CNN architectures by borrowing techniques from ViTs, as seen in works like ConvNeXt that systematically incorporate transformer-inspired design choices into pure convolutional networks.

視覺基礎模型從 AlexNet 歷經一長串 CNN 架構的演進，包括 VGGNet、GoogLeNet、ResNet、DenseNet 及 EfficientNet。這些模型透過深度、寬度、連接模式與神經架構搜尋等創新，逐步提升了精確度與效率。近期，視覺 Transformer（ViT）及其階層式變體（如 Swin Transformer、PVT、CSWin）引入了自注意力機制，提供全域感受野與動態權重計算。這也激發了借鏡 ViT 技術來現代化 CNN 架構的研究興趣，例如 ConvNeXt 系統性地將受 Transformer 啟發的設計選擇融入純摺積網路。

段落功能文獻回顧——以歷史脈絡串聯 CNN 與 ViT 兩大陣營的演進軌跡。

邏輯角色建立完整的學術譜系：傳統 CNN -> ViT -> 現代化 CNN（如 ConvNeXt），為 InternImage 在這條演進線上的定位提供脈絡。

論證技巧 / 潛在漏洞以線性演進敘事覆蓋大量相關工作，既展現了學術嚴謹性，又暗示 InternImage 是此演進的最新產物。但將 ConvNeXt 等工作僅歸為「借鏡 ViT」可能簡化了其原創性貢獻。

The success of large-scale pre-training in NLP has inspired similar efforts in computer vision. Zhai et al. extended ViT to 2 billion parameters, and Liu et al. developed 3 billion parameter Swin variants, demonstrating that model scaling is a crucial factor for achieving state-of-the-art performance. However, CNN-based large-scale models have significantly lagged behind transformer variants in both parameter count and downstream performance. Recent attempts to improve CNNs through large kernel designs — such as RepLKNet with 31x31 kernels and SLaK with up to 51x51 kernels — have narrowed the gap but still fall short of the best ViTs and require specialized re-parameterization techniques to stabilize training.

NLP 領域大規模預訓練的成功啟發了電腦視覺的類似努力。Zhai 等人將 ViT 擴展至二十億參數，Liu 等人開發了三十億參數的 Swin 變體，證明模型縮放是達到最先進效能的關鍵因素。然而，基於 CNN 的大規模模型在參數量與下游效能上均顯著落後於 Transformer 變體。近期透過大核設計改進 CNN 的嘗試——如 RepLKNet 採用 31x31 核、SLaK 採用高達 51x51 核——雖縮小了差距，但仍不及最佳 ViT，且需要特殊的重參數化技術以穩定訓練。

段落功能競爭分析——指出大規模 CNN 的缺席與大核方案的侷限。

邏輯角色此段在相關工作中扮演「收束」角色：從廣泛的模型演進聚焦到「大規模 CNN 的空白」，直接為 InternImage 的研究貢獻定位。

論證技巧 / 潛在漏洞列舉 RepLKNet 與 SLaK 的具體核大小（31x31、51x51），以數字的直觀衝擊暗示「密集大核路線」的笨拙感，為可變形摺積的「稀疏 3x3」設計提供反襯。但大核方法的優勢在於實作簡單、硬體友善，此處未予公平呈現。

Deformable convolution was first introduced by Dai et al. as DCNv1, which augments the standard convolution grid with 2D learnable offsets, allowing each sampling point to shift from its regular position. DCNv2 further introduced modulation scalars that re-weight the contribution of each sampling point. While these operators have proven effective as plug-in modules in detection and segmentation backbones, they have not been explored as the core building block for designing large-scale foundation models. This paper bridges this gap by extending DCNv2 into DCNv3 with modifications tailored for foundation model design.

可變形摺積最初由 Dai 等人以 DCNv1 的形式提出，透過二維可學習偏移來擴充標準摺積網格，允許每個取樣點偏離其常規位置。DCNv2 進一步引入了調變純量，重新加權每個取樣點的貢獻。儘管這些運算子已被證明是偵測與分割骨幹網路中有效的外掛模組，但它們尚未被探索作為設計大規模基礎模型的核心建構單元。本文透過將 DCNv2 擴展為 DCNv3——加入針對基礎模型設計量身打造的修改——來彌合此缺口。

段落功能技術溯源——追蹤可變形摺積從 v1 到 v2 的演進，並指出其作為基礎模型核心元件的未開發潛力。

邏輯角色此段建立了「DCNv1 -> DCNv2 -> DCNv3（本文）」的技術譜系，使 InternImage 的核心創新看起來是自然且漸進的學術延伸，而非突兀的跳躍。

論證技巧 / 潛在漏洞將先前的 DCN 定位為「外掛模組」而非「核心建構單元」，精準地劃分了本文的貢獻邊界。但「未被探索作為核心建構單元」的說法需考量是否有其他研究已進行類似嘗試——若有，則此創新性宣稱需要調整。

3. Method — 方法

3.1 Deformable Convolution v3 (DCNv3)

The authors begin by analyzing the limitations of standard convolution compared to multi-head self-attention (MHSA). Standard convolution suffers from two critical shortcomings: (1) the effective receptive field remains relatively small despite increasing network depth, constrained by the fixed kernel size (typically 3x3 or 5x5); and (2) static convolution weights impose strong inductive biases — specifically 2D locality and translation equivalence — which restrict the model's ability to learn from web-scale diverse data. In contrast, MHSA computes dynamic attention weights conditioned on input features, enabling both long-range dependencies and adaptive spatial aggregation.

作者首先分析標準摺積相較於多頭自注意力機制（MHSA）的侷限。標準摺積存在兩項關鍵缺陷：(1) 有效感受野即使隨著網路深度增加仍保持相對較小，受限於固定的核大小（通常為 3x3 或 5x5）；(2) 靜態摺積權重施加了強歸納偏置——具體而言是二維局部性與平移等變性——限制了模型從網路規模的多樣化資料中學習的能力。相較之下，MHSA 計算以輸入特徵為條件的動態注意力權重，同時實現長程相依性與自適應空間聚合。

段落功能問題診斷——以結構化方式解剖標準摺積的核心弱點。

邏輯角色此段為 DCNv3 的設計提供理論動機：每一項 MHSA 的優勢對應一項摺積的弱點，後續的 DCNv3 修改必須逐一回應這些弱點。

論證技巧 / 潛在漏洞將歸納偏置定性為「限制」而非「優勢」是一個策略性的框架選擇。在小資料集場景下，CNN 的歸納偏置實際上是優勢；作者的論述僅在「網路規模資料」的前提下成立，但此前提假設未被明確限定。

Recall that DCNv2 computes its output as: y(p₀) = Σ_k=1^K w_k m_k x(p₀ + p_k + Δp_k), where p_k denotes regular grid sampling locations, Δp_k are learnable offsets, and m_k are modulation scalars. DCNv2 already shares favorable properties with MHSA: the learnable offsets enable variable-length receptive fields, and the input-conditioned modulation provides adaptive spatial aggregation. However, DCNv2 has several design choices that hinder its scalability to large-scale models: location-specific convolution weights, single-group operation, and element-wise sigmoid normalization of modulation scalars.

回顧 DCNv2 的輸出計算公式：y(p₀) = Σ_k=1^K w_k m_k x(p₀ + p_k + Δp_k)，其中 p_k 為常規網格取樣位置，Δp_k 為可學習偏移量，m_k 為調變純量。DCNv2 已具備與 MHSA 共通的有利特性：可學習偏移量實現了可變長度的感受野，而以輸入為條件的調變則提供了自適應空間聚合。然而，DCNv2 有若干設計選擇妨礙其擴展至大規模模型：位置特定的摺積權重、單一群組運作，以及調變純量的逐元素 sigmoid 歸一化。

段落功能技術基礎——回顧 DCNv2 公式，並辨識其可擴展性瓶頸。

邏輯角色此段扮演「承上啟下」的關鍵角色：先肯定 DCNv2 已具備的優良特性（與 MHSA 的相似性），再精確列出三項需要修改的設計缺陷，為 DCNv3 的三項改進提供一對一的銜接。

論證技巧 / 潛在漏洞以數學公式建立嚴謹的技術基礎，同時以「妨礙擴展性」的措辭將 DCNv2 的設計選擇定性為「問題」而非「取捨」。但位置特定權重在小模型中可能帶來更好的表達能力，此處未予討論。

DCNv3 introduces three critical modifications. First, weight sharing among convolutional neurons: the original DCNv2 maintains location-specific weights for each sampling point, making parameter count grow linearly with the number of sampling points. DCNv3 decomposes this into depth-wise components (handling spatial modulation) and point-wise shared projections, dramatically reducing parameters. Second, the multi-group mechanism: inspired by group convolution and multi-head attention, DCNv3 splits spatial aggregation into G groups, each with individual sampling offsets Δp_mk and modulation scalars m_mk, enabling different groups to learn diverse spatial aggregation patterns from different representation subspaces. Third, softmax normalization: replacing the original element-wise sigmoid (which produces unstable output range [0, K]) with softmax normalization along the K sampling points, constraining the sum to 1 and stabilizing gradient flow in large-scale training.

DCNv3 引入三項關鍵修改。第一，摺積神經元間的權重共享：原始 DCNv2 為每個取樣點維護位置特定的權重，使參數量隨取樣點數量線性增長。DCNv3 將此分解為深度方向元件（處理空間調變）與逐點共享投影，大幅降低參數量。第二，多群組機制：受群組摺積與多頭注意力啟發，DCNv3 將空間聚合分割為 G 個群組，每組具有獨立的取樣偏移 Δp_mk 與調變純量 m_mk，使不同群組能從不同的表示子空間中學習多樣化的空間聚合模式。第三，softmax 歸一化：以沿 K 個取樣點的 softmax 歸一化取代原始的逐元素 sigmoid（其產生不穩定的輸出範圍 [0, K]），將總和約束為 1，穩定大規模訓練中的梯度流動。

段落功能核心技術創新——逐一描述 DCNv3 的三項改進及其設計動機。

邏輯角色全文的技術核心。三項修改分別回應前段列出的三個 DCNv2 缺陷，形成完美的「問題-解決方案」對應關係，使論證邏輯無懈可擊。

論證技巧 / 潛在漏洞以「第一、第二、第三」的編號結構增強清晰度。每項修改都有明確的「問題描述 -> 解決方案 -> 效益」邏輯。但三項修改是否存在交互作用（例如權重共享可能降低多群組的表達能力）未被討論。

The resulting DCNv3 operator is formulated as: y(p₀) = Σ_g=1^G Σ_k=1^K w_g m_gk x_g(p₀ + p_k + Δp_gk), where G denotes the number of aggregation groups, K is the number of sampling points per group, w_g ∈ R^C×C' is the group-wise projection weight with C' = C/G, and the modulation scalars m_gk are normalized via softmax. Compared to MHSA, DCNv3 is more efficient because it uses sparse sampling (K=9 for 3x3 kernel) rather than attending to all spatial locations, while still achieving adaptive spatial aggregation through the learned offsets and modulation.

最終的 DCNv3 運算子公式為：y(p₀) = Σ_g=1^G Σ_k=1^K w_g m_gk x_g(p₀ + p_k + Δp_gk)，其中 G 為聚合群組數量，K 為每組的取樣點數，w_g ∈ R^C×C' 為群組投影權重（C' = C/G），調變純量 m_gk 透過 softmax 進行歸一化。相較於 MHSA，DCNv3 更具效率，因為它使用稀疏取樣（3x3 核對應 K=9）而非注意所有空間位置，同時仍透過學習到的偏移量與調變實現自適應空間聚合。

段落功能數學嚴格化——給出 DCNv3 的完整公式，並與 MHSA 進行效率比較。

邏輯角色此段完成了 DCNv3 的技術規格定義，同時以「稀疏取樣 vs. 全域注意力」的對比，將 DCNv3 定位在 MHSA 與標準摺積之間的「甜蜜點」。

論證技巧 / 潛在漏洞以數學公式建立權威感，同時以效率比較回應可能的「為何不直接用 MHSA」質疑。但稀疏取樣的 K=9 與全域注意力的 HW 個 token 之間的效率差距，在高解析度輸入時最為顯著，作者可更明確量化此優勢。

3.2 InternImage Model — 模型架構

The basic block of InternImage departs from the traditional bottleneck design used in ResNet and instead adopts a transformer-like structure. Each block consists of DCNv3 as the core operator, preceded by Layer Normalization (LN) and followed by a Feed-Forward Network (FFN) with GELU activation. The sampling offsets and modulation scales are predicted via separable convolution from input features. This design combines the spatial modeling strength of deformable convolution with the training stability and representation capacity of transformer-style components.

InternImage 的基本區塊脫離了 ResNet 使用的傳統瓶頸設計，轉而採用類 Transformer 結構。每個區塊以 DCNv3 作為核心運算子，前置層歸一化（LN），後接具有 GELU 啟動函數的前饋網路（FFN）。取樣偏移量與調變尺度透過可分離摺積從輸入特徵中預測。此設計結合了可變形摺積的空間建模強度與 Transformer 風格元件的訓練穩定性及表示能力。

段落功能架構設計——描述 InternImage 基本區塊的組成與設計哲學。

邏輯角色回應緒論中提出的第二個差異（「ViT 的進階架構元件」）：透過整合 LN、FFN、GELU 等 Transformer 元件，直接消弭 CNN 與 ViT 在架構層級的差距。

論證技巧 / 潛在漏洞「類 Transformer 結構」的措辭暗示 InternImage 汲取了兩個世界的精華。但這也引發一個根本問題：當 CNN 採用了大量 Transformer 元件後，其「CNN」的身分認定是否仍然明確？核心差異僅剩 DCNv3 vs. MHSA。

The stem layer consists of two 3x3 convolutions with stride 2 and padding 1, flanked by Layer Normalization and GELU activations, which reduce the input resolution by 4x before the first stage. Between consecutive stages, downsampling layers employ a single 3x3 convolution with stride 2 and padding 1 followed by Layer Normalization, providing a 2x spatial reduction. The overall architecture follows the four-stage hierarchical design common to both modern CNNs and ViTs, producing feature maps at 1/4, 1/8, 1/16, and 1/32 of the input resolution, which are compatible with standard downstream task heads such as FPN and UperNet.

莖幹層由兩個步幅為 2、填充為 1 的 3x3 摺積組成，搭配層歸一化與 GELU 啟動函數，在進入第一階段前將輸入解析度降低四倍。相鄰階段之間的下採樣層使用一個步幅為 2、填充為 1 的 3x3 摺積加上層歸一化，提供二倍的空間縮減。整體架構遵循現代 CNN 與 ViT 通用的四階段階層式設計，產生輸入解析度 1/4、1/8、1/16 和 1/32 的特徵圖，相容於 FPN 和 UperNet 等標準下游任務頭。

段落功能工程細節——描述莖幹、下採樣與整體階層結構。

邏輯角色此段確保 InternImage 與現有生態系統的相容性：四階段設計與標準解析度比例使其能無縫接入現有的偵測與分割框架，降低了採用門檻。

論證技巧 / 潛在漏洞強調與 FPN、UperNet 的相容性是務實的工程考量，展現了作者對實際部署的重視。四階段設計雖為業界標準，但可能限制了架構搜尋的自由度——是否存在更適合 DCNv3 的非標準階層結構，值得探討。

3.3 Stacking Rules — 堆疊規則

To reduce the vast hyperparameter space when designing model variants, the authors propose four stacking rules that constrain the architecture. Rule 1: the channel dimensions of stages 2, 3, and 4 are determined by the first stage channel C₁ through fixed scaling factors (2x, 4x, 8x). Rule 2: group numbers in each stage correspond to their channel configurations. Rules 3 and 4: the block count follows an "AABA" pattern — stages 1, 2, and 4 have equal numbers of blocks (L₁), while stage 3 has significantly more blocks (L₃), concentrating the majority of computational capacity in the third stage. These rules reduce the full 12-dimensional hyperparameter space to just 4 variables: (C₁, C', L₁, L₃).

為縮減設計模型變體時龐大的超參數空間，作者提出四條堆疊規則以約束架構。規則一：第二、三、四階段的通道維度由第一階段的通道數 C₁ 透過固定縮放因子（2 倍、4 倍、8 倍）決定。規則二：每個階段的群組數量對應其通道配置。規則三與四：區塊數量遵循「AABA」模式——第一、二、四階段具有相同數量的區塊（L₁），而第三階段具有顯著更多的區塊（L₃），將大部分計算容量集中於第三階段。這些規則將完整的 12 維超參數空間縮減為僅需 4 個變數：(C₁, C', L₁, L₃)。

段落功能系統化設計——以規則化的方式降低架構搜尋複雜度。

邏輯角色此段回應「如何系統化地建構模型族群」的實用問題。從 12 維降至 4 維的超參數壓縮，使大規模搜尋變得可行，也暗示作者的架構設計並非拍腦袋決定，而是經過系統化探索。

論證技巧 / 潛在漏洞「AABA」模式與 Swin Transformer 的「1:1:3:1」或「1:1:9:1」分配比例相似，顯示此設計可能源自經驗觀察而非理論推導。作者將此包裝為「規則」增添了方法論的嚴謹感，但這些規則的最佳性並未被嚴格證明。

With the constrained search space, the authors discretize the remaining hyperparameters: C₁ ∈ {48, 64, 80}, L₁ ∈ {1, 2, 3, 4, 5}, and C' ∈ {16, 32}, producing 30 candidate models that are evaluated on ImageNet with a short training schedule. The optimal configuration (C₁=64, C'=16, L₁=4, L₃=18) yields approximately 30 million parameters and serves as the "origin model" (InternImage-T) from which all other variants are derived through systematic scaling.

在約束後的搜尋空間中，作者將剩餘超參數離散化：C₁ ∈ {48, 64, 80}、L₁ ∈ {1, 2, 3, 4, 5}、C' ∈ {16, 32}，產生 30 個候選模型，在 ImageNet 上以短訓練排程進行評估。最佳配置 (C₁=64, C'=16, L₁=4, L₃=18) 產出約三千萬參數，作為「原點模型」（InternImage-T），所有其他變體均從此透過系統化縮放衍生。

段落功能驗證搜尋——報告超參數探索的具體過程與最終選擇。

邏輯角色將「規則」轉化為「實驗結果」：30 個候選的離散搜尋提供了可復現的探索框架，(64, 16, 4, 18) 的最終選擇有實驗依據支撐。

論證技巧 / 潛在漏洞 30 個候選看似少量但合理，展現了效率。但短訓練排程下的排名是否能可靠地預測完整訓練的結果，是一個潛在的方法論風險——訓練排程改變可能導致不同的最佳配置。

3.4 Scaling Rules — 縮放規則

Starting from the origin model, larger variants are obtained by jointly scaling depth and width using the compound scaling approach. The depth is scaled as D' = α^φ · D (where D = 3L₁ + L₃) and the width as C₁' = β^φ · C₁, subject to the constraint α · β^1.99 ≈ 2, ensuring that each scaling step approximately doubles the model's FLOPs. Through empirical evaluation of multiple (α, β) combinations, the authors select α = 1.09 and β = 1.36, which balances depth and width growth for optimal performance.

從原點模型出發，透過複合縮放方法聯合縮放深度與寬度以獲得較大的變體。深度按 D' = α^φ · D（其中 D = 3L₁ + L₃）縮放，寬度按 C₁' = β^φ · C₁ 縮放，受約束條件 α · β^1.99 ≈ 2 限制，確保每次縮放步驟大約使模型的 FLOPs 倍增。經過多組 (α, β) 組合的經驗評估，作者選擇 α = 1.09 與 β = 1.36，在深度與寬度增長之間取得最佳效能平衡。

段落功能縮放策略——定義從小模型到大模型的系統化擴展方法。

邏輯角色此段解決「如何從 30M 擴展到 1B+ 參數」的核心工程問題。複合縮放的概念借鏡 EfficientNet，但約束條件（α · β^1.99 ≈ 2）提供了更具理論支撐的縮放框架。

論證技巧 / 潛在漏洞指數 1.99 的選取（近似於 2）暗示 FLOPs 主要由寬度平方決定，與摺積的理論計算量一致。但在極大規模（如 1B 參數）下，記憶體瓶頸可能比 FLOPs 更關鍵，純粹基於 FLOPs 的縮放策略可能不是最優的。

This scaling strategy produces six model variants spanning a wide range of computational budgets: InternImage-T (30M params), InternImage-S (50M), InternImage-B (97M), InternImage-L (223M), InternImage-XL (335M), and InternImage-H (1.08B params). The block configurations are: T has {4,4,18,4}, S has {4,4,21,4}, B has {4,4,21,4} with wider channels, L uses {5,5,22,5}, XL uses {5,5,24,5}, and the largest H variant uses {6,6,32,6} blocks across the four stages, with C₁ = 320 and C' = 32.

此縮放策略產出六個模型變體，涵蓋廣泛的計算預算範圍：InternImage-T（30M 參數）、InternImage-S（50M）、InternImage-B（97M）、InternImage-L（223M）、InternImage-XL（335M）與 InternImage-H（1.08B 參數）。各自的區塊配置為：T 為 {4,4,18,4}、S 為 {4,4,21,4}、B 為 {4,4,21,4}（通道更寬）、L 為 {5,5,22,5}、XL 為 {5,5,24,5}，最大的 H 變體在四個階段使用 {6,6,32,6} 區塊，C₁ = 320 且 C' = 32。

段落功能具體規格——列出完整的模型族群及其配置細節。

邏輯角色將抽象的縮放公式具體化為六個可復現的模型變體。從 30M 到 1.08B 的參數跨度覆蓋了從輕量級到旗艦級的完整光譜，展現了方法的通用性。

論證技巧 / 潛在漏洞六個變體的命名（T/S/B/L/XL/H）與 Swin Transformer 的命名慣例一致，方便直接比較。但 H 變體的 C' 從 16 跳到 32，偏離了其他變體的設定，暗示最大模型需要額外的手動調整，略微削弱了「完全自動縮放」的敘事。

4. Experiments — 實驗

Image Classification on ImageNet. InternImage-T/S/B are trained for 300 epochs on ImageNet-1K, while InternImage-L/XL are first pre-trained on ImageNet-22K for 90 epochs then fine-tuned on ImageNet-1K for 20 epochs. For the largest model, InternImage-H is pre-trained on a joint dataset of 427 million images (Laion-400M, YFCC-15M, CC12M) for 30 epochs using the M3I pre-training approach, then fine-tuned on ImageNet-1K. Among small models, InternImage-T achieves 83.5% top-1 accuracy, surpassing ConvNeXt-T (82.1%) by 1.4 points. InternImage-B reaches 84.9%, outperforming ConvNeXt-B (83.8%) by 1.1 points. At the largest scale, InternImage-H achieves 89.6% on 640x640 inputs, approaching ViT-G/14's 90.5% with a gap of only 0.9 points.

影像分類（ImageNet）。InternImage-T/S/B 在 ImageNet-1K 上訓練 300 個周期，InternImage-L/XL 先在 ImageNet-22K 上預訓練 90 個周期，再於 ImageNet-1K 上微調 20 個周期。最大模型 InternImage-H 在四億兩千七百萬張影像的聯合資料集（Laion-400M、YFCC-15M、CC12M）上以 M3I 預訓練方法進行 30 個周期的預訓練，再於 ImageNet-1K 上微調。在小型模型中，InternImage-T 達到 83.5% 的 top-1 精確度，超越 ConvNeXt-T（82.1%）1.4 個百分點。InternImage-B 達到 84.9%，超越 ConvNeXt-B（83.8%）1.1 個百分點。在最大規模下，InternImage-H 在 640x640 輸入上達到 89.6%，接近 ViT-G/14 的 90.5%，差距僅 0.9 個百分點。

段落功能分類實驗——在 ImageNet 上全面驗證各規模模型的效能。

邏輯角色此段提供分類任務的完整實證鏈：小型模型顯著超越同類 CNN，大型模型接近最強 ViT。兩個層面共同支撐「CNN 可與 ViT 匹敵」的核心論點。

論證技巧 / 潛在漏洞小模型的比較對象為 ConvNeXt（CNN），展示了 DCNv3 相對於大核方法的優勢。但 InternImage-H 與 ViT-G 的 0.9% 差距不可忽略——考慮到 ViT-G 使用了更大的私有資料集（JFT-3B），此差距可能反映資料規模差異而非架構差異。

Object Detection on COCO. On the standard Mask R-CNN 1x schedule, InternImage-T achieves 47.2 box AP, outperforming Swin-T (42.7) by 4.5 points and ConvNeXt-T (44.2) by 3.0 points. InternImage-B reaches 48.8 box AP, 1.8 points ahead of ConvNeXt-B (47.0). With the advanced DINO detector and Objects365 pre-training, InternImage-H achieves a record 65.4 mAP on COCO test-dev, surpassing the previous best FD-SwinV2-G (64.2) by 1.2 points while using 27% fewer parameters (2.18B vs 3.00B). Notably, InternImage-H achieves this without knowledge distillation, which was employed by competing methods, further highlighting the inherent strength of the DCNv3-based backbone for dense prediction tasks.

物件偵測（COCO）。在標準的 Mask R-CNN 1x 排程上，InternImage-T 達到 47.2 box AP，超越 Swin-T（42.7）4.5 個百分點、ConvNeXt-T（44.2）3.0 個百分點。InternImage-B 達到 48.8 box AP，領先 ConvNeXt-B（47.0）1.8 個百分點。搭配進階的 DINO 偵測器與 Objects365 預訓練，InternImage-H 在 COCO test-dev 上達到 65.4 mAP 的新紀錄，超越先前最佳的 FD-SwinV2-G（64.2）1.2 個百分點，同時使用少 27% 的參數（2.18B 對 3.00B）。值得注意的是，InternImage-H 在未使用知識蒸餾的情況下達成此成績——而競爭方法採用了蒸餾——進一步凸顯了基於 DCNv3 骨幹在密集預測任務上的固有優勢。

段落功能偵測實驗——在 COCO 上展示跨越多種偵測框架的全面優勢。

邏輯角色此段是論證鏈中最強的實證環節。COCO test-dev 的 SOTA 紀錄搭配更低的參數量，直接且有力地支撐了「CNN 基礎模型不遜於 ViT」的核心論點。

論證技巧 / 潛在漏洞「無知識蒸餾」的強調是巧妙的加分項——暗示 InternImage 的優勢來自架構本身而非訓練技巧。但 DINO 偵測器本身是一個強大的偵測框架，部分功勞可能歸於偵測器而非骨幹。此外，不同方法使用不同的偵測器使直接比較複雜化。

Semantic Segmentation on ADE20K. Using the UperNet framework with single-scale testing, InternImage-T achieves 47.9 mIoU, surpassing ConvNeXt-T (46.0) by 1.9 points. InternImage-B reaches 50.8 mIoU, ahead of RepLKNet-31B (49.9) by 0.9 points. At the large scale with Mask2Former and 896x896 crop inputs, InternImage-H achieves 62.9 mIoU, surpassing SwinV2-G (59.9) by 3.0 points with approximately 3x fewer parameters, and even slightly exceeding BEiT-3 (62.8 mIoU). On multi-scale evaluation, InternImage-H reaches 60.3 mIoU with UperNet, which alone surpasses SwinV2-G's 59.9 mIoU.

語義分割（ADE20K）。使用 UperNet 框架進行單尺度測試，InternImage-T 達到 47.9 mIoU，超越 ConvNeXt-T（46.0）1.9 個百分點。InternImage-B 達到 50.8 mIoU，領先 RepLKNet-31B（49.9）0.9 個百分點。在大規模設定下搭配 Mask2Former 與 896x896 裁剪輸入，InternImage-H 達到 62.9 mIoU，以約三倍少的參數超越 SwinV2-G（59.9）3.0 個百分點，甚至略微超越 BEiT-3（62.8 mIoU）。在多尺度評估中，InternImage-H 使用 UperNet 達到 60.3 mIoU，僅此一項即超越 SwinV2-G 的 59.9 mIoU。

段落功能分割實驗——在 ADE20K 上驗證 InternImage 在像素級預測任務的效能。

邏輯角色與偵測結果共同構成密集預測任務的完整實證。在分割任務上超越 BEiT-3 尤為重要，因為 BEiT-3 是基於 ViT 的最強分割模型之一。

論證技巧 / 潛在漏洞在偵測與分割兩項密集預測任務上的一致優勢，有力地支撐了「DCNv3 的自適應空間聚合特別適合密集預測」的論點。但分類任務上的差距（與 ViT-G 差 0.9%）暗示 DCNv3 的優勢可能集中於空間密集型任務，在全域語義理解上仍有提升空間。

4.4 Ablation Studies — 消融研究

Effect of weight sharing. Comparing shared versus unshared convolution weights in DCNv3, weight sharing reduces parameters by 42.0% and GPU memory by 84.2% for InternImage-H, while maintaining nearly identical accuracy: 83.5% vs 83.6% on ImageNet and 47.2 vs 47.4 box AP on COCO. This demonstrates that weight sharing is essential for scaling DCN-based models to billions of parameters without sacrificing performance — the memory savings alone make large-scale training feasible.

權重共享效果。比較 DCNv3 中共享與非共享摺積權重的影響，權重共享為 InternImage-H 減少了 42.0% 的參數量與 84.2% 的 GPU 記憶體用量，同時維持幾乎相同的精確度：ImageNet 上 83.5% 對 83.6%，COCO 上 47.2 對 47.4 box AP。此結果證明權重共享對於將基於 DCN 的模型擴展至數十億參數而不犧牲效能至關重要——僅記憶體節省一項就使大規模訓練變得可行。

段落功能組件驗證——量化權重共享對效率與效能的影響。

邏輯角色此消融直接驗證了 DCNv3 第一項修改的必要性：84.2% 的記憶體節省是一個驚人的數字，使十億參數級訓練從「不可能」變為「可行」。

論證技巧 / 潛在漏洞 42% 參數減少與 84% 記憶體節省的數字極具衝擊力。但精確度差距（0.1% ImageNet、0.2 AP COCO）雖小卻非零，在更大規模或更長訓練下此差距是否會擴大值得觀察。

Multi-group mechanism and softmax normalization. Removing the multi-group design (using a single group) leads to a 1.2-point drop on ImageNet (from 83.5% to 82.3%) and a significant 3.4-point drop on COCO (from 47.2 to 43.8 box AP). Visualization of learned sampling locations shows that different groups concentrate their sampling points in different spatial regions, confirming that the multi-group mechanism enables learning diverse spatial patterns. For the normalization choice, replacing softmax with element-wise sigmoid causes severe training instability, with ImageNet accuracy plummeting to 65.7% and COCO box AP falling to 38.7. This validates that softmax normalization is critical for stable training of DCN-based large-scale models.

多群組機制與 softmax 歸一化。移除多群組設計（使用單一群組）導致 ImageNet 下降 1.2 個百分點（83.5% 降至 82.3%）、COCO 顯著下降 3.4 個百分點（47.2 降至 43.8 box AP）。對學習到的取樣位置進行視覺化顯示，不同群組將其取樣點集中在不同的空間區域，確認了多群組機制能學習多樣化的空間模式。關於歸一化選擇，將 softmax 替換為逐元素 sigmoid 導致嚴重的訓練不穩定，ImageNet 精確度驟降至 65.7%、COCO box AP 降至 38.7。此結果驗證了 softmax 歸一化對基於 DCN 的大規模模型穩定訓練至關重要。

段落功能設計選擇驗證——以消融實驗確認多群組與 softmax 歸一化的必要性。

邏輯角色完成 DCNv3 三項修改的完整消融驗證。sigmoid 歸一化導致精確度暴跌至 65.7% 是最具說服力的結果——幾乎破壞了整個模型的可用性，強力證明了 softmax 的不可或缺。

論證技巧 / 潛在漏洞 COCO 上 3.4 個百分點的下降比 ImageNet 上 1.2 個百分點更為顯著，暗示多群組機制對密集預測任務的影響更大。sigmoid 的災難性失敗則可能暗示 DCNv3 的穩定性高度依賴特定的歸一化選擇，模型的穩健性可能有限。

Effective receptive field analysis. Using gradient-based visualization, the authors compare the effective receptive fields (ERFs) of ResNet-101 and InternImage-S. Before training, both models exhibit localized ERFs. After training, InternImage achieves near-global ERFs in stages 3 and 4, whereas ResNet's ERF remains relatively small even in its deepest layers. Interestingly, InternImage's ERF pattern differs from ViTs: while ViTs maintain global ERFs throughout all layers, InternImage shows a progressive expansion of the effective receptive field with increasing depth, suggesting it learns a hierarchical spatial representation — local details in early layers and global context in later layers.

有效感受野分析。透過基於梯度的視覺化，作者比較了 ResNet-101 與 InternImage-S 的有效感受野（ERF）。訓練前，兩個模型均展示局部化的 ERF。訓練後，InternImage 在第三與第四階段達到近全域的 ERF，而 ResNet 即使在最深層其 ERF 仍保持相對較小。有趣的是，InternImage 的 ERF 模式有別於 ViT：ViT 在所有層級都維持全域 ERF，而 InternImage 展現出有效感受野隨深度遞增的漸進式擴展，暗示其學習了階層式的空間表示——早期層關注局部細節，後期層捕捉全域脈絡。

段落功能機制解析——視覺化分析 DCNv3 的感受野行為，提供直覺理解。

邏輯角色此段回應了緒論中「有效感受野受限」的核心批判，以視覺化證據展示 DCNv3 確實克服了此限制。漸進式擴展的發現更為 InternImage 賦予了獨特的理論價值。

論證技巧 / 潛在漏洞「階層式空間表示」的詮釋巧妙地將 InternImage 與 ViT 的差異轉化為優勢——暗示漸進式的感受野擴展可能比 ViT 的「全程全域」更適合多尺度視覺任務。但此論點需要更嚴格的因果證明，而非僅從 ERF 視覺化推測。

5. Conclusion — 結論

This paper presents InternImage, which demonstrates that CNN-based foundation models can match or even surpass transformer-based models when properly designed with deformable convolution operators, transformer-inspired architectural components, and systematic scaling strategies. Through the proposed DCNv3 operator — featuring weight sharing, multi-group aggregation, and softmax normalization — the model effectively addresses the long-standing limitations of CNNs in receptive field flexibility and adaptive spatial aggregation. The comprehensive experimental results across image classification, object detection, and semantic segmentation confirm that InternImage achieves state-of-the-art results on COCO (65.4 mAP) and ADE20K (62.9 mIoU) while remaining parameter-efficient.

本文提出 InternImage，展示了基於 CNN 的基礎模型在配備可變形摺積運算子、受 Transformer 啟發的架構元件與系統化縮放策略時，能夠匹敵甚至超越基於 Transformer 的模型。透過所提出的 DCNv3 運算子——具備權重共享、多群組聚合與 softmax 歸一化——該模型有效解決了 CNN 在感受野靈活性與自適應空間聚合方面的長期侷限。涵蓋影像分類、物件偵測與語義分割的全面實驗結果確認，InternImage 在 COCO（65.4 mAP）與 ADE20K（62.9 mIoU）上達到最先進成果，同時保持參數效率。

段落功能全文總結——重申核心貢獻，呼應摘要與緒論的論點。

邏輯角色結論段形成論證的閉環：從緒論的「CNN 能否匹敵 ViT？」問題出發，經過方法設計與實驗驗證，最終以肯定的答案收束全文。

論證技巧 / 潛在漏洞結論措辭精煉，將三項技術創新濃縮為一句話。「匹敵甚至超越」的用語相對保守，準確反映了 InternImage 在密集預測上超越、在分類上略遜的實際情況。

The authors also acknowledge limitations of the current work. The latency of DCN operators remains problematic for real-time downstream applications, as the irregular memory access patterns of deformable convolution are less hardware-friendly than standard convolutions or self-attention. Additionally, large-scale CNN research is still in its early stages, and InternImage's performance gap with methods that leverage even larger private datasets (such as JFT-3B) suggests room for further improvement. The authors position this work as a starting point for efficient large-scale CNN development, encouraging the community to continue exploring the potential of CNN-based vision foundation models.

作者也坦承目前工作的侷限性。DCN 運算子的延遲對即時下游應用仍是問題，因為可變形摺積的不規則記憶體存取模式對硬體不如標準摺積或自注意力友善。此外，大規模 CNN 研究仍處於早期階段，InternImage 與利用更大規模私有資料集（如 JFT-3B）的方法之間的效能差距，暗示仍有進一步改進的空間。作者將本研究定位為高效大規模 CNN 開發的起點，鼓勵社群繼續探索基於 CNN 的視覺基礎模型的潛力。

段落功能侷限討論——坦誠指出方法的不足與未來改進方向。

邏輯角色此段為全文的論證增添了學術誠實度：承認延遲問題與資料規模差距，避免了過度宣稱，也為後續研究指明了方向。

論證技巧 / 潛在漏洞延遲問題是 DCN 的根本性瓶頸——不規則記憶體存取在 GPU 上的效率遠低於規則運算。作者以「起點」自居是明智的定位，但此侷限可能阻礙 InternImage 在工業界的廣泛採用。將工作定位為「起點」而非「終點」，既管理了預期，也為後續的 InternImage-v2 等工作預留了空間。

論證結構總覽

問題
CNN 基礎模型在大規模
視覺任務上落後於 ViT

→

論點
DCNv3 可變形摺積彌合
感受野與自適應聚合差距

→

證據
COCO 65.4 mAP、
ADE20K 62.9 mIoU SOTA

→

反駁
延遲問題與資料規模差距
是目前的已知侷限

→

結論
CNN 基礎模型仍具潛力
值得持續探索

作者核心主張（一句話）

以可變形摺積 v3 為核心運算子，配合 Transformer 風格架構元件與系統化縮放策略，CNN 基礎模型能夠在超過十億參數的規模下，於密集預測任務上匹敵甚至超越視覺 Transformer。

論證最強處

密集預測任務的壓倒性優勢：InternImage-H 在 COCO test-dev 上以 65.4 mAP 刷新紀錄，同時使用比 SwinV2-G 少 27% 的參數且未依賴知識蒸餾。DCNv3 的三項修改（權重共享、多群組、softmax）均有嚴格的消融實驗支撐，每項修改都展示了明確且不可替代的貢獻。

論證最弱處

分類任務的差距與延遲問題：在 ImageNet 分類上，InternImage-H（89.6%）與 ViT-G（90.5%）仍有 0.9% 的差距，且 ViT-G 使用了更大的私有資料集，使此比較不完全公平。更根本的問題是 DCN 的不規則記憶體存取模式導致實際推論延遲高於理論 FLOPs 預期，限制了其在即時應用中的採用。