Deformable Convolutional Networks

Abstract — 摘要

Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules. In this work, we introduce two new modules to enhance the transformation modeling capability of CNNs, namely deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard backpropagation. Extensive experiments validate the effectiveness of our approach on object detection and semantic segmentation tasks.

摺積神經網路（CNN）因其構建模組中的固定幾何結構，本質上在建模幾何變換方面受到限制。本研究引入兩個新模組來增強 CNN 的變換建模能力，分別為可變形摺積和可變形 RoI 池化。兩者皆基於以額外偏移量增強模組中的空間取樣位置的概念，並從目標任務中學習偏移量，無需額外監督。新模組可直接取代現有 CNN 中的對應模組，並透過標準反向傳播輕鬆地端到端訓練。廣泛的實驗驗證了此方法在物件偵測與語意分割任務上的有效性。

段落功能全文總覽——指出 CNN 的固定幾何限制，引出可變形摺積與可變形 RoI 池化作為解方。

邏輯角色摘要以「限制-解決方案-驗證」結構推進，核心主張清晰：固定的幾何結構限制了 CNN 的表示能力，可學習的偏移量是解方。

論證技巧 / 潛在漏洞強調「無需額外監督」和「標準反向傳播」降低了採用門檻。但「固定幾何結構」是否真的是 CNN 的核心瓶頸？對於不涉及大幅幾何變形的任務，此論點的說服力可能較弱。

1. Introduction — 緒論

A key challenge in visual recognition is how to accommodate geometric variations or transformations in object scale, pose, viewpoint, and part deformation. There are generally two approaches: building training datasets with sufficient variation (data augmentation), or using transformation-invariant features and algorithms. Both are limited: data augmentation is costly and cannot cover all transformations, while hand-crafted invariances may not generalize. The authors argue that the geometric transformation modeling should be internal and adaptive, learned from the data rather than fixed by design.

視覺辨識中的關鍵挑戰是如何適應物件尺度、姿態、視角和部件變形等幾何變化或變換。通常有兩種方法：建構具有足夠變化的訓練資料集（資料擴增），或使用變換不變的特徵與演算法。兩者皆有局限：資料擴增成本高昂且無法涵蓋所有變換，而手工設計的不變性可能無法泛化。作者主張，幾何變換建模應該是內部的且自適應的，從資料中學習而非由設計固定。

段落功能建立動機——從幾何變換的挑戰出發，批判現有兩種策略的不足。

邏輯角色此段建構了一個二元對立再超越的論證：資料擴增（外部策略）vs 不變特徵（靜態策略）-> 兩者皆不足 -> 需要「內部自適應」的新策略（可變形摺積）。

論證技巧 / 潛在漏洞「資料擴增無法涵蓋所有變換」的論點在理論上成立，但在實務中，大規模資料集配合豐富的資料擴增已被證明極為有效。可變形摺積的增量效益需要在實驗中明確量化。

CNNs are built on fixed-structure modules: convolutions sample at fixed grid locations, pooling reduces over fixed spatial bins, and RoI pooling separates features into fixed spatial partitions. This fixed structure limits their capacity to handle unknown geometric transformations. The authors propose to make these structures adaptive by learning spatial offsets. Deformable convolution adds 2D offsets to each sampling position of the regular grid, enabling free-form deformation of the sampling pattern. Deformable RoI pooling adds offsets to each bin position, enabling adaptive part localization. Both modules are lightweight, adding minimal parameters and computation overhead.

CNN 建構在固定結構的模組上：摺積在固定的網格位置取樣，池化在固定的空間格內縮減，RoI 池化將特徵分離到固定的空間分區中。這種固定結構限制了它們處理未知幾何變換的能力。作者提議透過學習空間偏移量使這些結構自適應。可變形摺積對正規網格的每個取樣位置添加二維偏移量，實現取樣模式的自由形式變形。可變形 RoI 池化對每個格位添加偏移量，實現自適應的部件定位。兩個模組都是輕量化的，僅添加極少的參數和計算開銷。

段落功能提出解決方案——概述可變形摺積與可變形 RoI 池化的核心思想。

邏輯角色此段將前段的抽象問題（「固定結構」）具體化為三個實例（摺積、池化、RoI），再逐一提出對應的解方。「輕量化」的強調預先回應了「增加複雜度是否值得」的潛在質疑。

論證技巧 / 潛在漏洞「自由形式變形」的能力聽起來強大，但也引入了一個問題：模型如何避免學習到退化的偏移量模式（如所有偏移歸零，退化為標準摺積）？此正規化問題未被充分討論。

The work is distinguished from several related approaches. Unlike Spatial Transformer Networks (STN), which perform global parametric warping, deformable convolution performs local, dense, per-position sampling. Unlike Active Convolution, which learns static offsets shared across all locations, deformable convolution uses dynamic, input-dependent offsets that vary for each spatial position. The approach generalizes atrous (dilated) convolution as a special case where offsets are fixed integers. Compared to Deformable Part Models (DPM), the proposed method is simpler, more end-to-end trainable, and integrated into modern deep architectures.

此研究與數種相關方法有所區隔。不同於執行全域參數化扭曲的空間變換網路（STN），可變形摺積執行局部的、密集的、逐位置取樣。不同於學習跨所有位置共享的靜態偏移量的 Active Convolution，可變形摺積使用動態的、依輸入而變的偏移量，隨空間位置而異。此方法將空洞（擴張）摺積泛化為一個特例，其中偏移量為固定整數。相較於可變形部件模型（DPM），所提出的方法更簡潔、更端到端可訓練，且整合在現代深度架構中。

段落功能文獻定位——透過四組精確的對比界定可變形摺積的獨特性。

邏輯角色此段系統性地排除了可能的混淆：STN（全域 vs 局部）、Active Conv（靜態 vs 動態）、Atrous Conv（固定 vs 學習）、DPM（傳統 vs 端到端）。每組對比都強化了「可變形摺積是局部、動態、可學習的」定位。

論證技巧 / 潛在漏洞將空洞摺積視為可變形摺積的特例是精妙的泛化論述。但與 STN 的比較可能不完全公平——STN 設計用於全域變換（如旋轉、縮放），而可變形摺積針對局部變形，兩者解決的是不同粒度的問題。

3. Deformable Convolutional Networks — 可變形摺積網路

3.1 Deformable Convolution — 可變形摺積

In standard convolution, the sampling grid R defines a fixed set of offsets (e.g., R = {(-1,-1), (-1,0), ..., (1,1)} for a 3x3 kernel). Deformable convolution augments each position p_n in R with a learnable offset Delta_p_n, so the output becomes: y(p_0) = sum_{p_n in R} w(p_n) * x(p_0 + p_n + Delta_p_n). Since the offsets are typically fractional, bilinear interpolation is used: x(p) = sum_q G(q, p) * x(q), where G is a bilinear interpolation kernel. The offsets are learned through a separate convolutional layer applied to the input feature map, producing 2N offset values (x and y for N sampling positions). This is fully differentiable and trained end-to-end.

在標準摺積中，取樣網格 R 定義了一組固定的偏移量（例如 3x3 核的 R = {(-1,-1), (-1,0), ..., (1,1)}）。可變形摺積對 R 中的每個位置 p_n 增加一個可學習偏移量 Delta_p_n，因此輸出變為：y(p_0) = sum_{p_n in R} w(p_n) * x(p_0 + p_n + Delta_p_n)。由於偏移量通常為分數值，使用雙線性插值：x(p) = sum_q G(q, p) * x(q)，其中 G 為雙線性插值核。偏移量透過施加於輸入特徵圖的單獨摺積層學習，產生 2N 個偏移值（N 個取樣位置的 x 和 y）。此機制完全可微分且端到端訓練。

段落功能核心方法——以數學形式定義可變形摺積的運算機制。

邏輯角色此段是全文的技術核心。從標準摺積出發，僅增加一個 Delta_p_n 項便實現了「固定到自適應」的跨越。雙線性插值確保了分數偏移量的可微性，是端到端訓練的關鍵技術保障。

論證技巧 / 潛在漏洞公式推導從標準摺積到可變形摺積僅差一步，使方法的概念門檻極低。但雙線性插值在大偏移量時可能引入不準確的近似，且偏移量的學習是否收斂到有意義的幾何模式需要視覺化驗證。

3.2 Deformable RoI Pooling — 可變形 RoI 池化

Deformable RoI pooling extends the same idea to region-of-interest pooling. In standard RoI pooling, an input region is divided into k x k fixed bins. Deformable RoI pooling adds offsets to the center of each bin, enabling the network to adaptively focus on the most discriminative parts of an object. The offsets are computed by a small FC network that takes the standard pooled features as input and outputs 2k^2 normalized offsets. This also extends to position-sensitive (PS) RoI pooling, where offsets enable adaptive part localization beyond the fixed grid partition. Both variants show that learned offsets correlate with object scale and shape, confirming the adaptive receptive field behavior.

可變形 RoI 池化將同一概念擴展至感興趣區域池化。在標準 RoI 池化中，輸入區域被劃分為 k x k 個固定格。可變形 RoI 池化對每個格的中心添加偏移量，使網路能夠自適應地聚焦於物件最具鑑別力的部位。偏移量由一個小型全連接網路計算，以標準池化特徵為輸入並輸出 2k^2 個歸一化偏移量。此方法也擴展至位置敏感（PS）RoI 池化，偏移量使自適應部件定位超越了固定的網格分割。兩種變體均顯示學習到的偏移量與物件的尺度和形狀相關，印證了自適應感受野的行為。

段落功能方法擴展——將可變形概念從摺積推廣到 RoI 池化。

邏輯角色此段展示了「可變形」作為通用設計原則的泛化能力：不僅適用於摺積，也適用於池化。偏移量與物件尺度/形狀的相關性提供了可解釋性的初步證據。

論證技巧 / 潛在漏洞從摺積到池化的推廣增強了方法的一致性和完整性。但偏移量的視覺化分析主要是定性的——缺乏對偏移量學習動態和收斂行為的定量分析。

4. Experiments — 實驗

Experiments integrate deformable modules into ResNet-101 with Faster R-CNN, R-FCN, and DeepLab. On COCO object detection, deformable convolution achieves 11-13% relative improvement across methods. On PASCAL VOC semantic segmentation, significant mIoU gains are observed. The computational overhead is minimal: adding deformable layers to 1-6 convolutional layers adds negligible parameters and runtime. Ablation studies confirm that: (1) more deformable layers yield better results up to a saturation point; (2) the improvements are particularly strong for large and deformable object categories; (3) the learned offsets visually correspond to object structure, expanding for large objects and contracting for small ones.

實驗將可變形模組整合到 ResNet-101 搭配 Faster R-CNN、R-FCN 和 DeepLab 中。在 COCO 物件偵測上，可變形摺積跨方法達到 11-13% 的相對改善。在 PASCAL VOC 語意分割上，觀察到顯著的 mIoU 增益。計算開銷極小：在 1-6 個摺積層添加可變形層僅增加可忽略的參數和執行時間。消融研究確認：(1) 更多可變形層帶來更好的結果直至飽和；(2) 改善在大型和可變形物件類別上尤為顯著；(3) 學習到的偏移量在視覺上對應物件結構，對大物件擴張、對小物件收縮。

段落功能核心實證——在多個框架與任務上驗證可變形模組的有效性。

邏輯角色此段覆蓋三個驗證維度：(1) 跨框架的一致改善（泛化性）；(2) 低開銷的實用性；(3) 偏移量的可解釋性。「大物件和可變形類別」的特別改善直接驗證了方法的設計動機。

論證技巧 / 潛在漏洞偏移量的視覺化分析是極為有力的可解釋性證據。但 11-13% 的相對改善具體數值為何？相對改善的百分比可能掩蓋絕對值上的有限改善。此外，在小物件和剛體類別上的改善是否顯著，值得進一步審視。

5. Conclusion — 結論

This paper presents the first demonstration that learning dense spatial transformations in deep CNNs is effective for sophisticated vision tasks. Deformable convolution and deformable RoI pooling are simple, lightweight modules that augment existing networks with adaptive geometric transformation modeling. The learned offsets show meaningful correspondence with object geometry, and the approach achieves consistent improvements across detection and segmentation frameworks with minimal overhead. The work opens a new direction for building transformation-adaptive neural network components.

本文首次展示了在深度 CNN 中學習密集空間變換對於複雜視覺任務是有效的。可變形摺積和可變形 RoI 池化是簡潔、輕量的模組，以自適應幾何變換建模增強現有網路。學習到的偏移量展現出與物件幾何的有意義對應，且此方法以最小開銷在偵測和分割框架中取得了一致改善。此研究為建構變換自適應的神經網路元件開啟了新方向。

段落功能總結全文——重申「首次展示」的開創性並展望新方向。

邏輯角色結論呼應摘要，強調三個關鍵訊息：(1) 首次驗證；(2) 簡潔輕量；(3) 一致改善。「新方向」的措辭暗示此研究是一系列後續工作的起點。

論證技巧 / 潛在漏洞「首次展示」的宣稱需要謹慎——STN 已在此方向上有所探索。此外，結論未討論可變形摺積的潛在風險：偏移量可能學習到取樣遠離相關區域的退化模式，以及在小資料集上可能的過擬合問題。

論證結構總覽

問題
CNN 固定幾何結構
限制變換建模能力

→

論點
可學習偏移量
自適應空間取樣

→

證據
COCO/VOC 一致改善
偏移量對應物件結構

→

反駁
輕量模組最小開銷
即插即用端到端訓練

→

結論
變換自適應元件
開啟新研究方向

作者核心主張（一句話）

透過對摺積和 RoI 池化的取樣位置添加可學習的二維偏移量，能使 CNN 自適應地建模幾何變換，以極小的額外開銷在物件偵測與語意分割上取得一致且顯著的改善。

論證最強處

即插即用的實用性：可變形模組可直接替換現有 CNN 中的對應模組，無需修改訓練流程或添加額外監督。跨越 Faster R-CNN、R-FCN、DeepLab 三個框架的一致改善，以及偏移量與物件幾何的視覺對應，同時提供了實用價值和可解釋性。

論證最弱處

偏移量的正規化與穩定性：偏移量的學習缺乏明確的正規化約束——如何避免退化模式（偏移歸零或發散）？在小物件和剛體類別上的改善幅度有限，暗示方法的效益可能侷限於特定的物件類型。後續的 Deformable ConvNets v2 引入了調節機制來解決此問題，間接印證了 v1 的不足。