RepVGG: Making VGG-Style ConvNets Great Again

Abstract — 摘要

The authors present RepVGG, a simple architecture whose inference-time body consists of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. The decoupling of training and inference architectures is realized by a structural re-parameterization technique, which converts the multi-branch training model into a plain inference model. RepVGG reaches over 80% top-1 accuracy on ImageNet with higher speed than ResNet-50 and ResNet-101, demonstrating that plain models can be made as powerful as complex multi-branch architectures.

作者提出 RepVGG，一種簡潔的架構，其推論時的主體僅由3x3 摺積與 ReLU 的堆疊組成，而訓練時的模型則具有多分支拓撲。訓練與推論架構的解耦透過結構重參數化技術實現，將多分支訓練模型轉換為平面推論模型。RepVGG 在 ImageNet 上達到超過 80% 的 top-1 準確率，速度高於 ResNet-50 與 ResNet-101，證明平面模型可以被打造得與複雜的多分支架構同樣強大。

段落功能全文總覽——以「簡潔推論+複雜訓練」的對偶性概括 RepVGG 的核心創新。

邏輯角色摘要以三層遞進建構論述：(1) 推論時的極簡性（僅 3x3 + ReLU）；(2) 實現手段（結構重參數化）；(3) 效能證明（80%+ ImageNet）。

論證技巧 / 潛在漏洞「Making VGG Great Again」的標題本身就是強力的修辭——它喚起對經典架構簡潔性的懷舊，同時暗示複雜架構未必是最終答案。但 VGG 時代的「簡潔」與 RepVGG 的「訓練時多分支」之間存在落差，此處的「簡潔」僅限於推論階段。

1. Introduction — 緒論

Since the introduction of ResNet, modern convolutional neural network architectures have become increasingly complex with multi-branch designs such as residual connections, dense connections, and neural architecture search (NAS) derived structures. While these designs improve accuracy, they come at the cost of "increased memory consumption due to branch-level outputs" and "reduced inference speed due to the lack of support for efficient parallel computation." The classic VGG architecture with its plain stack of convolutions is "fast and memory-efficient" but has been deemed "outdated in terms of accuracy."

自 ResNet 問世以來，現代摺積神經網路架構日趨複雜，包含多分支設計如殘差連接、密集連接與神經架構搜尋（NAS）衍生結構。儘管這些設計提升了準確率，但代價是分支層級輸出導致的記憶體消耗增加，以及因缺乏高效平行運算支援而降低的推論速度。經典的 VGG 架構以其摺積的平面堆疊具備快速且記憶體效率高的特性，但在準確率方面被認為已經過時。

段落功能建立研究場域——對比複雜多分支架構的效能優勢與效率劣勢。

邏輯角色論證鏈的起點：建立「準確率 vs. 效率」的張力，使讀者意識到現有架構設計存在固有的取捨。VGG 的「快速但過時」定位為後續的「復興」埋下伏筆。

論證技巧 / 潛在漏洞將多分支架構的「效率問題」與 VGG 的「準確率問題」對比呈現，暗示兩者可以被同時解決。但現代硬體（如 GPU tensor cores）已針對特定多分支模式進行最佳化，「效率劣勢」的程度取決於具體硬體。

The authors argue that "the training-time and inference-time architecture need not be the same." A model can benefit from multi-branch topology during training (for better gradient flow and implicit regularization) while being converted to a plain architecture at inference time (for speed and simplicity). The key enabling technique is structural re-parameterization: "we equivalently convert a trained multi-branch model into a plain model using algebraic transformations on the convolution kernels."

作者主張訓練時與推論時的架構不必相同。模型可以在訓練時受益於多分支拓撲（獲得更好的梯度流動與隱式正則化），同時在推論時轉換為平面架構（追求速度與簡潔性）。關鍵的賦能技術是結構重參數化：透過對摺積核的代數變換，將訓練好的多分支模型等效轉換為平面模型。

段落功能核心洞見——提出訓練/推論架構解耦的哲學。

邏輯角色此段是全文的核心論點：打破「訓練與推論使用同一架構」的隱含假設。結構重參數化是實現此洞見的技術手段。

論證技巧 / 潛在漏洞「訓練與推論可以不同」的主張具有範式轉移的潛力——它將架構設計從單一目標（訓練+推論的聯合最優）轉為雙目標最佳化。但此主張的適用範圍有限：僅適用於可被代數等效轉換的結構（如線性操作的合併）。

Multi-branch architectures have dominated since ResNet introduced skip connections, followed by DenseNet's dense connectivity and NAS-derived models like EfficientNet. Model compression techniques such as pruning, quantization, and knowledge distillation aim to improve inference efficiency but "work within the same architectural paradigm." The concept of re-parameterization has appeared in kernel decomposition and Winograd-domain transformations, but prior work has not exploited it for converting between fundamentally different architectures.

自 ResNet 引入跳躍連接以來，多分支架構便主導了此領域，隨後有 DenseNet 的密集連接與 NAS 衍生模型如 EfficientNet。模型壓縮技術如剪枝、量化與知識蒸餾旨在改善推論效率，但仍在相同的架構範式內運作。重參數化的概念曾出現在核分解與 Winograd 域變換中，但先前的研究未曾利用它來在根本不同的架構之間進行轉換。

段落功能文獻回顧——涵蓋多分支架構、模型壓縮與重參數化三條研究線。

邏輯角色將 RepVGG 定位為跨越「架構設計」與「模型壓縮」兩個子領域的創新——既非單純設計新架構，也非壓縮現有模型，而是重新定義訓練與推論的關係。

論證技巧 / 潛在漏洞將模型壓縮歸為「同範式內」的方法，使 RepVGG 的「跨範式」定位更加突出。但知識蒸餾也涉及「訓練時複雜、推論時簡單」的思想，與 RepVGG 的哲學有相似之處，此處的差異化可能不如論述中那麼清晰。

3. Method — 方法

3.1 Training-Time Multi-Branch Block

During training, each RepVGG block consists of three parallel branches: a 3x3 convolution, a 1x1 convolution, and an identity shortcut (when input and output dimensions match). All branches are followed by batch normalization (BN), and their outputs are summed element-wise before ReLU activation. This multi-branch structure provides "the benefits of implicit regularization and better gradient flow similar to ResNet," resulting in higher training accuracy than a plain 3x3 stack.

在訓練期間，每個 RepVGG 區塊由三個平行分支組成：一個 3x3 摺積、一個 1x1 摺積與一個恆等捷徑（當輸入與輸出維度相符時）。所有分支之後接批次正規化（BN），其輸出在 ReLU 啟動前逐元素求和。此多分支結構提供類似 ResNet 的隱式正則化與更佳梯度流動的優勢，帶來比平面 3x3 堆疊更高的訓練準確率。

段落功能方法描述第一步——定義訓練時的多分支區塊結構。

邏輯角色此段說明 RepVGG 並非放棄多分支的好處，而是將其限制在訓練階段。三種分支的選擇（3x3, 1x1, identity）並非任意——它們恰好可以被合併為單一 3x3 摺積。

論證技巧 / 潛在漏洞分支設計的選擇看似簡單但實為精心工程：1x1 和 identity 都可以被零填充為 3x3 核，這是後續合併的前提。作者未明確討論為何不包含 5x5 或更大的核——實際上是因為只有相同大小的核才能被代數合併。

3.2 Structural Re-parameterization — 結構重參數化

The key insight is that convolution followed by BN can be fused into a single convolution with bias, and then multiple parallel convolutions of compatible kernel sizes can be merged into one. Specifically: (1) The 1x1 convolution kernel is zero-padded to 3x3. (2) The identity branch is represented as a 3x3 kernel with 1 at center and 0 elsewhere. (3) Each conv-BN sequence is fused by incorporating BN parameters into the convolution weights and bias. (4) The three resulting 3x3 kernels are summed to produce a single 3x3 convolution. This conversion is "mathematically exact — the plain model produces identical outputs to the multi-branch model for any input."

核心洞見在於：摺積後接 BN 可以融合為帶有偏置的單一摺積，接著相容核大小的多個平行摺積可被合併為一個。具體而言：(1) 1x1 摺積核以零填充至 3x3；(2) 恆等分支表示為中心為 1 其餘為 0 的 3x3 核；(3) 每個摺積-BN 序列透過將 BN 參數併入摺積權重與偏置進行融合；(4) 三個所得的 3x3 核相加產生單一的 3x3 摺積。此轉換在數學上是精確的——平面模型對任何輸入都產生與多分支模型完全相同的輸出。

段落功能核心技術——逐步推導結構重參數化的代數操作。

邏輯角色此段是全文的技術核心：「數學上精確」的等價轉換保證了不存在性能損失，這是 RepVGG 區別於近似方法（如剪枝、蒸餾）的關鍵優勢。

論證技巧 / 潛在漏洞四步驟的推導清晰且可驗證，「數學上精確」是極強的保證。但此精確性僅適用於 BN 在推論模式（使用累積統計量）時——若 BN 處於訓練模式（使用批次統計量），則融合不精確。作者隱含假設讀者理解此區別。

3.3 Architecture Details — 架構設計細節

The overall RepVGG architecture follows a VGG-like layout with 5 stages, where each stage begins with a stride-2 layer for downsampling. The width is controlled by multipliers a and b applied to the base channel numbers. The authors propose several variants: RepVGG-A (with multipliers a=0.75-2.5, b=2.5) for lightweight models and RepVGG-B (with larger multipliers) for high-accuracy models. Notably, RepVGG uses "only 3x3 convolution as the only type of spatial convolution," which is "highly optimized on modern GPU and CPU hardware" through libraries like cuDNN and MKL.

RepVGG 的整體架構遵循類 VGG 的佈局，包含 5 個階段，每個階段以步幅為 2 的層進行下取樣。寬度透過乘數 a 與 b 施加於基礎通道數來控制。作者提出多個變體：RepVGG-A（乘數 a=0.75-2.5, b=2.5）用於輕量模型，RepVGG-B（更大的乘數）用於高準確率模型。值得注意的是，RepVGG 僅使用 3x3 摺積作為唯一的空間摺積類型，這在現代 GPU 與 CPU 硬體上透過 cuDNN 和 MKL 等函式庫獲得高度最佳化。

段落功能架構細節——描述 RepVGG 的整體佈局與變體設計。

邏輯角色將理論（結構重參數化）轉化為實際可用的架構家族。「僅 3x3」的設計選擇不僅是美學偏好，更是硬體效率的考量。

論證技巧 / 潛在漏洞援引 cuDNN/MKL 的最佳化支援將「3x3 限制」從弱點轉化為優勢——這是一個巧妙的硬體感知論述。但隨著硬體演進（如針對 depthwise convolution 最佳化的晶片），此優勢可能不永恆。

4. Experiments — 實驗

On ImageNet classification, RepVGG-A0 achieves 72.41% top-1 accuracy at 1.36ms inference latency on a single 1080Ti GPU, while RepVGG-B3 reaches 80.52% at 5.01ms. For comparison, ResNet-50 achieves 76.15% at 3.07ms and ResNet-101 achieves 77.37% at 5.38ms. RepVGG-B1g2 achieves 77.79% at 3.46ms, outperforming ResNet-101 in both accuracy and speed. On semantic segmentation (Cityscapes) and object detection (COCO), RepVGG backbones show consistent improvements over ResNet counterparts. Compared to EfficientNet and RegNet which require depthwise or grouped convolutions, RepVGG achieves comparable accuracy with significantly higher actual throughput on standard GPU hardware.

在 ImageNet 分類上，RepVGG-A0 在單張 1080Ti GPU 上以 1.36ms 推論延遲達到 72.41% top-1 準確率，而 RepVGG-B3 以 5.01ms 達到 80.52%。作為對比，ResNet-50 以 3.07ms 達到 76.15%，ResNet-101 以 5.38ms 達到 77.37%。RepVGG-B1g2 以 3.46ms 達到 77.79%，在準確率與速度上均超越 ResNet-101。在語義分割（Cityscapes）與物件偵測（COCO）上，RepVGG 骨幹網路展現對 ResNet 同級模型的一致改進。相較於需要深度可分離摺積或分組摺積的 EfficientNet 與 RegNet，RepVGG 在標準 GPU 硬體上以顯著更高的實際吞吐量達到相當的準確率。

段落功能全面實驗驗證——以準確率-速度的二維指標展示 RepVGG 的優勢。

邏輯角色實證支柱以「帕累托最優」的框架呈現：RepVGG 在準確率-速度平面上位於更優的位置。跨任務的一致性（分類、偵測、分割）增強了結論的泛化性。

論證技巧 / 潛在漏洞以實際推論延遲（ms）而非理論 FLOPs 作為速度指標非常務實——這直接反映了部署場景的需求。但速度比較高度依賴硬體（1080Ti）與軟體版本，在其他平台上結論可能不同。EfficientNet 在行動裝置上可能仍具優勢。

5. Conclusion — 結論

RepVGG demonstrates that plain VGG-style architectures can compete with and even surpass complex multi-branch designs when equipped with structural re-parameterization. By decoupling training and inference architectures, the method achieves the "best of both worlds: the training benefits of multi-branch topology and the inference efficiency of plain models." The resulting models offer a favorable trade-off between accuracy and actual speed, and the simplicity of the inference model facilitates hardware optimization, quantization, and deployment.

RepVGG 證明了搭載結構重參數化的平面 VGG 風格架構可以與複雜的多分支設計競爭甚至超越。透過解耦訓練與推論架構，該方法達到了兩全其美的效果：多分支拓撲的訓練益處與平面模型的推論效率。所產生的模型在準確率與實際速度之間提供有利的取捨，且推論模型的簡潔性有利於硬體最佳化、量化與部署。

段落功能總結全文——以「兩全其美」的論述收束全文。

邏輯角色結論段將 RepVGG 的貢獻從技術層面提升到方法論層面——「訓練與推論架構解耦」的原則可能影響未來的架構設計思維。

論證技巧 / 潛在漏洞「兩全其美」是一個強有力的結語，但未討論方法的天花板——結構重參數化僅適用於線性操作的合併，無法處理非線性分支（如 SE modules、attention）。隨著 Transformer 架構的興起，純摺積方案的長期競爭力值得觀察。

論證結構總覽

問題
多分支架構犧牲
推論效率換準確率

→

論點
訓練/推論架構解耦
透過結構重參數化

→

證據
ImageNet 80.52%
速度優於 ResNet

→

反駁
數學精確轉換
無性能損失

→

結論
平面模型可與
複雜架構抗衡

作者核心主張（一句話）

透過結構重參數化將訓練時的多分支拓撲等價轉換為推論時的純 3x3 摺積堆疊，VGG 風格的平面架構可以在準確率與實際推論速度上同時超越 ResNet 等複雜設計。

論證最強處

數學上精確的等價轉換：結構重參數化不是近似或壓縮，而是代數上完全等價的操作。這意味著轉換後的平面模型與訓練時的多分支模型有完全相同的輸出，消除了「簡化是否犧牲精度」的顧慮。搭配以實際延遲（而非 FLOPs）為度量的實驗比較，論點在理論與實踐上均獲得堅實支撐。

論證最弱處

適用範圍的侷限性：結構重參數化僅適用於可被代數合併的線性操作，無法涵蓋注意力機制（Attention）、Squeeze-and-Excitation 模組等非線性分支。隨著 Vision Transformer 的崛起，純摺積架構的長期競爭力存疑。此外，速度比較基於特定 GPU（1080Ti），在不同硬體（如行動裝置 NPU）上的優勢未被驗證。