Identity Mappings in Deep Residual Networks

Abstract — 摘要

Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https://github.com/KaimingHe/resnet-1k-layers.

深度殘差網路已成為一類展現出卓越精度與良好收斂行為的極深架構。本文分析了殘差建構區塊背後的傳播公式，結果表明：當使用恆等映射作為跳躍連接並在加法後進行激活時，前向與反向信號可以從任一區塊直接傳播到任何其他區塊。一系列消融實驗支持了這些恆等映射的重要性。這促使我們提出新的殘差單元，使訓練更容易並改善泛化能力。我們報告了使用 1001 層 ResNet 在 CIFAR-10 上達到 4.62% 錯誤率，以及 200 層 ResNet 在 ImageNet 上的改進結果。

段落功能總結全文核心發現——恆等映射作為跳躍連接的理論基礎與實驗驗證。

邏輯角色建立從理論分析到結構改進的邏輯鏈。

論證技巧 / 潛在漏洞以 1001 層的極深網路作為能力展示，極具震撼力，但實際應用場景中是否需要如此深的網路有待商榷。

1. Introduction — 緒論

Deep residual networks (ResNets) consist of many stacked "Residual Units". Each unit can be expressed in a general form: y_l = h(x_l) + F(x_l, W_l), where x_l and x_{l+1} are input and output of the l-th unit, and F is a residual function. In the original work, h(x_l) = x_l is an identity mapping and f is a ReLU function. ResNets that are over 100-layer deep have shown state-of-the-art accuracy for several challenging recognition tasks on ImageNet and MS COCO competitions. The central idea of ResNets is to learn the additive residual function F with respect to h(x_l), with a key choice of using an identity mapping h(x_l) = x_l. This is realized by attaching an identity skip connection.

深度殘差網路（ResNets）由許多堆疊的「殘差單元」組成。每個單元可以用一般形式表達：y_l = h(x_l) + F(x_l, W_l)，其中 x_l 和 x_{l+1} 是第 l 個單元的輸入和輸出，F 是殘差函數。在原始工作中，h(x_l) = x_l 是恆等映射，f 是 ReLU 函數。超過 100 層深的 ResNets 在 ImageNet 和 MS COCO 競賽的多項具挑戰性的辨識任務上展示了最先進的精度。ResNets 的核心思想是學習相對於 h(x_l) 的加性殘差函數 F，其中的關鍵選擇是使用恆等映射 h(x_l) = x_l，透過附加恆等跳躍連接來實現。

段落功能回顧 ResNet 的基本公式與成就。

邏輯角色為後續的深入分析建立數學符號與概念基礎。

論證技巧 / 潛在漏洞以數學公式精確定義問題，為後續理論推導提供嚴謹的形式化框架。

In this paper, we analyze deep residual networks by focusing on creating a "direct" path for propagating information — not only within a residual unit, but through the entire network. Our derivations reveal that if both h(x_l) and f(y_l) are identity mappings, the signal could be directly propagated from one unit to any other units, in both forward and backward passes. Our experiments empirically show that training in general becomes easier when the architecture is closer to the above two conditions. To understand the role of skip connections, we analyze and compare various types of h(x_l). We find that the identity mapping h(x_l) = x_l chosen in the original ResNet achieves the fastest error reduction and lowest training loss among all variants we investigated.

本文透過聚焦於建立資訊傳播的「直接」路徑來分析深度殘差網路——不僅在殘差單元內部，更是貫穿整個網路。我們的推導揭示：若 h(x_l) 和 f(y_l) 均為恆等映射，信號便可在前向和反向傳遞中從一個單元直接傳播到任何其他單元。實驗表明，當架構越接近上述兩個條件時，訓練通常變得更為容易。為理解跳躍連接的作用，我們分析並比較了多種 h(x_l) 的類型，發現原始 ResNet 中選擇的恆等映射 h(x_l) = x_l 在所有我們研究的變體中達到最快的錯誤降低率和最低的訓練損失。

段落功能提出本文的核心研究方向與關鍵發現。

邏輯角色從直覺到理論的過渡，建立恆等映射最優性的論證。

論證技巧 / 潛在漏洞透過比較多種變體來驗證恆等映射的優越性，這種消融式研究方法具有高度說服力。

2. Analysis of Deep Residual Networks — 殘差單元分析

The ResNets developed in the original work are modularized architectures that stack building blocks of the same connecting shape. In this paper we call these blocks "Residual Units". The original Residual Unit performs the following computation: y_l = h(x_l) + F(x_l, W_l), x_{l+1} = f(y_l). With the recursive computation of x_L = x_l + sum of F, for any deeper unit L and any shallower unit l. This equation exhibits some nice properties. (i) The feature x_L of any deeper unit L can be represented as the feature x_l of any shallower unit l plus a residual function, indicating that the model is in a residual fashion between any units L and l. (ii) The feature x_L = x_0 + sum of all residual functions, which is in contrast to a "plain network" where a feature x_L is a series of matrix-vector products.

原始工作中開發的 ResNets 是模組化架構，堆疊具有相同連接形狀的建構區塊。本文中我們稱這些區塊為「殘差單元」。原始殘差單元執行以下計算：y_l = h(x_l) + F(x_l, W_l)，x_{l+1} = f(y_l)。透過遞迴計算 x_L = x_l + F 的總和，對於任意更深的單元 L 和任意更淺的單元 l。此方程展現了一些良好的性質：(i) 任意更深單元 L 的特徵 x_L 可以表示為任意更淺單元 l 的特徵 x_l 加上殘差函數，表明模型在任意單元 L 和 l 之間均呈殘差形式。(ii) 特徵 x_L = x_0 + 所有殘差函數的總和，這與「普通網路」中特徵 x_L 是一系列矩陣-向量乘積的形式形成對比。

段落功能從數學角度推導殘差網路的信號傳播特性。

邏輯角色建立核心理論基礎，解釋為何恆等映射如此有效。

論證技巧 / 潛在漏洞以遞迴展開的方式優雅地揭示了殘差網路的「加法」本質，與普通網路的「乘法」本質形成鮮明對比。

3. On the Importance of Identity Skip Connections — 恆等捷徑的重要性

Let us consider a simple modification, h(x_l) = lambda_l * x_l, to break the identity shortcut. We investigate various combinations and find that all shortcut-only gating and 1x1 convolutional shortcuts lead to higher training error and higher testing error. These experiments suggest that keeping a "clean" information path is helpful for easing optimization. We also study the impact of the activation functions. The original design uses post-addition activation: x_{l+1} = ReLU(y_l). We propose using pre-activation: y_l = x_l + F(BN(ReLU(x_l)), W_l). In the new design, the signals among the building blocks are clean identity mappings.

讓我們考慮一個簡單的修改，h(x_l) = lambda_l * x_l，以打破恆等捷徑。我們研究了各種組合，發現所有僅閘控捷徑和 1x1 摺積捷徑都導致更高的訓練誤差和測試誤差。這些實驗表明，保持「乾淨」的資訊路徑有助於簡化最佳化過程。我們也研究了激活函數的影響。原始設計使用加法後激活：x_{l+1} = ReLU(y_l)。我們提出使用預激活：y_l = x_l + F(BN(ReLU(x_l)), W_l)。在新設計中，建構區塊之間的信號是乾淨的恆等映射。

段落功能透過消融實驗驗證恆等映射的必要性，並提出預激活設計。

邏輯角色從反面（破壞恆等映射會變差）與正面（預激活更乾淨）雙重角度論證核心主張。

論證技巧 / 潛在漏洞「乾淨資訊路徑」的概念既直觀又有理論支撐，消融實驗提供了有力的反面證據。

4. Experiments — 實驗

We conduct experiments on CIFAR-10, CIFAR-100, and ImageNet. Our baseline ResNets are the 110-layer and 164-layer networks on CIFAR, and the 101-layer network on ImageNet. On CIFAR-10, our 1001-layer pre-activation ResNet achieves 4.62% test error, which is significantly better than the original ResNet's 7.61%. The improvement becomes more pronounced as the network goes deeper. On CIFAR-100, the trend is similar — our 1001-layer network achieves 22.71% error. On ImageNet, our 200-layer pre-activation ResNet achieves a top-1 error of 21.66%, which is lower than the baseline 34-layer plain net's 28.54%. The pre-activation design also demonstrates improved generalization: the gap between training and testing error is smaller.

我們在 CIFAR-10、CIFAR-100 和 ImageNet 上進行實驗。基準 ResNets 為 CIFAR 上的 110 層和 164 層網路，以及 ImageNet 上的 101 層網路。在 CIFAR-10 上，1001 層預激活 ResNet 達到 4.62% 測試錯誤率，顯著優於原始 ResNet 的 7.61%。隨著網路深度增加，改進更為明顯。在 CIFAR-100 上趨勢類似，1001 層網路達到 22.71% 錯誤率。在 ImageNet 上，200 層預激活 ResNet 達到 21.66% 的 top-1 錯誤率，低於基準 34 層普通網路的 28.54%。預激活設計也展現了改善的泛化能力：訓練與測試誤差之間的差距更小。

段落功能報告多資料集的量化實驗結果。

邏輯角色以數據驗證理論分析的正確性——預激活設計確實使更深網路訓練更好。

論證技巧 / 潛在漏洞跨三個資料集的一致性改進非常有說服力。但 1001 層的計算成本與實際收益比值得進一步探討。

We also investigate the activation function placement in detail. We compare: (a) original post-activation (BN after addition), (b) BN before addition, (c) ReLU before addition, (d) ReLU-only pre-activation, and (e) full pre-activation (BN + ReLU before weight layers). The full pre-activation variant achieves the best results. The key insight is that the BN layer as pre-activation improves the regularization of the model, since it normalizes the signal flowing through the shortcut. This is evidenced by the reduced overfitting observed in our experiments.

我們也詳細研究了激活函數的放置位置。比較了：(a) 原始後激活（加法後 BN）、(b) 加法前 BN、(c) 加法前 ReLU、(d) 僅 ReLU 預激活、以及 (e) 完全預激活（權重層之前的 BN + ReLU）。完全預激活變體達到最佳結果。關鍵洞見在於 BN 層作為預激活改善了模型的正則化效果，因為它正則化了流經捷徑的信號。這由實驗中觀察到的過擬合減少所證實。

段落功能詳細比較不同激活函數放置方式的效果。

邏輯角色提供架構設計的精細調優指南，補充理論分析的實踐面向。

論證技巧 / 潛在漏洞五種變體的系統性比較是優秀的消融研究範例，使結論更具可信度。

5. Conclusions — 結論

This paper investigates the propagation formulations behind the connection mechanisms of deep residual networks. Our derivations imply that identity shortcut connections and identity after-addition activation are essential for making information propagation smooth. Our proposed architecture with pre-activation residual units facilitates training of very deep networks (over 1000 layers) and improves generalization. We believe that exploring the propagation properties will give further insight for the design of deep networks.

本文研究了深度殘差網路連接機制背後的傳播公式。我們的推導表明，恆等捷徑連接和恆等加法後激活對於使資訊傳播順暢至關重要。我們提出的預激活殘差單元架構促進了極深網路（超過 1000 層）的訓練並改善了泛化能力。我們相信探索傳播性質將為深度網路的設計提供進一步的洞見。

段落功能總結核心貢獻與未來展望。

邏輯角色回歸理論分析的出發點，以「資訊傳播平滑性」作為統一論述收束全文。

論證技巧 / 潛在漏洞結論簡潔有力，將複雜的數學分析濃縮為「恆等映射使資訊傳播順暢」的直觀訊息。

論證結構總覽

問題
深層 ResNet 訓練困難

➔

論點
恆等映射保證信號直通

➔

證據
1001層 CIFAR 4.62%

➔

反駁
非恆等變體皆劣化

➔

結論
預激活殘差單元最優

核心主張

恆等映射作為跳躍連接使信號在前向與反向傳播中均可直達任意層，預激活設計進一步確保了資訊路徑的潔淨性。

最強論證

從數學推導出發，配合五種變體的系統性消融實驗，從理論與實踐雙面向證明了恆等映射的必要性。

最弱環節

雖然成功訓練了 1001 層網路，但並未充分討論如此極端深度在實際應用中的成本效益比，以及是否有更高效的替代方案。