Deep Residual Learning for Image Recognition

Abstract — 摘要

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

更深的神經網路更難以訓練。我們提出一種殘差學習框架，以降低訓練遠比以往更深之網路的難度。我們明確地將各層重新建構為學習相對於層輸入的殘差函數，而非學習無參照的函數。我們提供了全面的實證結果，證明這些殘差網路更易於最佳化，且能從大幅增加的深度中獲得精確度的提升。

段落功能全文總覽——以一句話揭示核心矛盾（深度 vs. 可訓練性），隨即提出殘差學習框架作為解方。

邏輯角色摘要開頭即建立全文論證的中心張力：深度對於表示能力至關重要，但訓練難度隨深度上升。殘差學習同時滿足「更深」與「更易訓練」兩個目標。

論證技巧 / 潛在漏洞開篇短句極具衝擊力，直接點出痛點。「學習殘差函數」的重新建構看似簡單卻深刻——把最佳化目標從 H(x) 轉為 H(x)-x，降低了學習難度。此處未說明「為何殘差更易學」，留待後文補充。

On the ImageNet dataset, we evaluate residual nets with a depth of up to 152 layers — 8× deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1,000 layers.

在 ImageNet 資料集上，我們評估了深達 152 層的殘差網路——比 VGG 網路深 8 倍，但仍具備更低的計算複雜度。這些殘差網路的集成在 ImageNet 測試集上達到 3.57% 的錯誤率，此結果贏得 ILSVRC 2015 分類任務第一名。我們同時在 CIFAR-10 上呈現了 100 層與 1,000 層的分析。

段落功能量化成果——以具體數字呈現殘差網路在深度、效率與精確度上的突破。

邏輯角色為上段的概念主張提供實證支撐：152 層的可行性、3.57% 的頂尖結果、比 VGG 更低的運算量，三者共同構成不可忽視的證據鏈。

論證技巧 / 潛在漏洞「8 倍深但更低複雜度」的對比極具說服力，巧妙化解了讀者對「深=慢」的直覺疑慮。CIFAR-10 上的 1,000 層實驗更起到「震撼效果」，暗示殘差學習在理論上對深度幾乎無上限。

1. Introduction — 緒論

Deep convolutional neural networks have led to a series of breakthroughs for image classification. Deep networks naturally integrate low/mid/high-level features and classifiers in an end-to-end multi-layer fashion, and the "levels" of features can be enriched by the number of stacked layers (depth). Recent evidence reveals that network depth is of crucial importance, and the leading results on the challenging ImageNet dataset all exploit "very deep" models, with a depth of sixteen to thirty layers.

深度摺積神經網路已為影像分類帶來一系列突破。深層網路以端對端的多層方式自然整合了低階、中階與高階特徵及分類器，且特徵的「層級」可藉由堆疊層數（深度）來豐富。近期的證據揭示網路深度至關重要，在極具挑戰性的 ImageNet 資料集上，領先的結果皆利用了十六至三十層的「極深」模型。

段落功能建立研究場域——確立「深度」在表示學習中的核心地位。

邏輯角色論證起點：先以成功案例（VGGNet 等）證明「深度有益」，為後續的核心問題「能否無限增加深度？」埋下伏筆。

論證技巧 / 潛在漏洞以業界成功案例作為開場具有權威性。「十六至三十層」的具體數字暗示深度已遇瓶頸，自然引發讀者好奇：為何不繼續加深？

Driven by the significance of depth, a question arises: "Is learning better networks as easy as stacking more layers?" An obstacle to answering this question was the notorious problem of vanishing/exploding gradients, which hamper convergence from the beginning. This problem has been largely addressed by normalized initialization and intermediate normalization layers. When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.

受到深度重要性的驅動，一個問題隨之浮現：「學習更好的網路是否只需堆疊更多層？」回答此問題的一大障礙是眾所周知的梯度消失／爆炸問題，它從一開始便阻礙了收斂。此問題已大致被正規化初始化與中間正規化層所解決。然而，當更深的網路能夠開始收斂時，一個退化問題隨即浮現：隨著網路深度增加，精確度先飽和然後急速退化。出乎意料的是，這種退化並非由過度擬合所導致，對一個已足夠深的模型添加更多層反而會造成更高的訓練誤差。

段落功能定義核心問題——揭示退化現象：更深的網路反而有更高的訓練誤差。

邏輯角色全文論證的關鍵轉折點。作者先排除了梯度消失（已被 BN 解決），再指出真正的障礙是退化問題，且明確排除過度擬合——這是最佳化本身的失敗。

論證技巧 / 潛在漏洞以反問句「Is learning better networks as easy as stacking more layers?」引導讀者思考，是經典的修辭策略。「非過擬合而是訓練誤差上升」的觀察極為精準，直接將問題從統計學習層面拉到最佳化層面，為殘差學習的引入鋪路。

In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping H(x), we explicitly let these layers fit a residual mapping F(x) := H(x) − x. The original mapping is recast into F(x) + x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

本文中，我們藉由引入深度殘差學習框架來解決退化問題。我們不再期望每幾個堆疊層直接擬合所欲的底層映射 H(x)，而是明確地讓這些層擬合殘差映射 F(x) := H(x) − x。原始映射因此重新表述為 F(x) + x。我們的假設是：最佳化殘差映射比最佳化原始的無參照映射更為容易。在極端情況下，若恆等映射為最佳解，則將殘差推至零遠比用一連串非線性層擬合恆等映射來得容易。

段落功能提出核心解法——以數學形式定義殘差學習的重新建構。

邏輯角色全文最核心的段落。從 H(x) 到 F(x)+x 的變換是整篇論文的智識貢獻。此段回應退化問題的機制：若最佳映射接近恆等，殘差函數接近零，網路只需學習「微小的偏差」。

論證技巧 / 潛在漏洞「推殘差至零比擬合恆等映射容易」的直覺論證非常有力但缺乏形式化證明。作者坦承這是「假設」而非定理，留下了理論空間。這種謙遜反而增強了可信度——後續由實驗驗證假設。

Residual Representations. In image recognition, VLAD is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector can be formulated as a probabilistic version of VLAD. Both are powerful shallow representations for image retrieval and classification. In low-level vision, solving Partial Differential Equations (PDEs), the Multigrid method reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. These solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions.

殘差表示。在影像辨識中，VLAD 是一種透過相對於辭典的殘差向量進行編碼的表示法，費雪向量則可視為 VLAD 的機率版本。兩者都是用於影像檢索與分類的強大淺層表示。在低階視覺中，求解偏微分方程（PDE）時，多重網格法將系統重新建構為多尺度子問題，每個子問題負責粗細尺度之間的殘差解。這類求解器比不了解解的殘差本質之標準求解器收斂得更快。

段落功能建立跨領域連結——殘差思維在影像檢索與數值方法中的先例。

邏輯角色為殘差學習提供更廣泛的智識脈絡：「殘差」並非新概念，在 VLAD、Fisher Vector、多重網格法中都證明了其效力。這強化了殘差重新建構的合理性。

論證技巧 / 潛在漏洞跨領域類比是高效的說服手段——從淺層表示到數值分析都佐證「殘差比原始函數更易處理」。但這些類比涉及的殘差概念與深度學習中的殘差連結在數學上並不完全對等，讀者需謹慎解讀。

Shortcut Connections. Practices and theories of shortcut connections have been studied for a long time. An early practice of training multi-layer perceptrons is to add a linear layer connected from the network input to the output. Intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. Concurrent with this work, "highway networks" present shortcut connections with gating functions. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is "closed," the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through.

捷徑連結。捷徑連結的實踐與理論已被研究了相當長的時間。訓練多層感知器的早期做法是添加一個從網路輸入直連到輸出的線性層。中間層直接連接至輔助分類器以處理梯度消失／爆炸問題。與本研究同期的「高速公路網路」使用了帶有閘控函數的捷徑連結。這些閘控是資料相依且具有參數的，相對於我們無參數的恆等捷徑。當閘控捷徑「關閉」時，高速公路網路中的層代表的是非殘差函數。相反地，我們的建構始終學習殘差函數；我們的恆等捷徑永遠不會關閉，所有資訊始終被傳遞。

段落功能差異化定位——將 ResNet 與高速公路網路進行精確的技術對比。

邏輯角色回應潛在的「ResNet 只是高速公路網路的變體」質疑。透過「有參數閘控 vs. 無參數恆等」的對比，建立 ResNet 的獨特性與優勢。

論證技巧 / 潛在漏洞「恆等捷徑永不關閉」是 ResNet 相對於高速公路網路的關鍵設計差異。作者暗示：更簡單的設計（無參數）反而更有效，這是一種「少即是多」的哲學。但未充分解釋為何閘控反而有害——後續的實驗結果間接回答了此問題。

3. Deep Residual Learning — 深度殘差學習

3.1 Residual Learning — 殘差學習

Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x. So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x) + x. Although both forms should be able to asymptotically approximate the desired functions, the ease of learning might differ.

令 H(x) 為欲以若干堆疊層（不一定是整個網路）來擬合的底層映射，其中 x 為這些層中第一層的輸入。若假設多個非線性層能漸近地逼近複雜函數，則等價地假設它們亦能漸近地逼近殘差函數，即 H(x) − x。因此我們不再期望堆疊層逼近 H(x)，而是明確讓這些層逼近殘差函數 F(x) := H(x) − x。原始函數因而成為 F(x) + x。儘管兩種形式理應都能漸近逼近所欲函數，學習的難易度可能有所不同。

段落功能形式化定義——以數學語言精確陳述殘差學習的核心觀點。

邏輯角色將緒論中的直覺概念提升為嚴謹的數學論述。關鍵洞見在於：雖然表達能力等價（兩種形式都能逼近任意函數），但最佳化的地貌（landscape）可能截然不同。

論證技巧 / 潛在漏洞利用萬能逼近定理的對稱性巧妙論證：既然 F 能逼近 H，也能逼近 H-x，因此從表達力角度看殘差重構是「免費的」。但「ease of learning」是一個最佳化陳述而非逼近論陳述，作者坦承此處無理論保證。

This reformulation is motivated by the counterintuitive degradation problem. If the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings. In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping.

此重新建構受到反直覺的退化問題所驅動。若新增的層能被構造為恆等映射，更深的模型應具有不高於其較淺對應模型的訓練誤差。退化問題暗示求解器可能難以透過多個非線性層來逼近恆等映射。在殘差學習的重新建構下，若恆等映射為最佳解，求解器只需將多個非線性層的權重驅近零值即可趨近恆等映射。在實際情境中，恆等映射不太可能是最佳解，但我們的重新建構有助於對問題進行預條件化。若最佳函數較接近恆等映射而非零映射，求解器應更容易找到以恆等映射為參照的擾動。

段落功能提供直覺解釋——為何殘差重構能解決退化問題。

邏輯角色這是殘差學習理論基礎的核心段落。推理鏈為：退化問題 -> 非線性層難以學習恆等映射 -> 殘差重構將「學習恆等」轉為「將權重歸零」-> 歸零比學恆等簡單 -> 問題解決。

論證技巧 / 潛在漏洞「預條件化」的比喻來自數值最佳化，非常精到。作者務實地承認「恆等映射不太可能為最佳」，轉而論述殘差框架讓函數空間的搜索從「零附近」開始，而非「隨機初始化附近」。此論點深刻但仍屬啟發式——直到後續工作才有更完整的理論分析。

3.2 Identity Mapping by Shortcuts — 恆等捷徑映射

We adopt residual learning to every few stacked layers. A building block is defined as: y = F(x, {W_i}) + x. Here x and y are the input and output vectors of the layers considered. The function F(x, {W_i}) represents the residual mapping to be learned. For the example of two layers, F = W₂σ(W₁x) in which σ denotes ReLU and the biases are omitted for simplifying notations. The operation F + x is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., σ(y)).

我們將殘差學習套用於每若干堆疊層。一個構建區塊定義為：y = F(x, {W_i}) + x。其中 x 與 y 為所考量各層的輸入與輸出向量。函數 F(x, {W_i}) 代表待學習的殘差映射。以兩層為例，F = W₂σ(W₁x)，其中 σ 代表 ReLU，偏差項為簡化符號而省略。F + x 的運算透過一個捷徑連結與逐元素相加來實現。我們在相加之後才採用第二個非線性函數（即 σ(y)）。

段落功能技術規格——定義殘差區塊的具體數學形式。

邏輯角色從抽象概念落實到可實作的數學公式。y = F(x) + x 是全文最重要的方程式，後續所有架構設計皆以此為基礎。

論證技巧 / 潛在漏洞數學表述極為簡潔，使讀者易於實作。ReLU 放置於加法之後（post-addition activation）的選擇在後續工作中被重新審視——He 等人在 2016 年的論文中證明 pre-activation 版本效果更佳，暗示此處的設計並非最終形態。

The shortcut connections in the above equation introduce neither extra parameters nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition). The dimensions of x and F must be equal in the equation. If this is not the case, we can perform a linear projection W_s by the shortcut connections to match the dimensions: y = F(x, {W_i}) + W_sx.

上述方程式中的捷徑連結既不引入額外參數，也不增加計算複雜度。這不僅在實務上具有吸引力，對我們在一般網路與殘差網路之間的比較也至關重要。我們可以公平地比較同時擁有相同參數數量、深度、寬度和計算成本的一般／殘差網路（除了可忽略的逐元素相加之外）。方程中 x 與 F 的維度必須相等。若維度不匹配，可透過捷徑連結執行線性投影 W_s 以匹配維度：y = F(x, {W_i}) + W_sx。

段落功能實務優勢——強調恆等捷徑的零成本特性與公平比較的可能。

邏輯角色預先回應「殘差網路多了參數所以更好」的質疑：捷徑連結是無參數的，性能差異純粹來自殘差重構的最佳化優勢。

論證技巧 / 潛在漏洞「零額外成本」的論點是 ResNet 被廣泛採用的關鍵因素之一——它是純粹的「免費午餐」。維度不匹配時的線性投影 W_s 引入了少量參數，但作者後續實驗表明投影並非必要，零填充（option A）即可，進一步強化了「免費」的論點。

3.3 Network Architectures — 網路架構

Plain Network. Our plain baselines are mainly inspired by the philosophy of VGG nets. The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34. It is worth noting that our model has fewer filters and lower complexity than VGG-19. Our 34-layer baseline has 3.6 billion FLOPs, which is only 18% of VGG-19 (19.6 billion FLOPs).

一般網路。我們的一般網路基準主要受 VGG 網路的設計哲學所啟發。摺積層多數使用 3×3 的濾波器，並遵循兩條簡單的設計原則：(i) 對於相同的輸出特徵圖尺寸，各層具有相同的濾波器數量；(ii) 若特徵圖尺寸減半，濾波器數量則加倍，以保持每層的時間複雜度。下取樣直接透過步幅為 2 的摺積層來執行。網路以一個全域平均池化層與 1000 路全連接層搭配 softmax 結尾。加權層的總數為 34 層。值得注意的是，我們的模型比 VGG-19 具有更少的濾波器與更低的複雜度。我們的 34 層基準僅有 36 億次浮點運算，僅為 VGG-19（196 億次浮點運算）的 18%。

段落功能建立基準——定義對照實驗所需的一般（非殘差）網路架構。

邏輯角色為公平比較奠定基礎。一般網路的設計遵循 VGG 的成熟原則，避免引入其他變因。34 層的選擇正好處於退化問題的「甜蜜點」——足夠深以暴露問題，又不至於完全無法訓練。

論證技巧 / 潛在漏洞「僅為 VGG-19 的 18%」這個數據極具震撼力，預先打消了讀者對計算成本的顧慮。此處也暗示 VGG 的參數效率極低，為後續 ResNet 更深但更高效的結論埋下伏筆。

Residual Network. Based on the above plain network, we insert shortcut connections which turn the network into its residual version. The identity shortcuts can be directly used when the input and output are of the same dimensions. When the dimensions increase, we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter. (B) The projection shortcut is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

殘差網路。在上述一般網路的基礎上，我們插入捷徑連結將其轉換為殘差版本。當輸入與輸出維度相同時，可直接使用恆等捷徑。當維度增加時，我們考慮兩個選項：(A) 捷徑仍執行恆等映射，對增加的維度補零，此選項不引入額外參數；(B) 使用投影捷徑以匹配維度（透過 1×1 摺積實現）。對兩個選項而言，當捷徑跨越兩個不同尺寸的特徵圖時，皆以步幅 2 執行。

段落功能架構細節——描述如何將一般網路轉為殘差網路。

邏輯角色提供兩個處理維度不匹配的選項（零填充 vs. 投影），為後續消融實驗做準備。選項 A 的「零額外參數」再次強調 ResNet 的效率優勢。

論證技巧 / 潛在漏洞提供多種選項並在實驗中系統比較，展現了嚴謹的研究方法。零填充（option A）雖簡潔，但在語意上不如投影直覺——新通道被初始化為零，意味著它們的殘差學習從零開始。這個設計選擇的影響在後續實驗中被量化。

4. Experiments — 實驗

4.1 ImageNet Classification — ImageNet 分類

We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net has higher validation error than the 18-layer plain net. To reveal the reasons, we compare their training error. The 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer network is a subspace of that of the 34-layer one. We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with Batch Normalization (BN), which ensures that forward propagated signals have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN.

我們首先評估 18 層與 34 層的一般網路。34 層一般網路的驗證誤差高於 18 層。為揭示原因，我們比較它們的訓練誤差。34 層一般網路在整個訓練過程中訓練誤差皆較高，儘管 18 層網路的解空間是 34 層的子空間。我們論述此最佳化困難不太可能由梯度消失所導致。這些一般網路使用批次正規化（BN）進行訓練，它確保前向傳播的信號具有非零的變異數。我們也驗證了使用 BN 後，反向傳播的梯度呈現健康的範數。

段落功能實證退化——以 ImageNet 上的實驗數據確認退化現象。

邏輯角色將緒論中的退化問題從概念轉為實證。排除梯度消失的邏輯嚴密：BN 保證信號與梯度皆正常，因此問題出在最佳化地貌而非訓練機制。

論證技巧 / 潛在漏洞「18 層解空間是 34 層子空間」的論點非常有力——從理論上看，34 層不應比 18 層差。但作者僅排除了梯度消失，並未給出退化的精確機制。「exponentially low convergence rates」的猜測被推遲到未來研究。

Next we evaluate 18-layer and 34-layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, except that a shortcut connection is added to each pair of 3×3 filters. Three major observations emerge. (1) The situation is reversed with residual learning — the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error, suggesting that the degradation problem is well addressed. (2) Compared to its plain counterpart, the 34-layer ResNet reduces the top-1 error by 3.5%, resulting from the successfully reduced training error. (3) The 18-layer plain/residual nets are comparably accurate, but the 18-layer ResNet converges faster at early stages.

接著我們評估 18 層與 34 層的殘差網路（ResNet）。基礎架構與上述一般網路相同，唯一差別是在每對 3×3 濾波器上添加了捷徑連結。三項主要觀察如下：(1) 透過殘差學習情況逆轉——34 層 ResNet 優於 18 層 ResNet（高出 2.8%）。更重要的是，34 層 ResNet 展現了顯著較低的訓練誤差，表明退化問題已被有效解決。(2) 與一般網路對應版本相比，34 層 ResNet 將 top-1 誤差降低了 3.5%，此改善源自成功降低的訓練誤差。(3) 18 層的一般網路與殘差網路精確度相當，但 18 層 ResNet 在早期階段收斂更快。

段落功能關鍵實驗結果——三項觀察構成殘差學習有效性的完整論證。

邏輯角色全文論證的高潮：觀察 (1) 直接驗證了殘差學習能解決退化問題，觀察 (2) 量化了改善幅度，觀察 (3) 表明即使在不深的網路中殘差學習亦有收斂優勢。三者合力構成完備的實證支撐。

論證技巧 / 潛在漏洞以編號列舉三項觀察的寫法清晰有力。「情況逆轉」的措辭戲劇性地對比了一般網路的退化與 ResNet 的改善。觀察 (3) 是額外的收穫——殘差學習不只解決退化，還加速了收斂，增強了方法的吸引力。

Deeper Bottleneck Architectures. Due to concerns on the training time, we modify the building block as a bottleneck design. For each residual function F, we use a stack of 3 layers instead of 2. The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. The 50-layer ResNet is constructed by replacing each 2-layer block with this 3-layer bottleneck block. We further construct 101-layer and 152-layer ResNets by using more bottleneck blocks. Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 (15.3/19.6 billion FLOPs).

更深的瓶頸架構。基於訓練時間的考量，我們將構建區塊修改為瓶頸設計。對每個殘差函數 F，我們使用 3 層堆疊取代 2 層。三層分別為 1×1、3×3 與 1×1 摺積，其中 1×1 層負責縮減再還原維度，使 3×3 層成為具有較小輸入／輸出維度的瓶頸。50 層 ResNet 透過將每個 2 層區塊替換為此 3 層瓶頸區塊而建構。我們進一步使用更多瓶頸區塊建構了 101 層與 152 層 ResNet。值得注意的是，儘管深度顯著增加，152 層 ResNet（113 億次浮點運算）仍低於 VGG-16/19（153/196 億次浮點運算）的複雜度。

段落功能架構擴展——引入瓶頸區塊以實現更深的網路。

邏輯角色展示殘差學習框架的可擴展性。1×1 摺積的降維-升維策略是 Network in Network 與 GoogLeNet 的延伸，此處與殘差連結結合，開啟了建構百層以上網路的可能。

論證技巧 / 潛在漏洞「152 層仍低於 VGG-16 的複雜度」是全文最具震撼力的數據點之一。它以一種近乎反直覺的方式證明：好的架構設計可以讓深度與效率不再對立。瓶頸設計本質上是一種降秩投影，壓縮了每層的計算負擔。

4.2 CIFAR-10 and Analysis — CIFAR-10 與分析

We conduct more studies on CIFAR-10, focusing on the behaviors of extremely deep networks rather than pushing the state-of-the-art results, using intentionally simple architectures. The plain/residual architectures follow the form with 3×3 convolutions on feature maps of sizes {32, 16, 8}, with filter numbers of {16, 32, 64}. Comparing networks of n = {3, 5, 7, 9}, yielding 20, 32, 44, and 56-layer networks: the deep plain nets suffer from increased depth and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet, suggesting that such an optimization difficulty is a fundamental problem.

我們在 CIFAR-10 上進行更深入的研究，聚焦於極深網路的行為而非追求最先進的結果，刻意使用簡單的架構。一般／殘差架構遵循在 {32, 16, 8} 尺寸特徵圖上的 3×3 摺積形式，濾波器數量為 {16, 32, 64}。比較 n = {3, 5, 7, 9}，對應 20、32、44 與 56 層的網路：深層一般網路隨著深度增加而表現不佳，更深時展現更高的訓練誤差。此現象與 ImageNet 上觀察到的一致，暗示此最佳化困難是一個根本性的問題。

段落功能跨資料集驗證——在 CIFAR-10 上復現退化現象，確認其普遍性。

邏輯角色擴展實證基礎：退化問題不局限於 ImageNet，而是深層網路最佳化的根本障礙。跨資料集的一致性增強了殘差學習的理論地位。

論證技巧 / 潛在漏洞「刻意使用簡單架構」的表述凸顯研究動機是理解問題本質而非刷榜——這在深度學習論文中是少見的嚴謹態度。CIFAR-10 的小尺度使得更多層數的實驗在計算上可行，為 1000 層實驗做好鋪墊。

We further explore an aggressively deep model with n = 200, leading to a 1202-layer network. This network shows no optimization difficulty, achieving a training error < 0.1%. Its test error is still fairly good (7.93%). But there are still open problems on such aggressively deep models: the testing result of this 1202-layer network is worse than that of the 110-layer network, although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M parameters) for this small dataset. Analysis of Layer Responses shows that ResNets have generally smaller responses than their plain counterparts, supporting the notion that residual functions might be generally closer to zero than the non-residual functions.

我們進一步探索了一個極度深的模型，n = 200，產生 1202 層的網路。此網路未展現最佳化困難，達到低於 0.1% 的訓練誤差。其測試誤差仍相當不錯（7.93%）。但此極深模型仍有待解決的問題：1202 層網路的測試結果不如 110 層網路，儘管兩者的訓練誤差相近。我們認為這是由於過度擬合。1202 層網路（1,940 萬參數）對此小型資料集而言可能過大。層回應分析顯示 ResNet 通常具有比其一般對應網路更小的回應值，支持了殘差函數通常比非殘差函數更接近零的觀點。

段落功能極限測試與誠實討論——1202 層的成功與侷限。

邏輯角色此段兼具兩重功能：(1) 1202 層訓練成功證明殘差學習在最佳化層面幾乎無深度限制；(2) 測試誤差的回升區分了「最佳化問題」（已解決）與「泛化問題」（仍存在）——這是關鍵的概念分離。

論證技巧 / 潛在漏洞作者坦誠地指出 1202 層不如 110 層，展現了難得的學術誠實。層回應分析（殘差函數接近零）為殘差學習的理論動機提供了實驗佐證，使全文的論證形成閉環。但作者未探索正規化技巧（如 dropout）能否改善極深模型的泛化，這是一個明顯的後續方向。

4.3 Object Detection on PASCAL and MS COCO — PASCAL 與 MS COCO 上的物件偵測

Deep residual nets show excellent generalization performance on other recognition tasks. We adopt Faster R-CNN as the detection method and evaluate the improvements solely due to replacing VGG-16 with ResNet-101. Most remarkably, on the challenging COCO dataset, we obtain a 6.0% increase in COCO's standard metric (mAP@[.5, .95]), which is a 28% relative improvement. This gain is solely due to the learned representations. Based on deep residual nets, the authors won 1st places in several tracks of ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

深度殘差網路在其他辨識任務上展現出優異的泛化性能。我們採用 Faster R-CNN 作為偵測方法，並評估僅將 VGG-16 替換為 ResNet-101 所帶來的改善。最令人矚目的是，在極具挑戰性的 COCO 資料集上，我們在 COCO 標準指標（mAP@[.5, .95]）上獲得 6.0% 的提升，相當於 28% 的相對改善。此增益完全歸因於學習到的表示。基於深度殘差網路，作者在 ILSVRC 與 COCO 2015 競賽的多個賽道中贏得第一名：ImageNet 偵測、ImageNet 定位、COCO 偵測，以及 COCO 分割。

段落功能遷移驗證——證明殘差表示對下游任務的廣泛效益。

邏輯角色將殘差學習的價值從「分類」擴展到「偵測、定位、分割」，證明其學習到的表示具有普適性。四個競賽第一名提供了無可爭議的業界認可。

論證技巧 / 潛在漏洞「28% 的相對改善」以百分比呈現放大了效果的感知——絕對值 6.0% 在 mAP@[.5,.95] 上確實很顯著。「完全歸因於表示」的主張有力——僅替換骨幹網路而保持偵測框架不變，是一個乾淨的消融實驗。四項第一名的列舉以事實為據，無需修辭即具有壓倒性說服力。

論證結構總覽

問題
深層網路出現退化
更深反而更差

→

論點
殘差映射 F(x)+x
比直接映射 H(x) 易學

→

證據
ImageNet/CIFAR-10
退化問題被解決

→

反駁
排除梯度消失
區分最佳化 vs. 泛化

→

結論
152 層勝 VGG
四項競賽第一名

作者核心主張（一句話）

透過殘差學習框架——讓堆疊層學習相對於輸入的殘差函數 F(x) 而非原始映射 H(x)——能夠有效訓練百層乃至千層的深度網路，解決深度增加所伴隨的退化問題，並在影像辨識任務上取得前所未有的精確度。

論證最強處

退化問題的精準診斷與簡潔解法：作者首先以嚴謹的實驗排除了梯度消失的可能，精確定位退化為最佳化問題而非統計問題。隨後提出的解法——恆等捷徑連結——不僅不增加任何參數與計算量，更能在 18/34/50/101/152 層上一致地展現深度增益。「零成本的免費午餐」使其在理論與工程上都具有無可比擬的說服力。

論證最弱處

理論解釋的不完整性：作者坦承殘差學習「更容易最佳化」僅是假設而非經過證明的定理，也未深入分析退化問題的根本原因（僅猜測為「指數級低收斂率」）。此外，1202 層實驗暴露了極深模型的泛化隱憂，但未探索正規化等改善手段。殘差重構為何在損失地貌上有利，直到後續研究（如 loss surface visualization）才逐步被理解。