Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet

Abstract — 摘要

Rectified activation units (rectifiers) are essential for recent advances in deep neural networks. In this work, we study rectifier neural networks from two aspects. First, we propose Parametric Rectified Linear Unit (PReLU), which generalizes the traditional rectified unit and improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities, enabling us to train extremely deep rectified models directly from scratch. Based on these two techniques, we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset, which surpasses the reported human-level performance (5.1%) for the first time.

整流激活單元（整流器）對深度神經網路的近期進展至關重要。本文從兩個面向研究整流器神經網路。首先，我們提出參數化整流線性單元（PReLU），它泛化了傳統整流單元，以幾乎零額外計算成本與極低過擬合風險改善模型擬合。其次，我們推導出一種穩健的初始化方法，特別考量整流器的非線性，使我們能從零開始直接訓練極深的整流模型。基於這兩項技術，我們在 ImageNet 2012 分類資料集上達到 4.94% top-5 測試錯誤率，首次超越已報告的人類水準效能（5.1%）。

段落功能全文總覽——以兩項技術貢獻與里程碑式結果定義論文核心。

邏輯角色摘要以「兩項貢獻 -> 一個突破」的結構組織，PReLU 與初始化方法互為補充，共同實現超越人類的目標。

論證技巧 / 潛在漏洞「超越人類水準」是極具標題效應的主張，但 5.1% 的「人類水準」基於特定實驗設定（單人標注者在 1000 類中的錯誤率），不等同於一般人類視覺辨識能力。此定義的侷限性在後續文獻中引發了廣泛討論。

1. Introduction — 緒論

Deep neural networks have recently achieved remarkable progress in visual recognition. A key factor behind this success is the use of Rectified Linear Units (ReLU): f(x) = max(0, x). While ReLU has enabled training of deeper networks compared to sigmoid and tanh, its behavior on the negative side — hard zero activation — leads to "dying neurons" that never activate again once pushed into the negative regime. Moreover, standard initialization methods like Xavier assume linear activations and are not designed for networks with rectifier nonlinearities, causing training difficulties for very deep models.

深度神經網路近期在視覺辨識上取得了顯著進展。此成功背後的關鍵因素是整流線性單元（ReLU）的使用：f(x) = max(0, x)。雖然 ReLU 相較 sigmoid 與 tanh 已能訓練更深的網路，但其負側的行為——硬零激活——導致「死亡神經元」：一旦被推入負區間便永不再激活。此外，如 Xavier 等標準初始化方法假設線性激活，並非為具有整流器非線性的網路設計，導致極深模型的訓練困難。

段落功能建立問題脈絡——指出 ReLU 與初始化的已知缺陷。

邏輯角色論證起點：肯定 ReLU 的貢獻後指出其兩個缺陷（死亡神經元、不適當的初始化），恰好對應論文的兩項貢獻。

論證技巧 / 潛在漏洞「死亡神經元」是一個被廣泛認知的問題，以此作為動機十分有效。但 Leaky ReLU 已在此之前提出了類似的解決方案（負側非零斜率），作者需說明 PReLU 的增量貢獻。

2. Parametric Rectified Linear Unit — 參數化整流線性單元

We propose PReLU, which generalizes ReLU by allowing a learnable slope for negative inputs: f(y_i) = y_i if y_i > 0; a_i · y_i if y_i ≤ 0. The coefficient a_i is learned jointly with the other model parameters through back-propagation, requiring no manual tuning. When a_i = 0, PReLU reduces to ReLU; when a_i is a fixed small value, it becomes Leaky ReLU. We investigate both channel-wise (each channel has its own a_i) and channel-shared (all channels share one a) variants. The channel-wise version on a 14-layer model shows a 1.2% gain over the ReLU baseline. The additional parameters are negligible: only one per channel for channel-wise, adding virtually zero computational cost.

我們提出 PReLU，透過允許負輸入的可學習斜率來泛化 ReLU：f(y_i) = y_i（若 y_i > 0）；a_i · y_i（若 y_i ≤ 0）。係數 a_i 透過反向傳播與其他模型參數聯合學習，無需手動調整。當 a_i = 0 時，PReLU 退化為 ReLU；當 a_i 為固定小值時，即為 Leaky ReLU。我們研究了逐通道（每個通道有自己的 a_i）與共享通道（所有通道共享一個 a）兩種變體。逐通道版本在 14 層模型上展現相較 ReLU 基準 1.2% 的增益。額外參數可忽略不計：逐通道僅每通道一個參數，幾乎零額外計算成本。

段落功能核心貢獻一——定義 PReLU 的數學形式與變體。

邏輯角色以 ReLU 與 Leaky ReLU 作為特例的展示，巧妙地將 PReLU 定位為一般化框架，而非全新的發明。

論證技巧 / 潛在漏洞將 ReLU 與 Leaky ReLU 收編為 PReLU 的特例是高明的定位策略。1.2% 的改善幅度不大，但「零額外成本」使其成為「無損改進」，大大降低了採用門檻。

3. Initialization Method — 初始化方法

We derive a new initialization method specifically designed for rectifier networks. The key insight is that the variance of responses in each layer must be maintained during both forward and backward passes. For ReLU networks, the proper standard deviation for weight initialization should be √(2/n_l), where n_l is the number of input connections, rather than Xavier's √(1/n_l) which assumes linear activations. The factor of 2 accounts for the fact that ReLU zeros out roughly half of the activations. This seemingly simple correction has dramatic practical impact: it enables training extremely deep rectified models (30+ layers) directly from scratch without the need for pre-training or other tricks, whereas Xavier initialization causes such models to fail to converge.

我們推導出一種專為整流器網路設計的新初始化方法。核心洞察在於，每層回應的變異數必須在前向與反向傳播中都得以維持。對 ReLU 網路而言，權重初始化的適當標準差應為 √(2/n_l)（n_l 為輸入連接數），而非 Xavier 的 √(1/n_l)（假設線性激活）。2 的因子考量了 ReLU 大約將一半激活歸零的事實。此看似簡單的修正具有巨大的實際影響：它使得極深的整流模型（30 層以上）能直接從零開始訓練，無需預訓練或其他技巧，而 Xavier 初始化會導致此類模型無法收斂。

段落功能核心貢獻二——推導適合整流器的初始化公式。

邏輯角色此段將直覺（ReLU 歸零一半）轉化為數學（2 的因子），再以實驗結果（30+ 層可訓練 vs. 不收斂）驗證。理論-直覺-實踐的三層論證結構極為完整。

論證技巧 / 潛在漏洞 √(2/n_l) vs. √(1/n_l) 的差異僅在於一個 √2 的因子，但其對深度模型訓練的影響卻是質變性的。此「小修正大影響」的敘事極具說服力，也成為此文最被廣泛引用的貢獻。

4. Experiments — 實驗

We train a series of deep models on ImageNet 2012 using PReLU and our initialization. Model A (19-layer) achieves 6.28% top-5 error with PReLU; Model B (22-layer) achieves 6.27%; Model C (wider variant) achieves 5.71% single-model top-5 error. The multi-model ensemble reaches 4.94% top-5 test error, representing a 26% relative improvement over ILSVRC 2014 winner GoogLeNet (6.66%). This is the first result to surpass the reported human-level performance of 5.1%. Our models use spatial pyramid pooling, aggressive data augmentation with scale jittering in range [256, 512], and multi-GPU training. The paper emphasizes that width increases (more filters per layer) contribute more than depth alone.

我們以 PReLU 與本文初始化方法在 ImageNet 2012 上訓練一系列深度模型。模型 A（19 層）以 PReLU 達到 6.28% top-5 錯誤率；模型 B（22 層）達 6.27%；模型 C（加寬變體）達 5.71% 單模型 top-5 錯誤率。多模型集成達到 4.94% top-5 測試錯誤率，相較 ILSVRC 2014 冠軍 GoogLeNet（6.66%）有 26% 的相對改善。這是首次超越已報告的 5.1% 人類水準效能的結果。我們的模型使用空間金字塔池化、在 [256, 512] 範圍內的尺度抖動積極資料增強以及多 GPU 訓練。論文強調寬度增加（每層更多濾波器）比單純增加深度貢獻更大。

段落功能全面的實驗驗證——以漸進式模型展示改進與最終突破。

邏輯角色實驗段以 A -> B -> C -> 集成的漸進結構展示每項技術的增量貢獻，最終以 4.94% 的集成結果達成「超越人類」的論文主張。

論證技巧 / 潛在漏洞 4.94% vs. 5.1% 的差距（0.16%）在統計上可能不顯著，且「人類水準」的定義較為寬鬆。「寬度比深度重要」的發現十分有洞察力，但與隨後 ResNet 論文的「深度至關重要」似有矛盾——實際上兩者在不同情境下皆有道理。

5. Conclusion — 結論

We have investigated rectifier neural networks from the perspectives of activation functions and initialization. PReLU provides a principled generalization of ReLU with negligible extra cost, while our initialization method enables training of very deep models that were previously intractable. Together, these contributions achieve 4.94% top-5 error on ImageNet, surpassing human-level performance for the first time. These results demonstrate that careful attention to seemingly minor design choices — activation functions and initialization — can yield substantial practical improvements in deep learning.

我們從激活函數與初始化兩個角度研究了整流器神經網路。PReLU 以可忽略的額外成本提供 ReLU 的有原則泛化，而我們的初始化方法使得先前不可訓練的極深模型成為可能。這些貢獻共同在 ImageNet 上達到 4.94% top-5 錯誤率，首次超越人類水準效能。這些結果證明，對看似微小的設計選擇——激活函數與初始化——的仔細關注，能在深度學習中產生實質性的改善。

段落功能總結全文——提煉出超越具體方法的一般性啟示。

邏輯角色結論從具體技術上升至一般原則：「看似微小的設計選擇可產生巨大影響」，為深度學習社群提供了方法論啟示。

論證技巧 / 潛在漏洞結論的哲學層次提升（微小設計 -> 巨大影響）是有效的修辭手法。但未討論「超越人類水準」的定義爭議，也未充分說明此結果在多大程度上依賴集成（單模型最佳 5.71% 仍未超越 5.1%）。

論證結構總覽

問題
ReLU 死亡神經元
Xavier 初始化不適配

→

論點
PReLU 可學習斜率
√(2/n) 初始化

→

證據
ImageNet 4.94% top-5
首次超越人類 5.1%

→

反駁
零額外成本、可訓練
30+ 層深度模型

→

結論
微小設計選擇可
帶來重大實務改善

作者核心主張（一句話）

透過可學習的 PReLU 激活函數與考量整流器非線性的 √(2/n_l) 初始化方法，能從零訓練極深模型並在 ImageNet 上以 4.94% top-5 錯誤率首次超越人類水準。

論證最強處

初始化方法的理論優雅性與實踐影響力：√(2/n_l) 的推導基於嚴謹的變異數分析，一個 √2 的因子就能將不可訓練的 30+ 層網路變為可訓練，且此方法（He 初始化）已成為深度學習的標準做法，影響力遠超論文本身。

論證最弱處

「超越人類」主張的定義爭議：5.1% 的人類基準基於單一標注者在 1000 類分類任務上的表現，並非一般性的人類視覺能力。單模型最佳結果（5.71%）實際上仍未超越此基準，需 7 模型集成才能達到 4.94%。此外，PReLU 相較 Leaky ReLU 的增量貢獻較為有限。