Abstract — 摘要
Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs). We demonstrate that a 16-layer-deep wide residual network has comparable or better accuracy than very deep thin networks (over 1000 layers), while being several times faster to train. We achieve state-of-the-art results on CIFAR-10 (3.89%), CIFAR-100 (18.85%), and SVHN (1.54%).
深度殘差網路已被證明可以擴展到數千層且性能仍在改善。然而,每提升百分之幾的精度幾乎需要將層數加倍,因此訓練極深殘差網路存在特徵重用遞減的問題。為解決這些問題,本文對 ResNet 區塊的架構進行了詳細的實驗研究,基於此提出一種減少深度、增加寬度的新型殘差網路架構。我們將所得的網路結構稱為寬殘差網路(WRNs)。我們證明了一個 16 層深的寬殘差網路具有與極深窄網路(超過 1000 層)相當或更好的精度,同時訓練速度快數倍。在 CIFAR-10(3.89%)、CIFAR-100(18.85%)和 SVHN(1.54%)上達到最先進的結果。
段落功能挑戰「越深越好」的主流思維,提出「寬度取代深度」的替代方案。
邏輯角色以「邊際效益遞減」的經濟學概念批判極深網路的效率問題。
論證技巧 / 潛在漏洞16 層 vs 1000+ 層的對比極具衝擊力,直接挑戰了深度學習界對深度的迷信。
1. Introduction — 緒論
Since the introduction of ResNets, the dominant trend in neural network design has been to make networks deeper. ResNets with 100, 200, and even 1000 layers have been explored. However, we observe that the benefits of going deeper diminish rapidly, and much of the representational capacity in very deep networks may be wasted. In contrast, increasing the width (number of channels) of residual blocks is a more computationally efficient way to improve performance. Wider networks also have the advantage of being more amenable to parallelization on modern GPUs, since matrix multiplications with wider layers can better utilize GPU cores.
自 ResNets 引入以來,神經網路設計的主流趨勢是使網路更深。已探索了 100、200 甚至 1000 層的 ResNets。然而,我們觀察到加深的益處迅速遞減,極深網路中的大量表示容量可能被浪費。相比之下,增加殘差區塊的寬度(通道數)是改善性能的更具計算效率的方式。更寬的網路還具有更適合在現代 GPU 上平行化的優勢,因為更寬層的矩陣乘法可以更好地利用 GPU 核心。
段落功能從計算效率和 GPU 利用率的角度論證寬度優於深度。
邏輯角色以硬體友好性補充理論論證,增加實用性說服力。
論證技巧 / 潛在漏洞GPU 平行化的論點與深度相關的序列化瓶頸形成有力對比。
2. Width vs Depth — 寬度 vs 深度
We parameterize the width of residual blocks by a widening factor k. A standard ResNet with n layers has channel widths of [16, 32, 64]; with k=10, these become [160, 320, 640]. We systematically vary both depth (d) and width (k) to study their interaction. Our key finding: WRN-28-10 (28 layers, width factor 10) achieves 3.89% error on CIFAR-10, outperforming a 1001-layer pre-activation ResNet (4.62%) while being 8x faster to train. Further, WRN-16-8 achieves similar accuracy to the 1001-layer network but trains 10x faster. These results strongly suggest that width is a more effective dimension to increase than depth beyond a certain point.
我們以加寬因子 k 參數化殘差區塊的寬度。標準 ResNet 的通道寬度為 [16, 32, 64];k=10 時變為 [160, 320, 640]。我們系統地變化深度(d)和寬度(k)以研究它們的交互作用。關鍵發現:WRN-28-10(28 層,加寬因子 10)在 CIFAR-10 上達到 3.89% 錯誤率,超越 1001 層預激活 ResNet(4.62%),且訓練速度快 8 倍。此外,WRN-16-8 達到與 1001 層網路相似的精度,但訓練快 10 倍。這些結果強烈表明,超過某個臨界點後,增加寬度是比增加深度更有效的維度。
段落功能報告深度-寬度交互作用的核心實驗結果。
邏輯角色以量化數據直接對比深 vs 寬的效率。
論證技巧 / 潛在漏洞28 層打敗 1001 層且快 8 倍,這一結果極具顛覆性。
3. Dropout in Residual Blocks — 丟棄法正則化
Wider networks have more parameters and may be prone to overfitting. We study the use of dropout between convolutional layers within residual blocks as a regularization technique. We insert dropout between the two 3x3 convolutions in each residual block. Our experiments show that dropout with rate 0.3-0.4 consistently improves performance for wide networks. Interestingly, dropout does not help thin (standard-width) networks, suggesting that regularization is particularly important when increasing width. The combination of widening with dropout achieves the best results across all datasets.
更寬的網路有更多參數,可能更易過擬合。我們研究在殘差區塊內的摺積層之間使用丟棄法作為正則化技術。在每個殘差區塊的兩個 3x3 摺積之間插入丟棄法。實驗表明,比率為 0.3-0.4 的丟棄法一致地改善了寬網路的性能。有趣的是,丟棄法對窄(標準寬度)網路沒有幫助,這表明正則化在增加寬度時特別重要。加寬與丟棄法的組合在所有資料集上達到最佳結果。
段落功能研究丟棄法在寬網路中的正則化效果。
邏輯角色解決「寬網路可能過擬合」的合理擔憂。
論證技巧 / 潛在漏洞丟棄法僅對寬網路有效的發現是有趣的洞見,暗示寬度帶來的冗餘可被正則化利用。
4. Experiments — 實驗
We evaluate WRN on CIFAR-10, CIFAR-100, SVHN, and ImageNet. On CIFAR-10, WRN-28-10 achieves 3.89% error, the best result at the time. On CIFAR-100, WRN-28-10 achieves 18.85% error. On SVHN, WRN-16-8 achieves 1.54%. Training time comparisons show WRNs are significantly faster to train than very deep networks with similar accuracy. On ImageNet, WRNs also show competitive performance. We provide comprehensive ablation studies covering depth, width, dropout rate, and their interactions.
我們在 CIFAR-10、CIFAR-100、SVHN 和 ImageNet 上評估 WRN。在 CIFAR-10 上,WRN-28-10 達到 3.89% 錯誤率,為當時最佳結果。在 CIFAR-100 上,WRN-28-10 達到 18.85% 錯誤率。在 SVHN 上,WRN-16-8 達到 1.54%。訓練時間比較顯示 WRNs 比具有類似精度的極深網路訓練快得多。在 ImageNet 上,WRNs 也展現具競爭力的表現。我們提供涵蓋深度、寬度、丟棄率及其交互作用的全面消融研究。
段落功能報告多資料集的全面實驗結果。
邏輯角色以跨資料集的一致性結果證明寬度優勢的普遍性。
論證技巧 / 潛在漏洞四個資料集的一致性結果極具說服力,全面的消融研究是論文的亮點。
5. Conclusions — 結論
We have conducted a thorough experimental study of residual network architectures and demonstrated that wide residual networks are a simple, efficient, and effective way to improve the performance of residual networks. Our results challenge the prevailing wisdom that deeper is always better, showing that increasing width with appropriate regularization can achieve superior results to increasing depth, while being much faster to train.
我們對殘差網路架構進行了徹底的實驗研究,並證明了寬殘差網路是改善殘差網路性能的簡單、高效且有效的方式。我們的結果挑戰了「越深總是越好」的主流認知,表明搭配適當正則化的增加寬度可以達到優於增加深度的結果,同時訓練速度快得多。
段落功能總結「寬度優於深度」的核心發現。
邏輯角色以挑戰主流思維的姿態收束全文,強調實驗驅動的研究方法。
論證技巧 / 潛在漏洞「挑戰主流認知」的定位使論文具有高度的學術影響力和討論性。
論證結構總覽
問題
極深網路邊際效益遞減
極深網路邊際效益遞減
➔
論點
減少深度、增加寬度
減少深度、增加寬度
➔
證據
16層打敗1001層
16層打敗1001層
➔
反駁
寬網路需丟棄法正則化
寬網路需丟棄法正則化
➔
結論
寬度是更高效的維度
寬度是更高效的維度
核心主張
在殘差網路中增加寬度(通道數)比增加深度(層數)更有效率,可在更少的層數下達到相當或更好的精度,同時大幅縮短訓練時間。
最強論證
WRN-28-10 以 28 層達到 3.89% 錯誤率,超越 1001 層 ResNet 的 4.62% 且訓練快 8 倍,數據有力地顛覆了「越深越好」的共識。
最弱環節
寬網路的參數量仍然很大(雖然訓練更快),在記憶體受限的場景下可能不如深而窄的網路有優勢。