Squeeze-and-Excitation Networks

Abstract — 摘要

The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. We propose the Squeeze-and-Excitation (SE) block, an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. SE networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%.

摺積神經網路的核心建構元件是摺積運算子，它使網路能夠在每一層的局部感受野內融合空間與通道資訊來建構有用的特徵。我們提出擠壓與激發（SE）區塊，一種架構單元，旨在透過動態通道級特徵重校準來提升網路的表示能力。SE 網路構成了我們 ILSVRC 2017 分類競賽提交的基礎，贏得了第一名並將 top-5 錯誤率降至 2.251%。

段落功能全文總覽——以摺積的侷限性引出 SE 區塊的核心創新，並以競賽冠軍結果作為有力佐證。

邏輯角色摘要以「問題 -> 方案 -> 成果」的三段式結構展開：摺積的通道融合是隱式的 -> SE 區塊使其顯式且動態 -> ILSVRC 冠軍證明其效力。

論證技巧 / 潛在漏洞以 ILSVRC 冠軍作為開場的「權威論證」極具說服力。但 2.251% 的 top-5 錯誤率是整個系統的結果，SE 區塊的獨立貢獻需在後續實驗中分離。

1. Introduction — 緒論

CNNs have proven to be effective models for image understanding. At each convolutional layer, a set of filters express local spatial connectivity patterns along input channels. However, the channel relationships are implicitly modeled by the filters and are entangled with spatial information. We argue that explicitly modeling the interdependencies between channels of convolutional features can significantly improve feature quality. To achieve this, we propose an SE block that operates in two phases: squeeze (aggregating spatial information) and excitation (learning channel-wise dependencies).

CNN 已被證明是影像理解的有效模型。在每一摺積層，一組濾波器表達輸入通道上的局部空間連接模式。然而，通道關係被濾波器隱式建模，且與空間資訊糾纏在一起。我們主張，顯式建模摺積特徵通道間的相互依賴性，可以顯著提升特徵品質。為達此目的，我們提出以兩個階段運作的 SE 區塊：擠壓（聚合空間資訊）與激發（學習通道級依賴關係）。

段落功能建立動機——指出摺積運算中通道關係的隱式性問題。

邏輯角色論證起點：「隱式 vs. 顯式」通道建模的對比，為 SE 區塊的存在提供必要性論述。

論證技巧 / 潛在漏洞「通道關係是隱式的」這一觀察雖然正確，但標準摺積中的 1x1 摺積本質上就在進行通道混合。SE 區塊的真正創新在於「自適應」而非「顯式」本身。

Prior work has explored deeper architectures (VGGNet, ResNet, Inception) and architectural search methods (NAS) to improve representations. Attention mechanisms have been applied in various forms: spatial attention focuses on informative spatial regions, while channel attention has been less explored. The SE block can be viewed as a lightweight gating mechanism applied to channels, related to but distinct from highway networks and gated recurrent units. The key difference is that SE blocks operate as self-attention on channel features, conditioned on the input.

先前研究已探索了更深的架構（VGGNet、ResNet、Inception）與架構搜尋方法（NAS）來改善表示。注意力機制以各種形式被應用：空間注意力聚焦於有資訊量的空間區域，而通道注意力則較少被探索。SE 區塊可視為應用於通道的輕量級閘控機制，與高速公路網路和閘控遞迴單元相關但有所區別。關鍵差異在於 SE 區塊作為通道特徵上的自注意力運作，以輸入為條件。

段落功能文獻定位——將 SE 區塊放置在注意力機制與閘控網路的交叉點。

邏輯角色區分「空間注意力」（已充分探索）與「通道注意力」（較少探索），精確定義 SE 區塊填補的學術空缺。

論證技巧 / 潛在漏洞將 SE 區塊連結到「自注意力」的概念增強了其理論深度。但與閘控機制的類比可能使讀者疑惑：SE 區塊的真正新穎性在哪裡？答案在於其特定的「擠壓-激發」兩階段設計。

3. Squeeze-and-Excitation Block — SE 區塊

3.1 Squeeze: Global Information Embedding — 擠壓：全域資訊嵌入

The squeeze operation aggregates feature maps across spatial dimensions to produce a channel descriptor. This is achieved through global average pooling, generating a vector z of dimension C where each element z_c is the spatial average of the c-th channel. The rationale is that each channel of the output after convolution is unable to exploit contextual information outside of its local receptive field. The squeeze operation addresses this by embedding global spatial information into the channel descriptor.

擠壓運算跨空間維度聚合特徵圖以產生通道描述子。這透過全域平均池化實現，產生一個維度為 C 的向量 z，其中每個元素 z_c 是第 c 個通道的空間平均值。基本原理是：摺積後每個通道的輸出無法利用其局部感受野之外的上下文資訊。擠壓運算透過將全域空間資訊嵌入通道描述子來解決此問題。

段落功能方法第一階段——定義擠壓運算的形式與動機。

邏輯角色擠壓是 SE 區塊的「資訊收集」步驟：將 H x W 的空間資訊壓縮為一個純量，為後續的通道間建模創造前提條件。

論證技巧 / 潛在漏洞全域平均池化是極其簡潔的設計選擇，但也丟失了所有空間位置資訊。後續消融研究顯示全域平均優於最大池化，但未探索保留部分空間資訊的替代方案。

3.2 Excitation: Adaptive Recalibration — 激發：自適應重校準

The excitation operation uses the channel descriptor to learn channel-wise dependencies. It employs a bottleneck architecture with two fully-connected layers: first a dimensionality reduction layer with ratio r (default r=16) followed by ReLU, then a dimensionality increasing layer followed by sigmoid activation. The output is a set of per-channel modulation weights that are applied to the original feature maps via channel-wise multiplication. This mechanism introduces dynamics conditioned on the input, which can be regarded as a self-attention function on channels.

激發運算使用通道描述子來學習通道級依賴關係。它採用具有兩層全連接層的瓶頸架構：先以比率 r（預設 r=16）進行維度縮減並接 ReLU，再進行維度擴增並接 sigmoid 啟動函數。輸出是一組逐通道的調變權重，透過逐通道乘法應用於原始特徵圖。此機制引入了以輸入為條件的動態性，可視為通道上的自注意力函數。

段落功能方法第二階段——描述激發運算的具體架構。

邏輯角色激發是 SE 區塊的「決策」步驟：基於全域資訊，學習哪些通道應被增強、哪些應被抑制。瓶頸設計（r=16）控制了參數開銷。

論證技巧 / 潛在漏洞 sigmoid 輸出的 [0,1] 範圍天然適合作為通道權重（可以完全抑制或保留某通道）。但 r=16 的選擇是經驗性的——過小的 r 增加計算量，過大的 r 可能限制通道間互動的建模能力。

4. Experiments — 實驗

ImageNet classification: SE-ResNet-50 achieves 6.62% top-5 error versus ResNet-50's 7.48% — a substantial improvement. More remarkably, SE-ResNet-101 (6.07% error) outperforms the deeper ResNet-152 (6.34% error), demonstrating that SE blocks provide more effective improvement than simply adding more layers. The computational overhead is minimal: only a 0.26% increase in GFLOPs for SE-ResNet-50. SE blocks generalize across architectures: they improve VGGNet, Inception, ResNeXt, and MobileNet consistently.

ImageNet 分類：SE-ResNet-50 達到 6.62% 的 top-5 錯誤率，相對於 ResNet-50 的 7.48% 有顯著改善。更引人注目的是，SE-ResNet-101（6.07%）優於更深的 ResNet-152（6.34%），證明 SE 區塊比單純增加更多層數提供更有效的改善。計算額外開銷極小：SE-ResNet-50 僅增加 0.26% 的 GFLOPs。SE 區塊可跨架構泛化：一致地改善 VGGNet、Inception、ResNeXt 與 MobileNet。

段落功能核心實驗——以 ImageNet 上的全面數據展示 SE 區塊的效果。

邏輯角色三重論證：(1) 絕對改善（6.62% vs. 7.48%）；(2) 效率論述（SE-101 勝 ResNet-152，且計算更少）；(3) 泛化性（跨架構一致改善）。

論證技巧 / 潛在漏洞 SE-ResNet-101 優於 ResNet-152 的比較極為巧妙——它直接回應了「為何不直接加深網路」的質疑。0.26% 的 GFLOPs 增加幾乎可忽略，使採用 SE 區塊成為「幾乎免費的改善」。

Beyond classification: On COCO object detection with Faster R-CNN, SE blocks provide a 2.4% AP improvement over the ResNet-50 baseline. On Places365 scene classification, SE-ResNet-152 surpasses the previous state-of-the-art. On CIFAR-10 and CIFAR-100, SE blocks improve ResNet-110, ResNet-164, and WideResNet across the board. These results confirm that the benefits of channel recalibration extend well beyond ImageNet classification to diverse tasks and datasets.

分類之外：在使用 Faster R-CNN 的 COCO 物件偵測中，SE 區塊相對於 ResNet-50 基線帶來 2.4% 的 AP 改善。在 Places365 場景分類中，SE-ResNet-152 超越先前最先進水準。在 CIFAR-10 和 CIFAR-100 上，SE 區塊全面改善了 ResNet-110、ResNet-164 和 WideResNet。這些結果確認通道重校準的效益遠超 ImageNet 分類，延伸至多樣化的任務與資料集。

段落功能跨任務驗證——將 SE 區塊的效益延伸至偵測、場景分類等領域。

邏輯角色支撐「通用改善工具」的主張：在不同任務（分類、偵測）、不同資料集（ImageNet、COCO、Places、CIFAR）上均有效。

論證技巧 / 潛在漏洞跨任務的一致改善是最具說服力的證據類型。但 2.4% AP 在 COCO 上的改善是否在統計顯著性範圍內，未被明確討論。

5. Ablation Studies — 消融研究

Comprehensive ablation studies reveal several design insights. Reduction ratio r=16 provides the optimal accuracy-complexity tradeoff. Global average pooling outperforms max pooling for the squeeze operation. For the excitation nonlinearity, sigmoid is essential — replacing it with ReLU causes significant performance degradation, since ReLU's zero-clipping prevents the suppression of uninformative channels. A role analysis of SE activations shows that early layers exhibit class-agnostic behavior (similar weights across categories), while deeper layers become increasingly class-specific, dynamically selecting the most relevant channels for each input.

全面的消融研究揭示了若干設計洞見。縮減比率 r=16 提供了最佳的精度—複雜度權衡。全域平均池化在擠壓運算上優於最大池化。對於激發的非線性函數，sigmoid 至關重要——以 ReLU 替代會導致顯著的效能下降，因為 ReLU 的零截斷阻止了對無資訊通道的抑制。SE 啟動值的角色分析顯示，淺層表現出類別無關的行為（跨類別的相似權重），而深層則變得愈加類別特定，動態選擇對每個輸入最相關的通道。

段落功能深入分析——以系統性消融驗證各設計選擇的合理性。

邏輯角色消融研究回答「為何這樣設計」的問題，將方法從「有效的黑箱」提升為「每個組件都有理由」的透明設計。

論證技巧 / 潛在漏洞 sigmoid vs. ReLU 的分析特別有啟發性——它揭示了通道抑制（而非僅增強）的重要性。角色分析展示了 SE 區塊學到了有意義的表示，而非隨機的權重調整。

6. Conclusion — 結論

We have proposed the Squeeze-and-Excitation block, a lightweight architectural unit that enhances the representational capacity of CNNs by explicitly modeling channel interdependencies. Through the squeeze operation (global information embedding) and the excitation operation (adaptive recalibration), SE blocks learn to dynamically emphasize informative features and suppress less useful ones. With minimal computational overhead (~0.26% GFLOPs increase), SE blocks deliver consistent improvements across multiple architectures, datasets, and tasks, including the winning entry of ILSVRC 2017 with 2.251% top-5 error.

我們提出了擠壓與激發區塊，一種透過顯式建模通道相互依賴性來增強 CNN 表示能力的輕量級架構單元。透過擠壓運算（全域資訊嵌入）與激發運算（自適應重校準），SE 區塊學習動態地強調有資訊量的特徵並抑制較無用的特徵。以極小的計算額外開銷（約 0.26% GFLOPs 增加），SE 區塊在多種架構、資料集與任務上提供一致的改善，包括以 2.251% top-5 錯誤率贏得 ILSVRC 2017 的冠軍。

段落功能總結全文——以三個關鍵數字（0.26% 開銷、2.251% 錯誤率、跨架構泛化）收束論述。

邏輯角色結論呼應摘要，以「輕量級 + 高效益 + 泛用性」的三角論述完成論證閉環。

論證技巧 / 潛在漏洞結論謹慎地聚焦於已驗證的成果，未做過度外推。但未討論 SE 區塊的理論極限——當網路已非常深且通道數極大時，SE 區塊的邊際效益是否會遞減？

論證結構總覽

問題
摺積隱式處理
通道間依賴關係

→

論點
顯式通道重校準
提升表示能力

→

證據
ILSVRC 2017 冠軍
跨架構一致改善

→

反駁
僅 0.26% 計算增加
消融確認各組件必要

→

結論
輕量級通用改善工具
適用於任何 CNN

作者核心主張（一句話）

透過擠壓（全域池化）與激發（自適應閘控）兩步驟，SE 區塊以極低的計算成本顯式建模通道間依賴關係，為任何 CNN 架構帶來一致且顯著的效能提升。

論證最強處

極高的效率—效果比：僅增加 0.26% 的計算量即獲得接近一個百分點的 top-5 錯誤率改善，且此改善跨越多種架構（VGG、ResNet、Inception、MobileNet）與多種任務（分類、偵測、分割）一致重現。SE-ResNet-101 優於 ResNet-152 的結果直接證明了「智慧勝於蠻力」。

論證最弱處

全域池化的資訊壓縮代價：擠壓運算將整個空間維度壓縮為單一純量，丟失了所有空間分布資訊。對於需要精細空間推理的任務（如小物件偵測），這種壓縮可能不是最佳策略。作者未探索保留部分空間資訊的替代擠壓方式。