CBAM: Convolutional Block Attention Module

Abstract — 摘要

We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate CBAM through extensive experiments on ImageNet-1K, MS COCO detection, and VOC 2007 detection datasets.

我們提出摺積區塊注意力模組（CBAM），一種用於前饋摺積神經網路的簡單且有效的注意力模組。給定中間特徵圖，我們的模組依序沿兩個獨立維度——通道與空間——推斷注意力圖，然後將注意力圖與輸入特徵圖相乘以進行自適應特徵精煉。由於 CBAM 是一個輕量且通用的模組，可以無縫整合到任何 CNN 架構中，額外開銷可忽略不計，並且可與基礎 CNN 進行端到端訓練。我們在 ImageNet-1K、MS COCO 偵測與 VOC 2007 偵測資料集上進行了廣泛的實驗驗證。

段落功能全文總覽——定義 CBAM 為通用注意力模組並陳述其核心特性。

邏輯角色摘要以「簡單-有效-通用」三個關鍵詞定義 CBAM 的研究定位，並以多個基準資料集的驗證收束。

論證技巧 / 潛在漏洞「輕量且通用」的定位是注意力模組的黃金賣點。但「可忽略的開銷」需要具體數字支撐。

1. Introduction — 緒論

Attention mechanisms have become an integral part of deep neural networks. In the context of CNNs, attention can be broadly categorized as channel attention (what to focus on) and spatial attention (where to focus). Previous works like SENet have explored channel attention alone, demonstrating significant improvements. However, combining both channel and spatial attention in a unified module remains underexplored.

注意力機制已成為深度神經網路的不可或缺的部分。在 CNN 的脈絡中，注意力可大致分為通道注意力（關注什麼）與空間注意力（關注哪裡）。先前的工作如 SENet 已探索了僅通道注意力，展現了顯著的改進。然而，在統一模組中結合通道與空間注意力仍未被充分探索。

段落功能背景鋪陳——從注意力機制的分類引出研究缺口。

邏輯角色透過「what vs. where」的二分法建立清晰的概念框架，並以 SENet 的「僅通道」不足為跳板。

論證技巧 / 潛在漏洞「what vs. where」是直觀的分類。但這兩個維度是否窮盡了注意力的所有形式（如時間維度），值得商榷。

We propose CBAM, which sequentially applies channel and spatial attention modules to any given feature map. The channel attention module exploits the inter-channel relationship of features using both average-pooled and max-pooled features. The spatial attention module utilizes the inter-spatial relationship by applying pooling operations along the channel axis. This sequential arrangement allows the network to first identify informative channels, then locate informative spatial regions within those channels.

我們提出 CBAM，它依序對任意特徵圖施加通道與空間注意力模組。通道注意力模組利用平均池化與最大池化特徵來開發特徵間的通道關係。空間注意力模組透過沿通道軸進行池化操作來利用空間間的關係。這種序列式排列使網路先識別有資訊量的通道，再在這些通道內定位有資訊量的空間區域。

段落功能方法概述——CBAM 的雙重注意力設計。

邏輯角色從概念框架自然過渡到技術方案：先確定「什麼」（通道），再確定「哪裡」（空間）。

論證技巧 / 潛在漏洞序列式排列的直覺解釋很有說服力。但為何是「先通道後空間」而非反序？消融實驗需要驗證此設計選擇。

SENet (Squeeze-and-Excitation Networks) introduced channel attention via global average pooling followed by fully connected layers. While effective, SENet only considers channel-wise attention and ignores spatial information. STN (Spatial Transformer Networks) introduced spatial attention via learned geometric transformations, but is computationally expensive and difficult to train. Other attention mechanisms in NLP, such as self-attention in Transformers, have demonstrated the power of attention but are not directly applicable to convolutional architectures in their original form.

SENet（擠壓與激勵網路）透過全域平均池化接全連接層引入了通道注意力。雖然有效，但 SENet 僅考慮通道維度的注意力而忽略空間資訊。STN（空間轉換器網路）透過學習幾何變換引入空間注意力，但計算成本高且訓練困難。NLP 中的其他注意力機制，如 Transformer 的自注意力，已展示了注意力的強大能力，但在原始形式下不直接適用於摺積架構。

段落功能文獻綜述——比較現有注意力方法的優缺點。

邏輯角色系統性地展示每個現有方法的局限性：SENet 無空間、STN 太重、Transformer 不直接適用，為 CBAM 的「通道+空間、輕量、即插即用」定位掃清道路。

論證技巧 / 潛在漏洞將三類方法的弱點恰好對應到 CBAM 的三個強項，是精心設計的對比策略。但對 SENet 的批評可能過於簡化。

3. Method — 方法

Given an input feature map F in R^(C x H x W), the Channel Attention Module first generates a channel attention map M_c in R^(C x 1 x 1) by applying both average pooling and max pooling across the spatial dimensions, then passing through a shared multi-layer perceptron (MLP). The outputs are summed and passed through a sigmoid activation: M_c(F) = sigma(MLP(AvgPool(F)) + MLP(MaxPool(F))). Using both pooling operations captures complementary information — average pooling captures general channel statistics while max pooling captures discriminative features.

給定輸入特徵圖 F 屬於 R^(C x H x W)，通道注意力模組首先透過在空間維度上同時施加平均池化與最大池化，再通過共享的多層感知器（MLP）生成通道注意力圖 M_c 屬於 R^(C x 1 x 1)。輸出相加後通過 sigmoid 啟動函數：M_c(F) = sigma(MLP(AvgPool(F)) + MLP(MaxPool(F)))。同時使用兩種池化操作擷取互補資訊——平均池化擷取通道的整體統計，而最大池化擷取判別性特徵。

段落功能通道注意力模組的詳細設計。

邏輯角色相比 SENet 僅使用平均池化，CBAM 加入最大池化是對通道注意力的增強版本。

論證技巧 / 潛在漏洞「平均 vs. 最大」的互補性直覺合理，但共享 MLP 是否真的能學到不同的映射是一個值得驗證的假設。

The Spatial Attention Module takes the channel-refined feature map and generates a spatial attention map M_s in R^(1 x H x W). It first applies average pooling and max pooling along the channel axis, concatenates the results, and then uses a 7x7 convolution layer followed by sigmoid to produce the spatial attention: M_s(F') = sigma(Conv7x7([AvgPool(F'); MaxPool(F')])). The final output is F'' = M_s * (M_c * F), where attention is applied sequentially.

空間注意力模組接受通道精煉後的特徵圖並生成空間注意力圖 M_s 屬於 R^(1 x H x W)。它先沿通道軸施加平均池化與最大池化，串接結果，然後使用 7x7 摺積層接 sigmoid 產生空間注意力：M_s(F') = sigma(Conv7x7([AvgPool(F'); MaxPool(F')]))。最終輸出為 F'' = M_s * (M_c * F)，注意力依序施加。

段落功能空間注意力模組設計與整體流程。

邏輯角色完成 CBAM 的完整定義：通道注意力 -> 空間注意力 -> 特徵精煉。

論證技巧 / 潛在漏洞 7x7 摺積的選擇暗示較大的感受野對空間注意力有益。但此超參數的敏感度需要消融實驗支持。

4. Experiments — 實驗

On ImageNet-1K classification, CBAM consistently improves various architectures: ResNet-50 improves from 75.44% to 77.34% top-1 accuracy, MobileNet improves from 68.36% to 70.99%. The overhead is minimal: only 0.06% parameter increase for ResNet-50. Compared to SE module, CBAM achieves better performance with similar computational cost, confirming the benefit of spatial attention.

在 ImageNet-1K 分類上，CBAM 持續改善各種架構：ResNet-50 從 75.44% 提升至 77.34% top-1 準確率，MobileNet 從 68.36% 提升至 70.99%。開銷極小：ResNet-50 僅增加 0.06% 參數量。相比 SE 模組，CBAM 在相似計算成本下達到更好的效能，確認了空間注意力的益處。

段落功能核心實驗——ImageNet 分類的定量結果。

邏輯角色以最標準的基準展示「通用性」與「輕量性」：多個架構都受益，且參數增加極少。

論證技巧 / 潛在漏洞 0.06% 的參數增加是極具說服力的數字。與 SENet 的直接對比有效地支撐了「空間注意力的額外價值」。

On MS COCO object detection using Faster R-CNN with ResNet-50 backbone, CBAM improves the mAP from 31.1% to 32.4%. On VOC 2007 detection, the improvement is from 73.2% to 75.1% mAP. Ablation studies confirm that the sequential channel-then-spatial arrangement outperforms parallel or reverse orderings, and that using both average and max pooling is better than using either alone.

在使用 ResNet-50 骨幹的 Faster R-CNN 進行 MS COCO 物件偵測時，CBAM 將 mAP 從 31.1% 提升至 32.4%。在 VOC 2007 偵測上，改善為 mAP 從 73.2% 至 75.1%。消融研究確認了「先通道後空間」的序列排列優於並行或反向排列，且同時使用平均與最大池化優於單獨使用其中之一。

段落功能延伸實驗——偵測任務與消融分析。

邏輯角色跨任務驗證強化了「通用模組」的主張。消融實驗驗證了序列排列與雙池化的設計決策。

論證技巧 / 潛在漏洞完整的消融矩陣是這類模組設計論文的必備要素。但在不同規模的模型上（如更大的 ResNet-101 或更小的模型），CBAM 的邊際收益是否穩定尚需更多驗證。

5. Conclusion — 結論

We have presented CBAM, a simple and effective attention module that combines channel and spatial attention for convolutional neural networks. Our extensive experiments demonstrate consistent improvements across multiple architectures and tasks with negligible computational overhead. CBAM is a plug-and-play module that enhances CNN feature representations without requiring architectural modifications, making it practical for a wide range of applications.

我們提出了 CBAM，一種簡單且有效的結合通道與空間注意力的注意力模組。廣泛的實驗證明在多種架構與任務上有一致的改善，且計算開銷可忽略。CBAM 是一個即插即用的模組，能增強 CNN 特徵表徵而無需架構修改，使其適用於廣泛的應用場景。

段落功能總結全文——重申「簡單、有效、通用」的核心定位。

邏輯角色結論與摘要形成完美閉環，一致地強調 CBAM 的三個核心特性。

論證技巧 / 潛在漏洞「即插即用」的定位使 CBAM 在實際應用中極具吸引力。但隨著 Transformer 架構的興起，純摺積架構中的注意力模組的長期影響力可能受限。

論證結構總覽

問題
CNN 缺乏有效的
通道+空間注意力

→

論點
序列式雙重注意力
輕量即插即用

→

證據
ImageNet +1.9%
COCO +1.3 mAP
0.06% 參數

→

反駁
SENet 已有通道注意力
空間注意力的邊際貢獻

→

結論
通用注意力模組
跨架構一致改善

作者核心主張（一句話）

CBAM 透過序列式的通道與空間注意力機制，以極小的計算開銷實現 CNN 特徵圖的自適應精煉，是一個通用的即插即用模組。

論證最強處

極致的輕量性與通用性：0.06% 的參數增加帶來接近 2% 的 top-1 準確率提升，且在 ResNet、MobileNet 等多種架構上一致有效。消融實驗的完整性（排列順序、池化方式、核大小）支撐了每個設計選擇的合理性。

論證最弱處

增量式改進的深度不足：CBAM 本質上是 SENet 的擴展（加入空間維度），概念新穎性有限。空間注意力帶來的額外改善相比通道注意力較小，且未提供深入的理論分析解釋為何「先通道後空間」是最佳選擇。