Learning Deep Features for Discriminative Localization (CAM)

Abstract — 摘要

In this work, we revisit the global average pooling layer proposed in Network in Network, and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the simplicity of our approach, we achieve 37.1% top-5 error for weakly-supervised object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by fully supervised methods. We demonstrate that our network is able to localize the discriminative regions used for classification across different tasks.

本研究重新審視 Network in Network 中提出的全域平均池化層，闡明它如何使摺積神經網路在僅以影像層級標籤訓練的情況下，仍具備卓越的定位能力。雖然此技術先前被視為一種訓練正則化手段，我們發現它實際上建構了一種可用於多種任務的通用可定位深度表示。儘管方法極為簡單，在 ILSVRC 2014 的弱監督物件定位任務上達到 37.1% top-5 錯誤率，非常接近全監督方法的 34.2% top-5 錯誤率。實驗展示了網路能夠在不同任務中定位用於分類的區辨性區域。

段落功能全文總覽——揭示全域平均池化的隱藏能力（定位），並以弱監督定位結果預告方法的有效性。

邏輯角色摘要以「重新發現」的敘事策略開場——全域平均池化並非新技術，但本文揭示了其未被注意的定位能力，創造了「舊瓶新酒」的新穎感。

論證技巧 / 潛在漏洞 37.1% vs 34.2% 的比較極具說服力——僅用影像標籤就接近全監督效能。但「弱監督」一詞可能誤導讀者：ImageNet 的影像標籤本身就暗含了強烈的位置偏差（物件通常位於影像中心）。

1. Introduction — 緒論

A remarkable property of convolutional neural networks is that, despite being trained on image-level classification labels without any bounding box annotations, the convolutional layers still retain substantial spatial information about objects. This information is typically lost when fully-connected layers are used for classification. Recent works such as Network in Network (NIN) and GoogLeNet have replaced fully-connected layers with global average pooling (GAP), originally motivated as a structural regularizer to prevent overfitting. We show that with a little tweaking, the network can retain its remarkable localization ability until the final layer, enabling single forward-pass localization across diverse tasks without explicit localization training.

摺積神經網路的一個顯著特性是：儘管僅以影像層級分類標籤訓練而無任何邊界框標注，摺積層仍保留了大量關於物件的空間資訊。此資訊通常在使用全連接層進行分類時喪失。近期的 Network in Network（NIN）和 GoogLeNet 以全域平均池化（GAP）取代全連接層，原本的動機是作為防止過擬合的結構正則化器。本文展示只需少量調整，網路即可將其卓越的定位能力保留至最終層，在無需顯式定位訓練的情況下，透過單次前向傳播在多樣化任務中實現定位。

段落功能建立核心洞見——CNN 天生具備定位能力，只是被全連接層破壞了。

邏輯角色論證起點：建立「空間資訊存在但被忽略」的觀察，為 CAM 方法的合理性奠定基礎。

論證技巧 / 潛在漏洞「只需少量調整」的措辭暗示方法的簡潔性是優勢而非限制。但此調整（移除全連接層、加入 GAP）實際上會影響分類精度——作者需在後文處理此權衡。

Weakly-supervised object localization has been approached through various methods, including multiple instance learning (MIL) and attention-based mechanisms. Oquab et al. proposed using global max pooling (GMP) for localization, but this only captures the single most discriminative point rather than the full extent of the object. In contrast, "average pooling encourages the network to identify the complete extent of the object" because it aggregates information from all spatial locations, incentivizing the network to activate broadly over the object region. For CNN visualization, prior methods such as backpropagation-based saliency maps and deconvolution provide insights into network behavior, but they are computationally expensive and do not directly relate to the classification decision.

弱監督物件定位已透過多種方法被探討，包括多實例學習（MIL）和基於注意力的機制。Oquab 等人提出使用全域最大池化（GMP）進行定位，但此方法僅捕捉到單一最具區辨力的點，而非物件的完整範圍。相比之下，「平均池化鼓勵網路辨識物件的完整範圍」，因為它從所有空間位置彙聚資訊，激勵網路在物件區域上廣泛激活。在 CNN 視覺化方面，先前的反向傳播顯著圖與反摺積等方法提供了網路行為的洞見，但計算成本高且與分類決策無直接關聯。

段落功能文獻對比——區分 GAP 與 GMP，定位 CAM 相對於視覺化方法的優勢。

邏輯角色 GAP vs GMP 的對比是關鍵論證：平均池化促使網路關注物件「全貌」，這為 CAM 的定位品質提供了理論解釋。

論證技巧 / 潛在漏洞 GAP「鼓勵辨識完整範圍」的論點直觀但非嚴格證明。實際上 GAP 也可能導致網路關注背景的共現紋理而非物件本身。

3. Class Activation Mapping — 類別激活圖

3.1 Global Average Pooling Revisited — 全域平均池化再探

The Class Activation Map (CAM) is derived as follows. For a given image, let f_k(x, y) represent the activation of unit k in the last convolutional layer at spatial location (x, y). For a given class c, the class score before softmax is S_c = sum_k w^c_k * F_k, where F_k is the global average pool of f_k and w^c_k is the weight connecting the k-th feature map to class c. The class activation map is then simply: M_c(x, y) = sum_k w^c_k * f_k(x, y). This map "directly indicates the importance of the activation at spatial grid (x, y) leading to the classification of an image to class c".

類別激活圖（CAM）的推導如下。對於給定影像，令 f_k(x, y) 表示最後一個摺積層中第 k 個單元在空間位置 (x, y) 的激活值。對於給定類別 c，softmax 前的類別分數為 S_c = sum_k w^c_k * F_k，其中 F_k 為 f_k 的全域平均池化結果，w^c_k 為連接第 k 個特徵圖到類別 c 的權重。類別激活圖即為：M_c(x, y) = sum_k w^c_k * f_k(x, y)。此圖直接指示了空間網格 (x, y) 處的激活對於將影像分類為類別 c 的重要性。

段落功能核心演算法——以數學公式定義 CAM 的計算方式。

邏輯角色全文的技術核心：一個簡潔的加權求和公式即構成整個方法。公式的簡單性是 CAM 廣泛被採用的關鍵因素。

論證技巧 / 潛在漏洞公式的優雅簡潔（僅一行加權求和）使方法極易理解和實作。但 CAM 要求網路結構必須以 GAP 層直接連接到分類層，限制了其對任意架構的適用性——此限制後來被 Grad-CAM 解決。

3.2 Localization Capability — 定位能力

The class activation map can be upsampled to the input image resolution to produce a heatmap highlighting the regions most relevant to the predicted class. By thresholding the heatmap and computing the bounding box of the largest connected component, we obtain an object localization without any bounding box supervision. This localization emerges as a natural byproduct of classification training with GAP. Crucially, the same network can generate different activation maps for different classes, revealing what regions the network focuses on for each category. This provides an interpretable visualization of the CNN decision-making process.

類別激活圖可被上取樣至輸入影像解析度，產出一張熱力圖以突顯與預測類別最相關的區域。透過對熱力圖設定閾值並計算最大連通區域的邊界框，即可在無任何邊界框監督的情況下獲得物件定位。此定位是 GAP 分類訓練的自然副產物。關鍵在於，同一網路可為不同類別生成不同的激活圖，揭示網路對每個類別所關注的區域。這提供了 CNN 決策過程的可解釋性視覺化。

段落功能應用延伸——從分類到定位再到可解釋性的三層應用。

邏輯角色擴展 CAM 的價值主張：不僅是定位工具，更是 CNN 可解釋性的基礎方法——這使論文影響力從「弱監督定位」擴展到「模型解釋」整個子領域。

論證技巧 / 潛在漏洞「不同類別產生不同激活圖」的展示極具說服力。但熱力圖的解析度受限於最後一個摺積層的空間大小，對小物件的定位精度可能不足。

4. Experiments — 實驗

On ILSVRC 2014 weakly-supervised localization, GoogLeNet-GAP achieves 43% top-5 localization error, significantly outperforming global max pooling variants and backpropagation-based approaches. For fine-grained recognition on CUB-200 birds, using CAM to automatically crop bounding boxes achieves 67.8% accuracy, approaching fully-supervised crop methods. The method also demonstrates strong pattern discovery capabilities: when applied to scene recognition, CAM highlights relevant objects (e.g., beds in bedrooms, bookshelves in libraries) that contribute to scene classification. Furthermore, the technique enables informative error diagnosis: when the classifier makes mistakes, CAM reveals what the network was "looking at," providing actionable insights for model improvement.

在 ILSVRC 2014 弱監督定位上，GoogLeNet-GAP 達到 43% top-5 定位錯誤率，顯著優於全域最大池化變體和基於反向傳播的方法。在 CUB-200 鳥類的細粒度辨識上，使用 CAM 自動裁切邊界框達到 67.8% 準確率，接近全監督裁切方法。方法同時展示了強大的模式發現能力：應用於場景辨識時，CAM 突顯了對場景分類有貢獻的相關物件（如臥室中的床鋪、圖書館中的書架）。此外，此技術使得具有資訊量的錯誤診斷成為可能：當分類器犯錯時，CAM 揭示網路「在看什麼」，為模型改進提供可操作的洞見。

段落功能多面向驗證——從定位精度、細粒度辨識到模式發現與錯誤診斷。

邏輯角色實證支柱以「層層遞進」的方式展開：定量定位結果 -> 細粒度應用 -> 定性模式發現 -> 錯誤診斷，逐步擴展 CAM 的應用範圍。

論證技巧 / 潛在漏洞場景辨識中的「模式發現」為 CAM 開啟了可解釋性的新方向。但部分「發現」可能反映的是資料集偏差而非真正的語義理解——例如網路可能僅學到「與床相關的紋理」而非「床的概念」。

5. Conclusion — 結論

We have shown that Class Activation Mapping provides a simple yet powerful way to leverage global average pooling for discriminative localization. Despite the simplicity of this technique, it is remarkably effective and can be applied to a variety of computer vision tasks for fast and accurate localization. The ability to generate class-specific heatmaps opens avenues for understanding and interpreting CNN decisions, bridging the gap between high classification performance and model transparency.

本文展示了類別激活圖提供了一種利用全域平均池化進行區辨性定位的簡單而強大的方法。儘管技術極為簡潔，卻非常有效，可應用於多種電腦視覺任務以實現快速且準確的定位。生成類別特定熱力圖的能力為理解和詮釋 CNN 決策開闢了途徑，彌合了高分類效能與模型透明性之間的鴻溝。

段落功能總結全文——以「簡單即強大」的主調收尾，展望可解釋性方向。

邏輯角色結論將 CAM 從「定位方法」提升至「可解釋 AI」的層級，大幅擴展了論文的學術影響力。

論證技巧 / 潛在漏洞「簡單」被反覆強調為優勢，但 CAM 對架構的限制（必須使用 GAP）和解析度的限制未在結論中被坦承。後續的 Grad-CAM 正是為解決這些限制而提出。

論證結構總覽

問題
CNN 分類時喪失
空間定位資訊

→

論點
GAP 保留定位能力
加權求和即為 CAM

→

證據
弱監督定位接近
全監督水準

→

反駁
GAP 優於 GMP
捕獲物件全貌

→

結論
簡單方法開啟
CNN 可解釋性

作者核心主張（一句話）

透過全域平均池化與類別權重的簡單加權求和，摺積神經網路能在僅以分類標籤訓練的情況下，自然獲得接近全監督水準的物件定位與可解釋性視覺化能力。

論證最強處

方法的極致簡潔性：CAM 的核心僅是一行加權求和公式 M_c(x,y) = sum_k w^c_k * f_k(x,y)，無需額外訓練、無需修改損失函數，僅透過網路結構的微調即可解鎖定位能力。37.1% vs 34.2% 的弱監督 vs 全監督對比更是令人印象深刻。

論證最弱處

架構限制與解析度瓶頸：CAM 要求網路必須使用「GAP + 單層線性分類器」結構，無法直接應用於任意 CNN。此外，激活圖的解析度受限於最終摺積層的空間大小（通常為 14x14），對小物件或多物件場景的定位精度不足。這些限制雖不影響論文的開創性地位，但確實為後續改進留下了明確空間。