Going Deeper with Convolutions (GoogLeNet/Inception)

Abstract — 摘要

We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

我們提出一種代號為 Inception 的深度摺積神經網路架構，該架構在 2014 年 ImageNet 大規模視覺辨識挑戰賽（ILSVRC14）中創下了分類與偵測的最新水準。此架構的主要特徵是改善了網路內部運算資源的利用效率。這透過精心設計而達成，使得在維持運算預算不變的前提下，能增加網路的深度與寬度。為了最佳化品質，架構設計的決策基於赫布原則與多尺度處理的直覺。在我們提交至 ILSVRC14 的特定實現版本稱為 GoogLeNet，這是一個 22 層深的網路，其品質在分類與偵測的脈絡下進行了評估。

段落功能全文總覽——以精練的語言概括架構名稱、核心理念、設計原則與競賽成果。

邏輯角色摘要同時承擔「成果宣示」與「方法預告」的雙重角色：先以 ILSVRC14 冠軍成績建立可信度，再揭示「效率優先」的設計哲學，為後文的 Inception 模組鋪路。

論證技巧 / 潛在漏洞「維持運算預算不變」的措辭極具吸引力，但實際上 22 層網路的推論時間仍高於淺層模型。此處強調的是相對於同等深度的密集網路而言的效率提升，而非絕對的低運算量。

1. Introduction — 緒論

In the last three years, mainly due to the advances of deep learning, more concretely convolutional networks, the quality of image recognition and object detection has been progressing at a dramatic pace. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. The GoogLeNet network that won ILSVRC14 uses 12x fewer parameters than AlexNet while being significantly more accurate. The biggest gains in object detection have not come from the utilization of deep networks alone but from the synergy of deep architectures and classical computer vision.

過去三年來，主要由於深度學習（更具體而言是摺積網路）的進展，影像辨識與物件偵測的品質以驚人的速度進步。令人鼓舞的是，這些進步並非僅來自更強大的硬體、更大的資料集與更大的模型，而是主要歸功於新的想法、演算法與改良的網路架構。贏得 ILSVRC14 的 GoogLeNet 網路使用的參數量比 AlexNet 少 12 倍，同時準確度顯著更高。物件偵測的最大進展並非單純來自深度網路的運用，而是來自深度架構與經典電腦視覺的協同效應。

段落功能建立研究場域——肯定深度學習的進步，同時重新定義進步的真正來源。

邏輯角色論證鏈的起點：先描述領域的快速發展，再以「非暴力堆砌」的論述引出架構創新的重要性，為 Inception 的設計哲學奠定正當性。

論證技巧 / 潛在漏洞「12 倍更少參數卻更準確」是極具說服力的數據對比，但這並未完全公平——GoogLeNet 使用了多種訓練技巧（如資料增強、多尺度測試）而非僅靠架構本身。此處將功勞主要歸於架構設計。

In this paper, we focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the "network in network" paper by Lin et al. as well as from the famous "we need to go deeper" internet meme. In our context, the word "deep" is used in two different senses: first, we introduce a new level of organization in the form of the "Inception module" and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of Network in Network while drawing inspiration from Arora et al. on the theoretical justification.

本文聚焦於一種用於電腦視覺的高效深度神經網路架構，代號為 Inception，其名稱源自 Lin 等人的「Network in Network」論文以及著名的「我們需要更深入」網路迷因。在我們的脈絡中，「深」一詞有兩層含義：首先，我們引入一種以「Inception 模組」形式呈現的新組織層級；另外也指更直接的網路深度增加。廣義而言，可將 Inception 模型視為 Network in Network 的邏輯延伸，同時從 Arora 等人的理論論證中汲取靈感。

段落功能命名與學術定位——解釋 Inception 的名稱由來及理論基礎。

邏輯角色建立理論正當性：Inception 不僅是工程直覺的產物，更有 Network in Network 與 Arora 等人稀疏近似理論的學術根基，增強學術可信度。

論證技巧 / 潛在漏洞引用流行文化（電影《全面啟動》的迷因）同時引用理論文獻，兼顧親和力與學術嚴謹。然而，Arora 等人的稀疏表示理論與實際 Inception 模組的關聯較為鬆散，理論論證與實作之間存在落差。

Starting with LeNet-5, convolutional neural networks have typically had a standard structure — stacked convolutional layers followed by one or more fully-connected layers. Variations of this basic design are prevalent in the image classification literature. For larger datasets such as ImageNet, the recent trend has been to increase the number of layers and layer size, while using dropout to address the problem of overfitting. Network in Network is an approach by Lin et al. that increases the representational power of neural networks by applying micro neural networks (1x1 convolutions) within each stage. This approach is adopted in our architecture for dimension reduction to remove computational bottlenecks.

從 LeNet-5 開始，摺積神經網路通常具有標準結構——層疊的摺積層後接一或多個全連接層。這種基本設計的變體在影像分類文獻中廣泛存在。對於像 ImageNet 這樣的大型資料集，近年來的趨勢是增加層數與層的大小，同時使用 Dropout 來解決過擬合問題。Network in Network 是 Lin 等人的方法，透過在每個階段內應用微型神經網路（1x1 摺積）來增加神經網路的表徵能力。我們的架構採用此方法進行維度縮減，以消除運算瓶頸。

段落功能文獻回顧——梳理 CNN 架構的演進脈絡，引出 1x1 摺積的關鍵技術。

邏輯角色為 Inception 模組中 1x1 摺積的使用建立先例：它不是憑空出現的設計，而是 Network in Network 的直接延伸。

論證技巧 / 潛在漏洞將「暴力擴增網路」描述為既有趨勢的缺點，為 Inception 的「聰明擴增」做鋪墊。但作者省略了 VGGNet 等同期工作以簡單堆疊也取得了優異成績的事實。

3. Motivation and High Level Considerations — 動機與高層考量

The most straightforward way of improving the performance of deep neural networks is by increasing their size, including both the depth — the number of levels — of the network and its width — the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of large labeled training sets. However, this approach has two major drawbacks. Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples is limited. It also results in dramatically increased use of computational resources. The fundamental way of solving both issues would be by moving from fully connected to sparsely connected architectures, even inside the convolutions.

改善深度神經網路效能最直接的方式是增大其規模，包括網路的深度（層數）與寬度（每層的單元數）。特別是在有大量標記訓練集的情況下，這是訓練更高品質模型的簡易且安全的方法。然而，此方法有兩大缺點：更大的規模通常意味著更多的參數，使得擴大後的網路更容易過擬合，尤其在標記樣本有限時。它還會導致運算資源的使用急遽增加。解決這兩個問題的根本方法是從全連接轉向稀疏連接的架構，甚至在摺積內部也是如此。

段落功能問題診斷——指出「暴力擴增」策略的兩大固有缺陷。

邏輯角色此段是論證的轉折點：從「增大有效但有害」推導出「稀疏化是正確方向」，為 Inception 模組的設計動機提供邏輯基礎。

論證技巧 / 潛在漏洞以「過擬合」與「運算成本」雙重壓力論證稀疏化的必要性，邏輯緊密。然而，後續的 ResNet 證明在適當正則化下，純粹增加深度也能持續提升效能，此處對「暴力擴增」的否定可能過於絕對。

Arora et al. provided theoretical justification suggesting that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. Although the mathematical proof requires very strong conditions, the fact that the Inception architecture was a close approximation to a sparse structure appears to validate the underlying assumption. The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components.

Arora 等人提供了理論論證，指出若資料集的機率分布可由一個大型、非常稀疏的深度神經網路表示，則最佳的網路拓撲可透過逐層分析最後一層啟動值的相關統計量，並將高度相關的神經元聚類來建構。儘管數學證明需要非常嚴格的條件，但 Inception 架構對稀疏結構的近似看來驗證了其底層假設。Inception 架構的核心概念是考慮如何以現成的密集組件來近似並覆蓋摺積視覺網路的最佳局部稀疏結構。

段落功能理論基礎——引用 Arora 等人的稀疏近似理論為 Inception 設計提供學術根基。

邏輯角色關鍵的概念橋梁：將「稀疏化」的抽象理念轉化為「以密集組件近似稀疏結構」的實用策略——這正是多尺度摺積核並行的設計邏輯。

論證技巧 / 潛在漏洞作者坦承理論需要「非常嚴格的條件」，但隨即宣稱 Inception 的成功「驗證了底層假設」——這是循環論證的風險。實際上，Inception 的成功可能源於多尺度特徵擷取的工程效果，而非稀疏近似理論的正確性。

4. Architectural Details — 架構細節

4.1 Inception Module

The Inception module performs parallel convolutions with multiple filter sizes (1x1, 3x3, 5x5) along with a parallel max pooling path. The outputs are concatenated along the channel dimension. A naive version of this module would suffer from computational explosion due to the large number of filters. To address this, 1x1 convolutions are used as dimensionality reduction modules before the expensive 3x3 and 5x5 convolutions. This reduces the computational cost by an order of magnitude while preserving representational power. The resulting architecture can increase both the depth and width of the network without uncontrolled increase in computational complexity.

Inception 模組以多種濾波器尺寸（1x1、3x3、5x5）平行執行摺積，並包含一條平行的最大池化路徑。所有輸出沿通道維度串接。此模組的素樸版本會因大量濾波器而遭受運算量爆炸。為解決此問題，在耗費資源的 3x3 與 5x5 摺積之前使用 1x1 摺積作為維度縮減模組。這將運算成本降低了一個數量級，同時保留了表徵能力。最終的架構能在不導致運算複雜度失控增長的前提下，同時增加網路的深度與寬度。

段落功能核心創新——詳述 Inception 模組的結構與 1x1 摺積的關鍵作用。

邏輯角色全文論證的核心支柱：多尺度平行摺積回應「稀疏近似」的理論動機，1x1 維度縮減回應「運算效率」的實際需求。兩者的結合正是 Inception 的核心貢獻。

論證技巧 / 潛在漏洞先展示素樸版本的問題，再提出 1x1 摺積的解方，是經典的「問題-解決方案」敘事。「一個數量級」的效率提升數據有力，但各分支的濾波器數量選擇（超參數）仍需經驗性調整，作者未提供系統化的選擇策略。

The overall GoogLeNet architecture is 22 layers deep (27 layers if counting pooling) and uses approximately 5 million parameters, which is 12x fewer than AlexNet (60M) and significantly fewer than VGGNet (138M). The network employs 9 Inception modules stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution. Notably, the network uses an average pooling layer before the final classifier instead of fully-connected layers, which further reduces parameters. Additionally, two auxiliary classifiers are attached to intermediate layers to combat the vanishing gradient problem and provide regularization.

整體的 GoogLeNet 架構有 22 層深（若計入池化層則為 27 層），使用約 500 萬個參數，比 AlexNet（6000 萬）少 12 倍，也顯著少於 VGGNet（1.38 億）。該網路使用 9 個層疊的 Inception 模組，中間穿插步幅為 2 的最大池化層以將解析度減半。值得注意的是，網路在最終分類器之前使用平均池化層而非全連接層，進一步減少了參數量。此外，兩個輔助分類器接在中間層上，以對抗梯度消失問題並提供正則化效果。

段落功能量化指標——以具體數據展示架構的參數效率與設計巧思。

邏輯角色此段將效率主張具體化：5M vs 60M/138M 的參數對比直接支持「更高效」的核心論點。輔助分類器的設計則預先處理了「22 層是否太深」的潛在質疑。

論證技巧 / 潛在漏洞與 AlexNet 和 VGGNet 的參數量對比是有力的修辭手段。但輔助分類器的設計暗示作者也意識到過深網路的訓練困難——這正是後來 ResNet 透過殘差連接徹底解決的問題。輔助分類器是一個工程上的折衷方案。

5. Experiments — 實驗

On the ILSVRC 2014 classification challenge, GoogLeNet achieved a top-5 error rate of 6.67%, ranking first among all participants. This was significantly better than the previous year's best result and close to human-level performance. An ensemble of 7 GoogLeNet models was used for the competition submission. On the detection task, the GoogLeNet approach achieved 43.9% mAP, which was also the winning entry, surpassing the previous best by a large margin. Importantly, the strong detection results were achieved by combining the Inception architecture with the R-CNN approach by Girshick et al., demonstrating the generalizability of the Inception features beyond pure classification.

在 ILSVRC 2014 分類挑戰中，GoogLeNet 達到了 6.67% 的 Top-5 錯誤率，在所有參賽者中名列第一。這顯著優於前一年的最佳結果，且接近人類水準的表現。競賽提交使用了 7 個 GoogLeNet 模型的集成。在偵測任務上，GoogLeNet 方法達到了 43.9% mAP，同樣是冠軍方案，大幅超越先前的最佳成績。重要的是，強勁的偵測結果是透過將 Inception 架構與 Girshick 等人的 R-CNN 方法結合而達成，展現了 Inception 特徵在純分類之外的泛化能力。

段落功能實證支持——以競賽冠軍成績全面驗證架構的有效性。

邏輯角色實證支柱，覆蓋兩個核心維度：(1) 分類任務的 Top-5 錯誤率；(2) 偵測任務的 mAP。特別強調與 R-CNN 的結合，將 Inception 定位為通用特徵擷取器而非特定任務的解決方案。

論證技巧 / 潛在漏洞 7 模型集成是競賽的常見策略，但這使得單一模型的實際效能被放大。作者報告的 6.67% 是集成結果，而非單一模型的效能。此外，「接近人類水準」的說法取決於如何定義人類基準。

6. Conclusion — 結論

Our results yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and less wide networks. Our Inception architecture demonstrates that moving to sparser architectures is feasible and useful. This suggests that there is hope for automatic creation of network topologies based on correlation statistics in the future, complementing or even replacing hand-crafted solutions.

我們的結果提供了堅實的證據，證明以現成的密集建構模塊來近似預期的最佳稀疏結構，是改善電腦視覺神經網路的可行方法。此方法的主要優勢在於，相較於更淺、更窄的網路，它以適度的運算需求增加獲得了顯著的品質提升。我們的 Inception 架構展示了邁向更稀疏的架構是可行且有用的。這暗示未來有希望基於相關統計量自動建立網路拓撲，以補充甚至取代人工設計的解決方案。

段落功能總結全文——重申核心貢獻並展望自動化網路設計的未來。

邏輯角色結論段完成論證閉環：從「理論上的稀疏近似」到「實踐上的競賽冠軍」，最後展望「自動化拓撲設計」。呼應緒論中「架構創新比暴力擴增更重要」的主張。

論證技巧 / 潛在漏洞「自動建立網路拓撲」的展望極具前瞻性——這正是後來神經架構搜尋（NAS）領域的核心議題。但在 2015 年的技術背景下，此展望缺乏具體的實現路徑。結論對 Inception 架構的局限性（如手動超參數調整、輔助分類器的必要性）討論不足。

論證結構總覽

問題
增大網路導致
過擬合與運算爆炸

→

論點
以密集組件近似
最佳稀疏結構

→

證據
ILSVRC14 分類/偵測
雙冠軍

→

反駁
1x1 摺積維度縮減
控制運算成本

→

結論
稀疏架構可行
未來可自動化設計

作者核心主張（一句話）

透過 Inception 模組以多尺度平行摺積近似最佳稀疏結構，可在適度運算成本下大幅提升深度神經網路的辨識品質。

論證最強處

效率與效能的雙重勝利：GoogLeNet 以僅 500 萬參數（AlexNet 的 1/12）贏得 ILSVRC14 分類與偵測雙冠軍，有力地證明了架構創新比暴力擴增更有效。多尺度平行摺積加 1x1 維度縮減的組合，在理論動機與工程實踐之間找到了優雅的交匯點。

論證最弱處

理論論證與實作之間的鬆散聯繫：Arora 等人的稀疏近似理論需要極強的前提條件，與 Inception 模組的實際設計之間存在顯著的詮釋落差。此外，各 Inception 模組中濾波器數量的選擇仍高度依賴經驗性調整，「系統化設計原則」的承諾並未完全實現。