Fully Convolutional Networks for Semantic Segmentation

Abstract — 摘要

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks, trained end-to-end, pixels-to-pixels, on semantic segmentation exceed the state-of-the-art without further machinery. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, adapt contemporary classification networks (AlexNet, VGGNet, GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations.

摺積網路是強大的視覺模型，能產生階層式的特徵。我們展示了端對端、像素到像素訓練的摺積網路在語義分割上超越了現有技術水準，且無需額外裝置。我們的核心洞見是建構「全摺積」網路，它接受任意大小的輸入並產生相應大小的輸出，具備高效的推論與學習能力。我們定義並詳述了全摺積網路的空間，將當代分類網路（AlexNet、VGGNet、GoogLeNet）改造為全摺積網路，並透過微調將其學到的表徵遷移至分割任務。接著，我們定義了一種跳躍架構，結合深層粗略層的語義資訊與淺層精細層的外觀資訊，以產生準確且細緻的分割結果。

段落功能全文總覽——以三步遞進概括核心貢獻：全摺積化、遷移學習、跳躍連接。

邏輯角色摘要呈現清晰的「洞見-方法-結果」結構：先指出全摺積的關鍵洞見，再說明如何將分類網路改造，最後以跳躍架構提升精度。

論證技巧 / 潛在漏洞「無需額外裝置」暗示方法的簡潔性，但實際上跳躍連接本身就是一種額外的架構設計。「端對端」的強調在 2015 年極具號召力，因為當時許多方法仍依賴多階段管線。

1. Introduction — 緒論

Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification, but also making progress on local tasks with structured output. This includes advances in bounding box object detection, part and keypoint prediction, and local correspondence. The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation, but have done so in ways that involve patchwise training, post-processing with superpixels or CRFs, or both.

摺積網路正在推動辨識技術的進展。摺積網路不僅在整張影像的分類上持續改善，也在具有結構化輸出的局部任務上取得進展。這包括邊界框物件偵測、部件與關鍵點預測，以及局部對應關係等方面的進展。從粗到細推論的自然下一步是對每個像素進行預測。先前的方法已將摺積網路用於語義分割，但其方式涉及補丁式訓練、使用超像素或條件隨機場的後處理，或兩者兼具。

段落功能建立研究場域——追溯摺積網路從分類到密集預測的演進。

邏輯角色以「從粗到細」的敘事線將像素級預測定位為自然演進，同時批判先前方法的多階段複雜性，為端對端全摺積方法鋪路。

論證技巧 / 潛在漏洞將語義分割描繪為分類的「自然下一步」，使 FCN 的貢獻顯得順理成章。但補丁式訓練與後處理方法（如 CRF）在某些場景下可能提供更精確的邊界，此處的批判略顯片面。

We show that a fully convolutional network (FCN), trained end-to-end, pixels-to-pixels on semantic segmentation, surpasses the state-of-the-art without further machinery. To our knowledge, this is the first work to train FCNs end-to-end for pixelwise prediction and from supervised pre-training. Fully convolutional versions of existing classification networks predict dense outputs from inputs of any size. Both learning and inference are performed whole-image-at-a-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction in nets with subsampled pooling.

我們展示了一個全摺積網路（FCN），以端對端、像素到像素的方式在語義分割上訓練，超越了現有技術水準而無需額外裝置。據我們所知，這是首個端對端訓練 FCN 進行像素級預測並使用監督式預訓練的工作。現有分類網路的全摺積版本能從任意大小的輸入預測密集輸出。學習與推論均透過密集前饋運算與反向傳播對整張影像一次完成。網路內的上取樣層使得在具有下取樣池化的網路中實現像素級預測成為可能。

段落功能核心主張——宣示 FCN 的三項「首創」貢獻。

邏輯角色以「首個」的強力宣稱確立論文的歷史定位，同時預告方法論的三個核心要素：全摺積化、遷移學習、網路內上取樣。

論證技巧 / 潛在漏洞「首個」的宣稱在學術論文中極具分量。整張影像一次處理相較於補丁式方法的效率優勢是顯而易見的，但可能犧牲了對局部脈絡的精細控制。

Semantic segmentation has a long history in computer vision. Current approaches often combine bottom-up recognition with top-down refinement. Deep classification networks have been repurposed for segmentation by applying them in a sliding window fashion, which is computationally expensive and limits the receptive field. Others have used recurrent neural networks or conditional random fields (CRFs) as post-processing to refine coarse predictions. In contrast, our approach is entirely feedforward and does not require any post-processing. The idea of deconvolution for upsampling has appeared before, but not in the context of end-to-end training for dense prediction from full images.

語義分割在電腦視覺中有悠久的歷史。目前的方法通常結合由下而上的辨識與由上而下的精化。深度分類網路已被重新用於分割，以滑動視窗的方式應用，但這在運算上昂貴且限制了感受野。其他方法使用遞迴神經網路或條件隨機場（CRF）作為後處理來精化粗略預測。相比之下，我們的方法完全是前饋式的，不需要任何後處理。反摺積用於上取樣的概念先前已出現，但並非在整張影像端對端訓練以進行密集預測的脈絡中。

段落功能文獻回顧——系統性比較既有方法的架構與 FCN 的差異。

邏輯角色以「滑動視窗的低效」與「後處理的繁瑣」對比 FCN 的「前饋式簡潔」，強化端對端方法的優越性論述。

論證技巧 / 潛在漏洞將 CRF 歸類為「後處理」並非完全公平——後來的 DeepLab 系列證明 CRF 與 FCN 結合能進一步提升效能。FCN 放棄後處理固然簡潔，但可能犧牲了邊界精度。

3. Fully Convolutional Networks — 全摺積網路

Each layer of data in a convnet is a three-dimensional array of size h x w x d, where h and w are spatial dimensions, and d is the feature or channel dimension. The first layer is the image, with pixel size h x w, and d color channels. Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive fields. A network with only convolutional layers (and pooling) computes a nonlinear filter, referred to as a fully convolutional network. An FCN naturally operates on an input of any size, and produces an output of corresponding spatial dimensions. The classification nets (AlexNet, VGGNet, GoogLeNet) can be cast into fully convolutional form by replacing their fully connected layers with convolutional layers.

摺積網路中的每層資料是一個大小為 h x w x d 的三維陣列，其中 h 與 w 是空間維度，d 是特徵或通道維度。第一層是影像，具有像素大小 h x w 和 d 個色彩通道。較高層的位置對應於影像中與其路徑相連的位置，稱為其感受野。一個僅由摺積層（和池化層）組成的網路計算一個非線性濾波器，稱為全摺積網路。FCN 自然地在任意大小的輸入上運作，並產生相應空間維度的輸出。分類網路（AlexNet、VGGNet、GoogLeNet）可透過將其全連接層替換為摺積層而轉換為全摺積形式。

段落功能概念定義——從數學基礎出發定義全摺積網路。

邏輯角色此段為全文的理論基石：全連接層到摺積層的轉換是一個看似簡單卻影響深遠的洞見，它使得任意大小的輸入成為可能，從根本上改變了密集預測的範式。

論證技巧 / 潛在漏洞以教科書式的嚴謹定義建立概念基礎，使讀者清楚理解 FCN 的數學意義。將現有分類網路直接轉換為 FCN 的策略，巧妙地利用了已有的預訓練權重。

3.2 Upsampling and Skip Connections — 上取樣與跳躍連接

The output of the fully convolutional classification network is a coarse map at 1/32 of the input resolution. While this captures high-level semantic information, it lacks fine spatial detail. To address this, we use backwards convolution (deconvolution) with learned filters for upsampling. More importantly, we introduce skip connections that combine predictions from the final layer with those from earlier, finer layers. Our FCN-32s uses only the final prediction layer; FCN-16s fuses predictions from the final and pool4 layer; FCN-8s further adds the pool3 layer. This skip architecture yields a nonlinear feature hierarchy that combines deep, coarse, semantic information with shallow, fine, appearance information for progressively finer predictions.

全摺積分類網路的輸出是輸入解析度 1/32 的粗略對應圖。雖然這捕捉了高階語義資訊，但缺乏精細的空間細節。為解決此問題，我們使用具有學習濾波器的反向摺積（反摺積）進行上取樣。更重要的是，我們引入跳躍連接，將最終層的預測與較早、較精細層的預測結合。我們的 FCN-32s 僅使用最終預測層；FCN-16s 融合最終層與 pool4 層的預測；FCN-8s 進一步加入 pool3 層。此跳躍架構產生了一個非線性特徵階層結構，結合了深層粗略的語義資訊與淺層精細的外觀資訊，以實現逐步精化的預測。

段落功能核心創新——詳述跳躍連接如何彌補全摺積網路的空間精度不足。

邏輯角色此段解決了全摺積化帶來的副作用（解析度降低），透過 FCN-32s -> FCN-16s -> FCN-8s 的漸進式改良，展現跳躍連接的累積效益。

論證技巧 / 潛在漏洞三個變體的遞進展示（32s, 16s, 8s）是優秀的消融研究設計。然而，即使是 FCN-8s 的輸出仍不夠精細，後來的 U-Net 與 DeepLab 證明更密集的跳躍連接能顯著提升邊界品質。

4. Experiments — 實驗

We test FCN on PASCAL VOC 2011 and 2012 segmentation challenges. FCN-8s achieves 62.2% mean IU on the PASCAL VOC 2012 test set, a 20% relative improvement over the previous state-of-the-art. We also report results on NYUDv2 and SIFT Flow datasets, achieving state-of-the-art on both. Inference takes less than one-fifth of a second per image for a typical 500x500 input, compared to minutes for patch-based approaches. Different base networks yield different quality: VGG-16 as the base network provides the best performance, while GoogLeNet yields slightly lower accuracy. Fine-tuning the entire network end-to-end is critical: freezing layers significantly degrades results.

我們在 PASCAL VOC 2011 與 2012 分割挑戰上測試 FCN。FCN-8s 在 PASCAL VOC 2012 測試集上達到 62.2% 的平均交集比（mean IU），相較於先前的最佳成績有 20% 的相對改善。我們也報告了在 NYUDv2 與 SIFT Flow 資料集上的結果，均達到最新水準。推論對於典型的 500x500 輸入每張影像耗時不到五分之一秒，而基於補丁的方法需要數分鐘。不同的基礎網路產生不同的品質：以 VGG-16 作為基礎網路提供最佳效能，而 GoogLeNet 的準確度略低。端對端微調整個網路至關重要：凍結層會顯著降低結果。

段落功能實證支持——以多基準、多維度的實驗結果驗證方法的優越性。

邏輯角色覆蓋準確度（mean IU）、效率（推論速度）與消融分析（基礎網路選擇、微調策略），全面支撐「端對端全摺積」的核心主張。

論證技巧 / 潛在漏洞 20% 的相對改善與 200 倍的速度提升是壓倒性的數據。但 62.2% 的 mean IU 在絕對值上仍有很大提升空間，暗示語義分割遠未解決。端對端微調的必要性也意味著需要大量標記資料。

5. Conclusion — 結論

Fully convolutional networks are a rich class of models that can be trained end-to-end for pixelwise prediction. We have shown that adapting classification networks and applying them fully convolutionally for dense prediction is a simple, effective, and scalable approach. The key insight is the combination of deep, coarse layers with shallow, fine layers through skip connections, enabling predictions that respect both global structure and local detail. We believe that fully convolutional training and prediction will become standard for dense prediction tasks.

全摺積網路是一類豐富的模型，可端對端訓練以進行像素級預測。我們已展示，改造分類網路並以全摺積方式應用於密集預測，是一種簡單、有效且可擴展的方法。核心洞見在於透過跳躍連接結合深層粗略層與淺層精細層，使預測同時尊重全局結構與局部細節。我們相信全摺積的訓練與預測將成為密集預測任務的標準方法。

段落功能總結全文——重申核心貢獻並預言 FCN 方法的廣泛採用。

邏輯角色結論呼應摘要，形成論證閉環：「簡單、有效、可擴展」直接回應緒論中對多階段方法的批判。

論證技巧 / 潛在漏洞「將成為標準方法」的預言極具自信，且事後被完全驗證——FCN 確實成為幾乎所有後續語義分割方法的基礎。然而，結論未充分討論 FCN 的局限性，如缺乏對實例的區分能力（instance segmentation）以及邊界模糊的問題。

論證結構總覽

問題
語義分割依賴多階段
管線且效率低落

→

論點
全摺積化實現
端對端密集預測

→

證據
VOC 2012 上 20%
相對改善 + 200 倍加速

→

反駁
跳躍連接彌補
空間精度損失

→

結論
FCN 將成為
密集預測的標準

作者核心主張（一句話）

將分類網路改造為全摺積形式並透過跳躍連接融合多尺度特徵，可實現端對端的高效語義分割，超越所有既有方法。

論證最強處

範式轉移的簡潔性：全連接層到摺積層的轉換是一個看似微小卻影響深遠的設計變更。它不僅解決了固定輸入尺寸的限制，還透過遷移學習利用了大量已有的分類網路預訓練權重。FCN-32s/16s/8s 的漸進式展示清晰地量化了跳躍連接的累積效益。

論證最弱處

邊界精度的先天限制：即使加入跳躍連接，FCN 的輸出在物件邊界處仍顯模糊，因為池化層不可避免地丟失了空間資訊。作者未充分討論此限制，也未探索除跳躍連接外的替代方案（如空洞摺積）。這些限制在後續工作（DeepLab、U-Net）中才被系統性地處理。