DeepLabv3+ — 雙欄批注

Abstract 摘要

Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information.

空間金字塔池化模組或編碼器-解碼器結構被用於深度神經網路的語意分割任務中。前者能夠透過以不同擴張率和不同有效感受野的濾波器或池化操作來探測輸入特徵，從而編碼多尺度上下文資訊，而後者則能透過逐步恢復空間資訊來捕捉更清晰的物體邊界。

段落功能問題背景鋪陳：點出語意分割中兩大主流架構的各自優勢

邏輯角色為後續「結合兩者優勢」的核心主張鋪設動機

論證技巧以對比修辭（former vs. latter）明確定義問題空間，讓讀者快速理解兩種範式的互補關係

In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

在本研究中，我們提出結合兩種方法的優勢。具體而言，我們提出的模型 DeepLabv3+ 在 DeepLabv3 的基礎上增加了一個簡潔而有效的解碼器模組，以改善分割結果，尤其是物體邊界處的精細度。我們進一步探索了 Xception 模型，並將深度可分離摺積應用於 Atrous 空間金字塔池化及解碼器模組，從而實現了更快速且更強大的編碼器-解碼器網路。

段落功能提出核心方案：DeepLabv3+ 的技術全貌

邏輯角色論文的核心主張與技術貢獻摘要

論證技巧以「simple yet effective」降低讀者對複雜度的疑慮，同時強調深度可分離摺積帶來速度與精度的雙重增益

We demonstrate the effectiveness of the proposed model on the PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in TensorFlow.

我們在 PASCAL VOC 2012 和 Cityscapes 資料集上驗證了所提模型的有效性，在測試集上分別達到 89.0% 和 82.1% 的表現，且未使用任何後處理技術。本文附帶以 TensorFlow 實作的公開參考程式碼。

段落功能提供實驗成果與可重現性保證

邏輯角色以量化結果支撐核心主張，並以開源程式碼增強可信度

論證技巧選擇兩個標竿資料集同時報告成績，強化泛化論點；公開程式碼則是學術透明的強力背書

1. Introduction 緒論

Semantic segmentation with the goal of assigning a semantic label to every pixel in an image is one of the fundamental topics in computer vision. Deep convolutional neural networks based on the Fully Convolutional Neural Network show striking improvement over systems relying on hand-crafted features on benchmark tasks. In this work, we consider two types of neural networks that use spatial pyramid pooling module or encoder-decoder structure for semantic segmentation, where the former captures rich contextual information by pooling features at different resolution while the latter is able to obtain sharp object boundaries.

語意分割的目標是為影像中的每個像素指派一個語意標籤，這是電腦視覺的基礎課題之一。基於全摺積神經網路的深度摺積神經網路在標竿任務上相較於依賴手工特徵的系統展現了顯著提升。本研究考慮兩類使用空間金字塔池化模組或編碼器-解碼器結構的神經網路進行語意分割，其中前者透過在不同解析度下池化特徵來捕捉豐富的上下文資訊，而後者則能獲得清晰的物體邊界。

段落功能定義問題範疇，回顧兩大技術路線

邏輯角色在更廣闊的學術脈絡中定位本研究的起點

論證技巧從最一般的問題定義逐步收斂到具體的技術選項，引導讀者自然接受後續的融合方案

In order to combine the advantages of both approaches, we propose to add a decoder module to the existing encoder module, DeepLabv3, to form an encoder-decoder architecture. The rich semantic information is encoded in the output of DeepLabv3, while the detailed object boundaries are recovered by the simple yet effective decoder module. The encoder module allows us to extract features at an arbitrary resolution by applying atrous convolution, and the decoder module refines the segmentation results along object boundaries.

為了結合兩種方法的優勢，我們提出在現有的編碼器模組 DeepLabv3 上增加一個解碼器模組，形成編碼器-解碼器架構。豐富的語意資訊被編碼在 DeepLabv3 的輸出中，而精細的物體邊界則由簡潔而有效的解碼器模組恢復。編碼器模組使我們能透過空洞摺積在任意解析度下提取特徵，解碼器模組則沿物體邊界改善分割結果。

段落功能闡述核心提案的動機與架構概觀

邏輯角色將摘要中的主張展開為具體的技術方案描述

論證技巧「arbitrary resolution」強調空洞摺積的靈活性，暗示比傳統下採樣方案更具優勢

We further explore the Xception model and apply depthwise separable convolution to both ASPP and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on two competitive benchmark datasets: PASCAL VOC 2012 and Cityscapes. The proposed model, called DeepLabv3+, attains new state-of-the-art performance on both datasets without DenseCRF post-processing.

我們進一步探索 Xception 模型，並將深度可分離摺積應用於 ASPP 和解碼器模組，從而得到更快速且更強大的編碼器-解碼器網路。我們在兩個具競爭性的標竿資料集上驗證了所提模型的有效性：PASCAL VOC 2012 和 Cityscapes。所提模型被稱為 DeepLabv3+，在兩個資料集上均達成新的最先進水準，且無需 DenseCRF 後處理。

段落功能預告主要貢獻與實驗成果

邏輯角色緒論的結論段，建立讀者對後續實驗的期待

論證技巧「without DenseCRF post-processing」刻意排除外部增強手段，突顯模型本身的競爭力

Models based on Fully Convolutional Networks (FCNs) have demonstrated significant improvements for semantic segmentation. Several model variants have been proposed to exploit the contextual information for segmentation, including those that employ multi-scale inputs or those that adopt probabilistic graphical models such as CRFs. In this work, we mainly discuss four types of FCN-based approaches that are most closely related to our model: atrous convolution, spatial pyramid pooling, encoder-decoder, and depthwise separable convolution.

基於全摺積網路（FCN）的模型在語意分割方面展現了顯著進展。為了利用上下文資訊，已有多種模型變體被提出，包括採用多尺度輸入或利用機率圖模型（如 CRF）的方法。本研究主要討論與我們的模型最相關的四類基於 FCN 的方法：空洞摺積、空間金字塔池化、編碼器-解碼器以及深度可分離摺積。

段落功能系統性歸類相關文獻

邏輯角色建立技術譜系，幫助讀者理解 DeepLabv3+ 的學術定位

論證技巧將相關工作精準分為四類，恰好對應 DeepLabv3+ 的四個技術組件，暗示本方法是各路線的最佳整合

Atrous convolution, also known as dilated convolution, allows us to repurpose ImageNet pretrained networks to extract denser feature maps by removing the downsampling operations from the last few layers and upsampling the corresponding filter kernels. It is an alternative to using deconvolution layers and offers the advantage of being able to control the resolution of features computed by deep CNNs and adjust the filter's field-of-view in order to capture multi-scale information.

空洞摺積（又稱擴張摺積）使我們能透過移除最後幾層的下採樣操作並相應地上採樣濾波器核，將 ImageNet 預訓練網路轉化為提取更密集特徵圖的工具。它是反摺積層的替代方案，具備控制深度 CNN 所計算特徵的解析度以及調整濾波器感受野來捕捉多尺度資訊的優勢。

段落功能解釋空洞摺積的原理與優勢

邏輯角色為 ASPP 模組的設計提供理論基礎

論證技巧透過與反摺積的比較來凸顯空洞摺積的設計靈活性

3. Methods 方法

3.1 Encoder-Decoder with Atrous Convolution

Atrous convolution is a powerful tool that allows us to explicitly control the resolution of features computed by deep neural networks and adjust filter's field-of-view in order to capture multi-scale information, generalizing standard convolution operation. In the case of two-dimensional signals, for each location i on the output feature map y and a convolution filter w, atrous convolution is applied over the input feature map x as: y[i] = sum_k x[i + r * k] * w[k], where the atrous rate r determines the stride with which we sample the input signal.

空洞摺積是一個強大的工具，使我們能明確控制深度神經網路所計算特徵的解析度，並調整濾波器的感受野以捕捉多尺度資訊，是標準摺積操作的推廣。在二維訊號的情形中，對於輸出特徵圖 y 上的每個位置 i 和摺積濾波器 w，空洞摺積作用於輸入特徵圖 x 上的公式為：y[i] = sum_k x[i + r * k] * w[k]，其中空洞率 r 決定了我們對輸入訊號的取樣步幅。

段落功能正式定義空洞摺積的數學形式

邏輯角色為後續 ASPP 模組的多擴張率設計提供數學基礎

論證技巧以數學公式增強嚴謹性，同時用「generalizing standard convolution」框架化概念，讓讀者看到空洞摺積是更一般化的操作

We employ DeepLabv3 as the encoder module in our proposed encoder-decoder structure. DeepLabv3 employs atrous convolution to extract the features computed by deep convolutional neural networks at an arbitrary resolution. Specifically, we apply Atrous Spatial Pyramid Pooling (ASPP) module which probes convolutional features at multiple scales by applying atrous convolution with different rates. The proposed decoder module consists of 1x1 convolution to reduce the number of channels of the low-level features, followed by concatenation with the upsampled encoder features, and a few 3x3 convolutions to refine the features.

我們在所提出的編碼器-解碼器結構中採用 DeepLabv3 作為編碼器模組。DeepLabv3 利用空洞摺積以任意解析度提取深度摺積神經網路計算的特徵。具體而言，我們使用 Atrous 空間金字塔池化（ASPP）模組，透過以不同擴張率施加空洞摺積來探測多尺度的摺積特徵。所提出的解碼器模組包含 1x1 摺積來減少低階特徵的通道數，接著與上採樣後的編碼器特徵進行串接，再經過幾個 3x3 摺積來改善特徵。

段落功能詳述編碼器-解碼器架構的具體組件

邏輯角色本文最核心的方法論段落，完整呈現 DeepLabv3+ 的架構

論證技巧解碼器設計刻意簡化（僅 1x1 + concat + 3x3），與複雜的 U-Net 系解碼器形成鮮明對比，強化「simple yet effective」的敘事

3.2 Modified Aligned Xception

The Xception model has shown promising image classification results on ImageNet with fast computation. More recently, the MSRA team modified the Xception model, called Aligned Xception, and further pushed the performance. Motivated by these findings, we adapt the Xception model for the task of semantic segmentation. In particular, we make several changes: (1) deeper Xception same as in the MSRA version except that we do not modify the entry flow network structure; (2) all max pooling operations are replaced by depthwise separable convolutions with striding; (3) extra batch normalization and ReLU activation are added after each 3x3 depthwise convolution.

Xception 模型在 ImageNet 影像分類上以快速計算展現了良好的結果。近期，MSRA 團隊修改了 Xception 模型（稱為 Aligned Xception）並進一步提升了效能。受這些成果啟發，我們將 Xception 模型調整以適用於語意分割任務。具體變更包括：（1）採用與 MSRA 版本相同的更深 Xception，但不修改入口流網路結構；（2）所有最大池化操作均替換為帶步幅的深度可分離摺積；（3）每個 3x3 深度摺積後額外加入批次正規化和 ReLU 啟動函數。

段落功能描述對 Xception 骨幹網路的三項關鍵修改

邏輯角色提供效率提升的技術細節，補充主要架構

論證技巧以編號列表清晰呈現修改項目，便於讀者逐一比對與複現

4. Experimental Evaluation 實驗評估

We employ ImageNet-1k pretrained ResNet-101 or modified aligned Xception as network backbones for our DeepLabv3+ model. Our implementation is built on TensorFlow. We evaluate the proposed model on the PASCAL VOC 2012 semantic segmentation benchmark which contains 20 foreground object classes and one background class. The original dataset contains 1,464 (train), 1,449 (val), and 1,456 (test) pixel-level annotated images. We augment the dataset by the extra annotations provided by Hariharan et al., resulting in 10,582 (trainaug) training images.

我們採用 ImageNet-1k 預訓練的 ResNet-101 或修改版 Aligned Xception 作為 DeepLabv3+ 模型的骨幹網路。實作基於 TensorFlow 框架。我們在 PASCAL VOC 2012 語意分割標竿上評估所提模型，該資料集包含 20 個前景物體類別和一個背景類別。原始資料集包含 1,464（訓練）、1,449（驗證）和 1,456（測試）張像素級標注影像。我們利用 Hariharan 等人提供的額外標注擴增資料集，得到 10,582（trainaug）張訓練影像。

段落功能說明實驗設定：骨幹、框架、資料集

邏輯角色建立實驗的可重現性基礎

論證技巧詳列資料集劃分數量，展現實驗嚴謹性；提及資料擴增策略透明化訓練條件

Using the proposed encoder-decoder structure, DeepLabv3+ with the modified Xception as network backbone achieves the performance of 87.8% on the PASCAL VOC 2012 val set and 89.0% on the test set. Compared with other state-of-the-art models, our DeepLabv3+ outperforms PSPNet, Large Kernel Matters, and Multipath-RefineNet. When employing the depthwise separable convolution, the model is significantly faster (multiply-adds are reduced by 33% to 41%) while maintaining similar or better accuracy.

使用所提出的編碼器-解碼器結構，DeepLabv3+ 以修改版 Xception 作為骨幹網路，在 PASCAL VOC 2012 驗證集上達到 87.8%，測試集上達到 89.0%。與其他最先進模型相比，DeepLabv3+ 超越了 PSPNet、Large Kernel Matters 和 Multipath-RefineNet。當採用深度可分離摺積時，模型速度顯著提升（乘加運算減少 33% 至 41%），同時維持相近甚至更好的精度。

段落功能報告核心定量結果

邏輯角色以數字佐證 DeepLabv3+ 的 state-of-the-art 主張

論證技巧同時呈現精度與速度數據，論證效率-精度的帕累托優勢，比單一維度的比較更具說服力

We also evaluate DeepLabv3+ on the Cityscapes dataset, a large-scale dataset that contains 5,000 high quality pixel-level annotated images collected in street scenes from 50 different cities. Our best model achieves 82.1% on the test set, setting a new state-of-the-art. We note that we do not employ DenseCRF post-processing or multi-scale testing for our submissions, which may further improve the performance.

我們也在 Cityscapes 資料集上評估 DeepLabv3+，這是一個大規模資料集，包含來自 50 個不同城市的街景中收集的 5,000 張高品質像素級標注影像。我們的最佳模型在測試集上達到 82.1%，創下新的最先進水準。值得注意的是，我們的提交未使用 DenseCRF 後處理或多尺度測試，這些技術或可進一步提升效能。

段落功能展示跨資料集的泛化能力

邏輯角色補強主張的泛化論證，並以自我約束展現公正性

論證技巧主動聲明未使用額外增強手段（DenseCRF、多尺度測試），暗示模型的「裸分」已足夠領先，這是一種巧妙的讓步策略

5. Conclusion 結論

Our proposed model, DeepLabv3+, employs the encoder-decoder structure where DeepLabv3 is used to encode the rich contextual information and a simple yet effective decoder module is adopted to recover the object boundaries. One could also apply the atrous convolution to extract the encoder features at an arbitrary resolution, depending on the available computation resources. We also explore the Xception model and atrous separable convolution to make the proposed model faster and stronger.

我們提出的模型 DeepLabv3+ 採用編碼器-解碼器結構，其中 DeepLabv3 用於編碼豐富的上下文資訊，而簡潔有效的解碼器模組則用於恢復物體邊界。根據可用的計算資源，可以應用空洞摺積在任意解析度下提取編碼器特徵。我們也探索了 Xception 模型和空洞可分離摺積，使所提模型更快速且更強大。

段落功能總結全文的核心技術貢獻

邏輯角色回收緒論的動機，形成首尾呼應的完整論證迴路

論證技巧「depending on the available computation resources」暗示方法的實用靈活性，強化工程應用價值

Finally, our experimental results show that the proposed model achieves state-of-the-art performance on both PASCAL VOC 2012 and Cityscapes datasets. Our combination of encoder-decoder architecture with atrous separable convolution provides a principled approach that balances accuracy and efficiency for semantic segmentation. We hope our simple yet effective method could serve as a strong baseline for future research in the field.

最後，實驗結果表明所提模型在 PASCAL VOC 2012 和 Cityscapes 資料集上均達到最先進水準。我們將編碼器-解碼器架構與空洞可分離摺積結合的方式，提供了一種兼顧精度與效率的系統性方案。我們希望這個簡潔而有效的方法能成為該領域未來研究的強力基線。

段落功能展望未來影響力

邏輯角色以謙遜的基線定位結尾，同時暗示方法的長期價值

論證技巧「principled approach」強調方法論的理論基礎而非僅是工程堆疊，提升學術格調

論證結構總覽

問題
空間金字塔 vs 編碼器-解碼器
各有優缺

→

論點
結合兩者優勢
形成 DeepLabv3+

→

方法
DeepLabv3 編碼器 +
輕量解碼器 + Xception

→

證據
VOC 89.0%
Cityscapes 82.1%

→

結論
精度與效率的
系統性平衡方案

核心主張

透過在 DeepLabv3 的空洞摺積編碼器上增加輕量解碼器模組，並採用深度可分離摺積，能同時達到精確的物體邊界恢復與高效的多尺度特徵提取。

論證最強處

在兩個主流標竿資料集上同時達到 state-of-the-art 且未使用任何後處理，加上深度可分離摺積帶來 33-41% 的計算量減少，形成精度與效率的雙重說服力。

論證最弱處

解碼器設計的「簡潔」雖是賣點，但缺乏對不同解碼器複雜度的系統性消融研究。讀者無法確定現有設計是否已達最優，或只是一個工程上的折衷選擇。