SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Abstract — 摘要

Most recent semantic segmentation methods adopt a fully convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focusing on increasing the receptive field. This paper proposes SETR, which treats semantic segmentation as a sequence-to-sequence prediction task. Specifically, an image is treated as a sequence of patches, and a pure transformer (without convolution and resolution reduction) is deployed to encode the image as a sequence of feature representations with global context modeled in every layer of the transformer.

目前多數語義分割方法採用具有編碼器-解碼器架構的全摺積網路（FCN）。編碼器逐步降低空間解析度，並以更大的感受野學習更抽象的語義視覺概念。由於上下文建模對於分割任務至關重要，近年來的研究主要聚焦於擴大感受野。本文提出 SETR，將語義分割視為序列到序列的預測任務。具體而言，將影像視為一系列圖塊（patch）的序列，並部署純 Transformer（不含摺積與解析度縮減）對影像進行編碼，使每一層 Transformer 皆能建模全域上下文資訊。

段落功能全文總覽——點明現有方法的範式局限，並提出以 Transformer 取代 FCN 的核心主張。

邏輯角色摘要承擔「建立問題-提出方案」的雙重功能：先指出 FCN 依賴逐步縮減解析度來擴大感受野的根本限制，再以「序列到序列」的新視角重新定義分割問題。

論證技巧 / 潛在漏洞以「每一層皆建模全域上下文」作為核心賣點，直擊 FCN 的感受野瓶頸。然而，純 Transformer 的計算複雜度為序列長度的平方，在高解析度影像上的實用性尚需驗證。

1. Introduction — 緒論

Fully Convolutional Networks (FCNs) have been the dominant paradigm for semantic segmentation since their introduction. These approaches typically employ an encoder-decoder structure where the encoder, often a deep convolutional neural network pre-trained on ImageNet, progressively reduces spatial resolution while increasing feature abstraction. The decoder then recovers spatial details through upsampling. A fundamental limitation is that the receptive field grows only linearly with network depth, making it difficult to capture long-range dependencies efficiently.

自全摺積網路（FCN）問世以來，其一直是語義分割的主流範式。這些方法通常採用編碼器-解碼器結構，編碼器使用在 ImageNet 上預訓練的深度摺積神經網路，逐步降低空間解析度同時提升特徵抽象程度。解碼器隨後透過上取樣恢復空間細節。其根本限制在於感受野僅隨網路深度線性增長，難以有效捕捉長程依賴關係。

段落功能建立研究場域——定義 FCN 範式的運作機制與核心局限。

邏輯角色論證鏈的起點：先確立 FCN 的「感受野線性增長」瓶頸，為引入具有全域注意力的 Transformer 方案奠定必要性。

論證技巧 / 潛在漏洞以「線性增長 vs. 全域建模」的對比架構問題，但忽略了 dilated convolution、ASPP 等已部分緩解此問題的技術，使問題陳述略顯片面。

The Transformer architecture, originally designed for natural language processing, employs self-attention mechanisms that model dependencies between all positions in a sequence regardless of their distance. The recent success of Vision Transformer (ViT) on image classification demonstrates that a pure transformer applied to sequences of image patches can perform very well on recognition tasks. This raises a natural question: can we rethink the semantic segmentation task from a sequence-to-sequence perspective, replacing convolution-based encoders entirely?

Transformer 架構最初為自然語言處理所設計，其採用的自注意力機制能夠建模序列中所有位置之間的依賴關係，不受距離限制。近期 Vision Transformer（ViT）在影像分類上的成功表明，將純 Transformer 應用於影像圖塊序列能在辨識任務上取得優異表現。這自然引出一個問題：我們能否從序列到序列的視角重新思考語義分割任務，完全取代基於摺積的編碼器？

段落功能引入新技術背景——從 NLP 領域的 Transformer 成功過渡到電腦視覺的應用潛力。

邏輯角色此段扮演「橋接」角色：將上段指出的 FCN 局限與 Transformer 的全域建模能力連結，以設問句引導讀者接受新範式的合理性。

論證技巧 / 潛在漏洞以設問句收束段落是有效的修辭策略，引導讀者主動思考答案。但 ViT 的成功建立在大規模預訓練資料上（JFT-300M），此條件在分割任務中能否複製未被討論。

Prior efforts to enlarge receptive fields include dilated/atrous convolutions, Atrous Spatial Pyramid Pooling (ASPP), and non-local operations. While these methods partially address the limited receptive field problem, they are still built upon the FCN backbone and inherit its fundamental constraints. Concurrently, DETR demonstrated that transformers can tackle object detection as a set prediction problem, but it still relies on a CNN backbone for feature extraction. The recent ViT showed that a pure transformer without any convolutional layers achieves competitive classification performance, yet its application to dense prediction tasks like segmentation remains unexplored.

先前擴大感受野的嘗試包括空洞摺積（dilated/atrous convolution）、空洞空間金字塔池化（ASPP）以及非局部操作。雖然這些方法部分緩解了感受野不足的問題，但它們仍建構於 FCN 骨幹之上，繼承了其根本限制。同時期的 DETR 證明了 Transformer 能將物件偵測作為集合預測問題來處理，但仍依賴 CNN 骨幹進行特徵擷取。近期的 ViT 表明，不含任何摺積層的純 Transformer 能達到具競爭力的分類表現，然而其在分割等密集預測任務上的應用尚未被探索。

段落功能文獻回顧——系統性地指出既有方法的不足，建構研究缺口。

邏輯角色此段為 SETR 的定位提供合理性：既有改良方案（ASPP、non-local）仍受制於 FCN 框架；DETR 使用 Transformer 但未擺脫 CNN；ViT 擺脫 CNN 但僅限分類。SETR 恰好填補這一空白。

論證技巧 / 潛在漏洞以遞進式批判（FCN 改良 → DETR 半途 → ViT 未涉密集預測）建構「無人之地」的敘事非常有效。但 non-local operation 本質上即為自注意力機制，其與 Transformer 的差異程度被刻意放大。

3. Method — 方法

3.1 影像序列化與 Transformer 編碼器

Given an image of size H x W x 3, SETR first splits it into a sequence of flattened 2D patches of size p x p, producing a sequence of HW/p^2 tokens. Each patch is linearly projected into an embedding of dimension d, and learnable 1D position embeddings are added. The resulting sequence is then fed into a standard Transformer encoder consisting of L layers of Multi-Head Self-Attention (MHSA) and Feed-Forward Networks (FFN). Critically, every layer performs global self-attention over the entire patch sequence, meaning that each patch attends to all other patches regardless of spatial distance from the very first layer.

給定一張尺寸為 H x W x 3 的影像，SETR 首先將其分割為一系列大小為 p x p 的扁平化二維圖塊，產生 HW/p^2 個詞元（token）。每個圖塊透過線性投影映射為維度 d 的嵌入向量，並加入可學習的一維位置嵌入。所得序列隨後輸入標準 Transformer 編碼器，該編碼器由 L 層多頭自注意力（MHSA）與前饋網路（FFN）組成。關鍵之處在於，每一層皆對整個圖塊序列執行全域自注意力，意即從第一層起，每個圖塊即可關注到所有其他圖塊，不受空間距離限制。

段落功能方法核心——描述影像如何被序列化並送入純 Transformer 編碼器。

邏輯角色此段是全文技術核心：將影像問題轉化為序列問題的具體操作。「從第一層起即全域注意」直接回應緒論中 FCN 感受野不足的批判。

論證技巧 / 潛在漏洞強調「每一層全域自注意力」突顯了相較於 FCN 的結構性優勢。但自注意力的 O(n^2) 複雜度意味著序列長度（即 HW/p^2）受到嚴格限制，通常需要較大的 patch size（如 16x16），可能損失細粒度空間資訊。

3.2 解碼器設計：PUP 與 MLA

The authors propose two simple decoder designs: Progressive UPsampling (PUP) and Multi-Level feature Aggregation (MLA). PUP progressively upsamples the Transformer output by a factor of 2 in each step using bilinear interpolation followed by convolution, restoring the original resolution in a multi-stage fashion. MLA aggregates intermediate features from multiple Transformer layers (e.g., layers 6, 12, 18, 24) and fuses them through channel attention. A third variant, SETR-Naive, simply reshapes and upsamples the final layer output in one step. The key insight is that even these simple decoders, combined with the powerful Transformer encoder, achieve state-of-the-art results.

作者提出兩種簡單的解碼器設計：漸進式上取樣（PUP）與多層級特徵聚合（MLA）。PUP 在每一步中透過雙線性插值加摺積將 Transformer 輸出上取樣 2 倍，以多階段方式恢復原始解析度。MLA 則聚合來自多個 Transformer 層（例如第 6、12、18、24 層）的中間特徵，並透過通道注意力進行融合。第三種變體 SETR-Naive 僅簡單地將最終層輸出重塑並一步上取樣。核心洞見在於：即便是如此簡單的解碼器，搭配強大的 Transformer 編碼器，便能達到最先進的結果。

段落功能方法細節——描述三種解碼器變體的具體設計。

邏輯角色此段傳達一個重要的隱含論點：分割性能的關鍵瓶頸在於編碼器的表徵能力，而非解碼器的精巧設計。這為「純 Transformer 編碼器」的價值提供了間接論證。

論證技巧 / 潛在漏洞以「簡單解碼器即可達到 SOTA」的論述巧妙地將功勞歸於 Transformer 編碼器，但反向解讀亦成立：簡單解碼器可能限制了性能上限，更精巧的解碼器或許能進一步提升結果。

4. Experiments — 實驗

Experiments are conducted on three major benchmarks: ADE20K (150 categories), Pascal Context (59 categories), and Cityscapes (19 categories). The Transformer encoder uses ViT-Large with 24 layers, hidden dimension 1024, and 16 attention heads, pre-trained on ImageNet-21K. On ADE20K, SETR-MLA achieves 50.28% mIoU, surpassing all previous methods including HRNet+OCR (45.66%) and DeepLabV3+ (45.47%). On Pascal Context, SETR-MLA reaches 55.83% mIoU, again establishing a new state of the art. On Cityscapes test, SETR-PUP achieves 82.2% mIoU, competitive with the best methods.

實驗在三個主要基準上進行：ADE20K（150 個類別）、Pascal Context（59 個類別）與 Cityscapes（19 個類別）。Transformer 編碼器使用 ViT-Large，具有 24 層、隱藏維度 1024 及 16 個注意力頭，在 ImageNet-21K 上預訓練。在 ADE20K 上，SETR-MLA 達到 50.28% mIoU，超越所有先前方法，包括 HRNet+OCR（45.66%）與 DeepLabV3+（45.47%）。在 Pascal Context 上，SETR-MLA 達到 55.83% mIoU，再次刷新最佳紀錄。在 Cityscapes 測試集上，SETR-PUP 達到 82.2% mIoU，與最佳方法相當。

段落功能提供全面的實驗證據——在多個基準上驗證方法的有效性。

邏輯角色此段是論文的實證支柱。ADE20K 上 50.28% 對比 45.66% 的大幅領先（+4.62%）有力支撐了「Transformer 編碼器優於 CNN 編碼器」的核心主張。

論證技巧 / 潛在漏洞資料密集的呈現方式增強說服力。然而，ViT-Large 的參數量（307M）遠大於對比方法的骨幹網路，且依賴 ImageNet-21K 預訓練（約 1400 萬張影像），公平性比較存在疑問。Cityscapes 上的結果（82.2%）僅為「competitive」而非「superior」，暗示在街景等結構化場景中優勢不明顯。

Ablation studies reveal several key findings: increasing the number of Transformer layers from 12 to 24 improves mIoU by approximately 2%. MLA consistently outperforms PUP and Naive decoders, confirming the value of multi-level feature aggregation. Position embedding analysis shows that learned 1D position embeddings perform comparably to 2D variants, suggesting the Transformer implicitly learns spatial structure. Importantly, features from intermediate layers (e.g., layer 15) sometimes outperform the final layer, indicating that not all useful information propagates to the deepest layers.

消融研究揭示了數個關鍵發現：將 Transformer 層數從 12 增加到 24 可提升約 2% 的 mIoU。MLA 持續優於 PUP 和 Naive 解碼器，確認了多層級特徵聚合的價值。位置嵌入分析顯示，可學習的一維位置嵌入與二維變體表現相當，暗示 Transformer 隱含地學習了空間結構。值得注意的是，中間層（例如第 15 層）的特徵有時優於最終層，表明並非所有有用資訊都會傳播至最深層。

段落功能消融分析——驗證各設計選擇的合理性。

邏輯角色消融實驗為方法的每個組件提供獨立的因果證據。「中間層優於最終層」的發現為 MLA 解碼器的設計提供了事後理論依據。

論證技巧 / 潛在漏洞「1D 位置嵌入與 2D 相當」是一個反直覺但有力的發現，暗示 Transformer 的表徵能力足以從一維排列中恢復二維結構。但此結論可能受限於特定的 patch size 與資料集，泛化性有待驗證。

5. Conclusion — 結論

This paper presents SETR, an alternative perspective that reformulates semantic segmentation as a sequence-to-sequence prediction task using a pure Transformer encoder. By treating images as sequences of patches and applying global self-attention in every layer, SETR overcomes the fundamental receptive field limitation of FCN-based methods. Combined with simple decoder designs, SETR achieves new state-of-the-art results on ADE20K (50.28% mIoU) and Pascal Context (55.83% mIoU). The results demonstrate that the Transformer architecture offers a viable and powerful alternative to convolutional encoders for dense prediction tasks.

本文提出 SETR，一種將語義分割重新定義為序列到序列預測任務的替代視角，採用純 Transformer 編碼器。透過將影像視為圖塊序列並在每一層施加全域自注意力，SETR 克服了基於 FCN 方法的根本感受野限制。搭配簡單的解碼器設計，SETR 在 ADE20K（50.28% mIoU）與 Pascal Context（55.83% mIoU）上達到新的最先進結果。這些成果表明，Transformer 架構為密集預測任務提供了一個可行且強大的替代摺積編碼器的方案。

段落功能總結全文——重申核心貢獻與實驗成果。

邏輯角色結論段完美對稱於摘要的結構，以「問題重述→方法概括→結果摘要→意義闡述」的四段式收束全文。

論證技巧 / 潛在漏洞結論以「viable and powerful alternative」的措辭自我定位，既不宣稱完全取代 CNN，也明確主張 Transformer 的競爭力。但未討論計算成本、推論速度等實際部署考量，以及在小規模資料集上的表現。

論證結構總覽

問題
FCN 感受野受限
難以捕捉長程依賴

→

論點
以序列到序列視角
重新定義語義分割

→

證據
ADE20K 50.28%
Pascal Context 55.83%

→

反駁
簡單解碼器即可
排除解碼器貢獻

→

結論
Transformer 是密集預測
的可行強大替代方案

作者核心主張（一句話）

將語義分割重新定義為序列到序列預測任務，並以純 Transformer 編碼器取代 FCN，透過每一層的全域自注意力機制，能有效克服摺積網路的感受野瓶頸，並在主要基準上達到最先進的分割性能。

論證最強處

範式轉移的實證支持：在 ADE20K 上以 50.28% mIoU 大幅超越基於 CNN 的最佳方法（HRNet+OCR 45.66%），且僅需簡單解碼器即可達成，有力論證了 Transformer 編碼器在密集預測任務中的表徵優勢。消融實驗進一步證明中間層特徵的價值，為 MLA 設計提供了理論支撐。

論證最弱處

計算成本與公平性：ViT-Large 編碼器的參數量（307M）遠超對比方法的骨幹網路（如 ResNet-101 約 45M），且需要 ImageNet-21K 大規模預訓練。在 Cityscapes 上僅達到「competitive」而非「superior」的結果，暗示在特定場景下，Transformer 的全域建模優勢可能不如預期顯著。推論速度與記憶體消耗的討論缺失。