Vision Transformers for Dense Prediction

Abstract — 摘要

This paper introduces Dense Prediction Transformers (DPT), an architecture that leverages vision transformers in place of convolutional networks as backbone for dense prediction tasks. The model reassembles tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combines them using a convolutional decoder. The key advantage is that vision transformers maintain a constant and relatively high resolution throughout the computational process and have a global receptive field at every stage. DPT achieves a 28% improvement in relative performance on monocular depth estimation and sets new state-of-the-art results, producing "more fine-grained and globally coherent predictions" compared to fully-convolutional architectures.

本文提出密集預測 Transformer（DPT），一種以視覺 Transformer 取代摺積網路作為密集預測任務主幹的架構。模型將視覺 Transformer 各階段的 token 重組為不同解析度的類影像表示，並以摺積解碼器漸進式地融合它們。核心優勢在於視覺 Transformer 在整個計算過程中維持恆定且相對高的解析度，並在每個階段都具有全域感受野。DPT 在單眼深度估計上達到 28% 的相對性能改進，並創造新的最先進成果，產生比全摺積架構「更精細且全域一致的預測」。

段落功能全文總覽——以 ViT 的兩大固有優勢（恆定解析度 + 全域感受野）為核心賣點。

邏輯角色摘要將 DPT 定位為 ViT 的自然延伸：ViT 的特性天然適合密集預測，DPT 提供了實現此潛力的具體架構。28% 改進提供了強力的數據支撐。

論證技巧 / 潛在漏洞「恆定且高解析度」的描述需要限定——ViT 的 token 數量取決於 patch size，16x16 patches 實際上將解析度降至 1/16。作者所指的是 token 不再被進一步下採樣，而非「高解析度」的字面意義。

1. Introduction — 緒論

Dense prediction tasks such as monocular depth estimation and semantic segmentation require generating per-pixel outputs from input images. Dominant approaches use convolutional neural networks with encoder-decoder architectures, where the encoder progressively downsamples the input, losing fine-grained spatial information. While skip connections and feature pyramid networks partially mitigate this loss, the fundamental issue remains: "downsampling has distinct drawbacks that are particularly salient in dense prediction tasks: feature resolution and granularity are lost." The authors argue that vision transformers offer a natural solution because they maintain constant resolution throughout processing and have global receptive fields at every stage.

單眼深度估計與語意分割等密集預測任務需要從輸入影像生成逐像素的輸出。主流方法使用具有編碼器-解碼器架構的摺積神經網路，其中編碼器逐步下採樣輸入，丟失精細的空間資訊。雖然跳躍連結與特徵金字塔網路部分緩解了此損失，但根本問題依然存在：「下採樣具有在密集預測任務中尤為突出的明顯缺陷：特徵的解析度與精細度會喪失。」作者主張視覺 Transformer 提供了自然的解決方案，因為它們在整個處理過程中維持恆定的解析度，並在每個階段都具有全域感受野。

段落功能建立研究動機——以 CNN 下採樣的結構性缺陷為核心批判。

邏輯角色論證鏈的起點：CNN 的下採樣是密集預測的結構性障礙，skip connections 僅是「補丁」而非根本解決方案。ViT 的恆定解析度從根本上避免了此問題。

論證技巧 / 潛在漏洞將 skip connections 描述為「部分緩解」而非「有效解決」，巧妙地為 ViT 方案留出必要性空間。但 HRNet 等架構已在不犧牲解析度的前提下實現了強大的密集預測——作者未提及這些架構。

Encoder-decoder architectures for dense prediction, pioneered by FCN and refined by U-Net and DeepLab, have become the standard paradigm. MiDaS achieved strong zero-shot monocular depth estimation by training on multiple mixed datasets with a ResNeXt-101 encoder. Vision Transformer (ViT) demonstrated that a pure transformer encoder applied to sequences of image patches can match or exceed CNNs for classification, but its adaptation to dense prediction is non-trivial because the output tokens lack explicit spatial structure. SETR made an initial attempt at using ViT for semantic segmentation but did not fully exploit the representation capacity of intermediate transformer layers.

由 FCN 開創、經 U-Net 與 DeepLab 精煉的編碼器-解碼器架構已成為密集預測的標準範式。MiDaS 透過在多個混合資料集上以 ResNeXt-101 編碼器訓練，達到了強力的零樣本單眼深度估計。Vision Transformer (ViT) 展示了將純 Transformer 編碼器應用於影像區塊序列可以在分類上匹敵或超越 CNN，但其適配密集預測並非易事，因為輸出 token 缺乏明確的空間結構。SETR 進行了將 ViT 用於語意分割的初步嘗試，但未能充分利用中間 Transformer 層的表示能力。

段落功能文獻定位——將 DPT 置於 CNN 密集預測與 ViT 適配的交匯點。

邏輯角色以 MiDaS 建立深度估計的比較基線（後續實驗中直接取代其 CNN 主幹），以 SETR 建立「ViT 用於密集預測」的先驅但不完整的嘗試，為 DPT 的更完整方案鋪路。

論證技巧 / 潛在漏洞指出 SETR「未充分利用中間層」巧妙地預告了 DPT 的核心設計——從多個中間層提取並融合特徵。但 SETR 是並行工作，DPT 對其的批評是否公允取決於具體的時間線與技術細節。

3. Method — 方法

3.1 Transformer Encoder

DPT uses a Vision Transformer (ViT) as the encoder backbone. Three variants are employed: ViT-Base (12 layers, patch size p=16), ViT-Large (24 layers, D=1024), and ViT-Hybrid (ResNet50 embedding + 12 transformer layers). Tokens from four evenly-spaced transformer layers are extracted — for ViT-Base with 12 layers, tokens are taken from layers {3, 6, 9, 12}. Unlike CNN encoders that produce feature maps at decreasing resolutions, all extracted token sets have the same spatial resolution (H/p x W/p), preserving fine-grained information that would be lost in a CNN's deeper layers.

DPT 使用 Vision Transformer (ViT) 作為編碼器主幹。採用三種變體：ViT-Base（12 層，區塊大小 p=16）、ViT-Large（24 層，D=1024）與 ViT-Hybrid（ResNet50 嵌入 + 12 層 Transformer）。從四個等距的 Transformer 層提取 token——對於 12 層的 ViT-Base，從第 {3, 6, 9, 12} 層提取。不同於產生遞減解析度特徵圖的 CNN 編碼器，所有提取的 token 集具有相同的空間解析度（H/p x W/p），保留了在 CNN 深層會丟失的精細資訊。

段落功能編碼器設計——描述如何從 ViT 的中間層提取多級特徵。

邏輯角色「等距提取」的策略簡潔但需要理論支撐：淺層捕捉局部紋理、深層捕捉語意資訊的 CNN 直覺是否也適用於 Transformer？作者透過實驗間接驗證了此假設。

論證技巧 / 潛在漏洞「相同空間解析度」的強調是相對 CNN 的關鍵差異化賣點。但 p=16 意味著 token 解析度僅為原圖的 1/16——對於需要像素級精度的任務（如深度估計），這仍是一個顯著的瓶頸。

3.2 Reassemble Operation — 重組操作

The Reassemble operation converts 1D token sequences back into 2D spatial feature maps at various resolutions. It consists of three steps: (1) Read: handles the readout token through one of three strategies — ignore, add, or project (concatenate and project); (2) Concatenate: reshapes the 1D token sequence into a 2D feature map at H/p x W/p resolution; (3) Resample: applies 1x1 convolutions to adjust channel dimensions followed by 3x3 transposed convolutions (with stride 2) or strided convolutions to produce feature maps at four resolutions: H/4, H/8, H/16, and H/32. This creates a multi-scale feature representation analogous to CNN feature pyramids.

重組操作將一維 token 序列轉換回不同解析度的二維空間特徵圖。它包含三個步驟：(1) 讀取：透過三種策略之一處理 readout token——忽略、加法或投射（串接後投射）；(2) 串接：將一維 token 序列重塑為 H/p x W/p 解析度的二維特徵圖；(3) 重取樣：以 1x1 摺積調整通道維度，隨後以 3x3 轉置摺積（步幅 2）或跨步摺積產生四個解析度的特徵圖：H/4、H/8、H/16 與 H/32。這創造了類似 CNN 特徵金字塔的多尺度特徵表示。

段落功能核心架構模組——描述如何從 Transformer token 重建空間結構。

邏輯角色重組操作是 DPT 的橋接模組——將 Transformer 的「序列」輸出轉換為解碼器可處理的「空間」特徵圖。四個解析度恰好匹配 RefineNet 解碼器的輸入需求。

論證技巧 / 潛在漏洞使用轉置摺積上取樣來「人工」創造多尺度表示，與 CNN 透過下採樣「自然」產生多尺度表示形成對比。這暗示 Transformer 的恆定解析度既是優勢也是挑戰——需要額外模組來適配現有的多尺度框架。

3.3 Fusion Decoder — 融合解碼器

The reassembled multi-scale features are combined using a RefineNet-based fusion module. Starting from the lowest resolution (H/32), features are progressively upsampled and fused with higher-resolution features using residual convolution units and multi-resolution fusion blocks. The final output is at H/2 resolution for depth estimation and H/1 for semantic segmentation. This decoder is deliberately kept lightweight to emphasize that the representation quality of the transformer encoder, rather than the decoder complexity, drives the performance gains.

重組後的多尺度特徵使用基於 RefineNet 的融合模組進行結合。從最低解析度（H/32）開始，特徵以殘差摺積單元與多解析度融合區塊漸進式地上取樣並與更高解析度的特徵融合。最終輸出在深度估計時為 H/2 解析度，在語意分割時為 H/1。此解碼器刻意保持輕量，以強調驅動性能增益的是 Transformer 編碼器的表示品質而非解碼器的複雜度。

段落功能解碼器設計——描述漸進式多尺度融合策略。

邏輯角色「刻意輕量」的設計決策暗示作者意圖歸因性能增益於 Transformer 編碼器而非精巧的解碼器設計。這使論文的貢獻更加聚焦。

論證技巧 / 潛在漏洞使用已有的 RefineNet 作為解碼器是「控制變數」的實驗策略——只改變編碼器（CNN -> ViT），保持解碼器不變，使性能差異可歸因於編碼器的改進。但更複雜的解碼器可能進一步釋放 Transformer 特徵的潛力。

4. Experiments — 實驗

On monocular depth estimation, DPT-Large achieves 28% relative improvement in zero-shot transfer, including 13.2% on DIW and 31.2% on ETH3D compared to the CNN-based MiDaS baseline. After fine-tuning, DPT sets new state-of-the-art on NYUv2 (0.110 AbsRel) and KITTI (0.062 AbsRel). On semantic segmentation, DPT-Hybrid achieves 49.02% mIoU on ADE20K, exceeding the prior best of 48.36%, and 60.46% mIoU on Pascal Context, setting a new state-of-the-art. Qualitative analysis reveals that DPT produces "finer-grained delineations of object boundaries" and more globally consistent depth maps — objects at similar distances receive more uniform depth values. Inference latency is comparable: 35ms for DPT-Large vs. 32ms for MiDaS.

在單眼深度估計上，DPT-Large 在零樣本遷移中達到 28% 的相對改進，包括在 DIW 上提升 13.2% 與在 ETH3D 上提升 31.2%（相較 CNN 的 MiDaS 基線）。經微調後，DPT 在 NYUv2（0.110 AbsRel）與 KITTI（0.062 AbsRel）上創下最先進紀錄。在語意分割上，DPT-Hybrid 在 ADE20K 上達到 49.02% mIoU，超越先前最佳的 48.36%，在 Pascal Context 上達到 60.46% mIoU，創下新紀錄。定性分析顯示 DPT 產生了「更精細的物件邊界描繪」與更全域一致的深度圖——相似距離的物件獲得更均勻的深度值。推論延遲相當：DPT-Large 35ms 對比 MiDaS 32ms。

段落功能全面實證——跨兩項密集預測任務與多個資料集驗證 DPT 的優越性。

邏輯角色實驗覆蓋四個維度：(1) 零樣本深度（28% 改進），(2) 微調深度（SOTA），(3) 語意分割（SOTA），(4) 延遲（相當）。零樣本改進尤為有力，因為它反映了純粹的表示能力差異。

論證技巧 / 潛在漏洞將定性優勢（邊界精細度、全域一致性）與定量指標並列增強了說服力。35ms vs 32ms 的延遲對比化解了效率顧慮。但 DPT-Large 使用 ViT-Large（比 MiDaS 的 ResNeXt-101 更大），參數量與計算量的公平比較未被明確呈現。

5. Conclusion — 結論

Dense Prediction Transformers demonstrate that vision transformers can effectively serve as backbones for dense prediction tasks, producing "more fine-grained and globally coherent predictions when compared to fully-convolutional architectures." The key architectural contributions — the reassemble operation and multi-scale fusion decoder — enable seamless integration of ViT into existing dense prediction frameworks. The consistent improvements across depth estimation and semantic segmentation validate that the constant resolution and global receptive field of transformers are particularly beneficial for tasks requiring per-pixel understanding. The results suggest that transformer-based backbones will become the dominant paradigm for dense prediction.

密集預測 Transformer 展示了視覺 Transformer 可有效作為密集預測任務的主幹，產生「比全摺積架構更精細且全域一致的預測」。關鍵架構貢獻——重組操作與多尺度融合解碼器——使 ViT 無縫整合至現有密集預測框架。在深度估計與語意分割上的一致改進驗證了 Transformer 的恆定解析度與全域感受野對需要逐像素理解的任務尤為有利。結果顯示基於 Transformer 的主幹將成為密集預測的主流範式。

段落功能總結全文——將 DPT 的成功昇華為範式轉移的預測。

邏輯角色結論段從具體成果（深度/分割 SOTA）昇華到範式預測（Transformer 將主導密集預測）。這一預測在後續的 SegFormer、Mask2Former 等工作中已被部分驗證。

論證技巧 / 潛在漏洞「主流範式」的大膽預測在當時具有前瞻性。但 DPT 的成功部分依賴大規模預訓練（ImageNet-21K）——在中小規模資料上，CNN 主幹可能仍具優勢。此外，後續的 Swin Transformer 等階層式設計暗示「恆定解析度」可能不是唯一的正確方向。

論證結構總覽

問題
CNN 下採樣丟失
精細空間資訊

→

論點
ViT 恆定解析度
+ 全域感受野

→

證據
深度估計 +28%
分割 49.02 mIoU

→

反駁
推論延遲相當
（35ms vs 32ms）

→

結論
Transformer 主幹
將主導密集預測

作者核心主張（一句話）

以視覺 Transformer 取代摺積網路作為密集預測的編碼器主幹，透過重組操作與多尺度融合產生更精細且全域一致的預測，在深度估計與語意分割上顯著超越全摺積架構。

論證最強處

零樣本遷移的巨大改進：在零樣本深度估計上達到 28% 的相對改進，這是純粹反映編碼器表示能力的指標——不受微調技巧或任務特定設計的影響。ETH3D 上 31.2% 的改進尤為顯著，且推論延遲僅增加 3ms，展現了極佳的性能-效率取捨。定性結果中更清晰的物件邊界直接驗證了「恆定解析度保留精細資訊」的核心假說。

論證最弱處

歸因分析的不完整性：性能改進來自 ViT 的三個交纏因素——(1) 恆定解析度（不下採樣），(2) 全域感受野（自注意力），(3) 更大的預訓練規模（ImageNet-21K）。論文未充分分離這三個因素的各自貢獻。若在相同的預訓練資料量下比較（如 ResNeXt-101 也用 ImageNet-21K），改進幅度可能會縮小。此外，ViT-Hybrid（使用 ResNet50 + Transformer）的存在暗示純 Transformer 並非唯一的有效設計。