Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Abstract — 摘要

Although Transformer has recently demonstrated encouraging results in computer vision, it still has limitations when used as a backbone for dense prediction tasks such as object detection and segmentation. Unlike Vision Transformer (ViT), which produces single-scale, low-resolution outputs, the authors introduce Pyramid Vision Transformer (PVT), which overcomes these difficulties by generating multi-scale feature maps through a progressive shrinking pyramid while maintaining computational efficiency via Spatial-Reduction Attention (SRA). PVT can be plugged into many representative dense prediction pipelines, achieving significant improvements over ResNet on COCO detection and ADE20K segmentation.

儘管 Transformer 近來在電腦視覺中展現了令人鼓舞的成果，但作為密集預測任務（如物件偵測與分割）的骨幹時仍有局限。不同於 Vision Transformer (ViT) 僅能產生單一尺度、低解析度的輸出，作者引入了金字塔視覺 Transformer (PVT)，透過漸進式縮減金字塔生成多尺度特徵圖，同時藉由空間縮減注意力 (SRA) 維持計算效率。PVT 可插入多種代表性的密集預測管線中，在 COCO 偵測與 ADE20K 分割上相較 ResNet 實現顯著提升。

段落功能全文總覽——從 ViT 的局限出發，引出 PVT 的金字塔設計作為解決方案。

邏輯角色摘要承擔「問題識別與方案預告」雙重功能：先界定 ViT 在密集預測上的結構性缺陷（單尺度、低解析度），再以一句話概述 PVT 的核心創新（金字塔結構 + SRA）。

論證技巧 / 潛在漏洞作者將 ViT 框定為「有局限的」，但 ViT 本身的設計目標是分類而非密集預測。此處的問題建構略帶策略性——實際上是在一個 ViT 未針對的領域中指出其不足。

1. Introduction — 緒論

Convolutional Neural Networks (CNNs) have dominated computer vision for years, with architectures like ResNet and ResNeXt serving as standard backbones for both image classification and dense prediction tasks. Recently, Vision Transformer (ViT) demonstrated that a pure transformer architecture can achieve competitive classification performance. However, exploring convolution-free transformer backbones for dense prediction remains rarely studied. ViT produces only single-scale, low-resolution feature maps, making it unsuitable as a direct replacement for CNN backbones in detection and segmentation frameworks that require multi-scale feature pyramids.

摺積神經網路 (CNN) 多年來主導了電腦視覺領域，ResNet 與 ResNeXt 等架構作為影像分類與密集預測任務的標準骨幹。近期 Vision Transformer (ViT) 證明了純 Transformer 架構可達到具競爭力的分類效能。然而，探索無摺積的 Transformer 骨幹用於密集預測的研究仍屬罕見。ViT 僅產生單一尺度、低解析度的特徵圖，使其難以直接取代需要多尺度特徵金字塔的偵測與分割框架中的 CNN 骨幹。

段落功能建立研究場域——從 CNN 的主導地位到 ViT 的突破，再指出密集預測領域的空白。

邏輯角色論證鏈的起點：肯定 ViT 在分類上的成功，同時指出其在密集預測中的結構性不足，為 PVT 的必要性建立論據。

論證技巧 / 潛在漏洞作者巧妙地將研究缺口定義為「無摺積 Transformer 骨幹用於密集預測」，這個狹窄的定位確保了新穎性。但實際上，在 PVT 發表前後已有多項類似工作（如 Swin Transformer），此處的「罕見」描述可能過度渲染了新穎性。

To address this gap, the authors introduce Pyramid Vision Transformer (PVT), a pure transformer backbone that generates multi-scale feature maps for dense prediction. PVT inherits the advantages of both CNN and Transformer: like CNNs, it produces hierarchical feature maps at four stages with progressively decreasing resolutions (stride 4, 8, 16, 32); like Transformers, it captures long-range dependencies through self-attention. The key innovation is the Spatial-Reduction Attention (SRA) mechanism, which reduces the spatial dimensions of keys and values before computing attention, achieving significant computational savings without sacrificing performance.

為填補此缺口，作者引入金字塔視覺 Transformer (PVT)——一個純 Transformer 骨幹，能為密集預測生成多尺度特徵圖。PVT 同時繼承了 CNN 與 Transformer 的優勢：如同 CNN，它在四個階段產生層級式特徵圖，解析度漸進降低（步幅 4、8、16、32）；如同 Transformer，它透過自注意力捕捉長距離相依性。關鍵創新在於空間縮減注意力 (SRA) 機制，在計算注意力之前縮減鍵與值的空間維度，在不犧牲效能的情況下實現顯著的計算節省。

段落功能提出解決方案——完整概述 PVT 的架構設計與核心創新。

邏輯角色承接上段的問題陳述，此段扮演關鍵的「轉折」角色：PVT 被定位為「CNN 金字塔結構」與「Transformer 注意力機制」的最佳融合。

論證技巧 / 潛在漏洞「繼承兩者優勢」的修辭極具說服力，但也帶來隱含的質疑：既然結合了兩者特點，是否也繼承了兩者的弱點？例如，Transformer 的高記憶體需求在多尺度設定下是否會加劇？SRA 是否真的「不犧牲效能」需要實驗驗證。

CNN backbones for dense prediction have evolved from VGGNet to ResNet, ResNeXt, and EfficientNet, all sharing a common pyramidal design that produces multi-scale features. For dense prediction, frameworks such as Feature Pyramid Network (FPN), RetinaNet, and Mask R-CNN rely heavily on these multi-scale features. In the Transformer domain, ViT and DeiT have demonstrated strong classification performance but maintain a columnar, single-resolution architecture throughout all layers, making them incompatible with multi-scale detection and segmentation pipelines.

用於密集預測的 CNN 骨幹從 VGGNet 演進至 ResNet、ResNeXt 與 EfficientNet，皆共享一種產生多尺度特徵的金字塔式設計。在密集預測方面，特徵金字塔網路 (FPN)、RetinaNet 與 Mask R-CNN 等框架高度依賴這些多尺度特徵。在 Transformer 領域，ViT 與 DeiT 已展現強大的分類效能，但在所有層級中維持柱狀、單一解析度的架構，使其與多尺度偵測和分割管線不相容。

段落功能文獻回顧——系統性地串聯 CNN 金字塔設計的演進與 Transformer 的結構限制。

邏輯角色此段建立了兩條平行的學術脈絡（CNN 金字塔 vs. Transformer 柱狀），暗示 PVT 將是兩條脈絡的交匯點。

論證技巧 / 潛在漏洞將 ViT 描述為「柱狀架構」突出了其結構弱點，但 ViT 的設計初衷是分類任務，柱狀架構並非缺陷而是設計選擇。作者在此隱含地將分類架構的特徵重新框架為密集預測的缺陷。

3. Method — 方法

3.1 Overall Architecture — 整體架構

PVT consists of four stages, each producing feature maps at different scales. In Stage 1, the input image is divided into 4x4 patches, which are linearly embedded into tokens. These tokens are processed by a Transformer encoder, and the output is reshaped to form a feature map at 1/4 resolution. Subsequent stages further reduce the spatial resolution by a factor of 2 through patch embedding layers, yielding feature maps at 1/8, 1/16, and 1/32 of the input resolution. This progressive shrinking pyramid mirrors the multi-scale design of CNN backbones while operating entirely through attention mechanisms.

PVT 由四個階段組成，每個階段產生不同尺度的特徵圖。在第一階段，輸入影像被切分為 4x4 的圖塊，並線性嵌入為標記。這些標記經由 Transformer 編碼器處理，輸出重塑為 1/4 解析度的特徵圖。後續各階段透過圖塊嵌入層進一步將空間解析度縮減為 2 倍，分別產生 1/8、1/16 與 1/32 解析度的特徵圖。此漸進式縮減金字塔映射了 CNN 骨幹的多尺度設計，同時完全透過注意力機制運作。

段落功能架構總覽——描述四階段金字塔的整體設計。

邏輯角色此段是方法的根基：透過四階段的逐步縮減，建立了與 CNN 相同的多尺度特徵層級，使 PVT 可直接嵌入現有的密集預測框架（如 FPN）。

論證技巧 / 潛在漏洞以「映射 CNN 骨幹」的措辭暗示 PVT 是 CNN 的直接替代品，降低了讀者理解的門檻。但此對應關係在細節上並不完美——例如 CNN 的局部感受野逐層擴大，而 Transformer 從第一層就具有全域感受野，兩者的特徵性質可能本質上不同。

3.2 Spatial-Reduction Attention — 空間縮減注意力

Standard Multi-Head Attention (MHA) has computational complexity of O(n^2) where n is the number of tokens, making it prohibitively expensive for high-resolution feature maps. To address this, the authors propose Spatial-Reduction Attention (SRA), which applies a spatial reduction operation to keys and values before computing attention. Specifically, the key and value sequences are reshaped into 2D feature maps and downsampled by a factor R_i using a linear layer, reducing the sequence length from n to n/R_i^2. This achieves R_i^2 times lower computational and memory costs compared to standard MHA, making it feasible to process high-resolution feature maps in early stages.

標準的多頭注意力 (MHA) 具有 O(n^2) 的計算複雜度（n 為標記數量），對高解析度特徵圖而言成本過高。為此，作者提出空間縮減注意力 (SRA)，在計算注意力之前對鍵與值施加空間縮減操作。具體而言，鍵與值序列被重塑為二維特徵圖，並透過線性層以因子 R_i 進行降取樣，將序列長度從 n 縮減至 n/R_i^2。這實現了相比標準 MHA 降低 R_i^2 倍的計算與記憶體成本，使得在早期階段處理高解析度特徵圖成為可行。

段落功能核心技術創新——詳述 SRA 的運作機制與效率優勢。

邏輯角色 SRA 是 PVT 相較於 ViT 的關鍵差異化因素。此段建立了 O(n^2) 問題 -> SRA 解決方案的因果鏈，直接回應了「Transformer 計算成本過高」的常見質疑。

論證技巧 / 潛在漏洞 SRA 的設計優雅且直觀——對鍵和值進行空間縮減在數學上等價於對特徵圖進行下取樣後再計算注意力。但潛在問題在於：空間縮減是否會丟失細粒度的空間資訊？特別是在需要像素級精確度的分割任務中，這種資訊損失可能影響最終效能。

3.3 Progressive Shrinking — 漸進式縮減策略

Unlike CNNs that use convolutional strides to reduce spatial resolution, PVT employs patch embedding layers at the beginning of each stage to progressively shrink the feature map. At stage i, the feature map from the previous stage is divided into patches of size P_i x P_i and linearly projected to form new token sequences. This enables flexible control over the resolution and channel dimensions at each stage, allowing the model to allocate more computational resources to early, high-resolution stages where fine-grained spatial information is critical. The reduction ratios R_i are set inversely to the stage depth — larger reductions in early stages (R_1=8) and smaller in later stages (R_4=1) — balancing computation across stages.

不同於 CNN 使用摺積步幅來縮減空間解析度，PVT 在每個階段的開頭採用圖塊嵌入層來漸進式地縮減特徵圖。在第 i 階段，前一階段的特徵圖被切分為 P_i x P_i 大小的圖塊並線性投影，形成新的標記序列。這使得每個階段的解析度與通道維度可以靈活控制，允許模型在早期高解析度階段（細粒度空間資訊至關重要之處）分配更多計算資源。縮減比率 R_i 與階段深度成反比設定——早期階段較大的縮減（R_1=8）、後期階段較小（R_4=1）——在各階段之間平衡計算量。

段落功能設計細節——解釋跨階段的解析度縮減策略與計算資源分配。

邏輯角色承接 SRA 的效率設計，此段進一步說明如何在不同階段之間平衡計算——高解析度用大縮減率，低解析度用小縮減率，形成一套系統性的效率策略。

論證技巧 / 潛在漏洞反比縮減策略是精心設計的工程選擇，但其最優性僅靠經驗確認。作者未探討是否存在更優的縮減排程（如學習式的自適應縮減比率）。此外，以圖塊嵌入取代摺積步幅的做法在概念上非常接近摺積——是否真正「無摺積」值得商榷。

4. Experiments — 實驗

PVT is evaluated across three major benchmarks. On ImageNet classification, PVT-Large achieves 81.7% top-1 accuracy, competitive with ResNeXt-101. For COCO object detection, replacing ResNet-50 with PVT-Small in RetinaNet yields +4.1 AP improvement (36.7 to 40.4); in Mask R-CNN, PVT-Small achieves +3.9 AP_box and +3.4 AP_mask improvements over ResNet-50. On ADE20K semantic segmentation with Semantic FPN, PVT-Large achieves 42.1 mIoU, surpassing ResNet-101 (38.8 mIoU) by 3.3 points. These results demonstrate that PVT can serve as a versatile, convolution-free backbone competitive with well-established CNN architectures across diverse dense prediction tasks.

PVT 在三項主要基準上進行評估。在 ImageNet 分類上，PVT-Large 達到 81.7% 的 top-1 準確率，與 ResNeXt-101 具有競爭力。在 COCO 物件偵測上，將 RetinaNet 中的 ResNet-50 替換為 PVT-Small 可提升 +4.1 AP（36.7 至 40.4）；在 Mask R-CNN 中，PVT-Small 相較 ResNet-50 達到 +3.9 AP_box 與 +3.4 AP_mask 的提升。在使用 Semantic FPN 的 ADE20K 語意分割上，PVT-Large 達到 42.1 mIoU，超越 ResNet-101（38.8 mIoU）3.3 個百分點。這些結果證明 PVT 可作為一個通用的、無摺積骨幹，在多種密集預測任務上與成熟的 CNN 架構競爭。

段落功能全面的實驗驗證——在分類、偵測、分割三大任務上提供定量比較。

邏輯角色此段是論文的實證支柱：三個基準測試覆蓋了分類（ImageNet）、偵測（COCO）、分割（ADE20K）的完整生態系統，全面證明 PVT 作為通用骨幹的可行性。

論證技巧 / 潛在漏洞實驗設計精心選擇了多個下游框架（RetinaNet、Mask R-CNN、Semantic FPN），展現 PVT 的通用性。但比較基準主要是 ResNet/ResNeXt，未與同期的 Swin Transformer 等層級式 Transformer 比較。此外，在計算效率（FLOPs）對等的條件下，PVT 的優勢是否仍然明顯？

5. Conclusion — 結論

This paper presents PVT, demonstrating that a pure transformer architecture with pyramid structure can effectively replace CNN backbones for dense prediction tasks. Through progressive shrinking and Spatial-Reduction Attention, PVT generates multi-scale features efficiently while capturing long-range dependencies. Extensive experiments on classification, detection, and segmentation validate its versatility. The authors note that transformer-based models remain in early development compared to mature CNN designs, and anticipate further improvements as the community develops better training strategies, data augmentation techniques, and architectural innovations for vision transformers.

本文提出 PVT，證明了具有金字塔結構的純 Transformer 架構能有效地取代 CNN 骨幹用於密集預測任務。透過漸進式縮減與空間縮減注意力，PVT 高效地生成多尺度特徵，同時捕捉長距離相依性。在分類、偵測與分割上的廣泛實驗驗證了其通用性。作者指出 Transformer 模型相較於成熟的 CNN 設計仍處於早期發展階段，並預期隨著社群開發更好的訓練策略、資料增強技術與視覺 Transformer 的架構創新，將會有進一步的提升。

段落功能總結全文——重申核心貢獻並展望未來方向。

邏輯角色結論段呼應摘要結構，形成完整的論證閉環：問題（ViT 不適合密集預測） -> 方案（PVT 金字塔 + SRA） -> 驗證（三大基準） -> 展望（早期階段，未來可期）。

論證技巧 / 潛在漏洞結論中承認「仍處於早期發展」是謙遜而誠實的立場，但也暗示了 PVT 本身可能很快被更成熟的設計所取代——事實上 PVTv2 與 Swin Transformer 等後續工作確實在短時間內超越了 PVT，驗證了此預言。

論證結構總覽

問題
ViT 缺乏多尺度輸出
不適用密集預測

→

論點
金字塔 Transformer
可替代 CNN 骨幹

→

證據
COCO/ADE20K 多任務
顯著超越 ResNet

→

反駁
SRA 機制解決
O(n^2) 計算瓶頸

→

結論
純 Transformer 骨幹
是密集預測的可行方向

作者核心主張（一句話）

透過金字塔結構與空間縮減注意力，純 Transformer 架構可生成多尺度特徵圖，有效替代 CNN 骨幹用於偵測、分割等密集預測任務。

論證最強處

跨任務一致性驗證：PVT 在分類、偵測、分割三大任務上均展現競爭力，且可直接嵌入 RetinaNet、Mask R-CNN、Semantic FPN 等現有框架中，證明了其作為通用骨幹的實用性，而非僅在單一任務上表現突出。

論證最弱處

效率與精度的取捨不透明：SRA 的空間縮減是否導致細粒度空間資訊的損失，論文缺乏系統性分析。此外，在 FLOPs 對等的條件下與同期層級式 Transformer（如 Swin）的比較不足，難以判斷 PVT 的設計選擇是否為當前最優。