DaViT: Dual Attention Vision Transformers

Abstract — 摘要

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. The key idea is to alternate between spatial window attention and channel group attention in each transformer block. Spatial window attention captures local fine-grained features within non-overlapping windows, while channel group attention models global interactions by attending across spatial locations in a channel-grouped manner. Extensive experiments demonstrate that DaViT achieves state-of-the-art performance on ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation, outperforming existing vision transformers with comparable computational budgets.

本文提出雙注意力視覺 Transformer（DaViT），一種簡潔而有效的視覺 Transformer 架構，能在維持計算效率的同時捕捉全域上下文。核心思想是在每個 Transformer 區塊中交替使用空間視窗注意力與通道分組注意力。空間視窗注意力在非重疊視窗內捕捉局部精細特徵，通道分組注意力則透過以通道分組方式在空間位置間進行注意力計算來建模全域互動。大量實驗表明，DaViT 在 ImageNet-1K 分類、COCO 物件偵測和 ADE20K 語意分割上達到最先進效能，在相近的計算預算下超越現有的視覺 Transformer。

段落功能全文總覽——概述 DaViT 的核心思想：雙注意力機制及其在多個基準上的表現。

邏輯角色摘要建立了「設計（雙注意力交替）→ 優勢（全域+局部）→ 驗證（多任務 SOTA）」的論證預告。

論證技巧 / 潛在漏洞以「簡潔而有效」的修辭降低讀者對複雜度的預期，同時以多任務的廣泛驗證增強說服力。但「交替」策略相較於「並行」雙注意力的優劣未被討論。

1. Introduction — 緒論

Vision Transformers (ViTs) have achieved remarkable success in computer vision, demonstrating strong performance on image classification, object detection, and semantic segmentation. However, the quadratic computational complexity of global self-attention with respect to the number of tokens poses a significant challenge for processing high-resolution images. Various strategies have been proposed to address this limitation, including local window attention as used in Swin Transformer, dilated attention, and linear attention approximations. While effective at reducing computation, these approaches often sacrifice the ability to model long-range dependencies at every layer.

視覺 Transformer（ViTs）在電腦視覺領域取得了顯著成功，在影像分類、物件偵測和語意分割上展現強勁效能。然而，全域自注意力相對於 token 數量的二次計算複雜度對處理高解析度影像構成重大挑戰。為解決此限制已提出多種策略，包括 Swin Transformer 使用的局部視窗注意力、膨脹注意力及線性注意力近似。雖然這些方法有效降低了計算量，但往往犧牲了在每一層建模長程依賴的能力。

段落功能建立問題意識——指出現有 ViT 在效率與全域建模之間的矛盾。

邏輯角色論證鏈起點：二次複雜度→現有解決方案→各自缺陷（犧牲長程依賴），為 DaViT 的雙注意力設計鋪路。

論證技巧 / 潛在漏洞透過列舉多種既有方案（Swin、膨脹、線性）並統一批評其「犧牲長程依賴」，建立強烈的問題動機。但部分方法（如 Swin 的移位視窗）確實能間接實現跨視窗資訊傳播，批評略顯過度。

We observe that attention can be computed along two orthogonal dimensions: spatial and channel. Spatial attention computes relationships between tokens at different spatial positions within a local region, focusing on "where" to attend. Channel attention computes relationships across the feature channels at each spatial position, focusing on "what" features are important. By alternating between these two types of attention, we can effectively capture both local spatial patterns and global semantic information without incurring the quadratic cost of full global attention. This dual perspective provides a more comprehensive representation than either attention type alone.

我們觀察到注意力可以沿兩個正交維度進行計算：空間與通道。空間注意力計算局部區域內不同空間位置之 token 間的關係，聚焦於「在哪裡」進行注意。通道注意力計算每個空間位置上跨特徵通道的關係，聚焦於「什麼」特徵是重要的。透過交替使用這兩種注意力，我們能有效捕捉局部空間模式和全域語意資訊，而不產生完全全域注意力的二次成本。此雙重視角提供了比任一單獨注意力類型更全面的表示。

段落功能核心洞察引介——提出空間與通道的正交注意力觀點。

邏輯角色從問題過渡到解方：以「正交維度」的觀點為雙注意力設計提供理論基礎，巧妙地將效率與效能的矛盾轉化為「分別處理」的策略。

論證技巧 / 潛在漏洞「正交」的概念賦予設計以數學上的優雅性。但空間與通道是否真正正交（即資訊是否完全不重疊）是一個經驗性問題，理論保證尚不充分。

Hierarchical vision transformers have become the dominant architecture paradigm. Swin Transformer introduced shifted window attention to enable cross-window information exchange. PVT and Twins use spatial reduction to decrease the key-value length in attention computation. CSWin Transformer proposes cross-shaped window attention for more efficient global modeling. Despite their effectiveness, most of these methods focus solely on spatial attention and do not explicitly model channel-wise interactions. In the CNN literature, channel attention mechanisms such as SE-Net and CBAM have demonstrated the importance of adaptive channel weighting. DaViT bridges these two lines of research by incorporating both spatial and channel attention within the transformer framework.

層次化視覺 Transformer 已成為主流架構範式。Swin Transformer 引入移位視窗注意力以實現跨視窗資訊交換。PVT 和 Twins 使用空間縮減來降低注意力計算中鍵值的長度。CSWin Transformer 提出十字形視窗注意力以實現更高效的全域建模。儘管這些方法有效，大多數僅聚焦於空間注意力，未明確建模通道間的互動。在 CNN 文獻中，通道注意力機制如 SE-Net 和 CBAM 已證明自適應通道加權的重要性。DaViT 透過在 Transformer 框架中同時納入空間與通道注意力，銜接了這兩條研究路線。

段落功能文獻回顧——回顧層次化 ViT 與通道注意力的發展脈絡。

邏輯角色將 DaViT 定位為「空間注意力 ViT」與「通道注意力 CNN」的交匯點，建立獨特的研究定位。

論證技巧 / 潛在漏洞以「銜接兩條研究路線」的敘事增強了 DaViT 的學術價值感。但 SE-Net 等通道注意力方法的計算方式與 DaViT 的通道分組注意力有本質區別，類比的準確性值得商榷。

Beyond attention mechanism design, multi-scale feature extraction is another critical component of modern vision architectures. Methods like FPN and BiFPN have demonstrated the importance of combining features at different resolutions for dense prediction tasks. DaViT adopts a four-stage hierarchical structure with progressively reduced spatial resolution and increased channel dimensions, similar to the design philosophy of Swin and PVT. However, unlike these methods, DaViT's dual attention mechanism ensures that global context is captured at every stage through channel group attention, rather than relying on indirect cross-window mechanisms.

除了注意力機制設計，多尺度特徵提取是現代視覺架構的另一關鍵組件。FPN 和 BiFPN 等方法已證明在密集預測任務中結合不同解析度特徵的重要性。DaViT 採用四階段層次結構，逐步降低空間解析度並增加通道維度，與 Swin 和 PVT 的設計哲學類似。然而，與這些方法不同的是，DaViT 的雙注意力機制透過通道分組注意力確保在每個階段都能捕捉全域上下文，而非依賴間接的跨視窗機制。

段落功能差異化定位——強調 DaViT 相較於其他層次化 ViT 的獨特優勢。

邏輯角色透過「每個階段都能捕捉全域上下文」的論述，回應了緒論中「犧牲長程依賴」的批評，建立 DaViT 的核心差異化。

論證技巧 / 潛在漏洞以「間接機制」委婉批評 Swin 的移位視窗策略，有效凸顯自身優勢。但通道分組注意力是否真正等同於「全域上下文」取決於分組方式和通道數量。

3. Method — 方法

The DaViT architecture follows a hierarchical design with four stages. Each stage consists of multiple Dual Attention Transformer (DAT) blocks, where each block contains two sequential attention layers: a spatial window attention layer followed by a channel group attention layer. In the spatial window attention layer, the input feature map is partitioned into non-overlapping windows of fixed size (e.g., 7x7), and standard multi-head self-attention is applied within each window. This captures local spatial relationships with linear complexity relative to the input resolution. In the channel group attention layer, the feature channels are divided into groups, and attention is computed across all spatial positions within each channel group, capturing global spatial relationships with complexity proportional to the number of channel groups.

DaViT 架構遵循四階段的層次設計。每個階段包含多個雙注意力 Transformer（DAT）區塊，每個區塊含有兩個順序的注意力層：空間視窗注意力層後接通道分組注意力層。在空間視窗注意力層中，輸入特徵圖被分割為固定大小的非重疊視窗（如 7x7），並在每個視窗內施加標準多頭自注意力。這以相對於輸入解析度的線性複雜度捕捉局部空間關係。在通道分組注意力層中，特徵通道被劃分為多組，在每個通道組內跨所有空間位置計算注意力，以與通道組數成正比的複雜度捕捉全域空間關係。

段落功能方法論展開——詳細描述 DAT 區塊的雙注意力結構。

邏輯角色將「雙注意力交替」的設計理念具體化為架構細節：空間視窗→通道分組的順序安排，以及各自的複雜度分析。

論證技巧 / 潛在漏洞透過明確的複雜度分析（線性 vs. 通道組數正比）有效回應了效率質疑。但通道分組數的選擇直接影響全域建模的精細度，此超參數的敏感度分析值得關注。

The channel group attention operates by reshaping the feature tensor from (B, H, W, C) to (B, G, N, C/G), where G is the number of groups and N = H x W is the number of spatial tokens. Attention is then computed along the spatial dimension N for each group independently, resulting in G independent attention maps that collectively cover all channels. This formulation allows each group to specialize in different semantic aspects while maintaining computational cost of O(G * N^2 * C/G) = O(N^2 * C), which is equivalent to standard global attention but with significantly smaller attention matrices. Between stages, patch merging layers reduce the spatial resolution by 2x and double the channel dimension, creating the hierarchical multi-scale structure necessary for dense prediction tasks.

通道分組注意力透過將特徵張量從 (B, H, W, C) 重塑為 (B, G, N, C/G) 來運作，其中 G 為組數，N = H x W 為空間 token 數。隨後在每個組內沿空間維度 N 獨立計算注意力，產生 G 個獨立的注意力圖，共同覆蓋所有通道。此公式使每個組能專注於不同的語意面向，同時維持O(G * N^2 * C/G) = O(N^2 * C) 的計算成本，等同於標準全域注意力但具有顯著更小的注意力矩陣。階段之間，patch 合併層將空間解析度降低 2 倍並將通道維度加倍，建立密集預測任務所需的層次多尺度結構。

段落功能技術細節深化——剖析通道分組注意力的張量操作與複雜度。

邏輯角色以精確的數學表示證明通道分組注意力的計算效率，同時解釋了多尺度結構的構建方式。

論證技巧 / 潛在漏洞複雜度推導清晰嚴謹。但 O(N^2 * C) 與標準全域注意力相同的結論可能讓讀者質疑實際加速效果——關鍵在於常數因子和注意力矩陣大小的差異，此點需更明確說明。

4. Experiments — 實驗

We evaluate DaViT on three representative benchmarks. On ImageNet-1K classification, DaViT-Tiny (28.3M parameters) achieves 82.8% top-1 accuracy, outperforming Swin-T (81.3%), PVTv2-B2 (82.0%), and CSWin-T (82.7%). DaViT-Small (49.7M) reaches 84.2% accuracy, surpassing Swin-S (83.0%) and CSWin-S (83.6%). DaViT-Base (87.9M) achieves 84.6%, competitive with the best models at this scale. These results demonstrate that the dual attention mechanism consistently provides improvements across different model sizes.

我們在三個代表性基準上評估 DaViT。在ImageNet-1K 分類上，DaViT-Tiny（28.3M 參數）達到82.8% top-1 精確度，超越 Swin-T（81.3%）、PVTv2-B2（82.0%）和 CSWin-T（82.7%）。DaViT-Small（49.7M）達到84.2% 精確度，超越 Swin-S（83.0%）和 CSWin-S（83.6%）。DaViT-Base（87.9M）達到84.6%，與該規模最佳模型相當。這些結果證明雙注意力機制在不同模型規模下一致地提供了改進。

段落功能提供分類實證——在 ImageNet 上與主流 ViT 進行全面比較。

邏輯角色以三種規模（Tiny/Small/Base）的一致性提升，支撐「雙注意力普遍有效」的論點。

論證技巧 / 潛在漏洞多規模比較增強了結論的穩健性。但 Base 規模僅「相當」而非超越最佳模型，暗示雙注意力的邊際收益在大規模模型上可能遞減。

On COCO object detection using Mask R-CNN, DaViT-Tiny achieves 45.0 box AP and 40.7 mask AP, outperforming Swin-T (43.7/39.8) by significant margins. On ADE20K semantic segmentation with UperNet, DaViT-Small achieves 49.4 mIoU, surpassing Swin-S (48.5) and CSWin-S (49.2). Ablation studies confirm the complementary nature of the two attention types: removing channel group attention decreases ImageNet accuracy by 0.9%, while removing spatial window attention causes a 1.3% drop, demonstrating that both attention mechanisms are essential and contribute unique information.

在使用 Mask R-CNN 的 COCO 物件偵測上，DaViT-Tiny 達到45.0 box AP 和 40.7 mask AP，顯著超越 Swin-T（43.7/39.8）。在使用 UperNet 的 ADE20K 語意分割上，DaViT-Small 達到49.4 mIoU，超越 Swin-S（48.5）和 CSWin-S（49.2）。消融研究確認了兩種注意力類型的互補性：移除通道分組注意力使 ImageNet 精確度下降0.9%，移除空間視窗注意力則導致1.3% 的下降，證明兩種注意力機制均不可或缺且各自貢獻獨特資訊。

段落功能補充實證——擴展至偵測、分割任務及消融分析。

邏輯角色多任務驗證加上消融分析，從外部（跨任務泛化）和內部（組件必要性）兩個維度完整支撐核心主張。

論證技巧 / 潛在漏洞消融研究直接驗證了「雙注意力互補」的核心假設。空間注意力的貢獻（1.3%）大於通道注意力（0.9%），暗示局部特徵在分類任務中仍較為重要。

5. Conclusion — 結論

We have introduced DaViT, a vision transformer architecture that combines spatial window attention with channel group attention to achieve both local fine-grained feature extraction and global context modeling. Through extensive experiments on classification, detection, and segmentation benchmarks, we demonstrate that DaViT consistently outperforms existing vision transformers with comparable computational budgets. The dual attention design provides a simple and effective approach to bridging the gap between computational efficiency and representational capacity in vision transformers. We hope that DaViT inspires further exploration of orthogonal attention designs that jointly optimize for different aspects of visual understanding.

本文提出了 DaViT，一種結合空間視窗注意力與通道分組注意力的視覺 Transformer 架構，同時實現局部精細特徵提取與全域上下文建模。透過在分類、偵測和分割基準上的大量實驗，我們展示了DaViT 在相近計算預算下一致超越現有視覺 Transformer。雙注意力設計為視覺 Transformer 中彌合計算效率與表示能力之間差距提供了簡潔有效的方法。我們期望 DaViT 能啟發更多關於正交注意力設計的探索，以聯合優化視覺理解的不同面向。

段落功能全文總結——重申雙注意力的核心貢獻與未來展望。

邏輯角色以「效率與能力的橋樑」總結方法定位，並以「正交注意力設計」的開放性展望結尾，為後續研究留下空間。

論證技巧 / 潛在漏洞結論措辭謹慎適當。但未討論 DaViT 在更大規模預訓練（如 ImageNet-22K）下的表現，以及與後續更新架構（如 MetaFormer）的比較。

Abstract — 摘要

1. Introduction — 緒論

3. Method — 方法

4. Experiments — 實驗

5. Conclusion — 結論

論證結構總覽

核心主張

最強論點

最弱環節

Abstract — 摘要

1. Introduction — 緒論

2. Related Work — 相關工作

3. Method — 方法

4. Experiments — 實驗

5. Conclusion — 結論

論證結構總覽

核心主張

最強論點

最弱環節