Abstract — 摘要
Transformers have recently gained significant attention in computer vision. However, the quadratic complexity of self-attention with respect to input size has limited their application to tasks that require large receptive fields or high-resolution inputs. In this paper, we introduce Multi-axis Vision Transformer (MaxViT), an efficient and scalable attention model that enjoys global-local interaction throughout the network. We achieve this by decomposing the full attention into two sparse forms — blocked local attention and dilated global attention — which together form a multi-axis attention scheme. MaxViT can be used as a general-purpose vision backbone for a wide range of tasks. We demonstrate state-of-the-art performance on ImageNet classification (85.17% top-1 at 224x224), COCO object detection, and image generation (FID 2.47 on ImageNet 256x256).
Transformer 近年在電腦視覺中備受矚目。然而,自注意力相對於輸入大小的二次複雜度限制了其在需要大感受野或高解析度輸入之任務上的應用。本文介紹多軸視覺 Transformer(MaxViT),一種在整個網路中享有全域-局部互動的高效且可擴展注意力模型。我們透過將完整注意力分解為兩種稀疏形式——分塊局部注意力和膨脹全域注意力來實現此目標,二者共同構成多軸注意力機制。MaxViT 可作為廣泛任務的通用視覺骨幹。我們展示了在 ImageNet 分類(224x224 下 85.17% top-1)、COCO 物體偵測和影像生成(ImageNet 256x256 上 FID 2.47)的最先進表現。
段落功能全文總覽——以多軸注意力解決二次複雜度問題。
邏輯角色摘要在三個截然不同的任務(分類、偵測、生成)上展示 SOTA 表現,強力支撐「通用骨幹」的定位。
論證技巧 / 潛在漏洞跨三個任務的 SOTA 是極強的論證,但需要注意每個任務的實驗設定是否公平(如是否使用預訓練)。
1. Introduction — 緒論
Vision Transformers have shown great promise, but existing efficient attention mechanisms trade off between local and global information. Window attention (Swin Transformer) provides efficient local attention but lacks direct global connectivity within each block. Linear attention approximates global attention but often sacrifices accuracy. Hybrid CNN-Transformer models combine local convolutions with global attention but typically confine global attention to deeper layers only. We argue that both local and global receptive fields should be available at every network stage, from shallow to deep layers.
視覺 Transformer 展現了巨大潛力,但現有的高效注意力機制在局部與全域資訊之間做出取捨。窗口注意力(Swin Transformer)提供高效的局部注意力但每個區塊內缺乏直接的全域連通性。線性注意力近似全域注意力但往往犧牲精確度。混合 CNN-Transformer 模型結合局部摺積與全域注意力,但通常將全域注意力限制在深層。我們主張局部和全域感受野應在每個網路階段都可用,從淺層到深層。
段落功能批判分析——系統性地指出現有方法的局限。
邏輯角色透過逐一列舉三類方法的不足,建立「全階段全域-局部注意力」的需求合理性。
論證技巧 / 潛在漏洞系統性的分析具說服力,但 Swin 的移位窗口機制實際上提供了跨窗口連接,作者對其描述可能過度簡化。
2. Method — 方法
The core of MaxViT is the Multi-axis Attention, which consists of two complementary operations applied sequentially within each block. First, Block Attention partitions the input into non-overlapping local windows (e.g., 7x7) and applies standard self-attention within each window, capturing local patterns with O(n) complexity where n is the total number of tokens. Second, Grid Attention applies dilated attention by selecting one token from each window at the same local position, forming a global grid. This enables every token to attend to tokens across the entire spatial extent, also with O(n) complexity. Together, the two operations achieve full spatial coverage with linear overall complexity.
MaxViT 的核心是多軸注意力,由兩個互補操作在每個區塊內依序應用。第一,區塊注意力將輸入分割為不重疊的局部窗口(如 7x7),在每個窗口內應用標準自注意力,以 O(n) 複雜度(n 為總標記數)捕捉局部模式。第二,網格注意力透過從每個窗口中選取相同局部位置的標記形成全域網格,應用膨脹注意力。這使得每個標記都能注意到整個空間範圍的標記,同樣具有 O(n) 複雜度。二者結合達成了線性整體複雜度下的完整空間覆蓋。
段落功能核心方法——區塊注意力與網格注意力的互補設計。
邏輯角色以兩個 O(n) 操作組合實現完整空間覆蓋,是方法的核心創新。
論證技巧 / 潛在漏洞「局部 + 全域 = 完整覆蓋」的論點直觀有力。但網格注意力中每個位置只選一個標記,可能損失空間區域內的多樣性。
Each MaxViT block further integrates MBConv (Mobile Inverted Bottleneck Convolution) before the attention layers, forming a Conv-Block-Grid architecture. The MBConv layer provides inductive biases such as locality and translation equivariance that complement the attention mechanism. The overall architecture follows a hierarchical multi-scale design with 4 stages, where spatial resolution decreases and channel dimension increases progressively, similar to traditional CNNs. This design enables seamless integration as a drop-in backbone replacement for existing detection and segmentation frameworks.
每個 MaxViT 區塊進一步在注意力層之前整合 MBConv(行動反轉瓶頸摺積),形成摺積-區塊-網格架構。MBConv 層提供局部性和平移等變性等歸納偏置,與注意力機制互補。整體架構遵循四階段的分層多尺度設計,空間解析度逐步下降而通道維度逐步增加,類似傳統 CNN。此設計使其能無縫整合為現有偵測和分割框架的即插即用骨幹替換。
段落功能架構設計——結合摺積與注意力的混合設計。
邏輯角色MBConv 提供歸納偏置,分層設計確保相容性,使 MaxViT 實用性更強。
論證技巧 / 潛在漏洞「即插即用」的設計哲學降低了採用門檻。但 Conv + Attention 的組合增加了架構複雜度。
3. Experiments — 實驗
On ImageNet-1K classification, MaxViT-L achieves 85.17% top-1 accuracy at 224x224 without extra training data, outperforming CoAtNet-3 by 0.67% and Swin-L by 1.27%. At 384x384, MaxViT-L reaches 86.40%. On COCO object detection with Cascade Mask R-CNN, MaxViT-L achieves 55.4 box AP, surpassing Swin-L by 2.0 AP. For image generation, we integrate MaxViT as the backbone of a diffusion model, achieving FID 2.47 on ImageNet 256x256 class-conditional generation, setting a new record. These results demonstrate MaxViT's versatility as a universal vision backbone.
在 ImageNet-1K 分類上,MaxViT-L 在 224x224 下達到 85.17% top-1 精確度,不使用額外訓練資料,超越 CoAtNet-3 0.67% 和 Swin-L 1.27%。在 384x384 下,MaxViT-L 達到 86.40%。在 COCO 物體偵測(Cascade Mask R-CNN)上,MaxViT-L 達到 55.4 box AP,超越 Swin-L 2.0 AP。在影像生成方面,我們將 MaxViT 整合為擴散模型的骨幹,在 ImageNet 256x256 類別條件生成上達到 FID 2.47,創下新紀錄。這些結果展示了 MaxViT 作為通用視覺骨幹的多功能性。
段落功能核心實驗結果——三大任務的全面 SOTA。
邏輯角色三個領域的同時最優表現是「通用骨幹」論點的最強支撐。
論證技巧 / 潛在漏洞跨任務的一致優勢極具說服力。但影像生成中的 FID 比較需確認是否使用了相同的擴散框架。
4. Conclusion — 結論
We have presented MaxViT, a multi-axis vision transformer that achieves both local and global attention interaction at every network stage with linear complexity. The combination of block attention, grid attention, and MBConv creates a powerful and flexible architecture that excels across classification, detection, and generation tasks. We believe MaxViT demonstrates that carefully designed sparse attention patterns can match or exceed full attention while being orders of magnitude more efficient.
本文提出了 MaxViT,一種在每個網路階段以線性複雜度實現局部和全域注意力互動的多軸視覺 Transformer。區塊注意力、網格注意力和 MBConv 的結合創造了一個強大且靈活的架構,在分類、偵測和生成任務上均表現卓越。我們相信 MaxViT 展示了精心設計的稀疏注意力模式可以匹敵或超越完整注意力,同時效率提升數個量級。
段落功能全文總結——重申多軸注意力的核心價值。
邏輯角色以「稀疏注意力可匹敵完整注意力」作為更廣泛的洞察,提升論文的影響力。
論證技巧 / 潛在漏洞以「數個量級」描述效率提升需要具體的計算量比較作為支撐。