Focal Self-Attention for Local-Global Interactions in Vision Transformers

Abstract — 摘要

Recently, Vision Transformer and its variants have shown great potential in various computer vision tasks. The ability of self-attention to model both short- and long-range visual dependencies is the key. However, existing approaches either use coarse-grained global attention sacrificing fine-grained local details, or fine-grained local attention at the cost of long-range modeling. In this work, the authors propose focal self-attention, a new mechanism where each token attends its closest surrounding tokens at fine granularity and tokens far away at coarse granularity. This enables efficient capture of both short- and long-range visual dependencies. With focal self-attention, the proposed Focal Transformer achieves superior performance on image classification (83.8% top-1 on ImageNet), object detection, and semantic segmentation.

近期 Vision Transformer 及其變體在各種電腦視覺任務中展現了巨大潛力。自注意力建模短距離與長距離視覺相依性的能力是關鍵所在。然而，現有方法要麼使用粗粒度的全域注意力而犧牲細粒度的局部細節，要麼使用細粒度的局部注意力而犧牲長距離建模能力。本研究提出焦點自注意力——一種新機制，每個標記以細粒度關注其最近的周圍標記，並以粗粒度關注遠處標記。這使得模型能高效地捕捉短距離與長距離的視覺相依性。採用焦點自注意力的 Focal Transformer 在影像分類（ImageNet top-1 83.8%）、物件偵測與語意分割上達到優異效能。

段落功能全文總覽——從自注意力的兩難困境出發，引出焦點注意力作為統一解方。

邏輯角色摘要建構了清晰的「二選一困境 -> 統一方案」敘事：粗粒度全域注意力與細粒度局部注意力各有缺陷，焦點注意力同時兼顧兩者。

論證技巧 / 潛在漏洞將現有方法框定為「非此即彼」的二元對立是有效的修辭策略，但實際上部分方法（如 Swin Transformer 的移位視窗）已嘗試在局部注意力中加入跨視窗連接。此處的二分法可能過度簡化了現有技術的光譜。

1. Introduction — 緒論

Vision Transformers have emerged as powerful alternatives to convolutional networks. The core strength lies in self-attention's ability to model both short- and long-range dependencies. Yet this capability carries substantial computational costs when processing high-resolution feature maps. Existing approaches diverge into two strategies: coarse-grained global attention (such as downsampled tokens in PVT) sacrificing local detail, or fine-grained local attention (such as fixed windows in Swin Transformer) limiting long-range modeling. The authors observe that full self-attention ViT models indeed learn to attend local surroundings and global contexts simultaneously, suggesting both interaction types are necessary and should be captured in a single attention mechanism.

Vision Transformer 已成為摺積網路的強大替代方案。其核心優勢在於自注意力建模短距離與長距離相依性的能力。但在處理高解析度特徵圖時，此能力伴隨著可觀的計算成本。現有方法分為兩種策略：粗粒度全域注意力（如 PVT 的降取樣標記）犧牲局部細節；或細粒度局部注意力（如 Swin Transformer 的固定視窗）限制長距離建模。作者觀察到全自注意力 ViT 模型確實學會同時關注局部周圍與全域上下文，表明兩種互動類型都是必要的，應在單一注意力機制中被捕捉。

段落功能建立動機——從經驗觀察出發，論證局部與全域注意力應該統一。

邏輯角色此段的關鍵在於經驗觀察（ViT 自然學會局部+全域注意模式），為焦點注意力的設計提供了數據驅動的動機，而非僅靠直覺。

論證技巧 / 潛在漏洞引用 ViT 的注意力模式作為設計依據是強有力的論證——從「模型自然學會的」推導出「我們應該明確設計的」。但此觀察來自分類任務的 ViT，在偵測/分割任務中的注意力模式是否相同未被驗證。

To this end, the authors propose focal self-attention where each query token attends its closest surrounding tokens at fine granularity and tokens far away at coarse granularity. This is achieved through multi-level sub-window pooling: at each focal level, the feature map is partitioned into sub-windows of increasing sizes and pooled to create multi-granularity representations. The query then attends keys and values from all levels simultaneously, capturing both fine-grained local patterns and coarse-grained global context in a single attention operation. This design maintains receptive field coverage comparable to global attention while achieving computational complexity linear in spatial dimensions.

為此，作者提出焦點自注意力，其中每個查詢標記以細粒度關注其最近的周圍標記，以粗粒度關注遠處標記。這透過多層級子視窗池化實現：在每個焦點層級，特徵圖被切分為逐漸增大的子視窗並進行池化，建立多粒度表示。查詢同時關注所有層級的鍵與值，在單一注意力操作中捕捉細粒度的局部模式與粗粒度的全域上下文。此設計在維持與全域注意力相當的感受野覆蓋的同時，達到空間維度上線性的計算複雜度。

段落功能方案概述——描述焦點自注意力的核心機制與效率保證。

邏輯角色此段將直覺性的「近處細看、遠處粗看」概念轉化為具體的技術實現（子視窗池化 + 多層級關注），建立了從動機到方法的橋樑。

論證技巧 / 潛在漏洞「近處細看、遠處粗看」的設計極具人類視覺的直覺性，類似於視網膜中央窩（fovea）的注意力分配。但池化操作在粗粒度層級可能丟失關鍵的遠距離細節資訊——例如小物件的偵測可能受到粗粒度全域注意力的影響。

Prior approaches to reducing attention complexity fall into two categories. The first employs coarse-grained global self-attention by attending downsampled or summarized tokens, as seen in PVT's spatial-reduction attention and Twins' global sub-sampled attention. The second uses fine-grained local attention within constant window sizes, exemplified by Swin Transformer's shifted window mechanism. Both strategies involve fundamental trade-offs: global methods lose fine-grained spatial resolution, while local methods require additional mechanisms (window shifting, relative position bias) to enable cross-window communication. The focal self-attention represents the first reconciliation of both approaches in a single transformer layer.

先前降低注意力複雜度的方法分為兩類。第一類採用粗粒度全域自注意力，關注降取樣或摘要化的標記，如 PVT 的空間縮減注意力與 Twins 的全域子取樣注意力。第二類在固定視窗大小內使用細粒度局部注意力，以 Swin Transformer 的移位視窗機制為代表。兩種策略皆涉及根本性的取捨：全域方法喪失細粒度空間解析度，局部方法則需要額外機制（視窗移位、相對位置偏差）來實現跨視窗通訊。焦點自注意力代表了在單一 Transformer 層中首次調和兩種方法。

段落功能文獻定位——將焦點注意力置於全域與局部注意力的光譜中間。

邏輯角色以二分法組織文獻，使焦點注意力自然成為「第三條路」。PVT 與 Swin 分別代表兩個極端，焦點注意力則是統一解方。

論證技巧 / 潛在漏洞「首次調和」的宣稱需要謹慎——Swin 的移位視窗某種程度上也嘗試在局部注意力中引入跨視窗資訊。此處的「首次」可能是指在單一注意力操作中同時實現多粒度關注，而非泛指任何形式的調和。

3. Method — 方法

3.1 Motivation — 設計動機

The design is motivated by a key empirical observation: visualizing attention patterns in full self-attention ViT reveals that attention heads naturally develop both local and global patterns. Some heads focus narrowly on immediate neighbors, while others attend broadly across the image. This suggests that an ideal attention mechanism should explicitly model both scales. Furthermore, ablation studies show that removing either local-only attention (82.2% to 80.1%) or global-only attention (82.2% to 81.5%) degrades performance, confirming the complementary nature of local and global interactions.

此設計源於一項關鍵的經驗觀察：視覺化全自注意力 ViT 的注意力模式後發現，注意力頭自然發展出局部與全域兩種模式。部分注意力頭狹隘地聚焦於鄰近標記，而其他則廣泛地關注整張影像。這表明理想的注意力機制應明確地建模兩種尺度。此外，消融研究顯示移除局部注意力（82.2% 降至 80.1%）或全域注意力（82.2% 降至 81.5%）皆會導致效能下降，確認了局部與全域互動的互補性。

段落功能經驗證據——以注意力視覺化與消融實驗支撐設計動機。

邏輯角色此段是全文論證的根基：不是純粹的直覺設計，而是數據驅動的——注意力視覺化提供定性證據，消融研究提供定量證據。

論證技巧 / 潛在漏洞以消融數據（80.1% vs 81.5% vs 82.2%）量化局部與全域注意力的貢獻是說服力極強的手法。但需注意局部注意力的下降幅度（2.1%）大於全域（0.7%），暗示局部細節可能比全域上下文更重要——這與作者強調「兩者同等重要」的敘事略有矛盾。

3.2 Focal Self-Attention — 焦點自注意力

The focal self-attention mechanism operates at the window level. For each focal level l, the input feature map is partitioned into sub-windows of size s_w^l, then pooled using linear layers to create multi-granularity representations. Query tokens attend keys and values extracted from multiple focal levels simultaneously. For a given query window, surrounding regions are sampled at progressively coarser granularities — the further away the region, the coarser the granularity. Three key parameters specify focal attention: focal levels (L) controlling the number of granularity levels, focal window size (s_w^l) specifying the sub-window dimension at level l, and focal region size (s_r^l) defining the extent of attended regions. The overall computational complexity is O((L + sum(s_r^l)^2)(MN)d), which is linear in spatial dimensions rather than quadratic.

焦點自注意力機制在視窗層級運作。對每個焦點層級 l，輸入特徵圖被切分為大小為 s_w^l 的子視窗，接著透過線性層池化以建立多粒度表示。查詢標記同時關注從多個焦點層級提取的鍵與值。對於給定的查詢視窗，周圍區域以漸進粗化的粒度取樣——區域越遠，粒度越粗。三個關鍵參數定義焦點注意力：焦點層級 (L) 控制粒度層級數量、焦點視窗大小 (s_w^l) 指定第 l 層級的子視窗維度、焦點區域大小 (s_r^l) 定義關注區域的範圍。整體計算複雜度為 O((L + sum(s_r^l)^2)(MN)d)，在空間維度上為線性而非二次方。

段落功能核心技術細節——完整描述焦點自注意力的參數化設計與計算複雜度。

邏輯角色此段將「近處細看、遠處粗看」的直覺轉化為精確的數學公式，建立了從概念到實現的完整橋樑。三個參數的明確定義使方法具備可重現性。

論證技巧 / 潛在漏洞線性複雜度的保證是關鍵優勢，但複雜度中的常數項（L 個層級的額外開銷）可能使實際運行時間與理論值有差距。此外，最優的焦點層級數量和大小需要針對不同任務調整，增加了超參數搜尋的負擔。

3.3 Architecture — 模型架構

The Focal Transformer follows a multi-scale design with four stages. Input images undergo 4x4 patch embedding, then pass through stages where spatial resolution decreases by factor 2 while feature dimension increases by 2 at transitions. Each stage contains focal transformer layers processing feature maps at decreasing resolutions. The model comes in three variants: Focal-Tiny (29M parameters), Focal-Small (51M), and Focal-Base (90M), designed to match the parameter counts of Swin Transformer variants for fair comparison.

Focal Transformer 採用四階段的多尺度設計。輸入影像經過 4x4 圖塊嵌入，然後通過各階段——在階段轉換處空間解析度縮減為 2 倍，特徵維度增加為 2 倍。每個階段包含焦點 Transformer 層，處理逐漸降低解析度的特徵圖。模型有三種變體：Focal-Tiny（29M 參數）、Focal-Small（51M）與 Focal-Base（90M），設計為與 Swin Transformer 變體的參數量匹配以進行公平比較。

段落功能架構規格——描述 Focal Transformer 的整體架構與模型變體。

邏輯角色此段確保了實驗比較的公平性——透過匹配 Swin 的參數量，使後續效能差異可以歸因於注意力機制設計而非模型容量。

論證技巧 / 潛在漏洞以參數量匹配作為公平性保證是合理的，但 FLOPs（浮點運算量）可能因焦點注意力的多層級設計而高於 Swin——若在 FLOPs 對等的條件下比較，結論可能不同。

4. Experiments — 實驗

On ImageNet-1K image classification, Focal-Tiny achieves 82.2% top-1 accuracy (+1.0% vs. Swin-Tiny) at similar parameters. Focal-Small reaches 83.5% and Focal-Base achieves 83.8%, consistently outperforming comparable models. For COCO object detection, Focal-Tiny improves over Swin-Tiny by +1.7 box mAP and +1.2 mask mAP with Mask R-CNN. Testing across six detection methods (Cascade R-CNN, ATSS, RepPoints, Sparse R-CNN) demonstrates consistent gains of 1.0-2.3 mAP. On ADE20K semantic segmentation, Focal-Large achieves 55.4 mIoU (+1.9 vs. Swin-Large), establishing state-of-the-art performance. Notably, removing window shifting from Focal Transformers shows minimal degradation (unlike Swin), confirming that focal attention inherently captures cross-window information without requiring additional mechanisms.

在 ImageNet-1K 影像分類上，Focal-Tiny 達到 82.2% 的 top-1 準確率（相較 Swin-Tiny +1.0%），參數量相近。Focal-Small 達到 83.5%，Focal-Base 達到 83.8%，持續超越可比模型。在 COCO 物件偵測上，使用 Mask R-CNN 的 Focal-Tiny 相較 Swin-Tiny 提升 +1.7 box mAP 與 +1.2 mask mAP。在六種偵測方法（Cascade R-CNN、ATSS、RepPoints、Sparse R-CNN）上的測試展示了一致的 1.0-2.3 mAP 增益。在 ADE20K 語意分割上，Focal-Large 達到 55.4 mIoU（相較 Swin-Large +1.9），確立了最先進效能。值得注意的是，從 Focal Transformer 中移除視窗移位幾乎不造成退化（不同於 Swin），確認了焦點注意力本質上就能捕捉跨視窗資訊，無需額外機制。

段落功能全面的實驗驗證——在分類、偵測、分割三大任務上提供系統性比較。

邏輯角色此段是論文的實證核心，覆蓋三個維度：(1) 分類精度；(2) 多框架偵測一致性；(3) 分割 SOTA。移除視窗移位的消融實驗更進一步證明焦點注意力的設計優越性。

論證技巧 / 潛在漏洞跨六種偵測方法的一致性增益是極具說服力的——這排除了「方法相依性」的可能解釋。然而，所有比較均以 Swin 為主要基線，缺乏與其他同期方法（如 CSWin、Twins）的全面比較。此外，效能提升伴隨的計算開銷（實際延遲 vs. 理論 FLOPs）未被充分量化。

5. Conclusion — 結論

Focal self-attention successfully reconciles efficiency with modeling capacity by performing local self-attention at fine granularity and global self-attention at coarse granularity. The approach generalizes across classification, detection, and segmentation with consistent improvements over state-of-the-art baselines. The authors acknowledge that computational overhead remains a consideration for practical deployment, suggesting depth reduction and broader architectural exploration as future directions for making focal attention more efficient.

焦點自注意力透過在細粒度上執行局部自注意力、在粗粒度上執行全域自注意力，成功地調和了效率與建模能力。此方法在分類、偵測與分割上具有泛化能力，持續改善最先進基線。作者坦承計算開銷在實際部署中仍是考量因素，並建議深度縮減與更廣泛的架構探索作為使焦點注意力更高效的未來方向。

段落功能總結全文——重申核心貢獻並坦承侷限。

邏輯角色結論呼應摘要的「二選一困境 -> 統一方案」敘事，形成完整閉環。同時誠實地指出計算開銷問題，為未來研究留下空間。

論證技巧 / 潛在漏洞承認計算開銷是正確的透明度表現。然而，論文未討論焦點注意力在非視覺任務（如 NLP）中的適用性，也未探討與其他效率化技術（如稀疏注意力、線性注意力）的結合可能性。

論證結構總覽

問題
全域與局部注意力
各有結構性取捨

→

論點
焦點注意力統一
多粒度局部-全域互動

→

證據
三大任務一致超越
Swin Transformer

→

反駁
線性複雜度保證
無需視窗移位機制

→

結論
多粒度注意力是
視覺 Transformer 的方向

作者核心主張（一句話）

透過在細粒度上執行局部自注意力、粗粒度上執行全域自注意力，焦點注意力以線性複雜度同時捕捉短距離與長距離視覺相依性，在分類、偵測與分割上全面超越現有方法。

論證最強處

數據驅動的設計動機：焦點注意力的設計並非純粹的直覺，而是基於 ViT 注意力模式的經驗觀察，並透過消融研究（局部 vs. 全域貢獻）定量驗證。跨六種偵測方法的一致性增益進一步排除了方法相依性的解釋。

論證最弱處

效率宣稱與實際部署的落差：儘管理論複雜度為線性，但多層級子視窗池化與多粒度注意力的實際記憶體和延遲開銷未被充分量化。在邊緣裝置或即時應用場景中，焦點注意力的額外複雜性可能抵消其精度優勢。