Multiscale Vision Transformers (MViT)

Abstract — 摘要

The authors present Multiscale Vision Transformers (MViT) for video and image recognition, connecting multiscale feature hierarchies with transformer models. The architecture features multiple channel-resolution scale stages that hierarchically expand channel capacity while reducing spatial resolution. This enables early layers to operate on high-resolution features with compact channel dimensions, while deeper layers process coarse, complex features with expanded channels. A key innovation is Multi Head Pooling Attention (MHPA), which applies pooling operators to query, key, and value tensors for flexible resolution modeling within transformer blocks. MViT achieves state-of-the-art results on Kinetics-400 video classification (81.2%) with 5-10x fewer FLOPs than concurrent ViT-based models, and competitive ImageNet performance (84.8%) with 1.7x fewer FLOPs than DeiT, all without large-scale pre-training.

作者提出多尺度視覺 Transformer (MViT) 用於視訊與影像辨識，將多尺度特徵層級與 Transformer 模型相連結。其架構具備多個通道-解析度尺度階段，以層級方式擴展通道容量同時降低空間解析度。這使得早期層在緊湊通道維度下處理高解析度特徵，而深層則以擴展的通道處理粗略、複雜的特徵。關鍵創新為多頭池化注意力 (MHPA)，它對查詢、鍵與值張量施加池化運算子，實現 Transformer 區塊內的靈活解析度建模。MViT 在 Kinetics-400 視訊分類上達到最先進的 81.2% 準確率，FLOPs 為同期 ViT 模型的 5-10 分之一，在 ImageNet 上達到具競爭力的 84.8% 準確率，FLOPs 為 DeiT 的 1.7 分之一，且無需大規模預訓練。

段落功能全文總覽——從多尺度設計的原理到效率優勢，概述 MViT 的完整貢獻。

邏輯角色摘要建構了「設計原理（多尺度）-> 技術實現（MHPA）-> 效率驗證（5-10x 更少 FLOPs）」的三段式論證，同時強調「無需大規模預訓練」以突顯方法的獨立價值。

論證技巧 / 潛在漏洞「5-10x 更少 FLOPs」是極具衝擊力的數字，但需注意比較對象可能不是最優基線。同期 ViT 模型多半未針對視訊最佳化，因此 MViT 的效率優勢部分來自於任務匹配（多尺度天然適合視訊的時空結構）而非純粹的架構改進。

1. Introduction — 緒論

Vision Transformers (ViT) have demonstrated impressive results on image classification, but their columnar architecture processes all layers at the same spatial resolution and channel dimension. This is fundamentally at odds with the multiscale nature of visual signals: natural images contain features at multiple scales, and biological visual systems process them hierarchically. In convolutional networks, the progressive reduction in spatial resolution alongside increase in channel capacity — a multiscale pyramid — has been a cornerstone of successful architectures from VGGNet to ResNet. The authors argue that this multiscale principle should be the foundation of transformer architectures for vision, not an afterthought.

視覺 Transformer (ViT) 在影像分類上展現了令人印象深刻的成果，但其柱狀架構在所有層級以相同的空間解析度與通道維度處理。這與視覺訊號的多尺度本質從根本上矛盾：自然影像包含多尺度的特徵，而生物視覺系統以層級方式處理它們。在摺積網路中，空間解析度的漸進降低伴隨通道容量的增加——多尺度金字塔——從 VGGNet 到 ResNet 一直是成功架構的基石。作者主張此多尺度原則應作為視覺 Transformer 架構的基礎，而非事後附加。

段落功能建立動機——從視覺訊號的多尺度本質論證多尺度 Transformer 的必要性。

邏輯角色此段的論證策略是將多尺度原則提升為視覺架構的「第一性原理」，使 ViT 的柱狀設計顯得不自然。引用生物視覺系統增強了直覺上的說服力。

論證技巧 / 潛在漏洞以生物視覺類比來支持多尺度設計是有效的修辭，但深度學習架構不必遵循生物設計——ViT 的成功恰恰證明了「非生物式」架構也能奏效。此外，「不是事後附加」的措辭暗指 PVT 等方法只是將多尺度「嫁接」到 Transformer 上，而 MViT 是更根本的整合。

For video understanding, the multiscale principle is even more critical. Video data is extremely dense — a short clip contains millions of spatiotemporal tokens. Processing all tokens at full resolution through standard self-attention is computationally prohibitive. MViT addresses this by starting with thin channel dimensions at high spatiotemporal resolution in early layers and gradually expanding channels while reducing resolution through pooling. This creates a multiscale pyramid of features where computational complexity remains roughly constant across stages, enabling efficient processing of dense visual data without sacrificing the ability to capture fine-grained patterns.

對於視訊理解，多尺度原則更為關鍵。視訊資料極為密集——一個短片段包含數百萬個時空標記。以標準自注意力在全解析度下處理所有標記在計算上是不可行的。MViT 透過在早期層以高時空解析度的窄通道維度起始，逐步透過池化擴展通道同時降低解析度來解決此問題。這建立了一個多尺度特徵金字塔，其中計算複雜度在各階段間大致恆定，使密集視覺資料的高效處理成為可能，且不犧牲捕捉細粒度模式的能力。

段落功能視訊場景動機——論證多尺度設計在視訊理解中的特殊重要性。

邏輯角色此段將 MViT 的應用範圍從影像擴展到視訊，強調視訊的「極端密集性」使多尺度設計從「有益」升級為「必要」。

論證技巧 / 潛在漏洞「計算複雜度在各階段間大致恆定」是一個精心的工程設計，但此恆定性依賴於通道擴展率與解析度縮減率的精確匹配。在不同任務或輸入解析度下，是否仍能維持此平衡？

Prior vision transformers maintain constant resolution throughout all layers. ViT and DeiT process fixed-size tokens from patch embedding to output. For video, TimeSformer and ViViT extend ViT to spatiotemporal tokens but maintain the same columnar design, requiring massive pre-training on ImageNet-21K. Concurrent works like PVT and Swin Transformer introduce multiscale features for dense prediction, but they focus on image tasks and do not address the spatiotemporal nature of video. In the CNN domain, SlowFast networks demonstrated the value of multi-rate temporal processing for video. MViT uniquely combines multiscale spatial and temporal modeling within a pure transformer architecture, without relying on large-scale pre-training.

先前的視覺 Transformer 在所有層級中維持恆定解析度。ViT 與 DeiT 從圖塊嵌入到輸出處理固定大小的標記。在視訊方面，TimeSformer 與 ViViT 將 ViT 擴展至時空標記，但維持相同的柱狀設計，需要在 ImageNet-21K 上進行大規模預訓練。同期如 PVT 與 Swin Transformer 等工作為密集預測引入多尺度特徵，但聚焦於影像任務且未處理視訊的時空本質。在 CNN 領域，SlowFast 網路展示了多速率時間處理對視訊的價值。MViT 獨特地在純 Transformer 架構中結合了多尺度空間與時間建模，且不依賴大規模預訓練。

段落功能文獻定位——在影像 Transformer、視訊 Transformer 與 CNN 的三維座標中定位 MViT。

邏輯角色此段的策略是在三個研究方向中各指出一個缺口：影像 Transformer 忽視多尺度、視訊 Transformer 需要大規模預訓練、CNN 缺乏全域注意力。MViT 被定位為同時填補三個缺口的方案。

論證技巧 / 潛在漏洞「不依賴大規模預訓練」是 MViT 的重要差異化因素。但在同等預訓練條件下（都使用或都不使用 ImageNet-21K），MViT 與 ViViT 的公平比較結果如何？效率優勢是否完全來自架構，還是部分因為比較條件不對等？

3. Method — 方法

3.1 Multiscale Architecture — 多尺度架構

MViT implements a progressive channel-resolution trade-off across stages. The network begins with a small channel dimension (e.g., 96) at high spatiotemporal resolution and progressively doubles the channel dimension while halving the spatial resolution at stage transitions. This design is motivated by the observation that early visual processing requires high spatial resolution to capture fine-grained local patterns (edges, textures), while deeper processing requires rich channel capacity to encode complex semantic concepts. The normalized skip connections at stage transitions accommodate the changing dimensions, and separate space and time positional embeddings (rather than joint spatiotemporal) provide position information with factored complexity.

MViT 實現了跨階段的漸進式通道-解析度取捨。網路從高時空解析度下的小通道維度（如 96）起始，在階段轉換處漸進地將通道維度加倍、空間解析度減半。此設計的動機基於觀察：早期視覺處理需要高空間解析度來捕捉細粒度的局部模式（邊緣、紋理），而深層處理需要豐富的通道容量來編碼複雜的語義概念。階段轉換處的正規化跳躍連接適應了維度的變化，而分離的空間與時間位置嵌入（而非聯合時空嵌入）以分解的複雜度提供位置資訊。

段落功能架構設計——描述多尺度金字塔的具體實現方式。

邏輯角色此段將 CNN 的經典設計原則（低層高解析度窄通道、高層低解析度寬通道）正式引入 Transformer 架構。分離的時空位置嵌入是針對視訊的特殊考量。

論證技巧 / 潛在漏洞將 CNN 的設計原則「翻譯」至 Transformer 是合理的策略，但 Transformer 的全域注意力機制使得「早期層捕捉局部模式」的假設可能不成立——ViT 的早期層已能捕捉全域模式。多尺度設計在 Transformer 中的真正價值可能更多在於計算效率而非特徵品質。

3.2 Multi Head Pooling Attention — 多頭池化注意力

The core innovation is Multi Head Pooling Attention (MHPA), which enables flexible resolution modeling within a single transformer block. Unlike standard Multi-Head Attention that maintains constant sequence length, MHPA applies learnable pooling operators to query (Q), key (K), and value (V) tensors with configurable kernel sizes, strides, and padding. The pooling reduces the sequence length of K and V, decreasing the attention computation cost, while optionally also pooling Q to produce downsampled output for stage transitions. This mechanism serves dual purposes: it reduces computational cost within stages and enables resolution changes across stages, all within the attention mechanism itself rather than requiring separate downsampling layers.

核心創新為多頭池化注意力 (MHPA)，它在單一 Transformer 區塊內實現靈活的解析度建模。不同於維持恆定序列長度的標準多頭注意力，MHPA 對查詢 (Q)、鍵 (K) 與值 (V) 張量施加可學習的池化運算子，具有可配置的核大小、步幅與填充。池化縮減了 K 與 V 的序列長度，降低了注意力計算成本，同時可選地對 Q 進行池化以在階段轉換處產生降取樣輸出。此機制具有雙重目的：它在階段內降低計算成本，並在階段間實現解析度變化，全部在注意力機制本身內完成，無需單獨的降取樣層。

段落功能核心技術——詳述 MHPA 的機制與雙重功能。

邏輯角色 MHPA 是 MViT 的技術核心：它將「效率提升」與「解析度變化」整合到單一操作中，使多尺度設計成為注意力機制的固有屬性而非外加元件。

論證技巧 / 潛在漏洞對 Q、K、V 分別應用不同的池化策略是精妙的設計——K、V 池化降低成本，Q 池化實現階段轉換。但此設計使得不同注意力頭可能處理不同解析度的資訊，頭間的資訊整合是否會因解析度不一致而受影響？

3.3 Design Principles — 架構設計原則

The architecture follows a channel-resolution scaling principle: as spatial resolution decreases by factor s, channel dimension increases by factor s, keeping the total computation per stage roughly constant. The model is instantiated with MViT-B (36.6M parameters), starting from 96 channels at 16x temporal and spatial resolution, expanding to 768 channels at reduced resolution through four stages. For video inputs, the initial patch embedding produces spatiotemporal tokens by non-overlapping 3D patches. A wider variant (MViT-B-24-wide) achieves the best ImageNet accuracy. The entire design avoids large-scale pre-training requirements — MViT is trained from scratch on the target dataset, which is a significant practical advantage.

架構遵循通道-解析度縮放原則：空間解析度降低 s 倍時，通道維度增加 s 倍，使每階段的總計算量大致恆定。模型實例化為 MViT-B（36.6M 參數），從 96 通道、16 倍時間與空間解析度起始，透過四個階段擴展至 768 通道的縮減解析度。對於視訊輸入，初始圖塊嵌入透過非重疊的三維圖塊產生時空標記。較寬的變體 MViT-B-24-wide 達到最佳的 ImageNet 準確率。整體設計避免了大規模預訓練的需求——MViT 在目標資料集上從頭訓練，這是顯著的實務優勢。

段落功能設計原則——說明通道-解析度縮放的數學原理與模型配置。

邏輯角色此段提供了可重現的具體數字（96->768 通道、36.6M 參數），使方法從概念走向實現。「從頭訓練」的強調突出了與 ViViT、TimeSformer 等需要預訓練方法的關鍵差異。

論證技巧 / 潛在漏洞「無需大規模預訓練」是雙面刃——它降低了使用門檻，但也可能意味著在有預訓練資料可用時，MViT 的效能提升空間受限。MViTv2 後來確實透過預訓練獲得了進一步提升，削弱了「從頭訓練就夠好」的早期宣稱。

4. Experiments — 實驗

On Kinetics-400 video classification, MViT-B achieves 81.2% top-1 accuracy, surpassing concurrent transformer models that require ImageNet-21K pre-training, with 5-10x lower computation and parameters. A crucial temporal modeling experiment reveals: shuffling input frames causes a significant 7.1% accuracy decay for MViT, but only 0.1% for standard ViT. This confirms that MViT genuinely captures temporal dynamics while standard ViT performs mere "bag-of-frames" classification. On ImageNet image classification, MViT-B-24-wide reaches 84.8% top-1 accuracy with 1.7x fewer FLOPs than DeiT-B at higher resolution. Ablation studies confirm that MHPA, separate positional embeddings, and the multiscale design each contribute essential improvements.

在 Kinetics-400 視訊分類上，MViT-B 達到 81.2% 的 top-1 準確率，超越了需要 ImageNet-21K 預訓練的同期 Transformer 模型，且計算量與參數量僅為 5-10 分之一。一項關鍵的時間建模實驗揭示：打亂輸入幀順序導致 MViT 準確率顯著下降 7.1%，但標準 ViT 僅下降 0.1%。這確認了 MViT 真正地捕捉了時間動態，而標準 ViT 僅進行「逐幀袋」式分類。在 ImageNet 影像分類上，MViT-B-24-wide 達到 84.8% 的 top-1 準確率，FLOPs 為高解析度 DeiT-B 的 1.7 分之一。消融研究確認 MHPA、分離位置嵌入與多尺度設計各自貢獻了關鍵的改進。

段落功能全面的實驗驗證——在視訊與影像上提供定量結果與深入的行為分析。

邏輯角色此段最引人注目的是幀打亂實驗（7.1% vs. 0.1%），它超越了單純的效能比較，提供了模型行為的深層洞察。這不僅證明 MViT 更好，更說明了它為什麼更好。

論證技巧 / 潛在漏洞幀打亂實驗是出色的分析手法——以簡單的操控揭示了模型的本質差異。但 7.1% 的下降是否完全歸因於時間建模？多尺度設計本身可能對幀間的局部空間一致性敏感，打亂幀也可能干擾了空間特徵的時間連貫性。

5. Conclusion — 結論

MViT demonstrates that multiscale feature hierarchies are essential for effective vision transformers, particularly for dense, high-dimensional data like video. The Multi Head Pooling Attention mechanism enables flexible resolution modeling within the attention framework, achieving state-of-the-art video recognition with dramatically fewer resources and competitive image classification without pre-training. The channel-resolution scaling principle aligns transformer architectures with classical vision principles of multiscale processing, suggesting that the most effective vision transformers will be those that embrace, rather than ignore, the multiscale nature of visual data.

MViT 證明了多尺度特徵層級對有效的視覺 Transformer 至關重要，尤其是對於像視訊這樣密集、高維度的資料。多頭池化注意力機制在注意力框架內實現靈活的解析度建模，以大幅更少的資源達到最先進的視訊辨識，並在無需預訓練下達到具競爭力的影像分類。通道-解析度縮放原則使 Transformer 架構與多尺度處理的經典視覺原則對齊，暗示最有效的視覺 Transformer 將是那些擁抱而非忽視視覺資料多尺度本質的架構。

段落功能總結與宣言——重申核心貢獻並提出多尺度設計的普適性主張。

邏輯角色結論段將 MViT 從一個具體的方法提升為設計哲學的宣言：「擁抱多尺度本質」。這個論述後來被 MViTv2 以及眾多多尺度 Transformer 驗證。

論證技巧 / 潛在漏洞「擁抱而非忽視」的措辭帶有宣言性質。然而，多尺度是否真的是唯一正確的方向？後來的研究（如 plain ViT 的強大效能）表明，在足夠的資料與計算下，柱狀架構也能競爭。多尺度的優勢更多在於效率而非根本性的表示能力。

論證結構總覽

問題
ViT 柱狀架構忽視
視覺訊號的多尺度本質

→

論點
多尺度特徵層級
+ 池化注意力

→

證據
Kinetics 81.2%
5-10x 更少 FLOPs

→

反駁
幀打亂實驗證明
真正的時間建模

→

結論
多尺度原則是視覺
Transformer 的方向

作者核心主張（一句話）

透過多頭池化注意力實現的多尺度特徵層級，使視覺 Transformer 能以數量級更少的計算資源達到最先進的視訊辨識效能，且無需大規模預訓練。

論證最強處

幀打亂實驗的洞察力：MViT 在幀打亂後下降 7.1% 而 ViT 僅下降 0.1% 的對比，是全文最具說服力的證據。它超越了效能數字，揭示了 MViT 真正捕捉了視訊的時間結構，而非僅進行逐幀分類。這種行為分析比單純的效能比較更有說明力。

論證最弱處

效率優勢的比較條件：「5-10x 更少 FLOPs」的比較對象是需要 ImageNet-21K 預訓練的視訊 Transformer（如 ViViT），但這些模型在設計上未針對效率最佳化。若與同樣追求效率的視訊 Transformer 比較（如 Video Swin），FLOPs 優勢可能大幅縮小。此外，MViT 在影像任務上的優勢不如視訊任務顯著。