摘要1. 緒論2. 相關工作3. 方法4. 實驗5. 結論論證總覽

Abstract — 摘要

We present VideoMamba, addressing the dual challenges of local redundancy and global dependencies in video understanding by innovatively adapting the Mamba architecture to the video domain. Unlike 3D convolutions that have limited receptive fields and video Transformers that suffer from quadratic complexity, VideoMamba's linear-complexity operator enables efficient long-term modeling crucial for high-resolution long video understanding. Extensive evaluations reveal VideoMamba's four core abilities: (1) Scalability without extensive pretraining via a novel self-distillation technique; (2) Sensitivity for short-term action recognition with fine-grained motion differences; (3) Superiority in long-term video understanding; and (4) Compatibility with other modalities in multi-modal contexts.
我們提出 VideoMamba,透過創新地將 Mamba 架構適配到影片領域,解決影片理解中局部冗餘與全局依賴的雙重挑戰。不同於具有限接收場的三維摺積受二次方複雜度困擾的影片 Transformer,VideoMamba 的線性複雜度運算元使高效的長程建模成為可能,這對高解析度長影片理解至關重要。廣泛的評估揭示 VideoMamba 的四項核心能力:(1) 透過新穎的自蒸餾技術無需大規模預訓練即可擴展;(2) 對具有精細動作差異的短期動作辨識之靈敏度;(3) 在長期影片理解上的優越性;(4) 在多模態情境中與其他模態的相容性
段落功能全文總覽——定位 Mamba 在影片領域的適配及四項核心能力。
邏輯角色以「雙重挑戰」開場建立問題,以「四項核心能力」展示全面解決方案。
論證技巧 / 潛在漏洞線性複雜度是解決長影片理解的關鍵優勢,但 Mamba 架構在視覺任務中的理論理解仍不完善。
The adaptation of State Space Models (SSMs) to video understanding represents a paradigm shift from attention-based temporal modeling. While attention mechanisms provide global context, their O(n^2) scaling with sequence length makes them impractical for long videos where the number of tokens easily reaches hundreds of thousands. SSMs offer O(n) complexity through their recurrent formulation while maintaining the ability to capture long-range dependencies through learned state transitions. VideoMamba is the first comprehensive exploration of this paradigm for video understanding, providing both architectural innovations and practical guidelines.
狀態空間模型(SSM)適配到影片理解代表了從基於注意力的時間建模的典範轉移。雖然注意力機制提供全局上下文,但其O(n^2) 隨序列長度的擴展使其在長影片中不切實際,因為詞元數量輕易達到數十萬。SSM 透過其遞迴公式提供 O(n) 複雜度,同時透過學習的狀態轉換維持捕捉長程依賴的能力。VideoMamba 是此典範在影片理解中的首次全面探索,提供了架構創新和實用指南。
段落功能定位典範轉移——從注意力到狀態空間模型。
邏輯角色「典範轉移」的定位提升了研究的歷史意義,超越單純的技術改進。
論證技巧 / 潛在漏洞O(n) vs O(n^2) 的複雜度對比在理論上無可辯駁,但實際速度還取決於硬體利用率。

1. Introduction — 緒論

The explosive growth of video content — with over 500 hours uploaded to YouTube every minute — has created an urgent need for scalable video understanding systems. Applications ranging from content moderation and recommendation to autonomous driving and robotics require models that can efficiently process video at scale. However, the computational cost of existing methods has limited deployment to short clips (typically 4-16 seconds), leaving the vast majority of long-form video content underserved by current AI systems.
影片內容的爆炸性成長——每分鐘有超過 500 小時影片上傳至 YouTube——迫切需要可擴展的影片理解系統。從內容審核和推薦到自動駕駛和機器人技術的應用都需要能大規模高效處理影片的模型。然而,現有方法的計算成本限制了部署僅能處理短片段(通常 4-16 秒),使絕大多數長篇影片內容未被當前人工智慧系統服務。
段落功能產業背景——影片內容的規模與處理需求。
邏輯角色500 小時/分鐘的數字直觀地展示了規模化影片理解的迫切性。
論證技巧 / 潛在漏洞連結到具體應用場景使研究動機更加具體和迫切。
Video understanding presents unique challenges compared to image understanding: videos contain massive spatio-temporal redundancy (adjacent frames are highly similar) while also requiring modeling of long-range temporal dependencies (understanding a story or complex action sequence). 3D CNNs handle local patterns well but cannot capture long-range dependencies efficiently. Video Transformers can model global interactions but their O(n^2) complexity makes them prohibitively expensive for long videos. The recent State Space Model (SSM) paradigm, exemplified by Mamba, offers a compelling alternative with linear complexity while maintaining the ability to model long-range dependencies through its selective scan mechanism.
影片理解相較影像理解有獨特挑戰:影片包含大量時空冗餘(相鄰幀高度相似),同時需要建模長程時間依賴(理解故事或複雜動作序列)。三維 CNN 擅長處理局部模式但無法有效捕捉長程依賴影片 Transformer 可建模全局互動但其 O(n^2) 複雜度使其在長影片上成本過高。近期的狀態空間模型(SSM)典範,以 Mamba 為代表,提供了引人入勝的替代方案,以線性複雜度同時透過選擇性掃描機制維持長程依賴建模能力
段落功能建立研究場域——三種架構範式的優劣比較。
邏輯角色三方比較(CNN/Transformer/SSM)清晰地展示了 Mamba 的定位優勢。
論證技巧 / 潛在漏洞複雜度分析具有數學嚴謹性。但線性複雜度不代表常數小,實際速度還取決於硬體效率。
The emergence of VideoMamba is situated within a broader trend of efficient sequence modeling gaining traction across machine learning. The success of Mamba in natural language processing — where it achieves competitive results with Transformers at a fraction of the computational cost for long sequences — naturally raised the question of whether similar benefits transfer to visual domains. The answer, as demonstrated by Vim for images and now VideoMamba for videos, is affirmative but nuanced: SSMs excel particularly in scenarios where sequence length is the computational bottleneck, which is precisely the case for video understanding where temporal extent creates orders-of-magnitude more tokens than static images.
VideoMamba 的出現位於高效序列建模在機器學習領域獲得關注的更廣泛趨勢中。Mamba 在自然語言處理中的成功——在長序列中以極少的計算成本達到與 Transformer 相當的結果——自然引出了類似效益是否能遷移到視覺領域的問題。正如 Vim 在影像和 VideoMamba 在影片中的展示,答案是肯定的但有細微差異:SSM 特別在序列長度是計算瓶頸的場景中表現優異,而這正是影片理解的情況——時間延伸創造了比靜態影像多數個數量級的詞元。
段落功能學術脈絡——從 NLP 到視覺的 SSM 遷移趨勢。
邏輯角色將 VideoMamba 置於 Mamba 跨領域遷移的宏觀敘事中。
論證技巧 / 潛在漏洞「肯定但有細微差異」是準確的定性,避免了過度宣稱。
The evolution of video understanding architectures spans three generations. 3D CNNs (C3D, I3D, SlowFast) extended 2D convolutions to the temporal dimension but were limited to local temporal windows. Video Transformers (ViViT, TimeSformer, Video Swin) introduced global attention but required various efficiency tricks such as factorized attention, windowed attention, or sparse sampling to manage complexity. Mamba and its visual variant Vim (Vision Mamba) recently demonstrated that SSMs can achieve competitive results on image tasks with linear complexity. VideoMamba extends this success to the more challenging video domain, where the advantages of linear scaling become even more pronounced due to the multiplicative increase in token count from temporal frames.
影片理解架構的演進跨越三個世代。三維 CNN(C3D、I3D、SlowFast)將二維摺積擴展到時間維度,但受限於局部時間窗口影片 Transformer(ViViT、TimeSformer、Video Swin)引入全局注意力,但需要各種效率技巧如因式分解注意力、窗口注意力或稀疏取樣來管理複雜度。Mamba 及其視覺變體 Vim(Vision Mamba)近期展示了 SSM 可在線性複雜度下於影像任務上達到具競爭力的結果。VideoMamba 將此成功擴展到更具挑戰性的影片領域,由於時間幀帶來的詞元數量倍增效應,線性擴展的優勢更加顯著。
段落功能三代架構演進——從 CNN 到 Transformer 到 SSM。
邏輯角色歷史脈絡使 VideoMamba 成為自然的下一步演進,而非孤立的嘗試。
論證技巧 / 潛在漏洞「三代」的敘事結構清晰有力。但 Transformer 的效率改進(如 Flash Attention)正在快速縮小差距。
The fundamental advantage of Mamba over attention-based models becomes particularly stark in the video domain. Consider a video of 16 frames at 224x224 resolution with patch size 16: this produces approximately 3,136 tokens. Self-attention requires ~9.8 million pairwise computations. Scaling to 64 frames produces 12,544 tokens and ~157 million attention computations — a 16x increase for only 4x more frames. In contrast, VideoMamba's SSM processes the same 64-frame input with only 4x the compute of 16 frames, maintaining perfect linear scaling. This makes VideoMamba uniquely suited for applications requiring long video understanding, such as movie analysis, surveillance, and instructional video comprehension.
Mamba 相較注意力模型的根本優勢在影片領域尤為顯著。考慮一段16 幀、224x224 解析度、區塊大小 16 的影片:這產生約3,136 個詞元。自注意力需要約 980 萬次成對計算。擴展到 64 幀產生 12,544 個詞元約 1.57 億次注意力計算——幀數僅增 4 倍但計算增 16 倍。相比之下,VideoMamba 的 SSM 處理相同 64 幀輸入僅需16 幀的 4 倍計算量,維持完美的線性擴展。這使 VideoMamba 獨特地適合需要長影片理解的應用,如電影分析、監控和教學影片理解。
段落功能複雜度具體化——以實際數字展示線性 vs 二次方差距。
邏輯角色980 萬 vs 1.57 億的計算量對比使抽象的複雜度分析變得觸手可及。
論證技巧 / 潛在漏洞具體數字比 O(n) vs O(n^2) 更有說服力,直接量化了實際部署的差距。

3. Method — 方法

VideoMamba extends the bidirectional Mamba (Vim) architecture from images to videos. We treat a video as a sequence of 3D patches (spatial-temporal tubes) and process them through stacked bidirectional SSM blocks. Each block applies the selective scan mechanism in both forward and backward directions along the flattened patch sequence, enabling the model to capture dependencies from both temporal directions. To address the challenge of training data scarcity for video models, we introduce a self-distillation strategy that transfers knowledge from a pretrained image Mamba model to initialize the video model, enabling effective training even with limited video data.
VideoMamba 將雙向 Mamba(Vim)架構從影像擴展到影片。我們將影片視為三維區塊(時空管)序列,透過堆疊的雙向 SSM 區塊進行處理。每個區塊沿展平的區塊序列在前向和後向兩個方向應用選擇性掃描機制,使模型能從兩個時間方向捕捉依賴。為應對影片模型訓練資料稀缺的挑戰,我們引入自蒸餾策略,從預訓練的影像 Mamba 模型遷移知識來初始化影片模型,即使影片資料有限也能有效訓練。
段落功能闡述核心方法——三維區塊化、雙向掃描和自蒸餾。
邏輯角色從影像到影片的自然擴展 + 自蒸餾解決資料稀缺,方法設計務實。
論證技巧 / 潛在漏洞自蒸餾策略巧妙地利用了影像預訓練的知識,但展平後的掃描順序對時空結構的保持能力值得深究。
The selective scan mechanism in Mamba is key to its efficiency and expressiveness. Unlike traditional SSMs with fixed dynamics, Mamba uses input-dependent state transitions: the matrices governing state evolution are computed as functions of each input token. This makes the model content-aware while maintaining linear complexity. For video, this means the model can selectively attend to informative frames while efficiently skipping redundant ones. The bidirectional scanning ensures that temporal context from both past and future is available for each token's representation. We use a spatial-first, temporal-second flattening order that preserves spatial locality within each frame before connecting across time.
Mamba 中的選擇性掃描機制是其效率和表達力的關鍵。不同於具有固定動態的傳統 SSM,Mamba 使用輸入相關的狀態轉換:控制狀態演化的矩陣作為每個輸入詞元的函數計算。這使模型具備內容感知能力同時維持線性複雜度。對影片而言,這意味著模型可選擇性地關注有資訊量的幀,同時有效跳過冗餘幀。雙向掃描確保每個詞元的表徵可獲得來自過去和未來的時間上下文。我們使用先空間後時間的展平順序,在跨時間連接前保留每幀內的空間局部性。
段落功能技術深入——選擇性掃描的內容感知特性。
邏輯角色「輸入相關的狀態轉換」使線性複雜度的模型具備動態注意力的效果。
論證技巧 / 潛在漏洞展平順序的選擇(先空間後時間)是重要的設計決策,不同順序可能帶來不同效果。
The self-distillation strategy addresses a practical bottleneck: large-scale video datasets are scarce compared to image datasets. We initialize VideoMamba from a pretrained Vim (Vision Mamba) model trained on ImageNet. The key challenge is adapting the 2D patch embedding to 3D: we inflate the 2D patch embedding weights by repeating them along the temporal dimension and dividing by the number of temporal frames, ensuring the output magnitude remains consistent. During training, we use a distillation loss that encourages the video model to maintain the spatial understanding of the image model while learning temporal dynamics. This allows VideoMamba to achieve strong performance with only ImageNet-1K pretraining, without requiring large-scale video pretraining datasets like Kinetics-710 or HowTo100M.
自蒸餾策略解決了一個實際瓶頸:大規模影片資料集相較影像資料集極為稀缺。我們從在 ImageNet 上訓練的預訓練 Vim(Vision Mamba)模型初始化 VideoMamba。關鍵挑戰是將二維區塊嵌入適配為三維:我們透過沿時間維度重複二維區塊嵌入權重並除以時間幀數來膨脹權重,確保輸出幅度保持一致。訓練中,我們使用蒸餾損失鼓勵影片模型在學習時間動態的同時維持影像模型的空間理解。這使 VideoMamba 能僅透過 ImageNet-1K 預訓練即達到強勁效能,無需 Kinetics-710 或 HowTo100M 等大規模影片預訓練資料集。
段落功能自蒸餾細節——權重膨脹與知識遷移機制。
邏輯角色「除以幀數」的權重膨脹是簡潔但關鍵的技巧,確保初始化的穩定性。
論證技巧 / 潛在漏洞避免對大規模影片預訓練的依賴大幅降低了方法的使用門檻。
The VideoMamba architecture comes in three sizes to accommodate different computational budgets: VideoMamba-Ti (Tiny, 7M parameters), VideoMamba-S (Small, 26M), and VideoMamba-M (Middle, 74M). All variants use a patch size of 16x16x2 (spatial x temporal), processing videos at 224x224 spatial resolution. The number of bidirectional SSM blocks varies from 24 for Tiny to 32 for Middle, with hidden dimensions scaling accordingly. Position embeddings use learnable absolute embeddings that are interpolated when the number of input frames changes between training and inference, enabling flexible temporal resolution at test time.
VideoMamba 架構提供三種規模以適應不同的計算預算:VideoMamba-Ti(微型,7M 參數)、VideoMamba-S(小型,26M)和 VideoMamba-M(中型,74M)。所有變體使用16x16x2(空間 x 時間)的區塊大小,以 224x224 空間解析度處理影片。雙向 SSM 區塊數量從微型的 24 到中型的 32不等,隱藏維度相應擴展。位置嵌入使用可學習的絕對嵌入,當輸入幀數在訓練和推論間變化時進行內插,使測試時能靈活調整時間解析度。
段落功能架構規格——三種模型規模與設計細節。
邏輯角色多規模設計使方法在不同計算預算下均可使用,增強了實用性。
論證技巧 / 潛在漏洞位置嵌入的內插策略使模型具備時間解析度的靈活性,是務實的工程選擇。

4. Experiments — 實驗

VideoMamba demonstrates strong results across multiple video understanding benchmarks. On Kinetics-400, VideoMamba-M achieves 82.0% top-1 accuracy, competitive with Video Swin Transformer while being significantly more efficient (3x fewer GFLOPs). On the long-form video benchmark Breakfast (average 2.3 minutes), VideoMamba achieves 89.4% accuracy, surpassing TimeSformer by 4.7%, demonstrating its strength in long-term modeling. For multi-modal video-text retrieval on MSR-VTT, VideoMamba achieves 46.2 R@1, showing strong compatibility with language models. The self-distillation strategy provides 3.2% accuracy improvement on Kinetics-400 compared to training from scratch.
VideoMamba 在多個影片理解基準上展現強勁結果。在 Kinetics-400 上,VideoMamba-M 達到 82.0% top-1 準確率,與 Video Swin Transformer 相當但效率顯著更高(GFLOPs 少 3 倍)。在長影片基準 Breakfast(平均 2.3 分鐘)上,VideoMamba 達到 89.4% 準確率,超越 TimeSformer 4.7%,展現其在長程建模上的優勢。在 MSR-VTT 的多模態影片文字檢索上,VideoMamba 達到 46.2 R@1,顯示與語言模型的強相容性。自蒸餾策略相較從頭訓練在 Kinetics-400 上提供 3.2% 的準確率改進
段落功能提供核心實證——短、長影片與多模態的全面驗證。
邏輯角色長影片上 4.7% 的優勢直接驗證了線性複雜度對長程建模的價值。
論證技巧 / 潛在漏洞多維度的驗證增強了說服力。但在某些短影片任務上的增益有限,暗示 Mamba 的優勢主要在長序列。
Efficiency analysis reveals VideoMamba's computational advantages. At 16 frames input, VideoMamba-M uses 47 GFLOPs compared to Video Swin-B's 282 GFLOPs and TimeSformer-L's 590 GFLOPs. More importantly, scaling to 64 frames increases VideoMamba's cost to only 188 GFLOPs (linear scaling), while Video Swin would require over 1100 GFLOPs (quadratic scaling). Throughput measurements on a single A100 GPU show that VideoMamba processes 38 videos/second at 16 frames, compared to 12 for Video Swin and 6 for TimeSformer. Ablation on the self-distillation reveals that spatial knowledge preservation accounts for 2.1% of the total 3.2% improvement, while the remaining 1.1% comes from the weight initialization itself.
效率分析揭示了 VideoMamba 的計算優勢。在16 幀輸入下,VideoMamba-M 使用 47 GFLOPs,相較之下 Video Swin-B 為 282 GFLOPs,TimeSformer-L 為 590 GFLOPs。更重要的是,擴展到 64 幀僅使 VideoMamba 的成本增至 188 GFLOPs(線性擴展),而 Video Swin 將需要超過 1100 GFLOPs(二次方擴展)。在單個 A100 GPU 上的吞吐量量測顯示 VideoMamba 在 16 幀下處理每秒 38 段影片,相較之下 Video Swin 為 12,TimeSformer 為 6。自蒸餾的消融揭示空間知識保留貢獻了總 3.2% 改進中的 2.1%,其餘 1.1% 來自權重初始化本身。
段落功能效率量化——GFLOPs、吞吐量與自蒸餾消融。
邏輯角色64 幀時的 188 vs 1100 GFLOPs 對比是線性 vs 二次方擴展的最直觀證明。
論證技巧 / 潛在漏洞絕對吞吐量數字(38 vs 12 vs 6 影片/秒)直接量化了實際部署的效益。
We provide detailed analysis of where VideoMamba's advantages are most pronounced. On Something-Something V2, a dataset requiring fine-grained temporal reasoning (distinguishing "putting X into Y" from "taking X out of Y"), VideoMamba-M achieves 70.8% top-1 accuracy, outperforming Video Swin-T by 1.2% despite having fewer parameters. This demonstrates the selective scan's ability to focus on temporally discriminative moments. On the Epic-Kitchens-100 action anticipation task, requiring prediction of future actions from video context, VideoMamba achieves state-of-the-art results with 14.3% top-5 recall, suggesting that the SSM's sequential state propagation naturally suits temporal prediction tasks. Failure cases concentrate on activities requiring understanding of spatial relationships between multiple objects, where the flattened 1D sequence may lose critical spatial structure.
我們詳細分析了VideoMamba 優勢最為顯著的場景。在Something-Something V2 上,一個需要精細時間推理(區分「把 X 放入 Y」和「把 X 從 Y 取出」)的資料集,VideoMamba-M 達到 70.8% top-1 準確率,以較少參數超越 Video Swin-T 1.2%。這展示了選擇性掃描聚焦於時間上具辨別力的時刻的能力。在Epic-Kitchens-100 動作預期任務上,需要從影片上下文預測未來動作,VideoMamba 達到最先進結果 14.3% top-5 召回率,暗示 SSM 的序列狀態傳播自然適合時間預測任務。失敗案例集中在需要理解多物件間空間關係的活動,此時展平的一維序列可能損失關鍵空間結構。
段落功能場景分析——優勢場景與失敗模式。
邏輯角色精細時間推理和動作預期的成功直接驗證了選擇性掃描的價值。
論證技巧 / 潛在漏洞多物件空間關係的失敗案例誠實地揭示了一維展平的固有限制。

5. Conclusion — 結論

We have presented VideoMamba, demonstrating that State Space Models provide a scalable and efficient solution for comprehensive video understanding. Through bidirectional scanning, 3D patch tokenization, and self-distillation, VideoMamba achieves competitive or superior performance across short-term, long-term, and multi-modal video tasks while maintaining linear computational complexity. Our work establishes Mamba as a viable and often preferable alternative to Transformers for video understanding.
我們提出了 VideoMamba,展示狀態空間模型為全面影片理解提供了可擴展且高效的解決方案。透過雙向掃描、三維區塊詞元化和自蒸餾,VideoMamba 在短期、長期和多模態影片任務上達到相當或優越的效能,同時維持線性計算複雜度。我們的工作確立了Mamba 作為影片理解中 Transformer 的可行且通常更佳的替代方案
段落功能總結全文——確立 Mamba 在影片理解中的地位。
邏輯角色以「可行且通常更佳的替代方案」定位,既不過度宣稱也不低估貢獻。
論證技巧 / 潛在漏洞VideoMamba 開啟了 SSM 在影片理解中的研究方向,其影響力已超越單篇論文。
Future work should address several open questions. The flattening of 3D structure into 1D sequences may lose important spatial topology information; exploring multi-directional scanning strategies (e.g., separate spatial and temporal scans) could preserve this structure better. Combining VideoMamba with large language models for video question answering and video captioning represents a natural extension given the demonstrated multi-modal compatibility. The theoretical understanding of why SSMs work well for visual tasks remains incomplete and warrants further investigation, particularly regarding the relationship between the learned state dynamics and the visual features they encode.
未來工作應解決幾個開放問題。將三維結構展平為一維序列可能損失重要的空間拓撲資訊;探索多方向掃描策略(如分離的空間和時間掃描)可能更好地保留此結構。將 VideoMamba 與大型語言模型結合用於影片問答影片描述是基於已展示的多模態相容性的自然延伸。SSM 為何在視覺任務中表現良好的理論理解仍不完整,值得進一步研究,尤其是學習的狀態動態與其編碼的視覺特徵之間的關係。
段落功能展望未來——掃描策略、多模態擴展和理論理解。
邏輯角色坦誠地承認理論理解的不足展現了學術誠實,同時指出了有價值的研究方向。
論證技巧 / 潛在漏洞理論基礎的缺乏是 SSM 視覺應用的長期挑戰,但不妨礙其實踐價值。
The significance of VideoMamba extends beyond benchmark numbers to the broader question of whether attention is the only path to effective visual understanding. For the past five years, the field has witnessed a steady march toward Transformer dominance in vision, with ConvNets gradually being displaced. VideoMamba suggests that state space models represent a viable third paradigm — one that combines the global modeling capability of Transformers with the computational efficiency closer to ConvNets. For video understanding specifically, where sequence lengths easily reach tens of thousands of tokens, the SSM paradigm may ultimately prove more practical than attention-based approaches. The ongoing development of hardware-optimized SSM implementations further strengthens this case, as dedicated kernels close the gap between theoretical and actual computational savings.
VideoMamba 的意義超越了基準數字,涉及注意力是否是有效視覺理解的唯一路徑這一更廣泛的問題。過去五年中,該領域見證了 Transformer 在視覺中的穩步主導,ConvNet 逐漸被取代。VideoMamba 暗示狀態空間模型代表了可行的第三典範——結合了 Transformer 的全局建模能力和接近 ConvNet 的計算效率。特別對影片理解而言,序列長度輕易達到數萬詞元,SSM 典範最終可能被證明比基於注意力的方法更為實用。硬體最佳化 SSM 實作的持續開發進一步強化了此論點,因為專用核心縮小了理論計算節省與實際節省之間的差距。
段落功能學術定位——第三典範的歷史意義。
邏輯角色將 VideoMamba 置於 ConvNet→Transformer→SSM 的架構演進大敘事中。
論證技巧 / 潛在漏洞「第三典範」的定位大膽但有據——硬體最佳化的進展是實現此願景的關鍵。

論證結構總覽

問題
CNN 局限局部 /
Transformer 計算昂貴
論點
SSM 提供線性
複雜度方案
方法
雙向 Mamba +
自蒸餾
證據
K400 82.0%
長影片 +4.7%
結論
影片理解的
Transformer 替代方案

核心主張(一句話)

透過將 Mamba 的選擇性掃描機制適配到影片的時空結構,並以自蒸餾解決資料稀缺問題,可在線性複雜度下達到與 Transformer 相當或更好的影片理解效能。

論證最強處

長影片基準上的顯著優勢(Breakfast +4.7%)直接驗證了線性複雜度對長序列建模的關鍵價值。效率對比(47 vs 282 vs 590 GFLOPs)和吞吐量(38 vs 12 vs 6 影片/秒)使計算優勢無可置疑。

論證最弱處

Mamba 在視覺任務中的理論基礎尚不完善,且展平三維結構為一維序列可能損失空間拓撲資訊。在部分短影片任務上的增益有限,暗示 Mamba 的優勢集中在長序列場景。

核心論點 / Thesis
關鍵概念 / 術語
實證證據 / 資料
讓步 / 反駁處理
方法論說明