Tracking Everything Everywhere All at Once (OmniMotion)

Abstract — 摘要

We present a test-time optimization method for estimating dense and long-range motion from video. Our approach, OmniMotion, represents a video using a quasi-3D canonical volume and maps between this volume and each frame through learned bijective transformations. This representation enables computing accurate, full-length motion trajectories for every pixel while tracking through occlusions and maintaining global consistency. We demonstrate that OmniMotion significantly outperforms prior state-of-the-art methods on the TAP-Vid benchmark.

本文提出一種測試時最佳化方法，用於從影片中估計稠密且長距離的運動。我們的方法 OmniMotion 使用準三維正典體積表示影片，並透過學習得到的雙射變換在此體積與每一影格之間建立映射。此表示法能為每個像素計算精確、完整長度的運動軌跡，同時在遮擋中持續追蹤並維持全域一致性。我們展示了 OmniMotion 在 TAP-Vid 基準上顯著超越先前的最先進方法。

段落功能全文總覽——以「稠密+長距離+遮擋處理+全域一致」四重目標定位 OmniMotion。

邏輯角色摘要同時承載技術預覽與成果宣告：準三維正典體積與雙射變換是方法核心，TAP-Vid 的顯著超越是實證支撐。

論證技巧 / 潛在漏洞「每個像素的完整長度軌跡」是極具雄心的主張。「測試時最佳化」意味著每段影片都需要獨立訓練，計算成本可能是實用性的主要限制，但摘要中未提及。

1. Introduction — 緒論

Estimating dense pixel correspondences across video frames is fundamental to computer vision. Three key challenges persist: maintaining accurate tracks across long sequences, tracking points through occlusions, and maintaining coherence in space and time. Traditional methods use either sparse feature tracking — effective for distinctive points but limited in coverage — or dense optical flow, which works well for consecutive frames but accumulates drift errors when chained over long sequences. We propose a holistic approach that uses "all the information in a video to jointly estimate full-length motion trajectories".

估計影片影格間的稠密像素對應關係是電腦視覺的基礎問題。三項關鍵挑戰持續存在：在長序列中維持精確追蹤、穿越遮擋追蹤點位、以及維持時空一致性。傳統方法要麼使用稀疏特徵追蹤——對於顯著特徵點有效但覆蓋有限——要麼使用稠密光流，其在連續影格間表現良好，但在長序列串接時會累積漂移誤差。我們提出一種整體式方法，「利用影片中的所有資訊來聯合估計完整長度的運動軌跡」。

段落功能建立研究場域——定義三重挑戰並指出現有方法的根本限制。

邏輯角色論證鏈的起點：以「稀疏但長距離」vs.「稠密但短距離」的兩難，為「稠密且長距離」的整體方案建立必要性。

論證技巧 / 潛在漏洞三重挑戰的框架清晰有力。「所有資訊」的措辭暗示全域最佳化的優勢，但也預示了計算成本的平方級增長——這是整體式方法的固有代價。

Prior approaches to long-range motion estimation include flow chaining (concatenating pairwise optical flow), PIPs (tracking within 8-frame windows), and TAP-Net (single-frame predictions). Flow chaining methods like RAFT produce excellent pairwise flow but accumulate errors over long sequences and lack temporal coherence. PIPs handles occlusions within temporal windows but requires chaining for longer sequences. TAP-Net makes independent per-frame predictions that lack temporal context. Deformable Sprites provides a layered representation but is limited to fixed layer ordering. Our work fundamentally differs by optimizing a global 3D representation that enforces consistency across the entire video.

長距離運動估計的先前方法包括光流串接（串聯成對光流）、PIPs（在 8 影格窗口內追蹤）以及 TAP-Net（單影格預測）。光流串接方法如 RAFT 產生優秀的成對光流，但在長序列上累積誤差且缺乏時間一致性。PIPs 在時間窗口內處理遮擋，但對更長序列仍需串接。TAP-Net 的逐影格獨立預測缺乏時間脈絡。Deformable Sprites 提供分層表示，但受限於固定的層級順序。本研究的根本差異在於最佳化一個全域三維表示，在整段影片上強制一致性。

段落功能文獻回顧——系統性比較四類方法的能力與限制。

邏輯角色以統一的「一致性」維度評判所有先前方法，使 OmniMotion 的「全域 3D 表示」定位自然浮現為唯一的完整解方案。

論證技巧 / 潛在漏洞逐一列舉各方法的特定弱點是有效的排除策略。但 TAP-Net 與 PIPs 作為前饋方法具有即時推論的優勢，而 OmniMotion 的測試時最佳化需要數小時——此效率差異被略過。

3. Method — 方法

3.1 Canonical 3D Volume — 正典三維體積

OmniMotion represents the entire video using a canonical 3D volume — a coordinate-based network F_theta that maps each canonical 3D coordinate to density (sigma) and color (c). This serves as a three-dimensional atlas of the observed scene, similar to a NeRF but designed not for novel-view synthesis but for establishing dense correspondences across time. The key insight is that by mapping all frames to a shared 3D space, correspondences become implicit — any two points in different frames that map to the same canonical location are in correspondence.

OmniMotion 使用一個正典三維體積來表示整段影片——一個基於座標的網路 F_theta，將每個正典三維座標映射至密度（sigma）與顏色（c）。這充當觀察場景的三維圖譜，類似於 NeRF，但其設計目的並非新視角合成，而是建立跨時間的稠密對應關係。關鍵洞見在於：透過將所有影格映射至共享的三維空間，對應關係變成隱式的——不同影格中映射至同一正典位置的任意兩點即互為對應。

段落功能方法基礎——定義正典體積的結構與用途。

邏輯角色建立方法的核心抽象：借用 NeRF 的體積表示，但重新定義其用途——從「渲染」轉為「對應」。共享空間使對應關係隱式化，是全域一致性的數學基礎。

論證技巧 / 潛在漏洞將 NeRF 重新解釋為對應工具是創造性的概念轉換。但正典體積假設場景具有穩定的三維結構，對於高度非剛性的場景（如流體、布料形變），此假設可能不成立。

3.2 Bijective Transformations — 雙射變換

The representation defines invertible mappings T_i between local frame coordinates L_i and canonical space: u = T_i(x_i). Crucially, "bijective mappings ensure that the resulting correspondences between 3D points in individual frames are all cycle consistent". The bijections are parameterized using Real-NVP (invertible neural networks) conditioned on per-frame latent codes psi_i. To map between frames i and j: x_j = T_j^(-1) compose T_i(x_i). The method creates a "quasi-3D representation" as a "relaxation of dynamic multi-view geometry", allowing flexible modeling without solving the ill-posed full 3D reconstruction problem.

此表示法在局部影格座標 L_i 與正典空間之間定義可逆映射 T_i：u = T_i(x_i)。關鍵在於，「雙射映射確保各影格中三維點之間的對應關係全部具有循環一致性」。雙射以 Real-NVP（可逆神經網路）參數化，條件化於每影格潛在編碼 psi_i。影格 i 到影格 j 的映射為：x_j = T_j^(-1) 組合 T_i(x_i)。此方法建立了一個「準三維表示」，作為「動態多視圖幾何的鬆弛」，允許靈活建模而不需解決不適定的完整三維重建問題。

段落功能核心創新——雙射映射確保循環一致性。

邏輯角色此段是全文的技術支柱。雙射的數學性質（可逆+保一對一）天然保證循環一致性，無需額外約束。T_j^(-1) compose T_i 的組合形式優雅地將任意兩影格間的映射統一為經由正典空間的路徑。

論證技巧 / 潛在漏洞「準三維」的定位是精妙的折衷——比純 2D 光流更有結構，比完整 3D 重建更具彈性。但 Real-NVP 的表達能力有限（需保持可逆性），可能無法捕捉極端的拓撲變化（如物件分裂或出現新物件）。

For any query pixel p_i in frame i, the motion computation proceeds as: (1) lift to 3D by sampling K points along a ray, (2) map to canonical space via T_i to obtain densities and colors, (3) map to target frame j via T_j^(-1), (4) aggregate via alpha compositing using density-based weights, and (5) project to 2D to obtain the predicted location. The optimization uses three complementary losses: flow loss (MAE against input optical flow), photometric loss (MSE against observed colors), and regularization loss (penalizing large accelerations in 3D trajectories).

對於影格 i 中的任意查詢像素 p_i，運動計算按以下步驟進行：(1) 沿射線取樣 K 個點以提升至三維；(2) 透過 T_i 映射至正典空間以取得密度與顏色；(3) 透過 T_j^(-1) 映射至目標影格 j；(4) 以密度為基礎的權重透過 alpha 合成進行聚合；(5) 投影至二維以取得預測位置。最佳化使用三個互補損失函數：光流損失（對輸入光流的平均絕對誤差）、光度損失（對觀測顏色的均方誤差），以及正則化損失（懲罰三維軌跡中的大幅加速度）。

段落功能完整管線——五步驟運動計算與三損失最佳化。

邏輯角色將抽象的雙射映射落地為具體的計算流程。Alpha 合成的使用巧妙地將遮擋處理整合到運動估計中——被遮擋的點密度低，自然在合成中被抑制。

論證技巧 / 潛在漏洞五步驟的管線描述清晰可復現。但光流損失依賴外部光流估計（如 RAFT）作為監督，方法的上限受限於輸入光流的品質。此外，K=32 的射線取樣在複雜遮擋場景中可能不夠密集。

4. Experiments — 實驗

We evaluate on the TAP-Vid benchmark comprising DAVIS (30 real videos), Kinetics (100 videos), and RGB-Stacking (50 synthetic videos). On DAVIS, OmniMotion achieves AJ: 51.7 (best), position accuracy: 67.5 (best), occlusion accuracy: 85.3 (best), and temporal coherence: 0.74 (best). The improvements are substantial: compared to the strongest baseline, AJ improves by 1.1 points absolute while temporal coherence improves dramatically. Ablation studies confirm all components are critical: removing invertible networks drops AJ from 51.7 to 12.5; removing photometric loss drops AJ to 42.3; uniform sampling instead of hard mining drops AJ to 47.8.

我們在 TAP-Vid 基準上進行評估，涵蓋 DAVIS（30 段真實影片）、Kinetics（100 段影片）和 RGB-Stacking（50 段合成影片）。在 DAVIS 上，OmniMotion 達到 AJ：51.7（最佳）、位置準確率：67.5（最佳）、遮擋準確率：85.3（最佳），以及時間一致性：0.74（最佳）。改進幅度顯著：相較最強基線，AJ 提升 1.1 個絕對百分點，而時間一致性則大幅改善。消融研究確認所有組件均不可或缺：移除可逆網路使 AJ 從 51.7 驟降至 12.5；移除光度損失使 AJ 降至 42.3；以均勻取樣取代困難樣本挖掘使 AJ 降至 47.8。

段落功能提供全面的定量證據——基準評估與消融分析。

邏輯角色實證支柱：在所有四項指標上達到最佳，消融研究以極端的下降（51.7->12.5）證明可逆網路的絕對必要性。

論證技巧 / 潛在漏洞消融結果中可逆網路的重要性（移除後 AJ 降至 12.5）極具說服力。但 AJ 絕對值 51.7 的水準意味著仍有大量追蹤失敗案例。此外，每段影片需要 200K 次迭代的最佳化（數小時計算時間）是顯著的實用性限制。

5. Conclusion — 結論

OmniMotion presents a complete solution to dense, long-range video motion estimation. Through its quasi-3D canonical representation with guaranteed cycle consistency via bijective mappings, the method achieves state-of-the-art performance while handling general videos with arbitrary camera and scene motion. The test-time optimization approach leverages global information across entire videos, fundamentally addressing limitations of local, frame-pair-based methods. Limitations include struggles with rapid non-rigid motion, sensitivity to initialization, and quadratic scaling with sequence length.

OmniMotion 為稠密長距離影片運動估計提出了完整的解決方案。透過準三維正典表示與雙射映射保證的循環一致性，該方法在處理具有任意攝影機與場景運動的一般影片時達到了最先進的表現。測試時最佳化方法利用整段影片的全域資訊，從根本上克服了基於局部影格對方法的限制。局限性包括在快速非剛性運動中的困難、對初始化的敏感性，以及隨序列長度呈平方級增長的計算成本。

段落功能總結全文——重申貢獻並坦誠列出局限性。

邏輯角色結論平衡了成就與限制的陳述：先強調全域一致性的核心優勢，再坦承三項具體局限，展現學術誠信。

論證技巧 / 潛在漏洞明確列出三項局限性增強了可信度。平方級計算成本是最嚴峻的限制——限制了方法對長影片的適用性。初始化敏感性也值得關注，暗示結果可能因隨機種子而有不可忽視的變異。

論證結構總覽

問題
稠密長距離運動估計
缺乏全域一致性

→

論點
準三維正典體積+
雙射映射保證一致性

→

證據
TAP-Vid 四指標最佳
消融確認組件必要性

→

反駁
準三維鬆弛避免
不適定的完整3D重建

→

結論
全域最佳化是稠密
長距離追蹤的正確方向

作者核心主張（一句話）

透過將影片表示為準三維正典體積並以雙射映射保證循環一致性，可從一般影片中計算每個像素的稠密、全長度運動軌跡，同時自然處理遮擋與維持全域時空一致性。

論證最強處

雙射映射的數學優雅性：可逆映射天然保證循環一致性，無需額外的一致性損失或後處理。消融研究中移除可逆網路導致 AJ 從 51.7 崩潰至 12.5，以極端的數字差距證明了此設計的絕對核心地位。

論證最弱處

計算效率與可擴展性：每段影片需 200K 次迭代的測試時最佳化（數小時計算）、與序列長度平方級增長的成本，嚴重限制了實際應用場景。此外，方法依賴外部光流（RAFT）作為監督訊號，其上限受限於輸入光流的品質。