SEA-RAFT — 雙欄批注

Abstract — 摘要

We present SEA-RAFT, a more simple, efficient, and accurate method for optical flow estimation built upon the RAFT architecture. Compared to RAFT, SEA-RAFT introduces three key improvements: (1) a new mixture of Laplace loss function that better handles the multi-modal nature of optical flow uncertainty; (2) direct regression of an initial flow estimate that provides a better starting point for iterative refinement, leading to faster convergence; and (3) rigid-motion pre-training on synthetic data to improve generalization to real-world scenes. SEA-RAFT achieves state-of-the-art accuracy on the Spring benchmark with 3.69 endpoint-error (EPE) and 0.36 1-pixel outlier rate, representing 22.9% and 17.8% error reduction from best published results, while being at least 2.3x faster than existing methods.

我們提出 SEA-RAFT，一種建構於 RAFT 架構之上、更為簡潔、高效且精確的光流估測方法。相較於 RAFT，SEA-RAFT 引入三項關鍵改進：(1) 新的拉普拉斯混合損失函數，更好地處理光流不確定性的多模態特性；(2) 直接迴歸初始光流估計，為迭代精化提供更佳起點，加速收斂；(3) 在合成資料上進行剛體運動預訓練以提升對真實場景的泛化能力。SEA-RAFT 在 Spring 基準測試上達到最先進的精確度，端點誤差 3.69、1 像素離群率 0.36，分別代表22.9% 和 17.8% 的誤差降低，同時比現有方法快至少 2.3 倍。

段落功能全文總覽——清晰列舉三項改進及其量化效果。

邏輯角色以「簡潔、高效、精確」三詞定位（恰好是 SEA 的縮寫），用數據量化每項改進的貢獻。

論證技巧 / 潛在漏洞名稱 SEA-RAFT 的設計巧妙地編碼了方法特性。22.9% 的誤差降低令人印象深刻，但 Spring 基準的代表性需考量。

1. Introduction — 緒論

Optical flow estimation — the task of computing dense per-pixel motion between consecutive video frames — is a fundamental problem in computer vision with applications in video understanding, autonomous driving, video editing, and action recognition. The introduction of RAFT (Recurrent All-Pairs Field Transforms) marked a paradigm shift in the field, establishing an architecture based on iterative refinement of flow through a recurrent update operator. RAFT's design has proven remarkably influential, with subsequent works largely building incremental improvements on top of its framework. However, many of these follow-up works introduce significant computational overhead through larger feature extractors, more complex correlation computations, or additional network components.

光流估測——計算連續影片幀之間逐像素稠密運動的任務——是電腦視覺的基礎問題，廣泛應用於影片理解、自動駕駛、影片編輯和動作辨識。RAFT（循環全配對場變換）的提出標誌著該領域的典範轉移，建立了基於循環更新運算元進行光流迭代精化的架構。RAFT 的設計具有顯著影響力，後續工作大多在其框架上進行增量改進。然而，許多後續工作透過更大的特徵提取器、更複雜的相關性計算或額外的網路組件引入了顯著的計算開銷。

段落功能建立研究場域——回顧 RAFT 的影響並點出後續工作的效率問題。

邏輯角色先肯定 RAFT 的貢獻，再指出後續工作在效率上的退步，為 SEA-RAFT 的「回歸簡潔」定位鋪路。

論證技巧 / 潛在漏洞「回歸簡潔」的策略在 ML 領域日益受重視，此定位既務實又具說服力。

2. Method — 方法

The Mixture of Laplace (MoL) loss is our first key contribution. Standard optical flow training uses an L1 or L2 loss over the sequence of iterative predictions, treating flow estimation as a deterministic regression problem. However, optical flow is inherently ambiguous in many regions — at occlusion boundaries, textureless areas, and repetitive patterns, multiple flow values are plausible. The MoL loss models the flow prediction as a mixture of Laplace distributions, where the model outputs both the predicted flow and a confidence map. This allows the model to express uncertainty and allocate its capacity to learn from reliable regions while being robust to ambiguous ones. Empirically, the MoL loss provides consistent improvements across all benchmarks compared to the standard loss.

拉普拉斯混合（MoL）損失是我們的第一項關鍵貢獻。標準光流訓練使用基於迭代預測序列的 L1 或 L2 損失，將光流估測視為確定性迴歸問題。然而，光流在許多區域本質上是模糊的——在遮擋邊界、無紋理區域和重複模式處，多個光流值都是合理的。MoL 損失將光流預測建模為拉普拉斯分佈的混合，模型同時輸出預測光流和信心圖。這使模型能表達不確定性並將其學習能力分配給可靠區域，同時對模糊區域保持穩健。實證上，MoL 損失相較標準損失在所有基準測試上提供一致的改進。

段落功能闡述第一項創新——拉普拉斯混合損失函數的設計動機與機制。

邏輯角色從光流的固有模糊性（問題）到機率建模（解決方案），邏輯鏈清晰。

論證技巧 / 潛在漏洞將確定性問題轉為機率問題是深思熟慮的設計，但混合分佈的組件數選擇和訓練穩定性可能需要仔細調參。

Our second contribution is direct initial flow regression. In the original RAFT, the iterative refinement starts from a zero-initialized flow field, requiring many iterations to converge to the correct solution, especially for large displacements. We add a lightweight regression head that directly predicts an initial flow estimate from the correlation volume. This provides a much better starting point for the iterative updates, reducing the number of iterations needed for convergence by approximately 50%. Our third contribution, rigid-motion pre-training, uses synthetically generated scenes with known rigid-body transformations to provide the model with a strong prior for real-world motion patterns, significantly improving cross-dataset generalization, particularly on KITTI and Spring.

第二項貢獻是直接初始光流迴歸。在原始 RAFT 中，迭代精化從零初始化的光流場開始，需要多次迭代才能收斂到正確解，尤其是對於大位移的情況。我們新增一個輕量級迴歸頭，直接從相關性體積預測初始光流估計。這為迭代更新提供了更佳的起點，將收斂所需的迭代次數減少約 50%。第三項貢獻剛體運動預訓練使用具有已知剛體變換的合成場景為模型提供真實世界運動模式的強先驗，顯著提升跨資料集泛化能力，特別是在 KITTI 和 Spring 上。

段落功能闡述第二、三項創新——初始流迴歸與剛體運動預訓練。

邏輯角色三項改進各自獨立卻互補：損失函數改善學習、初始流加速收斂、預訓練提升泛化。

論證技巧 / 潛在漏洞「迭代次數減少 50%」是有力的效率論證。三項改進的正交性使消融研究更具說服力。

3. Experiments — 實驗

We evaluate SEA-RAFT on four major optical flow benchmarks: Sintel, KITTI, Spring, and FlyingThings. On the Spring benchmark, SEA-RAFT achieves a new state-of-the-art with 3.69 EPE, surpassing the previous best of 4.79 EPE. On KITTI 2015, we achieve an Fl-all rate of 4.31%, competitive with significantly larger models. Importantly, SEA-RAFT achieves the best cross-dataset generalization performance — when trained only on FlyingThings and evaluated on KITTI and Spring without fine-tuning, our method outperforms all baselines. In terms of efficiency, SEA-RAFT processes 1080p frames at 8.3 FPS, which is 2.3x faster than FlowFormer and 3.1x faster than GMFlow+, while achieving better or comparable accuracy.

我們在四個主要光流基準測試上評估 SEA-RAFT：Sintel、KITTI、Spring 和 FlyingThings。在 Spring 基準上，SEA-RAFT 以 3.69 EPE 達到新的最先進水準，超越先前最佳的 4.79 EPE。在 KITTI 2015 上，達到 Fl-all 率 4.31%，與顯著更大的模型相當。重要的是，SEA-RAFT 達到最佳的跨資料集泛化效能——僅在 FlyingThings 上訓練並在 KITTI 和 Spring 上不經微調直接評估時，優於所有基線。在效率方面，SEA-RAFT 以 8.3 FPS 處理 1080p 影格，比 FlowFormer 快 2.3 倍、比 GMFlow+ 快 3.1 倍，同時達到更好或相當的精確度。

段落功能提供核心實證——多基準多指標的全面驗證。

邏輯角色跨資料集泛化和速度優勢兩個維度的證據，強力支撐「簡潔且高效」的核心主張。

論證技巧 / 潛在漏洞 Spring 上 3.69 vs 4.79 的改進幅度顯著。跨資料集泛化結果特別有說服力，因為這測試了方法的真正泛化能力而非過擬合。

4. Conclusion — 結論

We have presented SEA-RAFT, demonstrating that carefully designed, simple modifications to the RAFT framework can yield substantial improvements in accuracy, efficiency, and generalization. Our three contributions — mixture of Laplace loss, direct initial flow regression, and rigid-motion pre-training — are each independently beneficial and combine synergistically. SEA-RAFT achieves state-of-the-art results on Spring with significant error reductions, the best cross-dataset generalization, and at least 2.3x speedup over comparable methods. Our work demonstrates the value of revisiting fundamental design choices rather than pursuing ever-more-complex architectures.

我們提出了 SEA-RAFT，論證精心設計的簡潔修改即可在 RAFT 框架上帶來精確度、效率和泛化能力的大幅提升。三項貢獻——拉普拉斯混合損失、直接初始光流迴歸和剛體運動預訓練——各自獨立有效且協同增益。SEA-RAFT 在 Spring 上達到最先進結果並顯著降低誤差、達到最佳跨資料集泛化，且比可比方法快至少 2.3 倍。我們的工作展現了重新審視基本設計選擇而非追求越來越複雜架構的價值。

段落功能總結全文——重申三項貢獻的獨立與協同效果。

邏輯角色以「簡潔勝過複雜」的哲學收束，呼應名稱中的「Simple」。

論證技巧 / 潛在漏洞「回歸簡潔」的訊息清晰而有力。作為 Award Candidate，SEA-RAFT 在工程品味上展現了成熟的判斷。

論證結構總覽

問題
RAFT 後續工作過度複雜

→

論點
簡潔改進可超越複雜方法

→

方法
MoL 損失 + 初始流 + 預訓練

→

證據
Spring EPE 3.69, 快 2.3x

→

結論
簡潔設計的價值

核心主張

透過精心設計的三項簡潔改進（損失函數、初始化策略、預訓練方案），可在 RAFT 框架上同時達到精確度、效率和泛化能力的顯著提升。

論證最強處

跨資料集泛化結果（不微調即超越基線）是最有說服力的證據，證明改進不是對特定基準的過擬合而是真正的能力提升。

論證最弱處

三項改進雖各自有效，但缺少對其交互效應的深入分析。此外，在某些密集小物體場景中的表現可能仍受限於 RAFT 的相關性計算解析度。