RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

Abstract — 摘要

We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance on the Sintel and KITTI benchmarks. On Sintel (final pass), RAFT obtains an end-point-error of 2.855, a 16% error reduction from the best published result. RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count.

我們提出遞迴全配對場變換（RAFT），一種用於光流估計的全新深度網路架構。RAFT 提取逐像素特徵，為所有像素對建構多尺度四維相關性體積，並透過遞迴單元在相關性體積上進行查找，迭代更新光流場。RAFT 在 Sintel 和 KITTI 基準測試上達到了最先進的效能。在 Sintel（final pass）上，RAFT 取得了 2.855 的端點誤差，相較於已發表的最佳結果降低了 16% 的誤差。RAFT 同時具備優異的跨資料集泛化能力以及推論時間、訓練速度和參數量方面的高效率。

段落功能全文總覽——簡潔地勾勒 RAFT 的三階段流程與核心效能優勢。

邏輯角色摘要建立了明確的「架構描述 → 量化成果 → 額外優勢」邏輯鏈，引導讀者理解 RAFT 在準確度與效率上的雙重突破。

論證技巧 / 潛在漏洞以具體數據（16% 誤差降低）開場，極具說服力。但「全配對」的計算成本未在此處說明，可能讓讀者擔憂可擴展性。

1. Introduction — 緒論

Optical flow is the task of estimating per-pixel motion between video frames. It is a fundamental problem in computer vision, with applications in action recognition, video editing, and autonomous navigation. Most recent approaches frame optical flow as a learning problem, directly regressing the flow field from a pair of images in a single forward pass. While these approaches have shown strong results, they lack the iterative refinement that is common in classical optimization-based methods, which use principles such as coarse-to-fine processing and variational inference.

光流是估計影片幀之間逐像素運動的任務。這是電腦視覺中的基礎問題，在動作辨識、影片編輯和自主導航等領域有廣泛應用。近年來大多數方法將光流框定為學習問題，透過單次前向傳播直接從影像對迴歸光流場。雖然這些方法展現了優異的成果，但它們缺乏經典最佳化方法中常見的迭代精煉機制，後者採用由粗到細處理與變分推論等原理。

段落功能建立研究場域——定義光流問題並指出深度學習方法與經典方法之間的核心差距。

邏輯角色論證鏈的起點：肯定學習式方法的成功，同時指出缺乏迭代精煉的結構性弱點，為 RAFT 的設計動機鋪路。

論證技巧 / 潛在漏洞巧妙地將經典方法的優勢（迭代精煉）與學習式方法的弱點對比，暗示理想的解決方案應融合兩者。但「單次前向傳播」並非所有先前方法的特徵，如 PWC-Net 已有多尺度迭代。

We propose RAFT, a new architecture that combines the benefits of both paradigms. Our approach has three main components: (1) a feature encoder that extracts per-pixel features from both input images, (2) a correlation layer that computes visual similarity between all pairs of pixels to produce a 4D correlation volume, and (3) a recurrent update operator that iteratively refines the flow estimate by performing lookups on the correlation volume. Unlike prior work that operates at a single fixed resolution or uses a coarse-to-fine pyramid, RAFT maintains and updates a single high-resolution flow field.

我們提出 RAFT，一種融合兩種範式優勢的新架構。我們的方法包含三個主要組件：（1）特徵編碼器，從兩張輸入影像中提取逐像素特徵；（2）相關性層，計算所有像素對之間的視覺相似度以產生四維相關性體積；（3）遞迴更新算子，透過在相關性體積上進行查找來迭代精煉光流估計。不同於先前工作在單一固定解析度上運作或使用由粗到細的金字塔，RAFT 維護並更新單一高解析度光流場。

段落功能提出方法——清晰地列舉 RAFT 的三大組件與核心設計差異。

邏輯角色承接上一段的問題陳述，提出具體的解決方案。三組件的列舉式描述使架構一目了然。

論證技巧 / 潛在漏洞「全配對」的設計雖然全面，但 O(N^2) 的記憶體複雜度可能成為實際部署的瓶頸。作者透過多尺度金字塔來緩解，但此處未提及。

2. Approach — 方法

Given a pair of consecutive images I_1 and I_2, we estimate a dense displacement field (f_1, f_2) which maps each pixel (u, v) in I_1 to its corresponding coordinates (u + f_1(u), v + f_2(v)) in I_2. Our approach produces a sequence of flow estimates {f_1, ..., f_N} that converges to a fixed point of the flow field. The feature encoder network g_theta maps both input images to dense feature maps at 1/8 resolution. We use a shared-weight architecture, applying the same network to both frames.

給定一對連續影像 I_1 和 I_2，我們估計一個稠密位移場 (f_1, f_2)，將 I_1 中的每個像素 (u, v) 映射到 I_2 中對應的座標 (u + f_1(u), v + f_2(v))。我們的方法產生一組光流估計序列 {f_1, ..., f_N}，收斂至光流場的不動點。特徵編碼網路 g_theta 將兩張輸入影像映射到 1/8 解析度的稠密特徵圖。我們採用共享權重架構，對兩幀施用相同的網路。

段落功能數學形式化——定義光流問題的輸入輸出，並描述特徵提取階段。

邏輯角色從問題定義過渡到具體實作，建立後續章節的符號系統。

論證技巧 / 潛在漏洞「收斂至不動點」的表述暗示了良好的理論性質，但實際上並未提供收斂性的嚴格證明。1/8 解析度的降採樣可能導致小物體運動的細節丟失。

2.1 Computing Visual Similarity — 相關性體積

We compute the correlation volume by taking the dot product between all pairs of feature vectors. Given feature maps g_theta(I_1) in R^{H x W x D} and g_theta(I_2) in R^{H x W x D}, the correlation volume C(g_theta(I_1), g_theta(I_2)) in R^{H x W x H x W} is defined by computing the inner product for all pairs. We then construct a 4-layer correlation pyramid by repeatedly pooling the last two dimensions with kernel sizes 1, 2, 4, and 8. This provides access to both large displacements (coarser levels) and sub-pixel precision (finer levels) within a single framework.

我們透過計算所有特徵向量對的內積來構建相關性體積。給定特徵圖 g_theta(I_1) 屬於 R^{H x W x D} 和 g_theta(I_2) 屬於 R^{H x W x D}，相關性體積 C(g_theta(I_1), g_theta(I_2)) 屬於 R^{H x W x H x W} 由所有配對的內積計算定義。隨後我們建構四層相關性金字塔，以核大小 1、2、4、8 對最後兩個維度反覆進行池化。這在單一框架內提供了同時存取大位移（較粗層級）和亞像素精度（較細層級）的能力。

段落功能技術核心——詳述相關性體積的建構方式與多尺度金字塔設計。

邏輯角色此為 RAFT 的關鍵創新之一：全配對相關性計算取代了先前方法的局部搜索窗口，金字塔池化則解決多尺度問題。

論證技巧 / 潛在漏洞金字塔池化巧妙地平衡了「全域搜索」與「記憶體效率」。但對於 H x W = 1024 x 1024 的高解析度影像，4D 相關性體積的記憶體需求可能仍然過大。

The key operation in our approach is the correlation lookup. Given a current estimate of the flow field, we use the estimated correspondences to index the correlation volume and retrieve a local neighborhood of correlation values. Specifically, we define a local grid centered at the current flow estimate and use bilinear interpolation to sample the correlation volume at the grid points. The lookups are performed on all levels of the correlation pyramid, providing multi-resolution information while keeping the lookup cost constant.

我們方法中的關鍵運算是相關性查找。給定光流場的當前估計，我們使用估計的對應關係來索引相關性體積並檢索局部鄰域的相關性值。具體而言，我們定義以當前光流估計為中心的局部網格，並使用雙線性內插在網格點上採樣相關性體積。查找在相關性金字塔的所有層級上執行，提供多解析度資訊的同時保持查找成本不變。

段落功能操作細節——描述相關性查找的具體機制。

邏輯角色承接相關性體積的建構，說明如何高效地「使用」它。查找機制是連接相關性表徵與迭代更新的橋梁。

論證技巧 / 潛在漏洞局部網格查找是一個精巧的設計——既利用了全域相關性資訊，又限制了每次查找的計算量。這種「構建全域，查找局部」的策略是 RAFT 效率的關鍵。

2.2 Iterative Update Operator — 迭代更新算子

The update operator is built using a gated recurrent unit (GRU). At each iteration, it takes as input the current flow estimate, the correlation features from the lookup operation, and the context features extracted from the first image. The GRU produces a flow update delta-f that is applied to the current estimate. This recurrent design allows the network to learn an optimization trajectory that mimics the iterative refinement of classical methods. We train the network by applying L1 loss on the sequence of flow predictions, with exponentially increasing weights, giving more importance to later (more accurate) predictions.

更新算子基於門控遞迴單元（GRU）構建。每次迭代中，它接收當前光流估計、查找操作得到的相關性特徵以及從第一張影像提取的上下文特徵作為輸入。GRU 產生一個光流更新量 delta-f，施加於當前估計。這種遞迴設計使網路能學習一條模擬經典方法迭代精煉的最佳化軌跡。我們在光流預測序列上施加 L1 損失進行訓練，使用指數遞增的權重，賦予後期（更準確的）預測更大的重要性。

段落功能核心創新——描述 GRU 基礎的迭代更新機制及訓練策略。

邏輯角色此為 RAFT 的第三大組件，也是架構名稱中「Recurrent」的核心所在。GRU 的使用將深度學習與經典最佳化的迭代精煉統一起來。

論證技巧 / 潛在漏洞指數遞增權重的訓練策略巧妙地鼓勵收斂行為。但 GRU 的迭代次數在推論時可以任意設定，此靈活性的實際影響（迭代次數與精度的關係）需要消融實驗驗證。

3. Experiments — 實驗

We evaluate RAFT on the Sintel and KITTI 2015 optical flow benchmarks. We first train on FlyingChairs followed by FlyingThings3D, then fine-tune on the target dataset. On Sintel (final pass), RAFT achieves an end-point-error (EPE) of 2.855, compared to 3.38 for the previous best method. On KITTI 2015, RAFT achieves 5.10% Fl-all, surpassing the previous state of the art at 6.24%. These results represent a 16% reduction in error on Sintel and an 18% reduction on KITTI.

我們在 Sintel 和 KITTI 2015 光流基準上評估 RAFT。我們首先在 FlyingChairs 上訓練，接著在 FlyingThings3D 上繼續訓練，然後在目標資料集上微調。在 Sintel（final pass）上，RAFT 達到了 2.855 的端點誤差（EPE），先前最佳方法為 3.38。在 KITTI 2015 上，RAFT 達到了 5.10% 的 Fl-all，超越先前最先進的 6.24%。這些結果代表了在 Sintel 上降低 16% 誤差、在 KITTI 上降低 18% 誤差。

段落功能定量評估——在標準基準上展示 RAFT 的效能優勢。

邏輯角色用具體數據支撐摘要中的宣稱，兩個基準上的一致性提升增強了結論的可信度。

論證技巧 / 潛在漏洞同時在 Sintel 和 KITTI 上取得顯著進步具有高度說服力。但 16% 和 18% 的改進幅度在成熟的基準上極為罕見，這也是本文獲得最佳論文獎的原因之一。

We also evaluate the generalization ability of RAFT. When trained only on synthetic data (FlyingChairs + FlyingThings3D) and evaluated directly on Sintel and KITTI without fine-tuning, RAFT achieves significantly better results than prior methods that were also trained only on synthetic data. Additionally, we study the effect of the number of iterations at test time. RAFT shows consistent improvement as the number of iterations increases, with most of the gain captured in the first 12 iterations. The model uses only 5.3M parameters, fewer than most competing methods.

我們也評估了 RAFT 的泛化能力。僅在合成資料（FlyingChairs + FlyingThings3D）上訓練並直接在 Sintel 和 KITTI 上評估，不經微調時，RAFT 的表現顯著優於同樣僅在合成資料上訓練的先前方法。此外，我們研究了測試時迭代次數的影響。RAFT 展現出隨迭代次數增加而持續改進的特性，大部分增益在前 12 次迭代中即已捕獲。模型僅使用 530 萬個參數，少於大多數競爭方法。

段落功能補充評估——展示泛化能力與效率優勢。

邏輯角色從「準確度」擴展到「泛化性」和「效率」，構建多維度的優勢論述。

論證技巧 / 潛在漏洞 530 萬參數的輕量化與高精度的組合極為亮眼。迭代次數的靈活性是 RAFT 的獨特優勢——可以根據計算預算調整推論時的精度。

4. Conclusion — 結論

We have presented RAFT, a new architecture for optical flow that achieves state-of-the-art accuracy, strong generalization, and high efficiency. The key design elements are the all-pairs correlation volume, the multi-scale correlation pyramid, and the GRU-based iterative update operator. RAFT demonstrates that iterative refinement, when implemented through learned recurrent updates, can dramatically improve optical flow estimation. The simplicity and strong performance of RAFT suggest that this design philosophy — building dense correlation volumes and iteratively refining predictions — may be applicable to other dense correspondence problems such as stereo matching and scene flow estimation.

我們提出了 RAFT，一種在光流估計上達到最先進準確度、強大泛化能力和高效率的新架構。關鍵設計元素包括全配對相關性體積、多尺度相關性金字塔和基於 GRU 的迭代更新算子。RAFT 證明了迭代精煉透過學習式遞迴更新實現時，能夠顯著改善光流估計。RAFT 的簡潔性與強大效能表明，這種設計哲學——建構稠密相關性體積並迭代精煉預測——可能適用於其他稠密對應問題，如立體匹配和場景流估計。

段落功能總結全文——重申核心貢獻並展望未來方向。

邏輯角色將具體的技術成果提升至更廣泛的設計哲學層面，暗示 RAFT 的影響力將超越光流領域。

論證技巧 / 潛在漏洞結論的前瞻性宣稱（適用於其他稠密對應問題）在後續研究中得到驗證——RAFT 的設計範式確實被廣泛採用於立體匹配等任務，證明了作者的洞察力。

論證結構總覽

問題
光流缺乏迭代精煉

→

論點
融合經典與學習式方法

→

方法
全配對相關性 + GRU 更新

→

證據
Sintel/KITTI 大幅領先

→

結論
設計哲學可泛化

核心主張

透過全配對相關性體積與門控遞迴更新，可在單一高解析度流場上實現迭代精煉，同時達到最先進準確度與高效率。

論證最強處

在 Sintel 和 KITTI 兩大基準上同時取得 16% 和 18% 的誤差降低，且模型僅需 530 萬參數，實驗設計全面且具說服力。

論證最弱處

對高解析度影像的記憶體需求分析不足，4D 相關性體積的可擴展性限制未被充分討論，且缺乏收斂性的理論保證。