DUSt3R — 雙欄批注

Abstract — 摘要

We present DUSt3R, a new paradigm for Dense and Unconstrained Stereo 3D Reconstruction from arbitrary image collections. Our approach does not require prior information about camera calibration or viewpoint poses. We cast the pairwise reconstruction problem as a regression of pointmaps, relaxing the hard constraints of usual projective camera models. We show that this formulation smoothly unifies the monocular and binocular reconstruction cases. In the multi-view case, we introduce a simple yet effective global alignment procedure. We demonstrate that our formulation allows leveraging powerful pretrained models and leads to state-of-the-art results on several 3D vision benchmarks, including monocular and multi-view depth estimation, relative and absolute pose estimation, and 3D reconstruction.

本文提出 DUSt3R，一種從任意影像集合進行稠密且無約束立體 3D 重建的新範式。我們的方法不需要相機校正或視角位姿的先驗資訊。我們將成對重建問題轉化為點圖（pointmap）的迴歸任務，放寬了傳統投影相機模型的嚴格約束。我們證明此公式流暢地統一了單目與雙目重建情境。在多視角情境中，我們引入一種簡潔而有效的全域對齊程序。我們展示了此公式能利用強大的預訓練模型，在多項 3D 視覺基準測試上取得最先進結果，涵蓋單目與多視角深度估計、相對與絕對位姿估計以及 3D 重建。

段落功能全文總覽——以點圖迴歸取代傳統幾何流程，實現無約束 3D 重建。

邏輯角色摘要宣告了「無需相機校正」的核心主張，並以多項基準測試結果支撐。

論證技巧 / 潛在漏洞「Dense and Unconstrained」的命名策略直接點明與傳統 SfM 的差異。多任務的最先進結果是強力佐證，但需確認各任務的公平比較條件。

Our key insight is that by directly regressing 3D pointmaps from image pairs, we can bypass the traditional sequential pipeline of keypoint detection, matching, essential matrix estimation, and triangulation. Each of these steps can introduce errors that propagate downstream, leading to brittle systems that fail with few views, non-Lambertian surfaces, or insufficient camera baselines. DUSt3R replaces this fragile pipeline with a single feed-forward network that learns geometric priors from large-scale training data.

核心洞見在於：透過直接從影像對迴歸 3D 點圖，我們可以繞過傳統的關鍵點偵測、匹配、本質矩陣估計和三角測量等序列流程。這些步驟中的每一步都可能引入向下傳播的誤差，導致系統在少量視角、非朗伯表面或不足的相機基線條件下變得脆弱且容易失敗。DUSt3R 以單一前向傳播網路取代此脆弱流程，從大規模訓練資料中學習幾何先驗。

段落功能闡述核心動機——指出傳統流程的誤差累積問題。

邏輯角色以具體的失敗情境（少視角、非朗伯面、小基線）建立問題嚴重性。

論證技巧 / 潛在漏洞將傳統方法描述為「脆弱」是有效的修辭策略，但傳統 SfM 在大規模場景中仍具不可替代的穩定性。

1. Introduction — 緒論

Reconstructing the 3D geometry and camera parameters from uncalibrated image collections is a fundamental challenge in computer vision. Traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipelines decompose the problem into sequential steps — keypoint matching, essential matrix estimation, triangulation, bundle adjustment — each introducing potential errors that propagate downstream. This sequential approach is unsatisfactory because error accumulation occurs across stages and communication between substeps is absent. Specific failures include SfM breakdown with few views, non-Lambertian surfaces, and insufficient camera motion.

從未校正的影像集合重建 3D 幾何結構與相機參數是電腦視覺的基礎挑戰。傳統的運動恢復結構（SfM）與多視角立體視覺（MVS）流程將問題分解為序列化步驟——關鍵點匹配、本質矩陣估計、三角測量、光束法平差——每一步都可能引入向下游傳播的誤差。此序列方法之所以不令人滿意，在於誤差在各階段間累積，而子步驟之間缺乏資訊交流。具體的失敗案例包括少視角下 SfM 崩潰、非朗伯表面以及相機運動不足等情況。

段落功能建立問題意識——揭示傳統流程的結構性缺陷。

邏輯角色以「序列化步驟」的分解來定性傳統方法的問題，為端到端方法的登場鋪路。

論證技巧 / 潛在漏洞「子步驟間缺乏資訊交流」是一個精準的批判。然而，近年基於學習的匹配器（如 SuperGlue）已在一定程度上緩解此問題。

DUSt3R takes a radically different stance by operating without any camera calibration requirements. The core innovation involves regressing pointmaps — dense 2D fields of 3D points — directly from image pairs, eliminating explicit geometric constraints while leveraging learned priors from large datasets. This formulation naturally unifies monocular and multi-view scenarios: a single network handles both cases by outputting two pointmaps in a shared coordinate frame, with the network architecture permitting leverage of powerful pretrained models like CroCo. A simple global alignment procedure extends the pairwise approach to handle multi-image scenarios without traditional bundle adjustment complications.

DUSt3R 採取根本性的不同立場，在完全不需要相機校正的條件下運作。核心創新在於直接從影像對迴歸點圖——即稠密的 3D 點場域——消除顯式幾何約束，同時利用從大型資料集學習的先驗知識。此公式自然地統一了單目與多視角情境：單一網路透過在共享座標系中輸出兩張點圖來處理兩種情況，而其架構允許利用CroCo 等強大的預訓練模型。一種簡潔的全域對齊程序將成對方法擴展到多影像情境，避免了傳統光束法平差的複雜性。

段落功能提出解決方案——以點圖迴歸統一單目與多視角重建。

邏輯角色三層遞進結構：核心創新（點圖迴歸）、統一框架（單目/多視角）、擴展機制（全域對齊）。

論證技巧 / 潛在漏洞利用 CroCo 預訓練是策略性選擇——跨視角預測能力直接轉移至 3D 任務。但對預訓練資料的依賴可能限制領域泛化。

2. Method — 方法

A pointmap is formally defined as X in R^(W x H x 3), establishing a one-to-one correspondence between image pixels and 3D scene points such that I_(i,j) corresponds to X_(i,j) for all pixel coordinates. The network processes two RGB images and outputs two pointmaps expressed in the first image's coordinate frame, plus associated confidence maps. This design choice — outputting both pointmaps in a shared coordinate system — represents a key departure from conventional approaches, as it implicitly embeds relative pose information without requiring explicit pose parameterization.

點圖正式定義為 X 屬於 R^(W x H x 3)，在影像像素與 3D 場景點之間建立一對一對應關係，使得 I_(i,j) 對應 X_(i,j)。網路處理兩張 RGB 影像並輸出以第一張影像座標系表達的兩張點圖，以及相關的信心圖。此設計選擇——將兩張點圖輸出於共享座標系中——是與傳統方法的關鍵差異，因為它隱式地嵌入了相對位姿資訊，無需顯式的位姿參數化。

段落功能定義核心表示——點圖的數學定義與輸出設計。

邏輯角色「共享座標系」的設計將位姿估計從顯式計算轉為隱式學習，是方法論的核心轉折。

論證技巧 / 潛在漏洞以第一張影像為參考座標系是簡潔的選擇，但在影像間視角差異極大時，可能對第二張影像的點圖精度造成影響。

The architecture builds on CroCo and comprises: a Siamese ViT-Large encoder processing both images with shared weights, Transformer decoders with cross-attention mechanisms enabling constant information exchange between branches, and separate DPT regression heads outputting pointmaps and confidence maps. The cross-attention structure ensures properly aligned pointmap outputs by having each decoder block attend to tokens from both views sequentially. Importantly, the architecture enforces no explicit geometric constraints — instead, the network learns geometric priors purely from training data containing only geometrically consistent pointmaps.

架構建構於 CroCo 之上，包含：使用共享權重處理兩張影像的孿生 ViT-Large 編碼器、啟用分支間持續資訊交換的具交叉注意力機制的 Transformer 解碼器，以及輸出點圖與信心圖的獨立 DPT 迴歸頭。交叉注意力結構讓每個解碼器區塊依序關注來自兩個視角的 token，確保點圖輸出的正確對齊。值得注意的是，此架構不強制任何顯式幾何約束——而是純粹從包含幾何一致點圖的訓練資料中學習幾何先驗。

段落功能架構細節——孿生編碼器與交叉注意力解碼器。

邏輯角色「無顯式幾何約束」是本文最大膽的設計決策，將幾何知識完全交由資料驅動。

論證技巧 / 潛在漏洞CroCo 預訓練提供了強大的跨視角理解基礎。但完全移除幾何約束意味著模型可能在訓練分佈外的場景（如極端視角）上表現不穩定。

The primary training objective uses Euclidean distance in 3D space rather than 2D image-space metrics. To handle scale ambiguity, predictions and ground-truth are normalized by average point distances from origin. The network jointly learns pixel-wise confidence scores indicating prediction reliability through a confidence-aware loss that includes a regularization term, forcing the model to extrapolate in difficult areas such as single-view regions and translucent surfaces. For multi-image scenarios, a global alignment procedure constructs a connectivity graph of image pairs and optimizes all pairwise predictions into a joint 3D space through rigid transformations and per-pair scaling factors, converging within hundreds of gradient descent iterations.

主要訓練目標使用 3D 空間中的歐幾里得距離而非 2D 影像空間度量。為處理尺度模糊性，預測與真值透過平均點距原點距離進行正規化。網路同時學習逐像素的信心分數以指示預測可靠性，透過包含正則項的信心感知損失函數迫使模型在困難區域（如單視角區域和半透明表面）進行外推。對於多影像情境，全域對齊程序構建影像對的連通圖，並透過剛性變換與逐對縮放因子將所有成對預測最佳化至統一的 3D 空間，在數百次梯度下降迭代內收斂。

段落功能訓練與全域對齊——損失函數設計與多視角延伸機制。

邏輯角色三個關鍵設計——3D 損失、信心感知、全域對齊——逐步解決從訓練到推論的完整流程。

論證技巧 / 潛在漏洞在 3D 空間而非 2D 空間計算損失是重要的設計選擇，直接約束了幾何一致性。全域對齊比傳統光束法平差更快，但放棄了重投影誤差的精確性。

3. Experiments — 實驗

The model trains on 8.5 million image pairs from eight diverse datasets: Habitat, MegaDepth, ARKitScenes, Static Scenes 3D, Blended MVS, ScanNet++, CO3D-v2, and Waymo — covering indoor, outdoor, synthetic, real-world, and object-centric scenarios. The architecture employs a ViT-Large encoder and ViT-Base decoder with DPT head, initialized from CroCo pretraining. Training proceeds sequentially: first at 224x224 resolution, then at 512-pixel maximum dimension with variable aspect ratios per batch for generalization.

模型在來自八個多元資料集的 850 萬影像對上訓練：Habitat、MegaDepth、ARKitScenes、Static Scenes 3D、Blended MVS、ScanNet++、CO3D-v2 和 Waymo——涵蓋室內、室外、合成、真實世界和以物件為中心的場景。架構採用 ViT-Large 編碼器與 ViT-Base 解碼器搭配 DPT 頭，以 CroCo 預訓練初始化。訓練分階段進行：先以 224x224 解析度訓練，再提升至最大 512 像素，每批次使用可變長寬比以增強泛化能力。

段落功能訓練設定——大規模多元資料與漸進式訓練策略。

邏輯角色850 萬影像對與八個資料集的規模是泛化能力的基礎保障。

論證技巧 / 潛在漏洞資料多元性是此方法成功的關鍵。但如此大量的訓練資料也提高了復現門檻。

On multi-view pose estimation using CO3Dv2 and RealEstate10K datasets, DUSt3R with global alignment achieves 96.2% RRA@15 and 86.8% RTA@15 on CO3Dv2, significantly surpassing the state-of-the-art PoseDiffusion. Both PnP and global alignment variants outperform existing methods, demonstrating strong performance even with limited input views (3-10 frames). On monocular depth estimation, the model outperforms self-supervised baselines and performs comparably to state-of-the-art supervised methods in a zero-shot transfer setting on DDAD, KITTI, NYUv2, BONN, and TUM datasets. For multi-view depth estimation on DTU, ETH3D, Tanks and Temples, and ScanNet, DUSt3R achieves state-of-the-art accuracy on ETH-3D, even outperforming methods that require ground-truth poses.

在使用 CO3Dv2 和 RealEstate10K 資料集的多視角位姿估計中，具全域對齊的 DUSt3R 在 CO3Dv2 上達到 96.2% RRA@15 和 86.8% RTA@15，大幅超越最先進的 PoseDiffusion。PnP 和全域對齊兩種變體均優於現有方法，展示即使在有限輸入視角（3-10 幀）下仍表現強勁。在單目深度估計方面，模型在 DDAD、KITTI、NYUv2、BONN 和 TUM 資料集上以零樣本遷移設定超越自監督基線，並與最先進的監督式方法表現相當。在 DTU、ETH3D、Tanks and Temples 和 ScanNet 上的多視角深度估計中，DUSt3R 在 ETH-3D 上達到最先進精度，甚至超越需要真值位姿的方法。

段落功能提供核心實證——跨多項任務的全面基準測試結果。

邏輯角色以四類任務的系統性評估支撐「統一框架」的宣稱，96.2% RRA@15 是極具說服力的數字。

論證技巧 / 潛在漏洞在不使用真值位姿的情況下超越需要真值位姿的方法，是極為有力的論證。但各基準測試的評估條件（如測試集重疊）需仔細確認。

Systematic ablations demonstrate the critical importance of key design choices. CroCo pretraining provides consistent improvements across all tasks, validating the value of cross-view completion as a pretraining objective. Higher input resolution (512 pixels versus 224) yields substantial performance gains, suggesting that spatial detail is crucial for geometric reasoning. On 3D reconstruction quality using the DTU dataset in a zero-shot setting without finetuning, DUSt3R achieves 2.7mm average accuracy, 0.8mm completeness, and 1.7mm overall error, demonstrating practical reconstruction precision despite not matching pixel-accurate triangulation methods.

系統性消融實驗展示了關鍵設計選擇的重要性。CroCo 預訓練在所有任務上提供一致的改進，驗證了跨視角補全作為預訓練目標的價值。更高的輸入解析度（512 像素對比 224 像素）帶來顯著的效能提升，表明空間細節對幾何推理至關重要。在 DTU 資料集上以零樣本設定（未微調）評估 3D 重建品質，DUSt3R 達到平均精度 2.7mm、完整度 0.8mm、整體誤差 1.7mm，展現實用級的重建精度，儘管未能匹配需要顯式相機參數的像素級三角測量方法。

段落功能消融與重建品質——驗證預訓練與解析度的貢獻。

邏輯角色消融結果將改進歸因至具體設計選擇，而非模型規模的蠻力提升。

論證技巧 / 潛在漏洞2.7mm 的重建精度在零樣本設定下令人印象深刻。誠實地承認未能匹配傳統三角測量方法，增加了論文的可信度。

4. Conclusion — 結論

We have presented DUSt3R, a unified paradigm that solves multiple 3D vision tasks — reconstruction, depth estimation, pose estimation, and localization — without requiring camera calibration. By reformulating geometric problems as pointmap regression with learned priors, our work simplifies traditional pipelines while achieving competitive or superior performance across diverse benchmarks. The pointmap representation preserves pixel-to-point correspondence while handling implicit pose relationships. Our generic Transformer-based architecture leverages pretrained models without task-specific constraints, and the 3D global alignment optimization enables fast convergence for multi-view scenarios. DUSt3R demonstrates that relaxing explicit geometric constraints while leveraging large-scale pretraining and learned priors can effectively solve challenging unconstrained 3D reconstruction problems.

本文提出了 DUSt3R，一種無需相機校正即可解決多項 3D 視覺任務——重建、深度估計、位姿估計與定位——的統一範式。透過將幾何問題重新表述為具學習先驗的點圖迴歸，本工作簡化了傳統流程，同時在多元基準測試上達到具競爭力或更優的表現。點圖表示在處理隱式位姿關係的同時保留了像素對點的對應關係。我們的通用 Transformer 架構利用預訓練模型而不受任務特定約束，3D 全域對齊最佳化為多視角場景實現快速收斂。DUSt3R 證明了放寬顯式幾何約束、同時利用大規模預訓練與學習先驗，能夠有效解決具挑戰性的無約束 3D 重建問題。

段落功能總結全文——重申統一範式的核心價值。

邏輯角色以「放寬顯式約束 + 學習先驗」的哲學總結全文，將技術貢獻提升至方法論層面。

論證技巧 / 潛在漏洞DUSt3R 已催生大量後續工作（如 MASt3R），證明其範式的深遠影響。但完全資料驅動的方法在極端分佈外場景的穩健性仍有待長期驗證。

論證結構總覽

問題
傳統 SfM/MVS 序列流程
誤差累積且缺乏交流

→

論點
以點圖迴歸取代
顯式幾何流程

→

方法
孿生 ViT + 交叉注意力
+ 全域對齊

→

證據
多項基準最先進
零樣本泛化

→

結論
統一範式解決多項
3D 視覺任務

核心主張（一句話）

透過將 3D 重建重新表述為點圖迴歸並利用 Transformer 學習幾何先驗，DUSt3R 無需相機校正即可在多項 3D 視覺任務上達到最先進表現。

論證最強處

在不使用真值位姿的條件下，多視角位姿估計（96.2% RRA@15）與深度估計均超越需要真值位姿的方法——「免費午餐」式的範式優越性證明極具說服力。

論證最弱處

850 萬影像對的訓練規模與 CroCo 預訓練的依賴提高了復現門檻；完全放棄幾何約束的方法在極端分佈外場景（如大型建築群的全域一致性）上的表現仍需更多驗證。