RayZer: A Self-supervised Large View Synthesis Model

Abstract — 摘要

We present RayZer, a model that takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. The system trains without 3D supervision, using self-predicted camera poses to render target views rather than ground-truth annotations. The framework achieves comparable or even superior novel view synthesis performance than "oracle" methods that require pose annotations. The model employs a ray structure, connecting camera, pixel, and scene simultaneously.

本文提出 RayZer，一個以未標註姿態且未校準的影像作為輸入，恢復相機參數、重建場景表示並合成新視角的模型。該系統在無三維監督的情況下訓練，使用自行預測的相機姿態來渲染目標視角，而非依賴真實標註。此框架達到與需要姿態標註的「先知」方法相當甚至更優的新視角合成性能。模型採用光線結構，同時連接相機、像素與場景。

段落功能全文總覽——以簡練語言概述 RayZer 的輸入輸出、核心創新（零 3D 監督）與性能定位。

邏輯角色摘要的核心張力在於「無監督 vs. 有監督」的對比：自監督方法竟能匹敵甚至超越使用真實姿態的方法，構成強烈的學術吸引力。

論證技巧 / 潛在漏洞「超越先知方法」的宣稱需要謹慎解讀——這可能是因為 COLMAP 標註本身不完美，而非自監督方法天生更優。作者在後文需區分此兩種解釋。

1. Introduction — 緒論

While self-supervised learning has advanced LLMs and VLMs, 3D Vision models still rely heavily on ground-truth 3D geometry and camera pose labels. The authors propose breaking from this paradigm by asking: "How far can we push a 3D Vision model without any 3D supervision?" RayZer processes unposed and uncalibrated multi-view images through three stages: camera parameter recovery, scene representation reconstruction, and novel view rendering. During training, the model uses camera poses predicted by RayZer itself to render views that provide photometric supervision.

儘管自監督學習已推動了大型語言模型與視覺語言模型的進展，但三維視覺模型仍高度依賴真實的三維幾何與相機姿態標籤。作者提出打破此典範的問題：「在完全沒有三維監督的情況下，我們能將三維視覺模型推進到什麼程度？」RayZer 處理未標註姿態且未校準的多視角影像，經歷三個階段：相機參數恢復、場景表示重建與新視角渲染。在訓練過程中，模型使用 RayZer 自身預測的相機姿態來渲染視角，以提供光度監督。

段落功能建立研究場域——從 NLP/VLM 的自監督成功切入，指出 3D 視覺的落後。

邏輯角色以跨領域類比建立論證：既然語言和 2D 視覺已成功擺脫標註依賴，3D 視覺也應如此。這為全文的研究動機提供了強力的概念支撐。

論證技巧 / 潛在漏洞「自己預測姿態來監督自己」的自舉策略在概念上有循環論證的風險——模型如何避免收斂到退化解（如所有姿態相同）？作者需在方法段中解釋資訊流控制機制。

The framework interprets the task as "3D-aware image auto-encoding", where input images are disentangled into camera and scene representations, then re-entangled through rendering. The model uses transformers with no 3D representation, no hand-crafted rendering equation, and no 3D-informed architectures — the only 3D prior is the ray structure, incorporating pixel-aligned Plucker ray maps to guide scene reconstruction. Evaluation across three datasets shows RayZer demonstrates comparable or even better novel view synthesis performance than "oracle" methods, suggesting that potentially noisy pose annotations from COLMAP can limit the performance of supervised models.

此框架將任務詮釋為「三維感知的影像自動編碼」，將輸入影像解耦為相機與場景表示，再透過渲染重新耦合。模型使用 Transformer，不含三維表示、不含手工設計的渲染方程式、不含三維資訊化架構——唯一的三維先驗是光線結構，整合像素對齊的 Plucker 光線映射來引導場景重建。在三個資料集上的評估顯示，RayZer 展現了與「先知」方法相當甚至更佳的新視角合成性能，這暗示來自 COLMAP 的潛在雜訊姿態標註可能限制了監督式模型的性能。

段落功能核心架構概述——描述「自動編碼」框架與極簡的 3D 先驗設計。

邏輯角色此段揭示了全文最具挑釁性的發現：監督資料中的雜訊反而可能拖累性能。這將自監督從「妥協方案」提升為「可能更優的方案」。

論證技巧 / 潛在漏洞「僅以光線結構作為 3D 先驗」的極簡設計極具吸引力，但也使得模型完全依賴資料驅動的 3D 理解。在分布外場景（如極端光照或重複紋理）上的穩健性是潛在隱憂。

Large-scale 3D Vision Models like LRM, DUSt3R, and LVSM have incorporated transformers to convert 2D images into 3D representations. However, these approaches still require ground-truth camera poses for supervised training and/or accurate camera annotations during inference. Self-supervised 3D representation learning methods either work for a specific category or can only recover partial observations. The most relevant prior work is RUST, which learns latent scene representations from unposed imagery. RayZer differs in three key aspects: it initially estimates camera poses rather than reconstructing first; it employs explicit pose representations for better information disentanglement; and it uses pure self-attention in transformers.

大規模三維視覺模型如 LRM、DUSt3R 與 LVSM 已整合 Transformer 將二維影像轉換為三維表示。然而，這些方法仍需要真實的相機姿態進行監督式訓練及/或推論時的精確相機標註。自監督三維表示學習方法要麼僅適用於特定類別，要麼只能恢復部分觀測。最相關的先前工作是 RUST，從未標註姿態的影像中學習潛在場景表示。RayZer 在三個關鍵面向上有所不同：首先估計相機姿態而非先重建；採用顯式姿態表示以實現更好的資訊解耦；以及使用純自注意力 Transformer。

段落功能文獻定位——將 RayZer 放置於現有大規模 3D 視覺模型與自監督方法的脈絡中。

邏輯角色以三個差異點（姿態優先、顯式表示、純自注意力）建立與最接近前作 RUST 的區別，清晰地劃定本文的增量貢獻。

論證技巧 / 潛在漏洞與 RUST 的三點區別描述精確，但未充分解釋為何「姿態優先」優於「重建優先」——這是一個重要的設計選擇，需在方法段中以消融實驗支撐。

3. RayZer — 方法

RayZer's input consists of a set of unposed and uncalibrated multi-view images. To enable self-supervised training, the framework controls the data information flow by splitting images into two non-overlapping subsets. One subset predicts scene representation while the other provides photometric supervision. The loss function combines MSE and perceptual loss. A key design element involves cascaded prediction of camera and scene representations, motivated by the fact that even noisy cameras can be a strong condition for better scene reconstruction, providing mutual regularization during training.

RayZer 的輸入由一組未標註姿態且未校準的多視角影像構成。為實現自監督訓練，框架透過將影像分割為兩個不重疊子集來控制資料資訊流。一個子集預測場景表示，另一個提供光度監督。損失函數結合均方誤差與感知損失。關鍵設計要素是相機與場景表示的級聯預測，其動機在於即使帶有雜訊的相機參數也能為更好的場景重建提供強力條件，在訓練中形成相互正則化。

段落功能自監督框架——描述資訊流控制與級聯預測的核心設計。

邏輯角色此段直接回應「自舉如何避免退化」的疑問：透過將影像分為兩組並控制資訊流，防止模型走捷徑。級聯設計則提供了姿態-場景之間的互利關係。

論證技巧 / 潛在漏洞資訊流控制是自監督學習中防止崩塌的經典手段，但兩組子集的劃分比例與策略（隨機 vs. 結構化）可能顯著影響訓練穩定性。作者需在消融研究中探討此設計選擇。

3.2 Model Architecture — 模型架構

RayZer builds a pure transformer-based model with three components. The Camera Estimator uses learnable camera tokens combined with image features via full self-attention to predict SE(3) camera poses (using continuous 6D rotation representation) and intrinsics. Camera parameters are then converted to pixel-aligned Plucker ray maps. The Scene Reconstructor fuses raw image tokens with ray information via MLP and uses self-attention transformer layers to predict a latent set scene representation z. Critically, it uses raw image tokens rather than pose transformer output to prevent information leakage. The Rendering Decoder represents target images as Plucker rays, fuses them with scene tokens via self-attention, and decodes RGB patches through an MLP.

RayZer 建構了一個純 Transformer 模型，包含三個組件。相機估計器使用可學習的相機詞元結合影像特徵，透過完全自注意力來預測 SE(3) 相機姿態（使用連續 6D 旋轉表示）與內部參數。相機參數接著轉換為像素對齊的 Plucker 光線映射。場景重建器透過 MLP 融合原始影像詞元與光線資訊，並使用自注意力 Transformer 層來預測潛在集合場景表示 z。關鍵地，它使用原始影像詞元而非姿態 Transformer 的輸出，以防止資訊洩漏。渲染解碼器將目標影像表示為 Plucker 光線，透過自注意力與場景詞元融合，並透過 MLP 解碼 RGB 色彩區塊。

段落功能架構細節——描述三個核心組件的設計與互動方式。

邏輯角色此段揭示了多項精心的工程選擇：6D 旋轉表示避免萬向節鎖、Plucker 光線提供幾何先驗、防止資訊洩漏的設計確保自監督的有效性。

論證技巧 / 潛在漏洞「使用原始影像詞元而非姿態 Transformer 輸出」的設計是防止資訊洩漏的關鍵細節，展現了對自監督訓練陷阱的深刻理解。但純 Transformer 架構缺乏顯式 3D 歸納偏置，可能需要更多訓練資料才能學到良好的 3D 理解。

4. Experiments — 實驗

RayZer is evaluated on DL3DV, RealEstate10k (scene-level), and Objaverse (object-level). The model achieves performance comparable to the best oracle model LVSM and even outperforms LVSM on DL3DV and RealEstate10k while performing slightly worse on Objaverse. The authors conjecture this is because camera poses in DL3DV and RealEstate are annotated by COLMAP, which can be imperfect, while Objaverse has perfect pose annotations from the rendering tool. Analysis shows predicted poses are interpolatable and 3D-aware. Ablation studies verify that the latent set representation, Plucker ray maps, and cascaded pose-first paradigm are all essential. Notably, replacing the latent representation with 3D Gaussian Splatting causes training to not converge, confirming the flexibility advantage of learned representations.

RayZer 在 DL3DV、RealEstate10k（場景級）與 Objaverse（物件級）上進行評估。模型達到與最佳先知模型 LVSM 相當的性能，甚至在 DL3DV 與 RealEstate10k 上超越 LVSM，而在 Objaverse 上稍遜。作者推測這是因為 DL3DV 與 RealEstate 中的相機姿態由 COLMAP 標註，可能不完美，而 Objaverse 擁有來自渲染工具的完美姿態標註。分析顯示預測的姿態可內插且具三維感知。消融研究驗證了潛在集合表示、Plucker 光線映射與級聯的姿態優先範式均為不可或缺的組件。值得注意的是，以三維高斯散射取代潛在表示會導致訓練無法收斂，證實了學習式表示的靈活性優勢。

段落功能提供全面的實驗證據——涵蓋多資料集比較、姿態分析與消融研究。

邏輯角色實驗結果巧妙地支撐了核心論點：在 COLMAP 標註的資料集上超越監督方法，在完美標註的資料集上稍遜，恰好印證了「雜訊標註限制監督模型」的假說。

論證技巧 / 潛在漏洞 3DGS 無法收斂的消融結果非常有說服力，但作者未嘗試其他顯式 3D 表示（如 NeRF）。在 Objaverse 上的性能差距雖被解釋為「完美標註的優勢」，但也可能反映自監督方法在物件級場景的固有弱點。

5. Conclusion — 結論

The authors introduce RayZer, a self-supervised large multi-view 3D Vision model trained with zero 3D supervision — no 3D geometry and no camera annotations. Results verify the feasibility of breaking free from supervised learning in 3D vision tasks. The model's ability to learn a pose space that is potentially better suited for novel view synthesis than noisy ground-truth annotations represents a paradigm shift in how we think about 3D supervision.

作者提出 RayZer，一個以零三維監督——無三維幾何、無相機標註——訓練的自監督大規模多視角三維視覺模型。結果驗證了在三維視覺任務中擺脫監督式學習的可行性。模型學習到的姿態空間可能比帶有雜訊的真實標註更適合新視角合成，這代表了我們對三維監督思維方式的典範轉移。

段落功能總結全文——以宏觀視角定位本文對三維視覺領域的影響。

邏輯角色結論回到緒論的核心問題（「能推進多遠？」），以實驗結果給出令人鼓舞的答案，形成完整的論證閉環。

論證技巧 / 潛在漏洞「典範轉移」的宣稱強烈但需要時間驗證。自監督方法在更多元的真實場景（如極端光照、大規模戶外）上的表現尚未被充分測試，從研究突破到實際取代監督方法仍有距離。

論證結構總覽

問題
3D 視覺模型
依賴昂貴的姿態標註

→

論點
自監督光線結構
零 3D 監督訓練

→

證據
三資料集測試
匹敵或超越先知方法

→

反駁
COLMAP 雜訊姿態
可能限制監督模型

→

結論
3D 視覺可擺脫
監督式學習依賴

作者核心主張（一句話）

透過以光線結構為唯一 3D 先驗的純 Transformer 自監督架構，三維視覺模型無需任何三維監督即可達到與有監督先知方法相當甚至更優的新視角合成性能。

論證最強處

自監督超越有監督的反直覺發現：在 COLMAP 標註的真實資料集上，自監督的 RayZer 超越使用真實姿態的先知方法，有力地論證了雜訊標註可能成為監督學習的瓶頸。這一發現對整個 3D 視覺社群具有深遠的啟示意義。

論證最弱處

自監督方法的泛化性隱憂：模型在完美標註的 Objaverse 上表現不如先知方法，暗示在標註品質可靠的場景中，自監督方法並無優勢。此外，模型是否能在更大規模、更多樣化的真實世界場景中維持其優勢，仍需進一步驗證。