LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos

Abstract — 摘要

LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. The method introduces three main components: Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians; a robust Pose Estimation Module leveraging learned 3D priors from MASt3R; and an efficient Octree Anchor Formation mechanism converting dense point clouds into anchors based on spatial density. The authors report achieving state-of-the-art results in rendering quality, pose accuracy, and computational efficiency.

LongSplat 解決了從隨手拍攝的長影片中進行新視角合成的關鍵挑戰，這些影片的特徵包括不規則的攝影機運動、未知的攝影機姿態與廣闊的場景。該方法引入三個主要組件：增量聯合最佳化同時最佳化攝影機姿態與三維高斯；穩健的姿態估計模組利用 MASt3R 的學習式三維先驗；以及高效的八叉樹錨點形成機制，基於空間密度將密集點雲轉換為錨點。作者報告在渲染品質、姿態精確度與計算效率上均達到最先進的結果。

段落功能全文總覽——以三重挑戰（不規則運動、未知姿態、大場景）定義問題，以三組件架構回應。

邏輯角色摘要以「挑戰-組件」的一對一映射結構呈現，每個組件對應一個具體挑戰，論證結構簡潔有力。

論證技巧 / 潛在漏洞三重最先進的宣稱（品質、精度、效率）需在實驗中逐一驗證。MASt3R 作為外部先驗的依賴意味著方法的成功部分歸功於預訓練模型而非框架本身。

1. Introduction — 緒論

High-quality 3D reconstruction and novel view synthesis are essential for applications such as virtual reality, augmented reality, virtual tourism, and cultural heritage preservation. Traditional methods rely on accurate poses from Structure-from-Motion (SfM) preprocessing, yet pipelines like COLMAP frequently fail in casual settings. The framework departs from conventional approaches by jointly optimizing camera poses and 3DGS in a unified framework, integrating correspondence-guided pose estimation with 3DGS geometry and photometric refinements.

高品質的三維重建與新視角合成對虛擬實境、擴增實境、虛擬觀光與文化遺產保存等應用至關重要。傳統方法依賴運動恢復結構（SfM）前處理所提供的精確姿態，但如 COLMAP 等管線在隨手拍攝的設定下經常失敗。本框架脫離傳統方法，在統一框架中聯合最佳化攝影機姿態與三維高斯潑灑，整合對應關係引導的姿態估計與 3DGS 的幾何及光度精煉。

段落功能建立研究場域——指出 SfM 前處理的脆弱性，為無姿態方法建立動機。

邏輯角色以「COLMAP 經常失敗」作為強力動機，將無姿態三維重建從學術好奇心提升為實際需求。

論證技巧 / 潛在漏洞 COLMAP 的失敗率取決於場景特性與影像品質。在結構化場景（如室內）中 COLMAP 通常表現良好，此處的「經常失敗」可能過度泛化。

Neural Radiance Fields (NeRF) revolutionized photorealistic rendering, with subsequent improvements in anti-aliasing, reflectance, sparse view training, and faster rendering. Recent 3D Gaussian Splatting "enables real-time rendering through explicit representations". However, "most approaches still rely on pre-computed camera parameters from SfM". For unposed NVS, methods like i-NeRF, BARF, and GARF "typically assume small pose perturbations, limited camera motion, or additional priors, struggling with challenging trajectories". Casual long videos present unique challenges: free-moving trajectories, lack of pose information, and continuously expanding scenes.

神經輻射場（NeRF）革新了逼真渲染，後續在反鋸齒、反射率、稀疏視角訓練與更快渲染方面持續改進。近期的三維高斯潑灑「透過顯式表示實現即時渲染」。然而，「大多數方法仍依賴 SfM 預先計算的攝影機參數」。對於無姿態的新視角合成，如 i-NeRF、BARF 與 GARF 等方法「通常假設小幅姿態擾動、有限的攝影機運動或額外先驗，在具挑戰性的軌跡上表現不佳」。隨手拍攝的長影片帶來獨特挑戰：自由移動的軌跡、缺乏姿態資訊、以及持續擴展的場景。

段落功能文獻回顧——從 NeRF 到 3DGS 的演進，再到無姿態方法的現有侷限。

邏輯角色以層層遞進的方式縮窄問題範圍：即時渲染已解決（3DGS）、小擾動姿態已解決（BARF 等），但「長影片 + 自由運動 + 無姿態」的組合仍是未解問題。

論證技巧 / 潛在漏洞將現有方法的限制歸因於「假設」的約束，暗示 LongSplat 不需這些假設。但 LongSplat 對 MASt3R 的依賴實際上也是一種先驗假設。

3. Method — 方法

3.1 Preliminaries — 前置知識

3D Gaussian Splatting represents scenes with explicit Gaussians, where each Gaussian is projected onto the image plane using the camera pose W. The anchor-based representation from Scaffold-GS divides scenes into sparse voxels, with k Gaussians initialized per anchor. Unlike Scaffold-GS's fixed-resolution approach, LongSplat constructs structured anchors from MASt3R's per-frame dense point clouds using an adaptive octree.

三維高斯潑灑以顯式高斯表示場景，每個高斯使用攝影機姿態 W 投影至影像平面。Scaffold-GS 的錨點式表示法將場景分割為稀疏體素，每個錨點初始化 k 個高斯。不同於 Scaffold-GS 的固定解析度方法，LongSplat 使用自適應八叉樹從 MASt3R 的逐幀密集點雲中建構結構化錨點。

段落功能技術基礎——建立 3DGS 與錨點表示的前置知識，並引出八叉樹的改進。

邏輯角色從 Scaffold-GS 的固定解析度出發，引出自適應八叉樹的動機：長影片中場景密度差異大，固定解析度在稀疏區域浪費記憶體、在密集區域品質不足。

論證技巧 / 潛在漏洞 MASt3R 作為密集點雲來源是關鍵依賴——其預測品質直接影響錨點品質。若 MASt3R 在特定場景（如無紋理區域）失效，整個管線可能受影響。

3.2 Octree Anchor Formation — 八叉樹錨點形成

LongSplat constructs structured anchors from MASt3R's per-frame dense point clouds using an adaptive octree. The method progressively subdivides space based on local point density: voxels exceeding density threshold tau_split split into 8 smaller voxels, while low-density voxels are removed to reduce redundancy. Each anchor inherits a spatial scale proportional to its voxel size, ensuring coarse anchors for sparsely observed areas and finer anchors for detailed regions. This achieves 7.92x compression compared to dense point cloud representation.

LongSplat 使用自適應八叉樹從 MASt3R 的逐幀密集點雲中建構結構化錨點。該方法基於局部點密度漸進式地細分空間：超過密度門檻的體素分裂為 8 個更小的體素，低密度體素則被移除以減少冗餘。每個錨點繼承與其體素大小成正比的空間尺度，確保稀疏觀察區域使用粗糙錨點、細節區域使用精細錨點。相比密集點雲表示達到 7.92 倍壓縮。

段落功能核心創新之一——描述自適應八叉樹如何實現記憶體效率。

邏輯角色 7.92 倍壓縮是處理長影片的關鍵：數千幀累積的點雲在無壓縮下將耗盡記憶體，八叉樹提供了精確到場景局部密度的自適應壓縮。

論證技巧 / 潛在漏洞密度門檻 tau_split 是關鍵超參數——過高導致粗糙錨點在細節區域品質不足，過低導致壓縮效率下降。此門檻的設定策略與敏感性分析值得關注。

3.3 Pose Estimation Module — 姿態估計模組

For each new frame, MASt3R provides 2D correspondences, allowing back-projection of matched points to 3D. These correspondences are used to solve the initial pose via PnP, followed by photometric refinement. To correct MASt3R's depth scale drift, the method computes a scale factor by comparing rendered depth and MASt3R's aligned depth. Newly visible regions are detected via occlusion mask and unprojected into 3D, then converted into anchors via Octree Anchor Formation. If PnP fails, a fallback mechanism triggers global re-optimization of all past frames.

對於每個新幀，MASt3R 提供二維對應關係，允許將匹配點反投影至三維。這些對應關係用於透過 PnP 求解初始姿態，再以光度精煉。為校正 MASt3R 的深度尺度漂移，方法計算渲染深度與 MASt3R 對齊深度之間的尺度因子。新可見區域透過遮擋遮罩偵測並反投影至三維，再透過八叉樹錨點形成轉換為錨點。若 PnP 失敗，回退機制觸發對所有過去幀的全域重新最佳化。

段落功能穩健性設計——描述從粗略姿態到精煉姿態的完整管線，包含失敗回退機制。

邏輯角色多層次的姿態估計策略（MASt3R 對應 -> PnP -> 光度精煉 -> 全域回退）體現了工程上的穩健性思維，每層為下一層的失敗提供保護網。

論證技巧 / 潛在漏洞尺度漂移校正是實用但事後的修補策略。全域重新最佳化在長序列中的計算成本可能很高，但作為罕見的回退機制是可接受的工程妥協。

3.4 Incremental Joint Optimization — 增量聯合最佳化

The process begins with a small set of initial frames using MASt3R for pose and point cloud estimation. After initialization, all 3D Gaussian parameters and camera poses are jointly optimized across all processed frames. Covisibility between frames is measured using Intersection-over-Union (IoU) of visible anchors, and frames with covisibility below threshold are excluded from the local optimization window. This visibility-adaptive mechanism ensures local Gaussians are consistently supervised by reliable multi-view constraints. The total loss combines photometric loss, depth loss from MASt3R's scale-aligned depth prior, and keypoint reprojection loss.

流程從使用 MASt3R 進行姿態與點雲估計的少量初始幀開始。初始化後，所有三維高斯參數與攝影機姿態在所有已處理幀上聯合最佳化。幀間的共視性以可見錨點的交集/聯集比（IoU）衡量，共視性低於門檻的幀被排除在局部最佳化窗口之外。此可見性自適應機制確保局部高斯始終受到可靠的多視角約束監督。總損失結合光度損失、MASt3R 尺度對齊深度先驗的深度損失、以及關鍵點重投影損失。

段落功能核心最佳化策略——描述全域與局部最佳化的交替機制。

邏輯角色共視性自適應窗口是處理長序列的關鍵創新：固定窗口在攝影機快速移動時可能包含無重疊的幀，而自適應窗口確保每個局部最佳化都有有效的多視角約束。

論證技巧 / 潛在漏洞三重損失函數提供了互補的監督訊號：光度確保外觀品質、深度確保幾何準確、重投影確保姿態一致。但三者的權重平衡可能需要針對不同場景調整。

4. Experiments — 實驗

Evaluation is conducted on three datasets: Tanks and Temples (smooth forward-facing), Free Dataset (complex unconstrained), and Hike Dataset (long videos with complex geometry). On Tanks and Temples, LongSplat achieves state-of-the-art rendering quality with average PSNR 32.83 dB. Competing methods like CF-3DGS often face out-of-memory issues, while LocalRF produces fragmented geometry and pose drift. LongSplat achieves 281.71 FPS and trains in just 1 hour on an NVIDIA RTX 4090, nearly 30x faster than LocalRF.

在三個資料集上進行評估：Tanks and Temples（平順前向式）、Free Dataset（複雜無約束）與 Hike Dataset（具複雜幾何的長影片）。在 Tanks and Temples 上，LongSplat 達到最先進的渲染品質，平均 PSNR 32.83 dB。競爭方法如 CF-3DGS 經常面臨記憶體耗盡問題，而 LocalRF 產生碎裂的幾何與姿態漂移。LongSplat 在 NVIDIA RTX 4090 上達到 281.71 FPS 且僅需 1 小時訓練，比 LocalRF 快近 30 倍。

段落功能全面的實驗比較——以三個難度遞增的資料集展示方法的穩健性。

邏輯角色 30 倍加速與 OOM 免疫是極具說服力的實用優勢。三個資料集的難度遞增（平順 -> 複雜 -> 長序列）系統性地驗證了方法的可擴展性。

論證技巧 / 潛在漏洞 PSNR 32.83 dB 是優秀的結果，但 PSNR 在感知品質上的代表性有限。LPIPS 與 SSIM 的對應結果將更具說服力。30 倍加速部分歸功於 3DGS 相對於 NeRF 的固有效率優勢。

Ablation studies confirm each component's contribution. Removing pose estimation, global optimization, or local optimization significantly degrades performance. The visibility-adapted window achieves the best balance compared to fixed-size windows. The adaptive octree achieves superior reconstruction quality with 7.92x memory compression. The method demonstrates robustness against pose drift, effectively minimizing drift and maintaining stable trajectories even over long sequences.

消融研究確認各組件的貢獻。移除姿態估計、全域最佳化或局部最佳化均顯著降低效能。可見性自適應窗口相比固定大小窗口達到最佳平衡。自適應八叉樹以 7.92 倍記憶體壓縮達到卓越的重建品質。方法展現了對姿態漂移的穩健性，即使在長序列上也能有效減少漂移並維持穩定軌跡。

段落功能組件驗證——消融研究逐一確認三個核心組件的必要性。

邏輯角色 7.92 倍壓縮與可見性自適應窗口的優勢提供了設計決策的實證支持。漂移穩健性分析直接回應了長序列中累積誤差的核心擔憂。

論證技巧 / 潛在漏洞消融研究全面但缺少 MASt3R 本身的消融——若將 MASt3R 替換為其他深度估計方法，效能會如何變化？這將揭示框架對特定預訓練模型的依賴程度。

5. Conclusion — 結論

The paper presents a complete framework addressing casual long video reconstruction challenges. The method successfully integrates incremental joint optimization, a robust tracking module, and adaptive octree anchors, significantly improving pose accuracy, reconstruction quality, and memory efficiency. The authors note shared limitations with unposed reconstruction methods: assuming static scenes and fixed intrinsics, making it unsuitable for dynamic objects or varying focal lengths. Future work includes handling dynamic scenes and enhancing pose estimation robustness.

本文提出了一個完整的框架來解決隨手拍攝長影片重建的挑戰。方法成功整合了增量聯合最佳化、穩健追蹤模組與自適應八叉樹錨點，顯著改善了姿態精確度、重建品質與記憶體效率。作者指出此方法與無姿態重建方法共有的侷限：假設靜態場景與固定內參，使其不適用於動態物件或變焦鏡頭。未來工作包括處理動態場景與增強姿態估計的穩健性。

段落功能總結與展望——重申貢獻並誠實揭露局限性。

邏輯角色結論的誠實程度值得讚賞：明確指出靜態場景假設是根本性限制，這在實際應用中（如街景拍攝中的行人與車輛）是重大約束。

論證技巧 / 潛在漏洞靜態場景假設是所有基於多視角一致性的方法的共同限制，此處的坦誠反而增強了論文的可信度。固定內參假設在現代智慧手機（自動對焦）中可能被違反。

論證結構總覽

問題
隨手拍攝長影片
無姿態、大場景
COLMAP 易失敗

→

論點
增量聯合最佳化
+ 八叉樹錨點
+ MASt3R 先驗

→

證據
PSNR 32.83 dB
30 倍加速
7.92 倍壓縮

→

反駁
共視性自適應窗口
+ 回退機制
防止漂移與失敗

→

結論
無姿態 3DGS
已可處理
隨手長影片

作者核心主張（一句話）

透過增量聯合最佳化、學習式三維先驗姿態估計與自適應八叉樹錨點的整合，可以從無姿態的隨手拍攝長影片中實現高品質、高效率的三維重建與新視角合成。

論證最強處

系統性的穩健性設計：多層次的姿態估計（對應 -> PnP -> 光度精煉 -> 全域回退）與共視性自適應窗口確保了在極具挑戰性的長序列中維持穩定。7.92 倍的記憶體壓縮使方法在實際應用中可行，而競爭方法（如 CF-3DGS）在相同條件下面臨記憶體耗盡。

論證最弱處

對 MASt3R 的深度依賴：框架的多個核心組件（點雲初始化、姿態估計、尺度校正）均依賴 MASt3R，使方法的成功難以完全歸因於 LongSplat 本身的設計。此外，靜態場景與固定內參假設在真實世界的隨手拍攝場景中經常被違反，限制了方法的適用範圍。