SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Abstract — 摘要

The paper introduces SceneSplat, described as "the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS". It addresses the challenge of effectively integrating semantic reasoning into 3DGS in a generalizable manner. Key contributions include SceneSplat-7K, a new dataset containing 7,916 scenes derived from seven established datasets totaling 8.15 billion 3D Gaussians. The authors propose a self-supervised learning scheme called GaussSSL that enables rich 3D feature learning from unlabeled scenes, achieving state-of-the-art performance in zero-shot semantic segmentation.

本文提出 SceneSplat，被描述為「首個原生運作於三維高斯潑灑上的大規模三維室內場景理解方法」。它解決了以可泛化方式將語意推理有效整合至 3DGS 的挑戰。關鍵貢獻包括 SceneSplat-7K 資料集，包含源自七個既有資料集的 7,916 個場景，共計 81.5 億個三維高斯。作者提出名為 GaussSSL 的自監督學習方案，實現從未標註場景中學習豐富的三維特徵，在零樣本語意分割上達到最先進的表現。

段落功能全文總覽——以「首個」定位建立差異化，並以大規模資料集與自監督方法作為雙重支柱。

邏輯角色 81.5 億高斯的資料規模建立了實證基礎的可信度，「原生 3DGS」的定位區分了與投影至二維再融合的既有方法。

論證技巧 / 潛在漏洞「首個」的宣稱需謹慎——LangSplat 等先前工作也在 3DGS 上整合了語言特徵，此處的「原生」定義（不依賴二維融合的推論階段）需更明確。

1. Introduction — 緒論

Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Current systems trained on fixed closed-set categories from datasets like ScanNet fail to capture real-world diversity. The fundamental challenge is "the absence of a robust model for processing 3D data end-to-end for semantic learning, along with the lack of sufficient data for training such a model". 3D Gaussian Splatting has emerged as the de facto standard for 3D scene representation, but integrating semantic reasoning remains problematic. Prior approaches like optimizing additional semantic features are inefficient and limited to a single scene.

辨識任意或先前未見的類別對於全面的真實世界三維場景理解至關重要。在如 ScanNet 等固定封閉類別資料集上訓練的現有系統無法捕捉真實世界的多樣性。根本挑戰在於「缺乏一個端到端處理三維資料進行語意學習的穩健模型，以及缺乏足以訓練此類模型的充足資料」。三維高斯潑灑已成為三維場景表示的事實標準，但語意推理的整合仍是問題。先前如最佳化額外語意特徵的方法效率低下且僅限於單一場景。

段落功能定義雙重缺口——模型缺口（端到端三維語意學習）與資料缺口（大規模 3DGS 資料集）。

邏輯角色雙重缺口的定義為 SceneSplat 的兩大貢獻（SceneSplat-7K 資料集 + GaussSSL 方法）提供了一對一的對應動機。

論證技巧 / 潛在漏洞將「逐場景最佳化」定位為低效是合理的批判，但這也是 3DGS 語意整合的最直接方法。SceneSplat 的前饋式方法犧牲了逐場景適應性以換取效率。

Existing 3D datasets (ScanNet, ScanNet++, Hypersim, ARKitScenes) are essential for 3D perception but lack large-scale support for emerging 3D representations like 3DGS. For open vocabulary scene understanding, LERF integrated language queries into NeRF but requires time-consuming preprocessing. LangSplat combined 3DGS with open-vocabulary embeddings from SAM and CLIP. SceneSplat's distinction: it enables "feed-forward open-vocabulary understanding of 3DGS" without requiring time-consuming preprocessing.

現有的三維資料集（ScanNet、ScanNet++、Hypersim、ARKitScenes）對三維感知至關重要，但缺乏對如 3DGS 等新興三維表示的大規模支持。在開放詞彙場景理解方面，LERF 將語言查詢整合至 NeRF 但需要耗時的前處理。LangSplat 結合了 3DGS 與 SAM 和 CLIP 的開放詞彙嵌入。SceneSplat 的區別在於：它實現了「3DGS 的前饋式開放詞彙理解」，無需耗時的前處理。

段落功能文獻回顧——從資料集缺口到方法缺口，系統性定位 SceneSplat。

邏輯角色「前饋式 vs. 逐場景最佳化」是核心區分點：前者可擴展到大量場景，後者在每個場景上需獨立訓練。

論證技巧 / 潛在漏洞前饋式方法的效率優勢以犧牲場景特定精度為代價。LangSplat 的逐場景最佳化可能在特定場景上達到更高精度。

3. Method — 方法

3.1 3DGS Language Label Collection — 語言標籤收集

The approach uses SAMv2 for object-level segmentation and SigLIP2 for feature extraction, then employs Occam's LGS to efficiently lift 2D feature maps to a 3D Gaussian feature field. Rather than aligning with text embeddings, the method directly aligns Gaussians with the image embedding space of vision-language models, preserving richer latent semantic information. A dynamic weighting mechanism combines three feature types: global features capturing full scene context, local features from crops with background, and masked features focusing solely on objects.

方法使用 SAMv2 進行物件級分割，SigLIP2 進行特徵提取，再以 Occam's LGS 有效地將二維特徵圖提升至三維高斯特徵場。不同於與文字嵌入對齊，方法直接將高斯與視覺語言模型的影像嵌入空間對齊，保留更豐富的潛在語意資訊。動態加權機制結合三種特徵類型：捕捉完整場景上下文的全域特徵、含背景裁切的局部特徵、以及僅聚焦於物件的遮罩特徵。

段落功能資料準備管線——描述如何從二維基礎模型蒸餾語言特徵至三維高斯。

邏輯角色「影像嵌入 vs. 文字嵌入」的選擇是有意義的設計決策：影像嵌入保留了更多視覺細節，而文字嵌入經過語言壓縮可能損失空間資訊。

論證技巧 / 潛在漏洞三種特徵的動態加權增加了複雜度但提供了多尺度語意理解。然而，此標籤收集過程依賴多個外部基礎模型（SAMv2、SigLIP2、Occam's LGS），其品質上限受限於這些模型的能力。

3.2 Vision-Language 3DGS Pretraining — 視覺語言預訓練

The model architecture adapts a transformer encoder-decoder backbone to map input Gaussians to language features. Three training objectives are employed: Cosine Similarity Loss minimizes angular difference, L2 Loss enforces feature similarity in Euclidean space, and Aggregated Contrastive Loss encourages class-level feature separation through class-wise mean pooling. Notably, "applying the contrastive loss later during training helps promote early feature learning while effectively refining class distinctions".

模型架構採用 Transformer 編碼器-解碼器主幹，將輸入高斯映射至語言特徵。使用三種訓練目標：餘弦相似度損失最小化角度差異、L2 損失在歐氏空間中強制特徵相似、以及聚合對比損失透過類別級平均池化促進類別間特徵分離。值得注意的是，「在訓練後期應用對比損失有助於促進早期特徵學習，同時有效精煉類別區分」。

段落功能訓練策略——三重損失函數的設計與時序安排。

邏輯角色三重損失提供互補的學習訊號：餘弦相似度處理方向、L2 處理距離、對比損失處理類別邊界。延遲應用對比損失的策略防止了早期訓練中的過度特化。

論證技巧 / 潛在漏洞延遲對比損失是有趣的發現，暗示對比學習在特徵空間尚未成型時可能有害。但最佳的延遲時間點（75%）看似經驗性而非原則性。

3.3 Self-Supervised Pretraining (GaussSSL) — 自監督預訓練

Three components are integrated: Masked Gaussian Modeling (MGM) samples Gaussian subsets, masks them, and reconstructs via L2 loss. Self-Distillation follows the DINO framework with student-teacher networks using EMA updates. Language-Gaussian Alignment uses precomputed language features compressed via autoencoder to regularize self-supervised learning. The high dimensionality problem is addressed by replacing original features with compressed representation learned via an autoencoder.

整合三個組件：遮罩高斯建模（MGM）取樣高斯子集、遮罩並透過 L2 損失重建。自蒸餾遵循 DINO 框架，使用指數移動平均更新的學生-教師網路。語言-高斯對齊使用經自動編碼器壓縮的預計算語言特徵來正規化自監督學習。高維度問題透過以自動編碼器學習的壓縮表示替代原始特徵來解決。

段落功能無標籤學習——描述 GaussSSL 的三組件自監督預訓練策略。

邏輯角色三組件分別提供：空間結構學習（MGM）、語意一致性學習（自蒸餾）、語言對齊正規化。此組合使模型能從未標註的 3DGS 場景中學習有用的表示。

論證技巧 / 潛在漏洞 GaussSSL 的三組件設計參考了成熟的自監督範式（MAE、DINO），遷移至 3DGS 領域。但消融結果顯示改善幅度有限（+0.1% 到 +0.5%），自監督的價值需更有力的驗證。

4. Experiments — 實驗

On ScanNet200, SceneSplat achieves a 5.9% f-mIoU increase over prior work when trained on single sources. On ScanNet++, results show 11.1% f-mIoU improvement while using significantly less training data. Remarkably, inference results can "even be better than using the collected labels", demonstrating that large-scale pretraining is able to filter noise and learn meaningful patterns. For efficiency, SceneSplat requires 0.24 minutes per scene versus 107 minutes for Occam's LGS — approximately 445.8 times faster.

在 ScanNet200 上，SceneSplat 以單一來源訓練時達到 5.9% 的 f-mIoU 提升。在 ScanNet++ 上，結果顯示 11.1% 的 f-mIoU 改善，同時使用明顯更少的訓練資料。值得注意的是，推論結果甚至「可能優於使用收集的標籤」，展示了大規模預訓練能夠過濾噪聲並學習有意義的模式。在效率方面，SceneSplat 每個場景僅需 0.24 分鐘，相比 Occam's LGS 的 107 分鐘，快約 445.8 倍。

段落功能核心實驗結果——以精度與效率的雙重數據展示方法優勢。

邏輯角色 445.8 倍加速是最具衝擊力的數據，直接回應了「逐場景最佳化太慢」的動機。「推論優於標籤」的發現則暗示大規模預訓練的泛化價值。

論證技巧 / 潛在漏洞 445.8 倍加速部分因為 SceneSplat 是前饋式而 Occam's LGS 是逐場景最佳化，二者的架構根本不同。f-mIoU 的改善雖顯著但需注意不同方法的資料條件差異。

Ablation studies reveal a clear positive trend between input 3DGS quality (PSNR) and resulting mIoU, highlighting the importance of data curation. When comparing 3DGS parameters versus point cloud properties, the model trained on 3DGS parameters consistently outperforms the point cloud variant. The paper also identifies temporal inconsistency in the SAMv2+SigLIP2 pipeline, particularly for large background objects, resulting in corrupted feature fields.

消融研究揭示輸入 3DGS 品質（PSNR）與結果 mIoU 之間存在清楚的正向趨勢，突顯了資料策展的重要性。在比較 3DGS 參數與點雲屬性時，以 3DGS 參數訓練的模型持續優於點雲變體。論文也指出 SAMv2+SigLIP2 管線的時序不一致性，特別是在大型背景物件上，導致損壞的特徵場。

段落功能深入分析——消融研究揭示設計決策的合理性與已知侷限。

邏輯角色「3DGS > 點雲」的消融直接驗證了「原生 3DGS」定位的合理性。PSNR-mIoU 的正向趨勢則強調了 SceneSplat-7K 資料集品質控制的重要性。

論證技巧 / 潛在漏洞誠實揭露 SAMv2+SigLIP2 的時序不一致性增強了論文的可信度，但也揭示了語言標籤收集管線的本質性侷限——前饋模型的品質受限於蒸餾來源的品質。

5. Conclusion — 結論

SceneSplat establishes "the first large-scale 3D scene understanding model for indoor environments operating directly on 3D Gaussian splats". It enables "open-vocabulary scene recognition without relying on 2D fusion" through vision-language pretraining, and unlocks "label-free 3DGS pretraining at the scene level" through self-supervised techniques. The approach "achieves state-of-the-art performance in zero-shot semantic segmentation, establishing new benchmarks" for future 3D understanding research.

SceneSplat 建立了「首個直接運作於三維高斯潑灑上的大規模室內環境三維場景理解模型」。它透過視覺語言預訓練實現了「不依賴二維融合的開放詞彙場景識別」，並透過自監督技術解鎖了「場景級的無標籤 3DGS 預訓練」。該方法「在零樣本語意分割上達到最先進的表現，為未來三維理解研究建立新的基準」。

段落功能總結全文——以三個「首個」重申研究的開創性。

邏輯角色結論以「資料集 + 方法 + 基準」的三位一體結構概括貢獻，暗示 SceneSplat 不僅是一個方法，更是一個生態系統的起點。

論證技巧 / 潛在漏洞結論未討論室外場景的適用性、動態場景的處理、以及自監督預訓練（GaussSSL）改善有限（+0.1% 到 +0.5%）的事實。「首個」的定位可能因快速發展的領域而迅速過時。

論證結構總覽

問題
3DGS 缺乏原生的
語意理解能力
且無大規模資料集

→

論點
前饋式視覺語言
預訓練 + 自監督
原生 3DGS 理解

→

證據
f-mIoU +5.9%/+11.1%
445.8 倍加速
推論優於標籤

→

反駁
SceneSplat-7K 資料集
品質策展確保
PSNR-mIoU 正相關

→

結論
大規模原生 3DGS
語意理解已可行
建立新基準

作者核心主張（一句話）

透過建構大規模 3DGS 資料集（SceneSplat-7K）並結合視覺語言預訓練與自監督學習，可以首次實現原生運作於三維高斯潑灑上的前饋式開放詞彙場景理解，以 445 倍的速度優勢超越逐場景最佳化方法。

論證最強處

效率與精度的雙重突破：445.8 倍的速度優勢徹底改變了 3DGS 語意理解的可用性——從每場景近兩小時縮短至 15 秒。同時在 ScanNet++ 上達到 11.1% 的 f-mIoU 改善，證明效率提升不以精度為代價。「推論優於標籤」的發現更揭示了大規模預訓練的降噪能力。

論證最弱處

自監督預訓練（GaussSSL）的邊際效益有限：GaussSSL 僅帶來 +0.1% 到 +0.5% 的 mIoU 改善，考慮到其設計複雜度（遮罩建模 + 自蒸餾 + 語言對齊三組件），投入產出比存疑。此外，語言標籤收集管線對多個二維基礎模型的重度依賴，使得「原生 3DGS」的宣稱在訓練階段並不完全成立。