3D Gaussian Splatting for Real-Time Radiance Field Rendering

Abstract — 摘要

Radiance field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly enabling real-time (≥30 fps) novel-view synthesis at 1080p resolution: (1) anisotropic 3D Gaussians as a flexible scene representation, (2) interleaved optimization with adaptive density control of the 3D Gaussians, and (3) a fast, visibility-aware GPU rendering algorithm that supports anisotropic splatting.

輻射場方法近期徹底變革了以多張照片或影片擷取之場景的新視角合成。然而，達到高視覺品質仍需要訓練與渲染成本高昂的神經網路，而近期的快速方法不可避免地以品質換取速度。我們引入三項關鍵元素，使我們能在維持具競爭力的訓練時間的同時達到最先進的視覺品質，且至關重要地，實現 1080p 解析度下的即時（大於等於 30 fps）新視角合成：(1) 非等向性三維高斯作為彈性的場景表示；(2) 三維高斯的交錯最佳化與自適應密度控制；(3) 支援非等向性潑灑的快速可見性感知 GPU 渲染演算法。

段落功能全文總覽——以「品質-速度」的兩難為起點，引出三項核心貢獻。

邏輯角色摘要精準定位：在品質與速度的帕累托前沿上同時推進兩個軸向。三項貢獻對應「表示→最佳化→渲染」的完整管線。

論證技巧 / 潛在漏洞「即時 1080p」是極具衝擊力的指標——將 NeRF 類方法從「離線工具」提升至「即時應用」。但「最先進的視覺品質」與「即時」的同時達成需仔細核驗——在某些場景中兩者可能無法兼得。

1. Introduction — 緒論

Neural Radiance Fields (NeRF) and its variants have enabled stunning novel-view synthesis quality. However, they rely on costly volumetric ray-marching with neural network evaluations at each sample point. Fast NeRF variants like InstantNGP and Plenoxels use spatial data structures to accelerate rendering but cannot achieve real-time rendering at high resolutions. A key insight from point-based rendering is that "point-based alpha-blending and NeRF-style volumetric rendering share essentially the same image formation model". We build on this insight, demonstrating that a "continuous representation is not strictly necessary for fast, high-quality training".

神經輻射場（NeRF）及其變體已實現了驚人的新視角合成品質。然而，它們依賴於每個取樣點上需要神經網路運算的高成本體積射線行進。快速 NeRF 變體如 InstantNGP 和 Plenoxels 使用空間資料結構加速渲染，但無法在高解析度下實現即時渲染。基於點的渲染的關鍵洞見是：「基於點的 alpha 混合與 NeRF 風格的體積渲染本質上共享相同的影像形成模型」。我們在此洞見上進一步發展，證明「連續表示對於快速、高品質的訓練並非嚴格必要」。

段落功能建立研究場域——從 NeRF 的速度瓶頸出發，引出點渲染的等價洞見。

邏輯角色論證鏈的起點：先肯定 NeRF 品質，再指出速度瓶頸，最後以「alpha 混合等價於體積渲染」的洞見為離散表示方法正名。

論證技巧 / 潛在漏洞「連續表示並非嚴格必要」是大膽且重要的主張——挑戰了 NeRF 以來「連續即優越」的隱含假設。但此等價性在理論上成立，實務上離散表示可能在邊界處產生鋸齒或空洞。

Traditional reconstruction methods used Structure-from-Motion (SfM) and Multi-View Stereo (MVS) for view synthesis through image reprojection and blending. NeRF-based methods employ volumetric ray-marching with neural networks; recent fast variants (InstantNGP, Plenoxels) use spatial data structures but still require hundreds of network evaluations per pixel per frame. Point-based rendering has a long history, from surfels to differentiable point rasterization. Our work establishes that the discrete point-based representation is flexible enough to allow creation, destruction, and displacement of geometric primitives during optimization.

傳統重建方法使用運動結構恢復（SfM）與多視圖立體視覺（MVS），透過影像重投影與混合進行視角合成。基於 NeRF 的方法採用搭配神經網路的體積射線行進；近期的快速變體（InstantNGP、Plenoxels）使用空間資料結構，但每個像素每影格仍需數百次網路運算。基於點的渲染擁有悠久歷史，從表面元素到可微分點光柵化。本研究確立了離散的基於點的表示足夠靈活，可在最佳化過程中允許幾何基元的建立、銷毀與位移。

段落功能文獻回顧——將三個技術脈絡匯聚，定位 3D Gaussian Splatting。

邏輯角色建立演進脈絡：傳統方法（顯式）→ NeRF（隱式連續）→ 本文（顯式可微分），暗示從連續回到離散是合理的技術演進。

論證技巧 / 潛在漏洞「每像素數百次網路運算」量化了 NeRF 的計算瓶頸，使讀者直覺理解為何需要替代方案。但傳統點渲染的品質問題（空洞、鋸齒）被輕描淡寫——3D Gaussians 如何解決這些問題是本文需回答的核心。

3. Differentiable 3D Gaussian Splatting — 可微分三維高斯潑灑

Each 3D Gaussian is defined by a mean mu (position) and full 3D covariance matrix Sigma in world space. Rather than directly optimizing Sigma (which must remain positive semi-definite), we decompose it as Sigma = R S S^T R^T, storing separately a scaling vector s and rotation quaternion q. Each Gaussian additionally stores an opacity alpha and spherical harmonics (SH) coefficients for view-dependent appearance. For rendering, 3D covariance projects to 2D screen-space covariance via: Sigma' = J W Sigma W^T J^T, where J is the Jacobian of the projective transformation and W is the viewing matrix.

每個三維高斯由世界空間中的均值 mu（位置）與完整三維共變異數矩陣 Sigma 定義。由於直接最佳化 Sigma 必須維持半正定性，我們將其分解為 Sigma = R S S^T R^T，分別儲存縮放向量 s 與旋轉四元數 q。每個高斯額外儲存不透明度 alpha 與球諧函數（SH）係數以表示視角相關的外觀。渲染時，三維共變異數透過 Sigma' = J W Sigma W^T J^T 投影至二維螢幕空間共變異數，其中 J 為投影變換的雅可比矩陣，W 為觀看矩陣。

段落功能核心表示法——定義 3D Gaussian 的參數化方式。

邏輯角色此段是方法的數學基礎。以 s+q 分解取代直接的 Sigma 最佳化，巧妙地將半正定約束轉化為無約束最佳化。球諧函數的引入使離散高斯具備視角相關的色彩表達能力。

論證技巧 / 潛在漏洞參數化的選擇（s+q 而非直接 Sigma）是關鍵的工程決策——既保證物理有效性又利於梯度下降。但非等向性高斯的自由度（位置3 + 縮放3 + 旋轉4 + 不透明度1 + SH係數48 = 59 個參數/高斯）可能導致大量高斯時的記憶體消耗問題。

4. Optimization with Adaptive Density Control — 自適應密度控制最佳化

The loss function combines L = (1 - lambda) * L1 + lambda * L_D-SSIM with lambda = 0.2. The key innovation is adaptive density control with two complementary strategies. Cloning: for under-reconstructed regions (small Gaussians with large positional gradients), duplicate the Gaussian and offset along the gradient direction. Splitting: for over-reconstructed regions (large Gaussians covering too much space), replace with two smaller Gaussians scaled by factor 1.6. Gaussians with opacity below threshold are periodically pruned, and every 3000 iterations, opacity is reset to near-zero to allow pruning of floaters accumulated during optimization.

損失函數結合 L = (1 - lambda) * L1 + lambda * L_D-SSIM，其中 lambda = 0.2。關鍵創新是自適應密度控制，包含兩項互補策略。克隆：對於重建不足的區域（位置梯度大的小型高斯），複製該高斯並沿梯度方向偏移。分裂：對於過度重建的區域（覆蓋過大空間的大型高斯），以縮放因子 1.6 替換為兩個更小的高斯。不透明度低於閾值的高斯會被定期修剪，且每 3000 次迭代，不透明度會被重設至接近零，以允許修剪最佳化過程中累積的浮游物。

段落功能最佳化策略——描述自適應密度控制的克隆與分裂機制。

邏輯角色此段是方法的核心創新之一：離散表示的最大優勢——可以動態調整基元數量。克隆解決稀疏區域，分裂解決過大高斯，修剪移除冗餘——三者構成完整的拓撲適應機制。

論證技巧 / 潛在漏洞以「克隆+分裂+修剪」三重機制回應「離散表示如何自適應」的核心問題。但密度控制的超參數（梯度閾值、不透明度閾值、3000 次迭代週期）需要經驗性調校，對於極端場景可能不夠穩健。

5. Fast Differentiable Rasterizer — 快速可微分光柵化器

Our GPU rasterizer uses a tile-based approach: the screen is divided into 16x16 tiles. Each Gaussian is instantiated for every overlapping tile, assigned a 64-bit key combining depth and tile ID, then globally sorted via GPU radix sort. The forward pass processes per-tile thread blocks that accumulate color and opacity front-to-back until alpha saturation. The backward pass traverses stored sorted lists back-to-front, recovering intermediate opacity values without storing explicit per-pixel lists. This enables "gradients on arbitrary numbers of blended Gaussians" while achieving 134 FPS on Mip-NeRF360 scenes at 1080p.

我們的 GPU 光柵化器採用磚塊式方法：螢幕被劃分為 16x16 的磚塊。每個高斯被實例化至所有重疊的磚塊，分配一個結合深度與磚塊 ID 的 64 位元鍵值，再透過 GPU 基數排序進行全域排序。正向傳遞處理每磚塊的執行緒區塊，以前到後順序累積顏色與不透明度直到 alpha 飽和。反向傳遞以後到前順序遍歷已儲存的排序列表，恢復中間不透明度值，無需儲存顯式的逐像素列表。這使得「對任意數量混合高斯的梯度計算」成為可能，同時在 Mip-NeRF360 場景上以 1080p 達到 134 FPS。

段落功能渲染引擎——描述磚塊式 GPU 光柵化器的設計。

邏輯角色此段是「即時渲染」承諾的技術實現。磚塊式設計+全域排序+前後方向遍歷的組合，將傳統圖形學的光柵化技巧引入可微分渲染，是速度突破的關鍵。

論證技巧 / 潛在漏洞「134 FPS at 1080p」是極具說服力的數字，比 NeRF（0.1 FPS）快三個數量級。但此設計高度依賴現代 GPU 的並行能力，在較舊或移動端 GPU 上的表現可能大幅下降。此外，高斯數量極多時排序成本可能成為瓶頸。

6. Experiments — 實驗

We evaluate on Mip-NeRF360, Tanks & Temples, Deep Blending, and synthetic Blender scenes. On Mip-NeRF360 (30K iterations): SSIM 0.815 (vs. Mip-NeRF360: 0.792), PSNR 27.21 dB (vs. 27.69 dB), training time ~42 minutes (vs. 48 hours), and rendering at 134 FPS (vs. 0.1 FPS). The method achieves comparable or superior visual quality to Mip-NeRF360 with over 1000x faster rendering and over 60x faster training. Ablation studies confirm that all components are critical — removing anisotropy, densification cloning/splitting, or unlimited gradient backpropagation each significantly degrades quality.

我們在 Mip-NeRF360、Tanks & Temples、Deep Blending 和合成 Blender 場景上進行評估。在 Mip-NeRF360（30K 次迭代）上：SSIM 0.815（對比 Mip-NeRF360 的 0.792）、PSNR 27.21 dB（對比 27.69 dB）、訓練時間約 42 分鐘（對比 48 小時），渲染速度 134 FPS（對比 0.1 FPS）。該方法以超過 1000 倍的渲染加速與超過 60 倍的訓練加速，達到與 Mip-NeRF360 相當或更優的視覺品質。消融研究確認所有組件均不可或缺——移除非等向性、密度化克隆/分裂或無限梯度反向傳播，各自都顯著降低品質。

段落功能提供全面的定量對比——品質、速度、訓練時間三維度。

邏輯角色實證支柱：SSIM 超越 Mip-NeRF360（0.815 vs 0.792）而 PSNR 略低（27.21 vs 27.69），加上 1000x 渲染加速，構成極具說服力的「帕累托改進」論述。

論證技巧 / 潛在漏洞「1000x 渲染加速」的數字衝擊力巨大。但 PSNR 低於 Mip-NeRF360 0.48 dB，在某些場景中可能意味著可見的品質差異。此外，記憶體消耗（峰值約 20 GB）遠超 NeRF 方法，但未在主要數字中提及。

7. Conclusion — 結論

We have presented 3D Gaussian Splatting, demonstrating that a continuous neural representation is not strictly necessary for fast, high-quality radiance field rendering. Our approach of representing scenes with anisotropic 3D Gaussians, optimized with adaptive density control, and rendered via a tile-based GPU rasterizer achieves state-of-the-art quality with real-time rendering speeds. Limitations include artifacts in poorly-observed regions, challenges with view-dependent effects, and higher memory consumption (peak ~20 GB) compared to NeRF-based methods. We believe this work opens a new direction in the tradeoff between representation expressiveness, optimization efficiency, and rendering speed.

本文提出了三維高斯潑灑，證明了連續神經表示對於快速、高品質的輻射場渲染並非嚴格必要。我們以非等向性三維高斯表示場景、以自適應密度控制最佳化、並透過磚塊式 GPU 光柵化器渲染的方法，以即時渲染速度達到了最先進的品質。局限性包括在觀察不足區域的偽影、視角相關效果的挑戰，以及相較 NeRF 方法更高的記憶體消耗（峰值約 20 GB）。我們相信此研究在表示表達力、最佳化效率與渲染速度之間的權衡上開啟了新的方向。

段落功能總結全文——重申核心論點並坦承局限性。

邏輯角色結論呼應緒論的核心論點（「連續表示非必要」），以完整的實驗證據支撐。坦誠列出三項限制展現學術誠信。

論證技巧 / 潛在漏洞結論將工作定位為「新方向」而非「終極解決方案」，留有發展空間。記憶體消耗（20 GB）是實務上最嚴峻的限制——遠超同品質的 NeRF 方法，限制了對大場景和低端硬體的適用性。

論證結構總覽

問題
NeRF 品質高但
渲染速度極慢

→

論點
離散 3D 高斯可替代
連續神經表示

→

證據
SSIM 0.815、134 FPS
42 分鐘訓練

→

反駁
自適應密度控制
解決離散表示的稀疏性

→

結論
連續表示非必要
開啟即時輻射場新方向

作者核心主張（一句話）

以非等向性三維高斯作為場景表示、搭配自適應密度控制與磚塊式 GPU 光柵化器，可在維持最先進品質的同時實現 1080p 即時新視角合成，證明連續神經表示並非高品質輻射場渲染的必要條件。

論證最強處

速度-品質帕累托前沿的全面推進：相較 Mip-NeRF360 在 SSIM 上更優（0.815 vs 0.792）、渲染加速 1000 倍（134 vs 0.1 FPS）、訓練加速 60 倍（42 分鐘 vs 48 小時）。三個維度的同時改進使競爭方法幾乎無法反駁。

論證最弱處

記憶體消耗與場景可擴展性：峰值約 20 GB 的 GPU 記憶體遠超 NeRF 方法，嚴重限制大場景與低端硬體的應用。此外，在觀察不足區域的偽影問題以及 PSNR 略低於 Mip-NeRF360（27.21 vs 27.69 dB）暗示離散表示在精細細節上仍有提升空間。