D-NeRF: Neural Radiance Fields for Dynamic Scenes

Abstract — 摘要

Neural Radiance Fields (NeRF) have emerged as a powerful representation for synthesizing novel views of complex scenes from a sparse set of input views. However, NeRF is limited to static scenes. This paper introduces D-NeRF, a method that extends neural radiance fields to dynamic, time-varying scenes. The key idea is to decompose the scene into a canonical configuration and a time-conditioned deformation field. Given a monocular video as input, D-NeRF simultaneously learns a canonical radiance field and a deformation network that maps observation-space 3D points at each time step to their canonical positions, enabling novel view synthesis of dynamic objects from a single camera viewpoint.

神經輻射場（NeRF）已成為從稀疏輸入視角合成複雜場景新視角的強大表徵方法。然而，NeRF 僅限於靜態場景。本文提出 D-NeRF，一種將神經輻射場擴展至動態、隨時間變化場景的方法。其核心概念是將場景分解為標準構型（canonical configuration）與隨時間變化的變形場（deformation field）。以單眼視訊作為輸入，D-NeRF 同時學習一個標準輻射場與一個變形網路，後者將每個時間步的觀測空間三維點映射至其標準位置，從而實現從單一攝影機視角對動態物件進行新視角合成。

段落功能全文總覽——指出 NeRF 的靜態限制，並提出透過標準空間與變形場的分解來處理動態場景。

邏輯角色摘要以「NeRF 的成功→靜態限制→D-NeRF 擴展」的三段式論證，清晰地建立了研究動機與貢獻。「單眼視訊」的輸入條件設定暗示方法的實用性。

論證技巧 / 潛在漏洞以「monocular video」作為輸入條件是雙刃劍：一方面降低了資料擷取門檻，另一方面也意味著缺乏多視角幾何約束，動態場景的三維重建在理論上是高度不適定問題。

1. Introduction — 緒論

Novel view synthesis aims to generate images from viewpoints not present in the input data. NeRF achieves photorealistic results by representing a scene as a continuous volumetric function mapping 3D coordinates and viewing directions to color and density, optimized through differentiable volume rendering. Despite its remarkable quality, the original NeRF formulation assumes a completely static scene captured from multiple viewpoints. This assumption severely limits applicability to real-world scenarios where objects move, deform, or change over time.

新視角合成旨在從輸入資料中不存在的觀測角度生成影像。NeRF 透過將場景表示為一個連續體積函數——將三維座標與觀測方向映射至顏色與密度——並透過可微分體積渲染進行最佳化，達到逼真的合成品質。儘管其效果卓越，原始 NeRF 的公式假設場景為完全靜態且從多個視角拍攝。此假設嚴重限制了其在真實世界場景中的適用性，因為物件會隨時間移動、變形或改變。

段落功能建立研究場域——回顧 NeRF 的原理並指出其靜態假設的局限。

邏輯角色論證鏈的起點：先肯定 NeRF 的成就（光寫實品質），再批判其核心假設（靜態場景），製造研究必要性。

論證技巧 / 潛在漏洞「嚴重限制適用性」的措辭有效地放大了問題的急迫性。但需注意，靜態場景在建築、文物數位化等應用中仍是主流需求，動態場景的需求雖真實但可能被過度強調。

Extending NeRF to dynamic scenes poses fundamental challenges: the number of observations per time step is typically very limited (often just one view), making the problem severely under-constrained. Existing approaches either require multi-view synchronized capture systems or rely on pre-computed 3D reconstructions as input. D-NeRF addresses this by learning a shared canonical representation alongside a time-dependent deformation field, effectively amortizing observations across time to overcome the single-view-per-timestep limitation.

將 NeRF 擴展至動態場景面臨根本性的挑戰：每個時間步的觀測通常非常有限（往往僅有一個視角），使問題高度欠約束。現有方法要麼需要多視角同步擷取系統，要麼依賴預先計算的三維重建作為輸入。D-NeRF 透過同時學習一個共享的標準表徵與隨時間變化的變形場來解決此問題，有效地在時間維度上分攤觀測資訊，以克服每個時間步僅有單一視角的限制。

段落功能闡述技術挑戰並批判既有方案的限制，提出 D-NeRF 的核心策略。

邏輯角色此段是從「問題」到「解決方案」的關鍵橋接：「在時間維度上分攤觀測」是一個精煉且具洞察力的表述，直接回應了欠約束的核心難題。

論證技巧 / 潛在漏洞「分攤觀測」的論述暗含一個關鍵假設：場景在不同時間步的外觀變化是平滑且可由變形場建模的。對於拓撲變化（如物件出現/消失）或非剛體大變形，此假設可能不成立。

Dynamic scene reconstruction has been a long-standing challenge. Traditional methods such as non-rigid structure from motion (NRSfM) and scene flow estimation typically require multi-view input or strong priors about scene geometry. Recent learning-based approaches include Neural Volumes, which uses an encoder-decoder architecture to predict a voxel grid, but is limited by the voxel resolution. Nerfies and Neural Scene Flow Fields address dynamic scenes but require multi-view capture or scene flow supervision. In contrast, D-NeRF operates on monocular video input without any 3D supervision.

動態場景重建一直是長期的挑戰。傳統方法如非剛體運動恢復結構（NRSfM）與場景流估計，通常需要多視角輸入或關於場景幾何的強先驗。近期基於學習的方法包括 Neural Volumes，其使用編碼器-解碼器架構預測體素網格，但受限於體素解析度。Nerfies 與 Neural Scene Flow Fields 處理動態場景，但需要多視角擷取或場景流監督。相比之下，D-NeRF 僅需單眼視訊輸入，無需任何三維監督。

段落功能文獻回顧——系統性地比較既有動態場景重建方法的限制。

邏輯角色透過逐一指出競爭方法的輸入要求（多視角、場景流監督、體素限制），突顯 D-NeRF 在輸入條件上的優勢——僅需單眼視訊。

論證技巧 / 潛在漏洞以「輸入門檻」作為比較維度是有效的差異化策略。但較低的輸入要求通常意味著需要更強的正則化或先驗假設，而這些隱含的代價在此段未被揭露。

3. Method — 方法

3.1 標準輻射場（Canonical NeRF）

D-NeRF models a dynamic scene through two networks: a canonical network and a deformation network. The canonical network follows the standard NeRF formulation, mapping a 3D position x = (x, y, z) and viewing direction d = (theta, phi) to color c = (r, g, b) and volume density sigma. This canonical network represents the scene in a fixed reference configuration, analogous to a template mesh in traditional non-rigid registration. The canonical space is chosen as the configuration at time t=0.

D-NeRF 透過兩個網路來建模動態場景：標準網路（canonical network）與變形網路（deformation network）。標準網路遵循標準 NeRF 公式，將三維位置 x = (x, y, z) 與觀測方向 d = (theta, phi) 映射至顏色 c = (r, g, b) 與體積密度 sigma。此標準網路以固定的參考構型表示場景，類似於傳統非剛體配準中的模板網格。標準空間選取時間 t=0 時的構型。

段落功能方法第一步——定義標準輻射場作為場景的固定參考表徵。

邏輯角色此段建立了方法論的基礎：將動態場景分解為「不變的部分」（標準構型）與「變化的部分」（變形場），這是一個經典且直覺的問題分解策略。

論證技巧 / 潛在漏洞以「模板網格」的類比連結傳統幾何方法與神經隱式表徵，有助於讀者理解。但 t=0 作為標準空間的選擇缺乏理論依據——若 t=0 時的場景構型具有遮擋或自交叉，可能導致學習困難。

3.2 時間條件變形場

The deformation network learns a mapping Delta_x = D(x, t) that transforms a 3D point x at time t into its corresponding position in canonical space: x_canonical = x + Delta_x. This network is implemented as an MLP conditioned on both spatial coordinates and time, with positional encoding applied to both inputs. The deformation field implicitly enforces temporal coherence by sharing the canonical representation across all time steps. At t=0, the deformation network should output zero displacement, which is achieved through a regularization loss penalizing non-zero deformations at the canonical time.

變形網路學習一個映射 Delta_x = D(x, t)，將時間 t 時的三維點 x 轉換至其在標準空間中的對應位置：x_canonical = x + Delta_x。此網路以多層感知器（MLP）實現，同時以空間座標與時間作為條件，並對兩者施加位置編碼。變形場透過在所有時間步共享標準表徵來隱含地施行時間一致性。在 t=0 時，變形網路應輸出零位移，此透過一個懲罰標準時間非零變形的正則化損失來達成。

段落功能方法核心——描述變形網路如何將觀測空間映射回標準空間。

邏輯角色此段是全文技術貢獻的頂點。殘差形式的變形（x + Delta_x）確保了小變形情境下的穩定性，而 t=0 正則化則提供了明確的錨定條件。

論證技巧 / 潛在漏洞殘差設計（Delta_x）是合理的歸納偏置，假設多數點的變形幅度相對較小。然而，對於大幅度運動（如旋轉超過 180 度），殘差表示可能不如以旋轉矩陣等表徵有效。此外，變形場的平滑性並未顯式約束。

4. Experiments — 實驗

D-NeRF is evaluated on a set of synthetic dynamic scenes including bouncing balls, jumping jacks, a hook, a T-Rex, a standing up motion, a mutant, a hellwarrior, and a lego bulldozer. Each scene consists of monocular video frames rendered from a fixed viewpoint with ground truth novel views for evaluation. Quantitative results show that D-NeRF achieves PSNR values ranging from 29 to 35 dB across scenes, significantly outperforming baselines including NeRF applied per-frame (21-25 dB) and T-NeRF without canonical decomposition (26-30 dB). The method also demonstrates convincing novel view synthesis at unseen time steps through temporal interpolation.

D-NeRF 在一系列合成動態場景上進行評估，包括彈跳球、開合跳、掛鉤、暴龍、站立動作、變異人、地獄戰士與樂高推土機。每個場景由固定視角渲染的單眼視訊畫格組成，並提供真實的新視角作為評估依據。定量結果顯示，D-NeRF 在各場景上達到 29 至 35 dB 的 PSNR 值，顯著優於基線方法，包括逐幀應用 NeRF（21-25 dB）以及不含標準分解的 T-NeRF（26-30 dB）。該方法也展示了透過時間插值在未見過的時間步進行令人信服的新視角合成。

段落功能提供實驗證據——在多個合成場景上以定量指標驗證方法的有效性。

邏輯角色此段是論文的實證支柱。透過與兩個基線的比較（逐幀 NeRF 與無標準分解的 T-NeRF），系統性地論證了「標準空間+變形場」分解策略的優越性。

論證技巧 / 潛在漏洞實驗僅在合成資料上進行，雖然提供了完美的評估真值，但真實世界的動態場景（光照變化、遮擋、背景雜訊）可能帶來截然不同的挑戰。缺乏在真實視訊上的定性或定量評估是明顯的弱點。

Ablation studies confirm the importance of key design choices: removing positional encoding for time reduces PSNR by 2-4 dB, demonstrating that high-frequency temporal signals are necessary for capturing fine-grained motions. Without the canonical-time regularization, performance drops by 1-2 dB as the canonical space becomes ambiguous. The two-stage coarse-to-fine training strategy inherited from NeRF proves essential, with the fine network contributing 2-3 dB improvement over the coarse network alone.

消融研究確認了關鍵設計選擇的重要性：移除時間的位置編碼會使 PSNR 降低 2-4 dB，證明高頻時間訊號對於捕捉細粒度運動是必要的。若不使用標準時間正則化，性能降低 1-2 dB，因為標準空間變得模糊不定。從 NeRF 繼承的由粗到精兩階段訓練策略被證明至關重要，精細網路相較於僅使用粗糙網路貢獻了 2-3 dB 的提升。

段落功能消融分析——驗證各組件對最終性能的獨立貢獻。

邏輯角色消融實驗為每個設計決策提供因果證據，將「整體有效」的論述細化為「每個組件皆有貢獻」。

論證技巧 / 潛在漏洞消融結果清晰且一致。「時間位置編碼」的重要性特別值得注意——它暗示了動態場景中運動的高頻特性，但也引發了過擬合的疑慮：位置編碼的頻率選擇是否需要針對不同運動速度進行調整？

5. Conclusion — 結論

This paper presents D-NeRF, the first approach that successfully extends Neural Radiance Fields to dynamic scenes using only monocular video input. By decomposing the problem into a canonical radiance field and a time-conditioned deformation field, D-NeRF effectively shares information across time steps to overcome the single-view-per-timestep constraint. Experiments on synthetic dynamic scenes demonstrate significant improvements over baselines in both novel view synthesis quality and temporal consistency. The work opens new directions for neural scene representations that handle the full complexity of the dynamic visual world.

本文提出 D-NeRF，首個僅使用單眼視訊輸入即可成功將神經輻射場擴展至動態場景的方法。透過將問題分解為標準輻射場與隨時間變化的變形場，D-NeRF 有效地在時間步之間共享資訊，以克服每個時間步僅有單一視角的約束。在合成動態場景上的實驗展示了相較於基線方法在新視角合成品質與時間一致性上的顯著提升。此工作為能夠處理動態視覺世界全面複雜性的神經場景表徵開啟了新方向。

段落功能總結全文——重申核心方法與成果，展望未來方向。

邏輯角色結論段以「首個」的定位宣示開創性，並以「opens new directions」展望後續研究，形成完整的論證閉環。

論證技巧 / 潛在漏洞「首個」的宣稱需謹慎——同期工作如 Nerfies、Neural Scene Flow Fields 亦處理動態場景，差異在於輸入條件。結論未討論方法的主要局限（僅限合成資料驗證、訓練時間長、無法處理拓撲變化），顯得過度樂觀。

論證結構總覽

問題
NeRF 限於靜態場景
動態世界無法處理

→

論點
標準空間+變形場
分解動態場景

→

證據
PSNR 29-35 dB
顯著超越基線

→

反駁
單眼輸入即可
無需多視角系統

→

結論
首個單眼動態
NeRF 方法

作者核心主張（一句話）

透過將動態場景分解為共享的標準輻射場與時間條件變形場，即可僅憑單眼視訊實現動態場景的新視角合成，有效克服了 NeRF 的靜態假設限制。

論證最強處

問題分解的優雅性：將動態場景的複雜性拆解為「空間表徵」與「時間變形」兩個獨立可學習的組件，既降低了學習難度，又透過共享標準空間在時間步間傳遞資訊。消融實驗清楚證明每個組件的貢獻，使論證具有層次感。

論證最弱處

驗證範疇的侷限：所有實驗僅在合成資料上進行，缺乏真實世界動態場景的驗證。合成場景的背景簡單、光照恆定、運動幅度有限，無法代表真實世界的複雜性。此外，訓練時間長（每場景數小時）與無法處理拓撲變化等實務限制未被充分討論。