DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time

Abstract — 摘要

We present DynamicFusion, the first dense SLAM system capable of reconstructing non-rigidly deforming scenes in real-time, using only a single commodity depth camera. We achieve this by estimating a dense volumetric 6D motion field (a warp field) that maps the live frame to a canonical model frame. We fuse the live depth maps into the canonical model by first estimating the warp field, then applying the inverse warp to integrate measurements. The warp field is parameterized as a set of deformation nodes with associated dual quaternion transformations, solved efficiently using a GPU-accelerated Gauss-Newton solver. Our system enables dense 3D reconstruction of complex, non-rigidly moving scenes such as people, animals, and clothing, tracking and accumulating surface detail over time at real-time frame rates (30 Hz).

我們提出 DynamicFusion，這是首個能夠即時重建非剛體變形場景的稠密 SLAM 系統，僅使用單一消費級深度攝影機。我們透過估計一個稠密的體積 6D 運動場（翹曲場）來實現此目標，該翹曲場將即時幀映射到正則模型幀。我們先估計翹曲場，再應用逆翹曲以整合量測，從而將即時深度圖融合至正則模型中。翹曲場被參數化為一組具有關聯的對偶四元數變換的變形節點，透過 GPU 加速的高斯-牛頓求解器高效求解。我們的系統能對複雜的非剛體運動場景（如人物、動物與衣物）進行稠密三維重建，以即時幀率（30 Hz）持續追蹤並累積表面細節。

段落功能全文總覽——以「首個」的強力宣稱概括系統的能力、方法與應用場景。

邏輯角色摘要以「問題（非剛體重建）-方法（翹曲場 + 融合）-結果（即時 30 Hz）」的三段式結構清晰地傳達核心貢獻。

論證技巧 / 潛在漏洞「首個」的宣稱建立了極高的學術價值。「消費級深度攝影機」強調實用性。但「非剛體」的範圍極廣——系統能否處理拓撲變化（如撕裂）或快速運動尚未明確。

1. Introduction — 緒論

Real-time 3D reconstruction from depth cameras has become a transformative technology. KinectFusion demonstrated that dense, high-quality 3D models can be built in real-time using a moving depth camera. However, KinectFusion and subsequent systems fundamentally assume that the scene is rigid — objects cannot move or deform. This is a severe limitation: most real-world scenes contain non-rigid objects such as people, animals, or deformable materials. Previous methods for non-rigid reconstruction require either pre-scanned templates, multi-view setups, or are limited to offline processing.

來自深度攝影機的即時三維重建已成為一項革命性技術。KinectFusion 展示了使用移動的深度攝影機即可即時建立稠密、高品質的三維模型。然而，KinectFusion 及後續系統從根本上假設場景是剛體的——物件不能移動或變形。這是一個嚴重的限制：大多數真實世界的場景包含非剛體物件，如人物、動物或可變形材料。先前的非剛體重建方法需要預先掃描的模板、多視角設置，或僅限於離線處理。

段落功能建立研究場域——從 KinectFusion 的成就出發，指出剛體假設的根本限制。

邏輯角色經典的「肯定-否定-提出」結構：先讚揚 KinectFusion，再指出其剛體假設的嚴重限制，為 DynamicFusion 的存在意義建立強大的動機。

論證技巧 / 潛在漏洞「大多數真實世界場景」的措辭有效地將剛體假設描繪為致命缺陷。對先前非剛體方法的三重批判（模板、多視角、離線）精確地定義了 DynamicFusion 需要同時克服的挑戰。

We present DynamicFusion, a system that extends dense SLAM to non-rigid scenes without templates, using a single depth camera in real-time. Our key insight is to represent the scene motion as a volumetric warp field that maps every point in the live frame to its corresponding location in a canonical model. By estimating this warp field per frame using dense non-rigid alignment, we can fuse incoming depth data into a single, cumulative volumetric model that grows in detail over time. The warp field is efficiently represented using a set of sparse deformation nodes with dual quaternion interpolation, enabling both real-time performance and smooth, physically plausible deformations.

我們提出 DynamicFusion，一個將稠密 SLAM 擴展至非剛體場景的系統，無需模板，使用單一深度攝影機即時運作。我們的核心洞見是將場景運動表示為一個體積翹曲場，將即時幀中的每個點映射至正則模型中的對應位置。透過逐幀使用稠密非剛體對齊來估計此翹曲場，我們能將輸入的深度資料融合至一個隨時間累積細節的單一體積模型中。翹曲場透過一組稀疏的變形節點及對偶四元數內插高效地表示，使得即時效能與平滑、物理合理的變形得以兼具。

段落功能方案概述——提出 DynamicFusion 的核心概念與技術路線。

邏輯角色直接回應上段的三重限制：「無需模板」對應預掃描需求，「單一攝影機」對應多視角設置，「即時」對應離線處理。

論證技巧 / 潛在漏洞「翹曲場」的概念將非剛體問題優雅地轉化為一個場估計問題，使 KinectFusion 的融合管線得以複用。但對偶四元數的平滑假設可能在處理劇烈變形或不連續運動時失效。

Dense SLAM systems such as KinectFusion perform real-time reconstruction by fusing depth maps into a truncated signed distance function (TSDF) volume using the iterative closest point (ICP) algorithm for camera tracking. Extensions have improved scalability and accuracy, but all assume scene rigidity. Non-rigid surface reconstruction methods, including template-based approaches and physics-based simulation, typically require known geometry, multi-camera setups, or extensive offline computation. Embedded deformation graphs have been used for non-rigid registration in offline settings, and we adapt this representation for real-time dense tracking.

KinectFusion 等稠密 SLAM 系統透過使用迭代最近點（ICP）演算法進行攝影機追蹤，將深度圖融合至截斷符號距離函數（TSDF）體積中，以實現即時重建。後續的擴展改善了可擴展性與準確度，但均假設場景為剛體。非剛體表面重建方法，包括基於模板的方法與基於物理的模擬，通常需要已知幾何、多攝影機設置或大量的離線運算。嵌入式變形圖已被用於離線環境中的非剛體配準，而我們將此表示法適配為即時的稠密追蹤方案。

段落功能文獻回顧——梳理剛體 SLAM 與非剛體重建的兩條研究線，指出交匯缺口。

邏輯角色建立 DynamicFusion 在學術譜系中的定位：它位於「剛體即時 SLAM」與「離線非剛體重建」的交叉點，同時克服兩方的限制。

論證技巧 / 潛在漏洞將嵌入式變形圖從離線適配到即時是一個巧妙的技術遷移。但此處未討論變形圖方法在大變形下的穩定性問題，這在實驗中可能導致追蹤失敗。

3. Method — 方法

3.1 Warp Field Representation — 翹曲場表示

The warp field W maps every 3D point from the canonical model space to the live frame space. It is parameterized by a set of N deformation nodes, each with a position, a dual quaternion transformation, and a radial basis weight. The warp at any point is computed as a dual quaternion blending (DQB) of the K nearest deformation nodes, weighted by their radial basis function distances. Dual quaternion blending ensures smooth, artifact-free interpolation — unlike linear blend skinning, it avoids the "candy wrapper" artifacts associated with linear interpolation of rigid transformations. New deformation nodes are dynamically added as new surface geometry is discovered.

翹曲場 W 將每個三維點從正則模型空間映射至即時幀空間。它由 N 個變形節點參數化，每個節點具有一個位置、一個對偶四元數變換與一個徑向基權重。任意點的翹曲透過 K 個最近變形節點的對偶四元數混合（DQB）計算，以其徑向基函數距離加權。對偶四元數混合確保平滑、無偽影的內插——不同於線性混合蒙皮，它避免了與剛體變換線性內插相關的「糖果包裝紙」偽影。新的變形節點會隨著新的表面幾何被發現而動態添加。

段落功能核心表示——定義翹曲場的數學形式與物理意義。

邏輯角色翹曲場是整個系統的理論基石。選擇對偶四元數而非矩陣線性混合不僅有數學優勢，更有幾何直覺（保持剛性變換的物理合理性）。

論證技巧 / 潛在漏洞明確提及「糖果包裝紙偽影」展示了對替代方案缺陷的深入理解。動態添加節點的策略使系統具有適應性，但節點數量的增長可能影響即時效能的可持續性。

3.2 Non-rigid Fusion — 非剛體融合

The fusion pipeline operates in three stages per frame. First, dense non-rigid ICP alignment estimates the warp field parameters by minimizing the point-to-plane error between the warped canonical model and the live depth map. This is solved using a GPU-parallelized Gauss-Newton optimization with a data term, regularization term (as-rigid-as-possible), and Tikhonov damping. Second, the estimated warp field is applied inversely to warp the live depth map into the canonical frame. Third, the warped data is integrated into the TSDF volume using standard volumetric fusion. This three-stage pipeline enables progressive accumulation of surface detail from a non-rigidly moving scene, just as KinectFusion does for rigid scenes.

融合管線每幀分三個階段運作。首先，稠密非剛體 ICP 對齊透過最小化翹曲後的正則模型與即時深度圖之間的點到面誤差來估計翹曲場參數。這使用 GPU 平行化的高斯-牛頓最佳化求解，包含資料項、正則化項（盡可能剛體）與 Tikhonov 阻尼。其次，將估計的翹曲場反向應用，將即時深度圖翹曲至正則幀。第三，將翹曲後的資料使用標準體積融合整合至 TSDF 體積中。此三階段管線使得從非剛體運動場景中逐步累積表面細節成為可能，正如 KinectFusion 對剛體場景所做的那樣。

段落功能管線細節——詳述三階段融合流程的每一步。

邏輯角色將 KinectFusion 的剛體融合管線優雅地擴展至非剛體：對齊 -> 逆翹曲 -> 融合。三個階段各自承擔明確的責任，使系統設計清晰可理解。

論證技巧 / 潛在漏洞「盡可能剛體」的正則化是物理先驗的巧妙運用。但整個管線的成功高度依賴第一階段的對齊品質——若非剛體 ICP 在快速運動或大遮擋下失敗，後續階段將累積錯誤。

4. Experiments — 實驗

We evaluate DynamicFusion on a variety of non-rigid scenes captured with a single Kinect sensor, including people performing actions, hands manipulating objects, and a dog. The system runs at approximately 30 Hz on a desktop GPU. Qualitative results show that the system successfully tracks and reconstructs complex deformations such as facial expressions, hand articulation, and body motion. The cumulative model progressively gains detail that is not visible in any single frame. We compare against rigid KinectFusion, which fails catastrophically when any part of the scene moves. Failure modes include very fast motions that exceed the convergence basin of the ICP solver, and topological changes such as surfaces separating.

我們在使用單一 Kinect 感測器拍攝的多種非剛體場景上評估 DynamicFusion，包括人物執行動作、手部操作物件，以及一隻狗。系統在桌面 GPU 上以約 30 Hz 運行。定性結果顯示，系統成功追蹤並重建了複雜變形，如面部表情、手部關節運動與身體動作。累積模型逐步獲得在任何單一幀中不可見的細節。我們與剛體 KinectFusion 比較，後者在場景任何部分移動時會災難性地失敗。失敗模式包括超出 ICP 求解器收斂域的高速運動，以及表面分離等拓撲變化。

段落功能實證展示——以多樣化的非剛體場景驗證系統能力並坦承限制。

邏輯角色實驗覆蓋了從人到動物、從面部到全身的多種場景，展現系統的通用性。與 KinectFusion 的直接對比強烈突顯了非剛體處理能力的價值。

論證技巧 / 潛在漏洞誠實地列出失敗模式（快速運動、拓撲變化）增強了可信度。但缺乏定量基準評估（如與地面真值的幾何誤差）是一個遺憾——在 2015 年，非剛體重建的定量評估基準確實較為缺乏。

5. Conclusion — 結論

We have presented DynamicFusion, the first system for real-time dense reconstruction of non-rigid scenes from a single depth camera. Our approach extends the volumetric fusion paradigm to handle arbitrary non-rigid motion through a dense, per-frame warp field. The dual quaternion-based deformation graph enables smooth, artifact-free warping while maintaining real-time performance. We believe DynamicFusion opens new possibilities for interactive applications including telepresence, augmented reality, and performance capture with commodity sensors.

我們已提出 DynamicFusion，首個從單一深度攝影機即時稠密重建非剛體場景的系統。我們的方法透過逐幀的稠密翹曲場，將體積融合範式擴展至處理任意的非剛體運動。基於對偶四元數的變形圖實現了平滑、無偽影的翹曲，同時維持即時效能。我們相信 DynamicFusion 為使用消費級感測器的互動式應用開啟了新的可能性，包括遠端臨場、擴增實境與表演捕捉。

段落功能總結全文——重申首創地位並展望應用前景。

邏輯角色結論呼應摘要的「首個」宣稱，形成論證閉環。應用展望（遠端臨場、AR、表演捕捉）將技術貢獻提升至更廣泛的影響力層次。

論證技巧 / 潛在漏洞應用展望具有感染力，但從研究原型到實際產品仍有巨大的差距。結論未討論已知的限制（快速運動、拓撲變化）的潛在解決方向，在 Best Paper 的高度上顯得略為單薄。

論證結構總覽

問題
即時 SLAM 限於
剛體場景假設

→

論點
翹曲場將
非剛體映射至正則模型

→

證據
多場景即時 30 Hz
稠密非剛體重建

→

反駁
對偶四元數混合
確保平滑變形

→

結論
開啟互動式
非剛體應用新範式

作者核心主張（一句話）

透過逐幀估計體積翹曲場並使用對偶四元數變形圖，DynamicFusion 首次實現了從單一消費級深度攝影機對非剛體場景的即時稠密三維重建。

論證最強處

從剛體到非剛體的優雅擴展：翹曲場的概念將非剛體問題轉化為場估計問題，使 KinectFusion 的成熟融合管線得以直接複用。對偶四元數混合在數學上保證了變形的平滑性，而 GPU 加速的高斯-牛頓求解器確保了即時效能。整體設計在理論優雅性與工程實用性之間達到了出色的平衡。

論證最弱處

定量評估的缺乏與失敗模式的根本性：論文主要依賴定性視覺結果而非定量基準比較。已知的失敗模式（快速運動、拓撲變化）並非邊緣情況，而是真實場景中的常見狀況。系統對 ICP 收斂的依賴意味著在高速運動或大遮擋場景中，追蹤失敗的風險不可忽視。