3D Photography using Context-aware Layered Depth Inpainting

Abstract — 摘要

This paper proposes a method for converting a single RGB-D image into a 3D photo — a multi-layered representation that enables novel view synthesis with motion parallax from a single input image. The key contribution is "a learning-based inpainting model that synthesizes new local color-and-depth content into the occluded region in a spatial context-aware manner." The method uses a Layered Depth Image (LDI) representation that stores multiple color and depth values per pixel, enabling efficient rendering with realistic motion parallax effects.

本文提出一種將單張 RGB-D 影像轉換為 3D 照片的方法——一種多層表徵，能夠從單張輸入影像實現具備運動視差的新視角合成。核心貢獻在於「一個基於學習的修補模型，以空間上下文感知的方式，將新的局部色彩與深度內容合成至遮蔽區域」。該方法使用分層深度影像（LDI）表徵，在每個像素上儲存多個色彩與深度值，從而實現高效渲染與逼真的運動視差效果。

段落功能全文總覽——勾勒從單張 RGB-D 影像生成 3D 照片的完整流程與核心技術。

邏輯角色摘要建立了清晰的「輸入-方法-輸出」框架：RGB-D 輸入、上下文感知修補、3D 照片輸出。LDI 表徵作為橋梁連接修補與渲染。

論證技巧 / 潛在漏洞「3D 照片」的概念具有強烈的應用吸引力，但摘要中未說明深度資訊的來源假設——是否依賴特定深度感測器或深度估計演算法，這將影響方法的實際適用範圍。

1. Introduction — 緒論

Generating 3D experiences from 2D images is a long-standing goal in computer vision. Recent advances in monocular depth estimation have made it possible to obtain depth maps from ordinary photographs, but simply warping an image using its depth map reveals disoccluded regions — areas that were hidden behind foreground objects in the original view. These disocclusion artifacts appear as "holes and stretched textures" that break the 3D illusion. The core challenge is to plausibly fill in these missing regions with both color and depth information.

從二維影像生成三維體驗是電腦視覺中長久以來的目標。近年來單目深度估計的進展使得從普通照片取得深度圖成為可能，但僅利用深度圖對影像進行變形會暴露出去遮蔽區域——在原始視角中被前景物體遮擋的區域。這些去遮蔽偽影表現為「空洞與拉伸的紋理」，破壞了三維幻覺。核心挑戰在於以合理的方式同時填補這些缺失區域的色彩與深度資訊。

段落功能建立研究場域——定義從 2D 到 3D 轉換中的核心挑戰：去遮蔽區域的修補。

邏輯角色論證鏈的起點：先肯定深度估計的進展（前提條件已滿足），再指出僅有深度不足（仍有去遮蔽問題），為修補方法建立必要性。

論證技巧 / 潛在漏洞以「空洞與拉伸紋理」的視覺化描述生動地呈現問題，讀者無需專業背景即可理解。但此處暗示深度估計已解決，實際上深度估計的品質直接影響 3D 照片的效果，此依賴關係被淡化。

Existing approaches for novel view synthesis include multi-plane images (MPI) which require "a fixed number of depth planes" and struggle with continuous depth variation, and mesh-based methods that produce "stretching artifacts at depth discontinuities." The authors propose instead a layered depth image approach with learned context-aware inpainting, which separates the scene into depth layers and synthesizes plausible content behind foreground objects.

現有的新視角合成方法包括多平面影像（MPI），其需要固定數量的深度平面，且難以處理連續深度變化；以及基於網格的方法，在深度不連續處會產生拉伸偽影。作者改為提出一種結合學習式上下文感知修補的分層深度影像方法，將場景分離為深度層，並在前景物體背後合成合理的內容。

段落功能批判既有方法——指出 MPI 與網格方法的固有限制，定位研究缺口。

邏輯角色透過對比 MPI（離散化限制）與網格方法（拉伸偽影），建立本文 LDI 方法的差異化優勢。

論證技巧 / 潛在漏洞將替代方案的缺陷描述為結構性的（固定平面數、深度不連續），暗示需要全新的表徵方式。但 LDI 本身也非新概念——1998 年即已提出，作者的真正貢獻在於學習式修補而非表徵本身。

Prior work spans three areas: single-image view synthesis methods that learn to predict novel views directly, image inpainting techniques that fill missing regions in 2D, and depth inpainting/completion methods that reconstruct missing depth. Most image inpainting methods operate in 2D and do not jointly reason about color and depth. Depth completion methods focus on sparse-to-dense depth but do not address the color synthesis needed for rendering. This work uniquely combines joint color-and-depth inpainting with spatial context awareness in a layered 3D representation.

先前工作橫跨三個領域：學習直接預測新視角的單張影像視角合成方法、填補二維缺失區域的影像修補技術，以及重建缺失深度的深度修補/補全方法。大多數影像修補方法在二維中運作，不會同時推理色彩與深度。深度補全方法聚焦於稀疏到稠密的深度重建，但不處理渲染所需的色彩合成。本研究的獨特之處在於將色彩與深度的聯合修補結合空間上下文感知，整合於分層三維表徵中。

段落功能文獻回顧——梳理三個相關領域，並指出各自的不足。

邏輯角色透過「各有所短」的文獻分析，建立本文「色彩+深度聯合修補」的獨特定位。三個領域的交叉點即為本文的研究空間。

論證技巧 / 潛在漏洞以三領域的交叉定位是有效的差異化策略。但「聯合推理」是否真正帶來協同效應（vs. 分別推理後組合），需要消融實驗支持。

3. Method — 方法

3.1 Layered Depth Image Representation — 分層深度影像表徵

The method constructs a Layered Depth Image (LDI) from the input RGB-D image. The process begins by detecting depth discontinuities to identify occlusion boundaries between foreground and background objects. The scene is then decomposed into multiple depth layers, where each layer contains pixels at similar depths. Foreground objects are separated from their backgrounds, and the occluded regions — areas behind foreground objects that are invisible in the input view — are marked for inpainting. This layered structure enables independent rendering of each layer with proper depth ordering during novel view synthesis.

該方法從輸入的 RGB-D 影像建構分層深度影像（LDI）。過程首先偵測深度不連續性以識別前景與背景物體之間的遮蔽邊界。接著將場景分解為多個深度層，每層包含深度相近的像素。前景物體與其背景分離後，遮蔽區域——在輸入視角中被前景物體遮擋而不可見的區域——被標記為待修補。此分層結構使得在新視角合成時，每個深度層可依適當的深度順序獨立渲染。

段落功能方法細節——描述從 RGB-D 輸入到 LDI 表徵的建構流程。

邏輯角色此段建立方法的表徵基礎：先有 LDI 結構，才能在其上執行修補。深度不連續偵測是連接輸入與分層表徵的關鍵步驟。

論證技巧 / 潛在漏洞 LDI 的分層策略直覺合理，但深度不連續的偵測閾值如何設定、層數如何決定，皆為實作中的關鍵選擇。對於複雜場景（多重遮蔽、半透明物體），分層策略的穩健性值得質疑。

3.2 Context-aware Inpainting — 上下文感知修補

The core of the method is a context-aware inpainting network that simultaneously synthesizes color and depth in the occluded regions. Unlike standard image inpainting that operates on the full image, this network processes local regions around each depth edge, maintaining spatial context awareness by conditioning on the surrounding visible content. The inpainting proceeds in an edge-guided manner: starting from the outermost boundary of the occluded region and progressively filling inward, ensuring each newly synthesized strip is informed by both the original visible content and previously synthesized content. This iterative edge-based synthesis naturally handles varying sizes of occluded regions without requiring fixed-size inputs.

方法的核心是一個上下文感知修補網路，能同時在遮蔽區域合成色彩與深度。不同於在完整影像上運作的標準影像修補，此網路處理每個深度邊緣周圍的局部區域，透過以周圍可見內容為條件來維持空間上下文感知。修補以邊緣引導的方式進行：從遮蔽區域的最外層邊界開始，逐步向內填補，確保每條新合成的帶狀區域都參考了原始可見內容與先前合成的內容。這種迭代式邊緣合成方法自然地處理了不同大小的遮蔽區域，無需固定尺寸的輸入。

段落功能核心演算法——描述上下文感知修補的運作機制與迭代策略。

邏輯角色此段是方法論的核心：回答「如何修補遮蔽區域」。邊緣引導的漸進式填補策略確保合成結果與周圍環境保持一致性，是方法能產生逼真結果的關鍵。

論證技巧 / 潛在漏洞迭代式邊緣修補的設計優雅地解決了尺寸不一的遮蔽區域問題。但逐步填補可能導致累積誤差——越靠近遮蔽區域中心，合成品質可能越差。此外，色彩與深度的聯合合成如何確保幾何一致性（而非僅視覺合理性），需要更深入的分析。

4. Experiments — 實驗

The method is evaluated on diverse real-world images including indoor scenes, outdoor landscapes, and portraits. Comparisons against baselines including Niklaus et al.'s 3D Ken Burns method and mesh-based warping show that the proposed approach produces fewer artifacts at depth discontinuities and more plausible disoccluded content. User studies indicate that participants preferred the proposed method's results in the majority of comparisons. The method processes an image in approximately 3-5 minutes on a single GPU, generating a complete LDI representation that enables real-time rendering of novel views.

該方法在多樣化的真實世界影像上進行評估，包括室內場景、戶外風景與人像。與基準方法的比較——包括 Niklaus 等人的 3D Ken Burns 方法以及基於網格的變形——顯示所提出的方法在深度不連續處產生更少的偽影，且去遮蔽內容更為合理。使用者研究表明，在多數比較中，參與者偏好所提出方法的結果。該方法在單張 GPU 上處理一張影像約需 3-5 分鐘，生成完整的 LDI 表徵後即可即時渲染新視角。

段落功能提供實驗證據——在多場景類型上驗證方法的有效性，並以使用者研究佐證。

邏輯角色實證支柱覆蓋三個維度：(1) 定性比較（偽影減少）；(2) 主觀評估（使用者偏好）；(3) 效率（處理時間與即時渲染）。

論證技巧 / 潛在漏洞使用者研究是評估視覺品質的適當手段，但缺乏客觀量化指標（如 PSNR、SSIM）的報告。3-5 分鐘的處理時間雖可接受，但與即時方法仍有差距。未討論失敗案例或方法的適用邊界。

5. Conclusion — 結論

This work presents a complete pipeline for generating 3D photographs from single RGB-D images. The context-aware layered depth inpainting approach effectively synthesizes plausible color and depth content in occluded regions, producing Layered Depth Images that enable real-time rendering with convincing motion parallax. The method demonstrates robustness across diverse scene types and opens new possibilities for immersive photo viewing experiences.

本研究提出了一個完整的流程，從單張 RGB-D 影像生成 3D 照片。上下文感知的分層深度修補方法有效地在遮蔽區域合成合理的色彩與深度內容，產生的分層深度影像能夠即時渲染並呈現令人信服的運動視差。該方法在多樣化的場景類型中展現了穩健性，為沉浸式的照片觀賞體驗開啟了新的可能性。

段落功能總結全文——重述完整流程、核心技術與應用前景。

邏輯角色結論段呼應摘要的結構，以「輸入-方法-輸出-應用」的順序收束全文，形成完整的論證閉環。

論證技巧 / 潛在漏洞「穩健性」的宣稱在實驗章節中主要以定性結果支撐，缺乏系統性的穩健性分析。未討論已知的失敗模式（如大面積遮蔽、複雜幾何）以及對深度估計品質的敏感度。

論證結構總覽

問題
單張影像轉 3D
遮蔽區域缺失

→

論點
上下文感知修補
色彩+深度聯合合成

→

證據
多場景驗證
使用者研究優勢

→

反駁
優於 MPI / 網格
即時渲染效率

→

結論
完整 3D 照片
生成流程

作者核心主張（一句話）

透過上下文感知的分層深度修補，從單張 RGB-D 影像即可生成具備逼真運動視差的 3D 照片，無需多視角輸入或複雜的場景重建。

論證最強處

端到端的實用性：從單張 RGB-D 影像到可即時渲染的 3D 照片，整個流程完整且自動化。邊緣引導的迭代修補策略優雅地解決了不同大小遮蔽區域的處理問題，使用者研究進一步驗證了生成結果的視覺品質。

論證最弱處

對深度品質的隱性依賴：整個方法以 RGB-D 輸入為前提，但深度圖的品質（無論來自感測器或估計）直接影響 LDI 的分層正確性與修補結果。迭代修補的累積誤差在大面積遮蔽區域中可能導致品質下降，且缺乏定量評估指標來系統性地衡量合成品質。