Holistic Scene Understanding for 3D Object Detection with RGBD Cameras

Abstract — 摘要

This paper addresses the problem of 3D object detection in indoor environments using RGBD cameras. Unlike prior approaches that treat object detection in isolation, we propose a holistic approach that jointly reasons about the 3D scene layout, 3D object hypotheses, and their mutual interactions. We extend the Constrained Parametric Min-Cuts (CPMC) framework to 3D in order to generate candidate 3D object cuboids from depth data. We then formulate the problem as inference in a conditional random field (CRF) that couples scene classification with 3D object recognition, leveraging both 2D appearance features and 3D geometric cues. Experiments on the NYU Depth V2 dataset demonstrate that our holistic model achieves substantial improvements over state-of-the-art methods that reason about objects independently.

本文探討使用 RGBD 相機在室內環境中進行三維物件偵測的問題。有別於先前將物件偵測獨立處理的方法，我們提出一種整體式方法，聯合推理三維場景佈局、三維物件假設及其相互互動。我們將受限參數化最小切割（CPMC）框架擴展至三維空間，以從深度資料生成候選三維物件長方體。接著，我們將問題公式化為條件隨機場（CRF）中的推論，結合場景分類與三維物件辨識，同時利用二維外觀特徵與三維幾何線索。在 NYU Depth V2 資料集上的實驗證明，我們的整體式模型相較於獨立推理物件的最先進方法達到了顯著的改進。

段落功能全文總覽——從「獨立偵測的侷限」到「整體式聯合推理」，預告三大技術貢獻。

邏輯角色摘要建立了「孤立 vs. 整體」的核心對比框架，一句話概述 CPMC-3D + CRF 的技術管線，並以 NYU V2 的實證結果作為說服力的錨點。

論證技巧 / 潛在漏洞「整體式」一詞具有強烈的正面暗示——暗示先前方法「不完整」。但整體推理的計算成本與可擴展性是否被合理控制，需在方法章節驗證。

1. Introduction — 緒論

The recent availability of consumer-grade RGBD sensors such as the Microsoft Kinect has opened new possibilities for indoor scene understanding. Depth information provides strong cues about the 3D geometry of a scene, complementing the appearance information in RGB images. However, most existing methods for 3D object detection from RGBD data follow a pipeline approach: they first detect objects independently and then optionally apply contextual reasoning as a post-processing step. This sequential design fails to capture the rich mutual dependencies between scene layout, object categories, and spatial relationships.

消費級 RGBD 感測器（如 Microsoft Kinect）的普及為室內場景理解開啟了新的可能性。深度資訊提供了關於場景三維幾何的強力線索，補充了 RGB 影像中的外觀資訊。然而，大多數現有的 RGBD 三維物件偵測方法採用管線式方法：先獨立偵測物件，再選擇性地將脈絡推理作為後處理步驟。此循序設計未能捕捉場景佈局、物件類別與空間關係之間豐富的相互依賴性。

段落功能建立研究場域——以 Kinect 的普及為時代背景，指出管線式方法的結構性缺陷。

邏輯角色論證鏈的起點：先肯定 RGBD 的資料優勢，再揭示現有方法未能充分利用此優勢的原因——缺乏整體推理。

論證技巧 / 潛在漏洞以「管線式」vs.「整體式」的二分法建立論點，修辭清晰。但某些管線式方法透過迭代精煉也能捕獲跨任務依賴性，此處的批評略有過度簡化。

We argue that scene understanding and object detection should be addressed jointly. Knowing the scene type (e.g., kitchen vs. bedroom) provides strong priors on which objects are likely to be present, and conversely, detected objects inform the scene type. Moreover, spatial relationships between objects (e.g., a monitor typically sits on a desk) provide additional constraints. Our model captures all these dependencies through a unified CRF framework that performs joint inference over scene class, object categories, and 3D layouts.

我們主張場景理解與物件偵測應當被聯合處理。了解場景類型（例如廚房與臥室的差異）能提供關於哪些物件可能存在的強先驗，反之，偵測到的物件亦能回饋場景類型資訊。此外，物件間的空間關係（例如螢幕通常放在桌上）提供額外的約束條件。我們的模型透過統一的 CRF 框架捕捉所有這些依賴性，對場景類別、物件類別與三維佈局進行聯合推論。

段落功能提出核心論點——以具體範例說明聯合推理的必要性。

邏輯角色此段用「廚房 vs. 臥室」與「螢幕在桌上」等直覺範例，將抽象的「整體式推理」概念具體化，增強說服力。

論證技巧 / 潛在漏洞日常生活的範例降低了理解門檻。但這些脈絡關係的強度在不同場景中差異很大——在非典型場景（如混用空間）中，脈絡先驗可能誤導偵測。

RGBD-based scene understanding has attracted increasing attention with the availability of depth sensors. Silberman et al. introduced the NYU Depth dataset and proposed methods for semantic segmentation from RGBD data. Gupta et al. focused on contour detection and hierarchical segmentation in RGBD images. For 3D object detection, Song and Xiao proposed sliding shapes that search for objects in 3D space. However, these approaches treat detection as an isolated task without leveraging holistic scene context. Our work is also related to contextual models in 2D detection, such as Desai et al.'s work on modeling object co-occurrence and spatial layouts, which we extend to the 3D domain.

基於 RGBD 的場景理解隨著深度感測器的普及而受到越來越多的關注。Silberman 等人引入了 NYU Depth 資料集並提出從 RGBD 資料進行語意分割的方法。Gupta 等人聚焦於 RGBD 影像中的輪廓偵測與階層式分割。在三維物件偵測方面，Song 與 Xiao 提出在三維空間中搜尋物件的滑動形狀方法。然而，這些方法將偵測視為獨立任務，未利用整體場景脈絡。我們的工作亦與二維偵測中的脈絡模型相關，如 Desai 等人建模物件共現與空間佈局的工作，我們將其擴展至三維領域。

段落功能文獻回顧——梳理 RGBD 場景理解的三條主線。

邏輯角色建立學術譜系：分割方法 + 偵測方法 + 脈絡模型，指出三者尚未整合的缺口，為本文的統一框架定位。

論證技巧 / 潛在漏洞將現有方法分為三條平行線，暗示整合是自然的下一步。但 2D 脈絡模型的 3D 擴展並非trivial——三維空間的物件關係遠比二維複雜。

3. Method — 方法

3.1 3D CPMC — 三維 CPMC

We extend the Constrained Parametric Min-Cuts (CPMC) framework from 2D to 3D. Given an RGBD image, we first compute a 3D point cloud from the depth map and extract planar surfaces using RANSAC-based plane fitting. We then generate 3D cuboid proposals by placing axis-aligned bounding boxes at multiple locations and scales in the 3D space. Each cuboid is scored based on how well it encloses a coherent group of 3D points, using features such as point density, surface normal consistency, and color homogeneity. This generates a manageable set of high-quality 3D object candidates that cover the scene.

我們將受限參數化最小切割（CPMC）框架從二維擴展至三維。給定一幅 RGBD 影像，我們首先從深度圖計算三維點雲，並使用基於 RANSAC 的平面擬合提取平面表面。接著，我們透過在三維空間中的多個位置與尺度放置軸對齊包圍盒來生成三維長方體提案。每個長方體根據其包圍一致性三維點群組的程度來評分，使用的特徵包括點密度、表面法線一致性與色彩均質性。此過程生成一組規模可控的高品質三維物件候選，覆蓋整個場景。

段落功能方法推導第一步——定義三維物件提案的生成機制。

邏輯角色此為整體框架的輸入端。CPMC 的 3D 擴展為後續的 CRF 推論提供候選物件集合，是「由粗到精」管線的第一階段。

論證技巧 / 潛在漏洞軸對齊包圍盒的假設大幅簡化了搜尋空間，但對於傾斜擺放的物件（如斜靠的書本）可能產生不良擬合。RANSAC 平面擬合在曲面物件較多的場景中效果亦可能下降。

3.2 Joint CRF Inference — 聯合 CRF 推論

We formulate the joint reasoning as inference in a Conditional Random Field. The CRF includes three types of variables: a scene class variable (kitchen, bedroom, etc.), object detection variables for each cuboid proposal, and support relationship variables encoding which objects support others. The unary potentials capture individual evidence: scene type is predicted from global features (GIST, spatial pyramid), and each object cuboid is classified using 3D shape descriptors and 2D appearance features. The pairwise potentials encode contextual interactions: scene-object compatibility, object-object co-occurrence statistics, and spatial support relationships. Inference is performed via message passing, iterating until convergence.

我們將聯合推理公式化為條件隨機場中的推論。CRF 包含三類變數：場景類別變數（廚房、臥室等）、每個長方體提案的物件偵測變數，以及編碼哪些物件支撐其他物件的支撐關係變數。一元勢能捕捉個別證據：場景類型由全域特徵（GIST、空間金字塔）預測，每個物件長方體使用三維形狀描述子與二維外觀特徵進行分類。二元勢能編碼脈絡互動：場景-物件相容性、物件-物件共現統計，以及空間支撐關係。推論透過訊息傳遞執行，迭代至收斂。

段落功能核心創新——描述 CRF 的完整結構與推論機制。

邏輯角色此段是全文論證的支柱：三類變數（場景/物件/支撐）的設計精確對應了緒論中提出的「三重依賴性」。CRF 的圖結構將直覺性的脈絡關係形式化為可計算的能量函數。

論證技巧 / 潛在漏洞 CRF 的結構設計優雅地統一了多種脈絡線索。但訊息傳遞推論在大量候選物件時可能面臨效率問題，且收斂性並非總能保證。此外，共現統計依賴訓練資料分布，可能在罕見場景配置中失效。

4. Experiments — 實驗

We evaluate our approach on the NYU Depth V2 dataset, which contains 1449 RGBD images of indoor scenes with dense per-pixel labels. We follow the standard split with 795 training and 654 testing images. For 3D object detection, we evaluate using average precision (AP) at IoU threshold of 0.25 in 3D space. Our holistic model achieves significant improvements across most object categories compared to baselines that detect objects independently. Specifically, the joint model improves AP by an average of 4.2 points over the strongest baseline. Scene classification accuracy also improves when jointly optimized, reaching 72.1% compared to 67.8% for the independent classifier. Ablation studies confirm that both the scene-object coupling and the support relationship modeling contribute to the overall performance gain.

我們在 NYU Depth V2 資料集上評估我們的方法，該資料集包含 1449 幅具有逐像素密集標注的室內場景 RGBD 影像。我們依循標準劃分，使用 795 幅訓練影像與 654 幅測試影像。對於三維物件偵測，我們使用三維空間中 IoU 閾值為 0.25 的平均精確率（AP）進行評估。相較於獨立偵測物件的基線方法，我們的整體式模型在大多數物件類別上達到顯著改進。具體而言，聯合模型相比最強基線平均提升 4.2 個 AP 點。場景分類準確率在聯合最佳化時亦獲提升，達到 72.1%（獨立分類器為 67.8%）。消融研究確認場景-物件耦合與支撐關係建模均對整體效能提升有所貢獻。

段落功能提供實驗證據——以定量數據驗證整體式方法的優越性。

邏輯角色實證支柱，覆蓋三個維度：(1) 3D 偵測 AP 的改進；(2) 場景分類的連帶提升；(3) 消融研究確認各組件的貢獻。

論證技巧 / 潛在漏洞同時報告偵測與場景分類的雙向提升是有力的——證明整體推理確實產生互利。但 IoU=0.25 的閾值相當寬鬆，在更嚴格的閾值下改進幅度是否持續有待驗證。

5. Conclusion — 結論

We have presented a holistic approach to 3D scene understanding from RGBD data that jointly reasons about scene layout, object detection, and support relationships. By extending CPMC to 3D for proposal generation and formulating the joint reasoning as CRF inference, our method effectively leverages the mutual dependencies between scene-level and object-level understanding. Experiments on the NYU Depth V2 dataset demonstrate substantial improvements over methods that treat detection in isolation. Our work suggests that holistic scene reasoning is a fruitful direction for indoor 3D understanding, and future work could incorporate richer relationships such as functional affordances and temporal consistency.

我們提出了一個從 RGBD 資料進行三維場景理解的整體式方法，聯合推理場景佈局、物件偵測與支撐關係。透過將 CPMC 擴展至三維以生成提案，並將聯合推理公式化為 CRF 推論，我們的方法有效利用了場景層級與物件層級理解之間的相互依賴性。在 NYU Depth V2 資料集上的實驗證明，相比獨立處理偵測的方法有顯著改進。我們的工作顯示，整體式場景推理是室內三維理解的豐碩方向，未來的工作可納入更豐富的關係，如功能性可供性與時序一致性。

段落功能總結全文——重申核心貢獻並展望未來方向。

邏輯角色結論段呼應摘要的「整體式」主題，並以「功能性可供性」與「時序一致性」兩個具體方向指引後續研究，形成完整的論證閉環。

論證技巧 / 潛在漏洞結論簡潔有力，但未討論方法的計算成本與即時性限制——對於需要即時互動的 RGBD 應用（如機器人導航），推論效率可能是關鍵瓶頸。

論證結構總覽

問題
管線式 3D 偵測
忽略場景脈絡

→

論點
CRF 聯合推論
場景/物件/支撐

→

證據
NYU V2 上 AP
平均提升 4.2 點

→

反駁
消融研究確認
各組件皆有貢獻

→

結論
整體式推理是
室內 3D 理解方向

作者核心主張（一句話）

透過條件隨機場聯合推論場景類型、物件類別與支撐關係，能顯著提升基於 RGBD 的室內三維物件偵測效能。

論證最強處

雙向互利的實證：聯合模型不僅提升物件偵測的 AP，同時也改善場景分類的準確率，證明場景與物件層級的推理確實相互增益，而非單向的資訊流動。

論證最弱處

軸對齊假設與計算成本：3D CPMC 的軸對齊包圍盒假設限制了對任意姿態物件的偵測能力。CRF 訊息傳遞的收斂速度與大場景下的可擴展性未被充分討論，可能限制實際部署。