Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images

Abstract — 摘要

We address the problems of contour detection, bottom-up grouping and semantic segmentation of RGB-D images from indoor environments. We propose algorithms for object boundary detection and hierarchical segmentation that generalize the gPb-ucm approach by leveraging depth information. We label contours by type: depth, normal, or albedo discontinuities, and introduce a long-range amodal completion mechanism for grouping. For semantic segmentation, we classify superpixels into 40 object categories on the NYUD2 dataset using both generic and class-specific features capturing appearance and geometry. We demonstrate the effectiveness of our approach for scene understanding and show that contextual information further enhances object recognition, achieving significant improvements over the state-of-the-art.

本文處理室內環境中 RGB-D 影像的輪廓偵測、由下而上分組及語意分割問題。我們提出物件邊界偵測與階層式分割的演算法，藉由深度資訊擴展 gPb-ucm 方法。我們依據類型標記輪廓：深度不連續、法線不連續或反照率不連續，並引入長距離非模態補全機制以改善分組效果。在語意分割方面，我們使用兼顧外觀與幾何的通用特徵及類別專屬特徵，在 NYUD2 資料集上將超像素分類為 40 種物件類別。我們展示了此方法在場景理解上的有效性，並說明情境資訊能進一步提升物件辨識，達成顯著優於現有最佳方法的成果。

段落功能全文總覽——以遞進方式呈現從低階視覺（輪廓偵測）到中階（分組）再到高階（語意分割）的完整管線。

邏輯角色摘要扮演「問題定義與貢獻預告」的雙重角色：先界定 RGB-D 場景理解的三大子問題，再概述統一的解決框架。

論證技巧 / 潛在漏洞作者以「顯著優於現有最佳方法」作為結尾，但未在摘要中給出具體數值。此外，三個子問題的耦合程度——它們是否真正相互受益——需要在實驗中驗證。

1. Introduction — 緒論

The availability of affordable RGB-D sensors such as the Microsoft Kinect has opened up new possibilities for indoor scene understanding. Unlike traditional RGB images, depth maps provide explicit geometric information about the scene, enabling algorithms to reason about 3D structure, surface orientations, and spatial relationships between objects. However, most existing approaches for scene recognition and object detection have been developed primarily for RGB data and do not fully exploit the rich geometric cues available in depth images.

價格親民的 RGB-D 感測器（如 Microsoft Kinect）的問世，為室內場景理解開啟了新的可能性。有別於傳統 RGB 影像，深度圖提供了場景的顯式幾何資訊，使演算法能推理三維結構、表面朝向以及物件間的空間關係。然而，大多數現有的場景辨識與物件偵測方法主要是為 RGB 資料所設計，並未充分利用深度影像中豐富的幾何線索。

段落功能建立研究場域——以 Kinect 的普及作為時代背景，引出 RGB-D 場景理解的重要性。

邏輯角色論證鏈的起點：先肯定深度感測器的潛力，再指出現有方法未能充分利用深度資訊，為本文的方法鋪路。

論證技巧 / 潛在漏洞以硬體革新（Kinect）作為研究動機具有很強的時效性。但「未充分利用」的論斷需更精確——部分先前工作已嘗試融合深度資訊，此處可能過度簡化了現有研究。

In this work, we take a principled approach to indoor scene understanding from RGB-D images, addressing three interrelated tasks. First, we extend the gPb contour detector and UCM hierarchical segmentation framework to incorporate depth-derived features including height above ground, angle with gravity, and surface normal differences. Second, we introduce an amodal completion mechanism that uses long-range depth cues to infer occluded surface boundaries. Third, we build a semantic segmentation system that uses region-level features combining appearance, shape, and 3D geometric context to achieve state-of-the-art recognition performance on the NYUD2 benchmark.

在本研究中，我們採取有系統的方法處理 RGB-D 影像的室內場景理解，涵蓋三項相互關聯的任務。首先，我們擴展 gPb 輪廓偵測器與 UCM 階層式分割框架，融入深度衍生特徵，包括離地高度、與重力的夾角以及表面法線差異。其次，我們引入非模態補全機制，利用長距離深度線索推斷被遮擋的表面邊界。第三，我們建構語意分割系統，使用結合外觀、形狀與三維幾何情境的區域級特徵，在 NYUD2 基準上達成最先進的辨識效能。

段落功能提出解決方案——列舉三項技術貢獻。

邏輯角色承接上段的「現有方法不足」，此段扮演「轉折」角色：從問題過渡到具體解法。三項貢獻恰好對應摘要中的三大子問題，邏輯嚴密。

論證技巧 / 潛在漏洞以編號方式清楚列出三項貢獻，便於讀者追蹤。但每項貢獻都建立在既有框架（gPb-ucm）之上，若基礎框架在深度域表現不佳，整體效能可能受限。

Contour detection has a rich history in computer vision, from early edge detectors to the more recent gPb (globalized probability of boundary) framework which combines multiscale local brightness, color, and texture gradients with spectral clustering. The UCM (ultrametric contour map) provides a natural hierarchy of regions. However, these methods were designed for RGB images and do not leverage depth information. Recent works on RGB-D segmentation have used depth in various ways, but typically treat it as an additional channel rather than exploiting its geometric meaning.

輪廓偵測在電腦視覺中有悠久的歷史，從早期的邊緣偵測器到較新的 gPb（全域化邊界機率）框架，後者結合了多尺度局部亮度、顏色與紋理梯度及譜聚類。UCM（超度量輪廓圖）提供了自然的區域階層。然而，這些方法是為 RGB 影像設計的，並未利用深度資訊。近期關於 RGB-D 分割的研究雖以各種方式使用深度，但通常僅將其視為額外通道，而未發掘其幾何意涵。

段落功能文獻回顧——梳理輪廓偵測與分割的研究脈絡。

邏輯角色建立技術譜系：從通用邊緣偵測到 gPb-ucm，再點出「深度作為額外通道」的不足，為本文「深度作為幾何線索」的立場鋪路。

論證技巧 / 潛在漏洞將先前研究歸類為「僅把深度當額外通道」略顯簡化——部分研究已嘗試提取法線等幾何特徵。但此框架有效地凸顯了本文在幾何建模上的獨特性。

For semantic segmentation of indoor scenes, the seminal NYU Depth V2 dataset by Silberman et al. provides 1449 densely labeled RGB-D images with 40 semantic classes. Prior approaches have used hand-crafted features with random forest classifiers or kernel descriptors on RGB-D patches. Scene classification methods typically rely on global scene descriptors such as GIST. Our work differs by building a unified pipeline that integrates low-level perceptual organization with high-level semantic recognition, leveraging the geometric structure revealed by depth at every stage.

在室內場景的語意分割方面，NYU Depth V2 資料集提供了 1449 張具有 40 個語意類別密標註的 RGB-D 影像。先前的方法使用手工設計特徵搭配隨機森林分類器，或在 RGB-D 區塊上使用核描述子。場景分類方法通常依賴全域場景描述子（如 GIST）。本文的不同之處在於建構統一的管線，將低階感知組織與高階語意辨識整合在一起，在每個階段皆利用深度所揭示的幾何結構。

段落功能定位本文貢獻——在語意分割文獻脈絡中突顯統一管線的獨特性。

邏輯角色從技術工具（資料集、特徵、分類器）的角度區分本文與先前工作，強調「統一管線」是關鍵差異。

論證技巧 / 潛在漏洞「在每個階段皆利用深度」的承諾具有說服力，但需在後續章節逐一驗證各階段的深度融合是否真正帶來增益，而非僅是計算開銷。

3. Contour Detection — 輪廓偵測

We extend the gPb framework to RGB-D images by incorporating depth-derived gradient channels. Specifically, we compute oriented gradients of depth values, surface normals, and height above the estimated ground plane. Each gradient channel captures different types of boundaries: depth gradients detect occlusion boundaries, normal gradients reveal orientation discontinuities at surface folds, and height gradients help separate objects from the support surface. These are combined with the standard brightness, color, and texture gradients using a logistic regression classifier trained to predict true boundary probability.

我們藉由納入深度衍生梯度通道，將 gPb 框架擴展至 RGB-D 影像。具體而言，我們計算深度值、表面法線與離估計地面高度的方向梯度。每個梯度通道捕捉不同類型的邊界：深度梯度偵測遮擋邊界，法線梯度揭示表面摺痕處的朝向不連續性，高度梯度有助於將物件與支撐面分離。這些梯度通道與標準的亮度、顏色及紋理梯度結合，透過經訓練的邏輯迴歸分類器預測真實邊界機率。

段落功能方法推導第一步——定義深度衍生梯度通道。

邏輯角色此段是方法的基礎：三種深度梯度（深度值、法線、高度）各自對應不同的物理意義，展現了作者對深度資訊的系統性分析。

論證技巧 / 潛在漏洞將深度資訊拆解為三種具有物理可解釋性的梯度通道，是優雅的設計選擇。但深度感測器的雜訊（尤其在物件邊緣）可能嚴重影響梯度品質，作者需說明去雜訊策略。

Furthermore, we introduce a contour classification scheme that labels each detected boundary as arising from a depth discontinuity, a surface normal discontinuity, or an albedo/texture change. This classification is performed by training a multi-class classifier on the relative contributions of each gradient channel. Knowing the type of boundary provides valuable semantic information: depth boundaries indicate occlusion relationships, normal boundaries indicate object parts or surface curvature, and albedo boundaries suggest material or illumination changes.

此外，我們引入輪廓分類方案，將每條偵測到的邊界標記為源自深度不連續、表面法線不連續或反照率/紋理變化。此分類透過在各梯度通道的相對貢獻上訓練多類別分類器來實現。了解邊界類型能提供寶貴的語意資訊：深度邊界指示遮擋關係，法線邊界指示物件部件或表面曲率，而反照率邊界暗示材質或光照變化。

段落功能方法延伸——從偵測邊界到分類邊界類型。

邏輯角色此段將輪廓偵測從「哪裡有邊界」提升到「為什麼有邊界」，為下游的分組與語意分割提供更豐富的線索。

論證技巧 / 潛在漏洞邊界類型分類是本文的獨特貢獻，賦予輪廓物理語意。但在實際場景中，邊界類型可能混合出現（如遮擋處同時有深度與反照率變化），單一標籤的分類可能過度簡化。

4. Hierarchical Grouping — 階層式分組

Building on the extended contour detector, we construct ultrametric contour maps (UCMs) that provide a hierarchical segmentation of the RGB-D image. At each level of the hierarchy, regions are merged based on the minimum boundary strength along their shared contour. We further enhance grouping with an amodal completion mechanism: using depth-based reasoning about occlusion relationships, we infer the full extent of surfaces that are partially hidden behind other objects. This allows the algorithm to group together disconnected image regions that belong to the same physical surface, such as a table visible on both sides of a chair.

基於擴展的輪廓偵測器，我們建構超度量輪廓圖（UCM），提供 RGB-D 影像的階層式分割。在階層的每個層級，區域依據其共享輪廓上的最小邊界強度進行合併。我們進一步以非模態補全機制增強分組：利用基於深度的遮擋關係推理，推斷部分被其他物件遮擋的表面的完整範圍。這使演算法能將屬於同一物理表面但在影像中不連通的區域分組在一起，例如在椅子兩側可見的桌面。

段落功能核心創新——描述深度驅動的非模態補全機制。

邏輯角色此段是全文論證的支柱之一：非模態補全直接解決了室內場景中普遍的遮擋問題，將純粹的影像分割提升到三維表面推理的層次。

論證技巧 / 潛在漏洞以「椅子兩側可見的桌面」作為具體範例，使抽象的演算法變得直觀。但非模態補全假設遮擋關係可從深度圖可靠地推斷，在深度感測器失效的區域（如反光或透明表面）可能不成立。

5. Semantic Segmentation — 語意分割

For semantic segmentation, we classify each superpixel in the hierarchical segmentation into one of 40 object categories defined in the NYUD2 dataset. Our feature set includes generic features (color histograms, texture descriptors, shape statistics, 3D bounding box properties, surface normal distributions) as well as class-specific features that capture the geometric context of each region relative to the scene structure — e.g., height above ground, distance to walls, and local support relationships. A linear SVM classifier is trained on these features, followed by contextual refinement using a CRF that encourages spatial consistency.

在語意分割方面，我們將階層式分割中的每個超像素分類為 NYUD2 資料集定義的 40 種物件類別之一。我們的特徵集包括通用特徵（色彩直方圖、紋理描述子、形狀統計量、三維包圍框屬性、表面法線分布）以及類別專屬特徵——捕捉每個區域相對於場景結構的幾何情境，例如離地高度、與牆壁的距離及局部支撐關係。使用線性 SVM 分類器在這些特徵上進行訓練，隨後透過條件隨機場（CRF）進行情境精煉以鼓勵空間一致性。

段落功能方法的最高層——描述語意分類的特徵工程與分類架構。

邏輯角色此段展示如何將前幾節的低階與中階處理結果轉化為語意理解。通用特徵與類別專屬特徵的結合，體現了「一般到特殊」的設計哲學。

論證技巧 / 潛在漏洞特徵工程的豐富性令人印象深刻，但同時也暴露了對手工設計特徵的高度依賴。在深度學習時代即將來臨的背景下（2013年），此方法的可擴展性值得思考。CRF 精煉的效果需與額外計算成本一同評估。

6. Experiments — 實驗

We evaluate our approach on the NYU Depth V2 dataset, which contains 1449 RGB-D images of indoor scenes with dense semantic labels across 40 categories. For contour detection, our RGB-D gPb achieves an ODS F-score of 0.72, compared to 0.65 for the RGB-only baseline, demonstrating the significant benefit of depth features. For semantic segmentation, our method achieves a pixel-wise accuracy of 60.3% and a mean class accuracy of 35.1%, significantly outperforming prior methods. The addition of contextual features and CRF-based refinement provides a further 3-4% improvement in pixel accuracy. We also demonstrate strong performance on scene classification, correctly categorizing 73.7% of test scenes into room types.

我們在 NYU Depth V2 資料集上評估本方法，該資料集包含 1449 張室內場景的 RGB-D 影像，具有橫跨 40 個類別的密集語意標註。在輪廓偵測方面，我們的 RGB-D gPb 達到 ODS F-score 0.72，相較於僅使用 RGB 的基準線 0.65 有顯著提升，展現深度特徵的顯著效益。在語意分割方面，我們的方法達到像素精確度 60.3% 與平均類別精確度 35.1%，顯著優於先前方法。情境特徵與基於 CRF 的精煉進一步帶來 3-4% 的像素精確度提升。我們也展示了在場景分類上的優異表現，正確地將 73.7% 的測試場景歸類至房間類型。

段落功能提供全面的實驗證據——在多個任務與指標上驗證方法的有效性。

邏輯角色實證支柱，覆蓋三個維度：(1) 輪廓偵測的 F-score；(2) 語意分割的像素精確度；(3) 場景分類的正確率。每個數字都有基準對照。

論證技巧 / 潛在漏洞具體的數值比較（0.72 vs 0.65）令人信服。但 40 類別的平均類別精確度僅 35.1%，說明長尾類別的辨識仍具挑戰性。此外，僅在 NYUD2 上評估限制了泛化性的論證力度。

7. Conclusion — 結論

We have presented a comprehensive framework for perceptual organization and recognition of indoor scenes from RGB-D images. By extending the gPb-ucm pipeline with depth-derived features, introducing amodal completion for better grouping, and building a rich feature-based semantic segmentation system, we achieve significant improvements over prior work across multiple tasks. Our results demonstrate that depth information, when properly exploited through geometric reasoning rather than treated as a simple additional channel, provides substantial benefits for scene understanding. Future directions include incorporating learned feature representations and extending to full 3D scene parsing.

我們提出了一個從 RGB-D 影像進行室內場景感知組織與辨識的完整框架。透過以深度衍生特徵擴展 gPb-ucm 管線、引入非模態補全以改善分組、以及建構豐富特徵的語意分割系統，我們在多項任務上達成顯著優於先前工作的成果。我們的結果表明，深度資訊在透過幾何推理而非僅作為額外通道被適當利用時，能為場景理解提供實質性的效益。未來方向包括納入學習式特徵表示，以及擴展至完整的三維場景解析。

段落功能總結全文——重申核心貢獻並展望未來。

邏輯角色結論段呼應摘要的結構，從具體成果回到更高層次的啟示：深度的幾何語意比原始數值更重要。形成完整的論證閉環。

論證技巧 / 潛在漏洞「學習式特徵表示」的展望頗具先見之明——僅一年後，深度學習方法便大幅超越了手工特徵。此處的謙虛承認暗示作者已意識到方法的時代局限性。

論證結構總覽

問題
RGB-D 場景理解缺乏
對深度幾何語意的利用

→

論點
深度衍生幾何特徵
應融入每個處理階段

→

證據
NYUD2 基準上
多任務顯著提升

→

反駁
非模態補全處理
遮擋造成的分組困難

→

結論
幾何推理而非原始通道
才是深度利用的正確方式

作者核心主張（一句話）

透過將深度資訊轉化為具物理意義的幾何特徵（法線、高度、遮擋關係），並系統性地融入輪廓偵測、階層式分組與語意分割的每個階段，能顯著提升室內場景的 RGB-D 理解能力。

論證最強處

深度資訊的幾何解構：將深度拆解為深度值梯度、法線梯度與離地高度三種獨立通道，每種都具有明確的物理意義（遮擋、表面摺痕、支撐關係）。這種有原則的設計比簡單地將深度作為第四通道更具說服力，且 ODS F-score 從 0.65 提升至 0.72 的實證結果直接驗證了此設計的有效性。

論證最弱處

手工特徵的可擴展性隱憂：語意分割的 40 類別平均精確度僅 35.1%，顯示方法在長尾類別上仍力有未逮。更根本的是，整套框架高度依賴手工設計的特徵與傳統分類器，在深度學習即將主導的時代（2013年），此方法的技術壽命有限——作者在結論中對此有所暗示。